Thursday, April 12, 2018

Compilation time – -std=c++98 vs. -std=c++11

I saw on twitter that it takes more than twice as much time compiling the Ogre graphics engine using GCC than when using Clang. My experience is that GCC and Clang usually compile with similar speed, so I decided to look into why compiling Ogre is different.

It turned out that a big part of the difference comes from which C++ version the compilers use per default – clang-5.0 defaults to C++98, while gcc-7 defaults to a newer version. Forcing the compilers to use C++98 by passing -std=c++98 makes GCC compile Ogre in about half the time (668s vs. 1135s), while passing -std=c++11 nearly doubles the time Clang needs to compile it!

One reason for this difference is that some of the standard include files are more expensive in C++11 mode as they suck in more dependencies. For example, compiling a file containing just the line
#include <memory>
takes 0.16 seconds on my computer when using C++11
> time -p g++ -O2 -c test.cpp -std=c++11
real 0.16
user 0.14
sys 0.01
while compiling it as C++98 is faster
> time -p g++ -O2 -c test.cpp -std=c++98
real 0.02
user 0.01
sys 0.01
The 0.14-second difference may not seem that big, but it makes a difference when, as for Ogre, you are compiling more than 500 files, each taking about one second. The increased cost of including standard header files for C++11 compared to C++98 adds about 20% to the Ogre build time.

It is a bit unclear to me exactly where the rest of the slowdown comes from, but it seems to be spread all over the code (I tried to remove various classes in the Ogre code base, and removing 10% of the source code seems to affect both the fast and slow version by about 10%) so I assume this is just because the templates in the C++11 STL are more complex and the compiler needs to work a bit harder each time they are used...

Anyway, the difference in compilation time between -std=c++98 and -std=c++11 was much bigger than I had guessed, and I’ll now ensure I use -std=c++98 when building C++98 code.


Updated: The original blog post said that gcc-7 uses C++11 per default. That was wrong, it defaults to C++14.

Saturday, March 3, 2018

Detecting incorrect C++ STL usage

The GCC and LLVM sanitizers are great for finding problems in C++ code, but they do not detect problems with incorrect STL usage. For example, std::list::merge merges two sorted lists, so the following program is incorrect (as list2 is not sorted)
#include <iostream>
#include <list>

int main()
{
  std::list<int> list1 = {1, 2, 3, 4};
  std::list<int> list2 = {5, 3, 4, 2};
  list1.merge(list2);
  for (auto x : list1)
    std::cout << x << '\n';
}
but this cannot be detected by the sanitizers. The libstdc++ library does, however, have a “debug mode” that can be used to detect this kind of problems. The debug mode is enabled by passing -D_GLIBCXX_DEBUG to the compiler
g++ -O2 -D_GLIBCXX_DEBUG example.cpp
which enables assertions checking the preconditions, and the program fails at runtime
/scratch/gcc-7.2.0/install/include/c++/7.2.0/debug/list:716:
Error: elements in iterator range [__x.begin().base(), __x.end().base())
are not sorted.

Objects involved in the operation:
    iterator "__x.begin().base()" @ 0x0x7fff1754ea10 {
      type = std::__cxx1998::_List_iterator;
    }
    iterator "__x.end().base()" @ 0x0x7fff1754ea40 {
      type = std::__cxx1998::_List_iterator;
    }
Abort (core dumped)

There are a few constructs that are invalid according to the C++ standard but that libstdc++ handles as an extension. One such example is inserting a range of a list into the list the range points to, as in
#include <iostream>
#include <list>

int main()
{
  std::list<int> list1 = {1, 2, 3, 4};
  list1.insert(list1.begin(), list1.begin(), list1.end());
  for (auto x : list1)
    std::cout << x << '\n';
}
The debug mode does not report errors for these extensions when using -D_GLIBCXX_DEBUG, but the debug mode can be made more pedantic by adding -D_GLIBCXX_DEBUG_PEDANTIC
g++ -O2 -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_PEDANTIC example.cpp
which reports errors for these too.

One annoyance with the debug mode is that it changes the size of some standard class templates, so you cannot pass containers between translation units compiled with and without debug mode – this often means that you need to build the whole application with debug mode enabled.

The libstdc++ debug mode was introduced in GCC 3.4.

Sunday, February 4, 2018

GCC command options for debugging – -Og and -g3

I listened to a recent CppCast episode where Balázs Török talked about game development and mentioned that C++ abstractions make the code unusable slow in debug builds, and that he is skeptical of debuggability of meta-classes as today's debuggers cannot even handle macros. I think both problems are solvable, and I would argue that GCC is already handling at least the first issue, provided the right command options are used.

Debugging optimized code

GCC can generate debug information when optimizing, so it is possible to run fully optimized code in the debugger. But this is not too useful in reality – many optimizations change the structure of the code, so it is often impossible to single-step in the resulting binary as instructions from different parts of the program are interleaved...

The GCC developers have traditionally tried to limit the damage done by the optimizers for the -O1 optimization level, but -O1 is often used for release builds too, so there is a limit to how much optimizations can be disabled without annoying too many developers – a new optimization level, -Og, were therefore introduced in GCC 4.8. The -Og optimization level enables the optimizations that do not interfere with debugging, and it may even result in a better debugging experience than when compiling without optimizations, as some optimization passes collect information useful for generating better debug information.

The difference between -O1 and -Og is that
  • -Og disables some optimizations, such as if-conversion, that simplifies control flow, so the structure of the generated code is roughly the same as in the source code.
  • -Og disables some passes, such as -ftree-pta and -ftree-sra, that help other optimization passes by propagating information about memory accesses throughout the functions. The effect of this is that those passes now optimize with only the information available locally within each basic block.
  • -Og is less aggressive in the back end peephole optimizations, so each generated instruction is less likely to execute functionality from several statements in the source code.
How much this affects the performance depends a lot on coding style etc., but modern CPUs are great at hiding inefficiencies using branch-prediction, speculative execution, and store to load forwarding, so the difference between -Og and fully optimized code is often surprisingly small, even when using the STL.

Debug information for macros

GCC does not put information about macros in the debug information per default, but it is possible to add it by passing -g3 to the compiler. This makes GDB know of the macros and enables some macro-related commands. But the debugging experience is still not that great. I do not understand why a better support has not been implemented – possibly because inline functions should be used instead of macros...

Sunday, January 21, 2018

GCC back end performance tuning

This is part six of a series “Writing a GCC back end”.

Cost model – TARGET_RTX_COSTS

The compiler often has different options for how it can optimize and emit the code. For example, dividing a 32-bit integer by the constant value 3
i = i / 3;
can be generated as a division instruction, but division instructions are slow, so it may be better to generate this as the equivalent of
i = (((int64_t)i * 0x55555556) >> 32) - (i >> 31);
which gives the same result. GCC decides which to generate by preparing RTL for the different alternatives, and queries the target’s cost function (if implemented) to determine which is the cheapest. Which alternatives are tried depends on which insns are defined in the target description, and what constraints the insns have, but the compiler will, in this case, ask for the costs of subexpressions of the form
(truncate:SI (lshiftrt:DI (mult:DI (sign_extend:DI (reg:SI 88))
                                   (const_int 0x55555556))
                          (const_int 32)))
and compares their combined costs with
(div:SI (reg:SI 88) (const_int 3))

Implementing this cost function is relatively easy for simple architectures – it consists of a switch case for each operation returning the cost expressed as the number of nop instructions  (which usually means the number of cycles)
static bool
machine_rtx_costs (rtx x, machine_mode mode, int outer_code, int opno,
                   int *total, bool speed)
{
  switch (GET_CODE (x))
    {
    case CONST_INT:
      *total = 0;
      return true;

    case AND:
    case IOR:
    case XOR:
      *total = COSTS_N_INSNS (GET_MODE_SIZE (mode) > UNITS_PER_WORD ? 2 : 1);
      return false;

    case ABS:
      *total = COSTS_N_INSNS (FLOAT_MODE_P (mode) ? 1 : 3);
      return false;

    // ...

    default:
      return false;
    }
}

#undef TARGET_RTX_COSTS
#define TARGET_RTX_COSTS machine_rtx_costs
Returning true from cost function means that it has written the cost of the whole RTL expression x to *total, and returning false means just for the first operation in the expression (in which case the cost function will be called separately on the arguments).

The cost function gets complicated fast as the CPU gets more complex with different costs depending on how the operations combine with other operations. For example, an addition may have different cost if it can be done as part of the addressing mode in a memory operation
(set (reg:SI 90)
        (mem:SI (plus:SI (reg:SI 88) (reg:SI 89))))
compared to if it is a normal addition
(set (reg:SI 90)
        (plus:SI (reg:SI 88) (reg:SI 89)))
But the cost function does not need to be too exact – there are many optimizations running after the optimization passes calling the cost function (and it is not obvious what the cost even mean for superscalar out-of-order CPUs anyway...).

Cost model – more configuration options

There are about 30 additional macros guiding performance-related decisions during compilation. These cover various properties such as relative cost of different addressing modes and registers, how expensive branches are compared to arithmetic operations, and when the compiler should unroll/inline memcpy instead of calling the function in the library.

See “Describing Relative Costs of Operations” in “GNU Compiler Collection Internals” for a list of these macros.

Peephole optimizations

The define_peephole2 definition in the target description takes a sequence of insns and transforms them to a new sequence of insns, working in essentially the same way as define_expand. This is used to take advantage of target-specific instructions that the generic peep-hole optimizations cannot do.

But there is in general not much need to write peephole optimizations – the insns describes exactly what they do in the RTL pattern, so GCC can reason about the insns and combine them when possible. So missing peephole optimizations are in general deficiencies in the machine description, such as missing define_insn, too conservative constraints (so that GCC does not believe the transformation is allowed), incorrect cost model (so it seems to be slower), etc.

Tuning optimization passes for the target architecture

GCC lets the user enable or disable optimization passes (using -f-options) and change different thresholds (using --param) when compiling. The default value for all of these can be set by macros in the target-specific configuration file gcc/common/config/machine/machine-common.c.

TARGET_OPTION_OPTIMIZATION_TABLE is used to enable or disable optimization passes at the various optimization levels. For example, this code snippet from the i386 backend enables -free for -O2 and higher optimization levels, and disables -fschedule-insns for all optimization levels
static const struct default_options machine_option_optimization_table[] =
  {
    /* Enable redundant extension instructions removal at -O2 and higher.  */
    { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
    /* Turn off -fschedule-insns by default.  It tends to make the
       problem with not enough registers even worse.  */
    { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },

    { OPT_LEVELS_NONE, 0, NULL, 0 }
  };

#undef TARGET_OPTION_OPTIMIZATION_TABLE
#define TARGET_OPTION_OPTIMIZATION_TABLE machine_option_optimization_table

TARGET_OPTION_DEFAULT_PARAMS is used to set the default value for --param parameters. For example, the default value of the parameter l1_cache_line_size is modified as
static void
machine_option_default_params (void)
{
  set_default_param_value (PARAM_L1_CACHE_LINE_SIZE, 16);
}

#undef TARGET_OPTION_DEFAULT_PARAMS
#define TARGET_OPTION_DEFAULT_PARAMS machine_option_default_params

The backend may need to change the default values for other options affecting how the compiler works. For example, it makes sense to make -fno-delete-null-pointer-checks the default for a microcontroller where address 0 is a valid address. This can be done by using TARGET_OPTION_INIT_STRUCT
static void
machine_option_init_struct (struct gcc_options *opts)
{
  opts->x_flag_delete_null_pointer_checks = 0;
}

#undef TARGET_OPTION_INIT_STRUCT
#define TARGET_OPTION_INIT_STRUCT machine_option_init_struct

Further reading

All functionality is described in “GNU Compiler Collection Internals”:

Tuesday, December 12, 2017

More about GCC instruction patterns

This is part five of a series “Writing a GCC back end”.

The previous post about GCC’s low-level IR did only contain the minimum to get started – this post continues with a bit more of the functionality used in machine descriptions.

define_expand

The define_insn definition describes an insn implementing the named functionality (used when converting the higher level IR to RTL), but there are cases where the named functionality must expand to more than one insn, or when the RTL should be generated differently depending on the operands – this is handled by define_expand.

The general format of define_expand is essentially the same as for define_insn – it consists of a name, an RTL template, a condition string containing C++ code to enable/disable the instruction pattern, and a section of C++ code (called preparation statements) that can be used to generate RTL.

Our first example of how to use define_expand is from the Arm handling of rotlsi3 (“rotate left”) – Arm does only have a “rotate right” instruction, but rotating left by x is the same as rotating right by 32-x, and we can generate this as:
(define_expand "rotlsi3"
  [(set (match_operand:SI              0 "s_register_operand" "")
        (rotatert:SI (match_operand:SI 1 "s_register_operand" "")
                     (match_operand:SI 2 "reg_or_int_operand" "")))]
  "TARGET_32BIT"
  {
    if (CONST_INT_P (operands[2]))
      operands[2] = GEN_INT ((32 - INTVAL (operands[2])) % 32);
    else
      {
        rtx reg = gen_reg_rtx (SImode);
        emit_insn (gen_subsi3 (reg, GEN_INT (32), operands[2]));
        operands[2] = reg;
      }
  })
The preparation statements are used to modify the second operand given to rotlsi3 to calculate the correct value for the code generated by the RTL template.

define_expand may choose to skip the generation of the RTL template by executing DONE in the preparation statements. An example of how this is used is the Moxie implementation of mulsidi3 that creates a 64-bit result from multiplying two 32-bit values. The Moxie back end generates this pattern as two instructions (one that calculates the upper 32 bits, and one that calculates the lower 32 bits) where all work is done by the preparation statements
(define_expand "mulsidi3"
  [(set (match_operand:DI 0 "register_operand" "")
        (mult:DI (sign_extend:DI (match_operand:SI 1 "register_operand" "0"))
                 (sign_extend:DI (match_operand:SI 2 "register_operand" "r"))))]
  ""
  {
    rtx hi = gen_reg_rtx (SImode);
    rtx lo = gen_reg_rtx (SImode);

    emit_insn (gen_mulsi3_highpart (hi, operands[1], operands[2]));
    emit_insn (gen_mulsi3 (lo, operands[1], operands[2]));
    emit_move_insn (gen_lowpart (SImode, operands[0]), lo);
    emit_move_insn (gen_highpart (SImode, operands[0]), hi);
    DONE;
  })
Note: This define_expand contains an RTL template, but it is not needed as RTL will not be generated from it. It would have been enough to specify the operands, as in
(define_expand "mulsidi3"
  [(match_operand:DI 0 "register_operand")
   (match_operand:SI 1 "register_operand")
   (match_operand:SI 2 "register_operand")]
  ""
  {
    rtx hi = gen_reg_rtx (SImode);
    rtx lo = gen_reg_rtx (SImode);

    emit_insn (gen_mulsi3_highpart (hi, operands[1], operands[2]));
    emit_insn (gen_mulsi3 (lo, operands[1], operands[2]));
    emit_move_insn (gen_lowpart (SImode, operands[0]), lo);
    emit_move_insn (gen_highpart (SImode, operands[0]), hi);
    DONE;
  })

DONE works as a return statement, and it can be used in conditional code to let the preparation statements override the RTL template for special cases while using the RTL template for the general case. The Arm smaxsi3 instruction pattern (calculating the maximum of two signed integers) uses this to handle two special cases more efficiently:
(define_expand "smaxsi3"
  [(parallel [
    (set (match_operand:SI 0 "s_register_operand" "")
         (smax:SI (match_operand:SI 1 "s_register_operand" "")
                  (match_operand:SI 2 "arm_rhs_operand" "")))
    (clobber (reg:CC CC_REGNUM))])]
  "TARGET_32BIT"
  {
    if (operands[2] == const0_rtx || operands[2] == constm1_rtx)
      {
        /* No need for a clobber of the condition code register here.  */
        emit_insn (gen_rtx_SET (operands[0],
                                gen_rtx_SMAX (SImode, operands[1],
                                              operands[2])));
        DONE;
      }
  })
The problem it solves is that the general case translates to the instructions
cmp   r1, r2
movge r0, r2
movlt r0, r1
which clobbers the condition code, but the special cases max(x,0) and max(x,-1) can be generated as
bic   r0, r1, r1, asr #31
and
orr   r0, r1, r1, asr #31
which do not clobber CC. This optimization could be handled by later optimization passes, but doing it as early as possible (i.e. when generating the RTL) gives more freedom to all later passes to optimize cases that otherwise would have been prevented by the clobbering.

Note: The RTL always have the constant as the second operand for commutative binary operations, so the code does not need to check the first operand.

One other way to return from the preparation statements is to execute FAIL which has the effect of ignoring the instruction pattern in the same way as if the pattern had been disabled by the condition string. An example of this is the AArch64 movmemdi instruction pattern implementing a memory block move:
(define_expand "movmemdi"
  [(match_operand:BLK 0 "memory_operand")
   (match_operand:BLK 1 "memory_operand")
   (match_operand:DI 2 "immediate_operand")
   (match_operand:DI 3 "immediate_operand")]
  "!STRICT_ALIGNMENT"
  {
    if (aarch64_expand_movmem (operands))
      DONE;
    FAIL;
  })
This calls the target-specific aarch64_expand_movmem function that checks if it makes sense to expand the block move inline (that is, if the result will be relatively small) and generates a sequence of move insns if that is the case. If not, it just returns false, which makes this pattern call FAIL, and GCC will ignore this instruction pattern and generate a call to memcpy instead.

The unspec and unspec_volatile expression codes

The RTL template in define_insn contains expressions describing the functionality of the instruction pattern, which enables the optimizers to reason about the insn. Many architectures have instructions that cannot be described by this (or where it does not make sense to describe the functionality, such as instructions for AES encryption – the optimizers cannot take advantage of this description anyway). These cases are handled by describing the instructions using an unspec or unspec_volatile expression which the compiler treats as a black box – the only knowledge the compiler has is what is described by the predicates and register constraints.

One example of how this is used the AArch64 set_fpsr insn that writes to the floating-point status register
(define_insn "set_fpsr"
  [(unspec_volatile [(match_operand:SI 0 "register_operand" "r")] UNSPECV_SET_FPSR)]
  ""
  "msr\\tfpsr, %0")
This describes a volatile instruction (i.e. an instruction with side effects) that takes a register operand. The last operand to the unspec and unspec_volatile expressions is an integer that identifies the instruction (the backend may have several different unspec instructions, and each gets a different number) – these are by convention defined as enumerations called unspec and unspecv
(define_c_enum "unspecv" [
    UNSPECV_EH_RETURN           ; Represent EH_RETURN
    UNSPECV_GET_FPCR            ; Represent fetch of FPCR content.
    UNSPECV_SET_FPCR            ; Represent assign of FPCR content.
    UNSPECV_GET_FPSR            ; Represent fetch of FPSR content.
    UNSPECV_SET_FPSR            ; Represent assign of FPSR content.
    UNSPECV_BLOCKAGE            ; Represent a blockage
    UNSPECV_PROBE_STACK_RANGE   ; Represent stack range probing.
  ])

Attributes

It is possible to add extra information to an insn (such as the length or scheduling constraints) that the back end may take advantage of when generating the code. This is done by defining attributes
(define_attr name list-of-values default)
where
  • name is a string containing the name of the attribute.
  • list-of-values is either a string that specifies a comma-separated list of values that can be assigned to the attribute, or an empty string to specify that the attribute takes numeric values.
  • default is an expression that gives the value of this attribute for insns whose definition does not include an explicit value for the attribute.
For example,
(define_attr "length" "" (const_int 2))
defines a numeric attribute “length” having the default value 2. The back end may define attributes with any name, but a few names have a specific usage in GCC. For example, the length attribute contains the instruction’s length measuerd in bytes, and is used when calculating branch distance for architectures where different instruction are used for “short” and “long” branches.

Attribute values are assigned to insns by attaching a set_attr to the define_insn as in
(define_insn "*call"
  [(call (mem:QI (match_operand:SI 0 "nonmemory_operand" "i,r"))
         (match_operand 1 "" ""))]
  ""
  "@
   jsra\\t%0
   jsr\\t%0"
  [(set_attr "length" "6,2")])
This gives the length attribute the value 6 if the first alternative was matched, and 2 if the second alternative was matched.

The attributes can be accessed from C++ code by calling the auto-generated function get_attr_name, as in
int len = get_attr_length (insn);
The return type of get_attr_name for attributes defined with a list-of-values is an enum of the possible values.

Further reading

All functionality is described in “GNU Compiler Collection Internals”:

Sunday, October 29, 2017

Excessive GCC memory usage for large std::bitset arrays

The C++ slack channel had a discussion last week about the code
#include <array>
#include <bitset>
#include <cstddef>

constexpr std::size_t N = 100000;
std::array<std::bitset<N>, N> elems;

int main() {}
that makes GCC consume about 9 gigabytes of memory in the parsing phase of the compilation. This does not happen when using C-style arrays, so changing the definition of elems to
std::bitset<N> elems[N];
makes the code compile without needing an excessive amount of memory. So why does GCC consume all this memory while parsing, and only when using std::array?

The reason has to do with deficiencies in GCC’s implementation of constexpr. To see what is happening, we start by expanding the include files and removing everything not needed for the program:
namespace std
{
  typedef long unsigned int size_t;
}

namespace std __attribute__ ((__visibility__ ("default")))
{
  template<typename _Tp, std::size_t _Nm>
    struct __array_traits
    {
      typedef _Tp _Type[_Nm];
    };

  template<typename _Tp, std::size_t _Nm>
    struct array
    {
      typedef std::__array_traits<_Tp, _Nm> _AT_Type;
      typename _AT_Type::_Type _M_elems;
    };
}

namespace std __attribute__ ((__visibility__ ("default")))
{
  template<size_t _Nw>
    struct _Base_bitset
    {
      typedef unsigned long _WordT;

      _WordT _M_w[_Nw];

      constexpr _Base_bitset() noexcept
      : _M_w() { }
    };

  template<size_t _Nb>
    class bitset
    : private _Base_bitset<((_Nb) / (8 * 8) + ((_Nb) % (8 * 8) == 0 ? 0 : 1))>
    {
    };
}

constexpr std::size_t N = 100000;
std::array<std::bitset<N>, N> elems;

int main() {}
_Base_bitset has a constexpr constructor, and this makes GCC end up building the whole elems array in memory at compile time. The array is large (1,250,400,000 bytes) and GCC need to use even more memory as it represents the array by building AST nodes for the elements.

The constexpr keyword does not mean that the compiler must evaluate at compile time – it only means that it can be evaluated at compile time if the result is used where only constant expressions are allowed. I had assumed that compile-time evaluation is not needed in this case, but GCC seems to always evaluate constexpr at compile time when instantiating templates. Anyway, GCC could use a more efficient representation that does not need to keep the whole array expanded in memory...

The C-style array is not expanded in memory at compile time as it is not defined as a template. The bitset class is still expanded, but it is small, and the compiler only wastes 12,504 bytes of memory by expanding it.

Saturday, September 16, 2017

Useful GCC warning options not enabled by -Wall -Wextra

GCC can warn about questionable constructs in the source code, but most such warnings are not enabled by default – developers need to use the options -Wall and -Wextra to get all generally useful warnings. There are many additional warning options that are not enabled by -Wall -Wextra as they may produce too many false positive warnings or be targeted to a specific obscure use case, but I think a few of them (listed below) may be useful for general use.

-Wduplicated-cond

Warn about duplicated condition in if-else-if chains, such as
int foo(int a)
{
  int b;
  if (a == 0)
    b = 42;
  else if (a == 0)
    b = 43;
  return b;
}
Note: -Wduplicated-cond was added in GCC 6.

-Wduplicated-branches

Warn when an if-else has identical branches, such as
int foo(int a)
{
  int b;
  if (a == 0)
    b = 42;
  else
    b = 42;
  return b;
}
It also warns for conditional operators having identical second and third expressions
int foo(int a)
{
  int b;
  b = (a == 0) ? 42 : 42;
  return b;
}
Note: -Wduplicated-branches was added in GCC 7.

-Wlogical-op

Warn about use of logical operations where a bitwise operation probably was intended, such as
int foo(int a)
{
  a = a || 0xf0;
  return a;
}
It also warns when the operands of logical operations are the same
int foo(int a)
{
  if (a < 0 && a < 0)
    return 0;
  return 1;
}
Note: -Wlogical-op was added in GCC 4.3.

-Wrestrict

Warn when the compiler detects that an argument passed to a restrict or __restrict qualified parameter alias with another parameter.
void bar(char * __restrict, char * __restrict);

void foo(char *p)
{
  bar(p, p);
}
Note: -Wrestrict was added in GCC 7.

-Wnull-dereference

Warn when the compiler detects paths that dereferences a null pointer.
void foo(int *p, int a)
{
  int *q = 0;
  if (0 <= a && a < 10)
    q = p + a;
  *q = 1;  // q may be NULL
}
Note: -Wnull-dereference was added in GCC 6.

-Wold-style-cast

Warn if a C-style cast to a non-void type is used within a C++ program.
int *foo(void *p)
{
  return (int *)p;
}
Note: -Wold-style-cast was added before GCC 3.
Note: -Wold-style-cast is only available for C++.

-Wuseless-cast

Warn when an expression is cast to its own type within a C++ program.
int *foo(int *p)
{
  return static_cast<int *>(p);
}
Note: -Wuseless-cast was added in GCC 4.8.
Note: -Wuseless-cast is only available for C++.

-Wjump-misses-init

Warn if a goto statement or a switch statement jumps forward across the initialization of a variable, or jumps backward to a label after the variable has been initialized.
int foo(int a)
{
  int b;
  switch (a)
  {
  case 0:
    b = 0;
    int c = 42;
    break;
  default:
    b = c;  // c not initialized here
  }
  return b;
}
Note: -Wjump-misses-init was added in GCC 4.5.
Note: -Wjump-misses-init is only available for C – jumping over variable initialization is an error in C++.

-Wdouble-promotion

Warn when a value of type float is implicitly promoted to double.

Floating point constants have the type double, which makes it easy to accidentally compute in a higher precision than intended. For example,
float area(float radius)
{
  return 3.14159 * radius * radius;
}
does all the computation in double precision instead of float. There is normally no difference in performance between float and double for scalar x86 code (although there may be a big difference for small, embedded, CPUs), but double may be much slower after vectorization as only half the number of elements fit in the vectors compared to float values.

Note: -Wdouble-promotion was added in GCC 4.6.

-Wshadow

Warn when a local variable or type declaration shadows another variable, parameter, type, or class member.
int result;

int foo(int *p, int len)
{
  int result = 0;  // Shadows the global variable
  for (int i = 0; i < len; i++)
    result += p[i];
  return result;
}
Note: -Wshadow was added before GCC 3.

-Wformat=2

The -Wformat option warns when calls to printf, scanf, and similar functions have an incorrect format string or when the arguments do not have the correct type for the format string. The option is enabled by -Wall, but it can be made more aggressive by adding -Wformat=2 which adds security-related warnings. For example, it warns for
#include <stdio.h>

void foo(char *p)
{
  printf(p);
}
that may be a security hole if the format string came from untrusted input and contains ‘%n’.

Note: -Wformat=2 was added in GCC 3.0.