Monday, October 17, 2016

Inlining — main is special

I wrote in my previous post that I assumed Jason compiled using the -O2 optimization level for the example in his CppCon 2016 talk, as it worked for me when I used -O3. But I was wrong – Jason used -O3, but his example was slightly different compared to mine. I used the function
int bar()
{
    return std::string("a").size() + std::string("b").size();
}
that optimizes to just returning a constant, while Jason used
int main()
{
    return std::string("a").size() + std::string("b").size();
}
which does not optimize. The only difference is the name of the function...

The reason for the difference is that GCC knows that main is special — it has the property that it is called only once. That is, it is a cold function, which makes the inlining less aggressive.

GCC is propagating the "called only once" information where possible, so functions for which the compiler can see all callers (such as static functions, functions in an anonymous namespace, or when compiling using -fwhole-program) are also marked as "called only once" if they are called exactly once from such a function. For example, the compiler can see that bar is called only once in
static int bar()
{
   return std::string("a").size() + std::string("b").size();
}

int main()
{
    return bar();
}
and the string code is not inlined. Removing static prevents GCC from marking bar as called only once (as it could be called from outside the translation unit), and the string code will now be inlined.

On the other hand, if the "called only once" function contains loops, then GCC can infer that it makes sense to optimize the content of the loop as it is executed multiple times. So for example
int main()
{
    int k = 0;
    for (int i = 0; i < 10000; i++)
        k += std::string("a").size() + std::string("b").size();
    return k;
}
gets the strings inlined, and it optimizes to
main:
    mov     eax, 20000
    ret

Sunday, October 16, 2016

C++ and code inlining

Jason Turner mentions in his CppCon 2016 talk that GCC optimizes the function
int foo()
{
    return std::string("a").size();
}
to just a return of a constant value
_Z3foov:
    movl    $1, %eax
    ret
while
int bar()
{
    return std::string("a").size() + std::string("b").size();
}
generates a less optimized result involving calls to constructors, destructors, etc. The difference between the two cases comes from how GCC handles inlining.

The basic idea behind the GCC inliner is to inline greedily as long as it estimates that the code size does not increase too much (where the growth limit depends on optimization level, if the function is hot/cold, etc.). One important special case is functions that are called exactly once — they do not increase code size if they are inlined, as the inlining just moves the code from the callee into the caller.

STL usage often expands to a large amount of code, for example, the string constructor used in the examples above calls
template<typename _CharT, typename _Traits, typename _Alloc>
  template<typename _InIterator>
    void
    basic_string<_CharT, _Traits, _Alloc>::
    _M_construct(_InIterator __beg, _InIterator __end,
                 std::forward_iterator_tag)
    {
      // NB: Not required, but considered best practice.
      if (__gnu_cxx::__is_null_pointer(__beg) && __beg != __end)
        std::__throw_logic_error(__N("basic_string::"
                                     "_M_construct null not valid"));

      size_type __dnew =
        static_cast<size_type>(std::distance(__beg, __end));

      if (__dnew > size_type(_S_local_capacity))
        {
          _M_data(_M_create(__dnew, size_type(0)));
          _M_capacity(__dnew);
        }

      // Check for out_of_range and length_error exceptions.
      __try
        { this->_S_copy_chars(_M_data(), __beg, __end); }
      __catch(...)
        {
          _M_dispose();
          __throw_exception_again;
        }

      _M_set_length(__dnew);
    }
This in turn calls several other functions, and more than 50 function calls need to be inlined in order to fully optimize foo. The temporary code increase of inlinling before the code gets optimized is over the limit allowed by -O2, but foo has exactly one string, so the compiler can inline this without increasing code size, and the compiler can eventually eliminate everything. But the bar function does not get the constructors inlined, as it constructs two strings...

This can have surprising effects when trying to understand how well code optimize. For example, we saw that foo optimizes to return a constant value, so let us add one more identical function
int foo()
{
    return std::string("a").size();
}

int foo2()
{
    return std::string("b").size();
}
We now construct two strings, which prevents inlining. So neither foo nor foo2 will be fully optimized when compiled with -O2.

The -O3 optimization level allows more inlining, so all examples in this blog post optimize fully when compiled using -O3 (so I assume Jason used -O2 for his examples). In general, -O2 does optimizations that do not involve a space-speed tradeoff, while -O3 allows the code size to increase if the resulting code is faster (which is good if you have enough cache).

To conclude:
  • It is not enough to look at trivial code snippets if you want to verify that your complex templated code is being optimized as you expect.
  • You may want to use -O3 rather than -O2 when compiling code for modern architectures.