Monday, June 19, 2017

A look at range-v3 code generation

I recently saw a Stack Overflow post that compared the speed of std::find_if on a vector vec
auto accumulated_length = 0L;
auto found = std::find_if(vec.begin(), vec.end(),
                          [&](auto const &val) {
                              accumulated_length += val;
                              return to_find < accumulated_length;
                          });
auto const found_index = std::distance(vec.begin(), found);

and the equivalent code using the range-v3 library
auto const found_index = ranges::distance(vec
    | ranges::view::transform(ranges::convert_to<long>{})
    | ranges::view::partial_sum()
    | ranges::view::take_while([=](auto const i) {
          return !(to_find < i);
      }));
Measuring the performance on an Intel Broadwell CPU using the Google benchmark library and this code compiled with the options
-O3 -march=native -std=c++14 -DNDEBUG
gives me the result
Benchmark                Time           CPU Iterations
------------------------------------------------------
BM_std/1024            311 ns        311 ns    2248354
BM_range/1024         2102 ns       2102 ns     332711
for gcc 7.1.0 and
BM_std/1024            317 ns        317 ns    2208547
BM_range/1024          809 ns        809 ns     864328
for clang 4.0.0. There are two obvious questions
  • Why is range-v3 slower than the STL?
  • Why is the difference so much bigger for GCC than for LLVM?

I also wanted to see if the STL added overhead, so I tried a simple C-style for-loop
long i, acc = 0;
for (i = 0; i < len; i++) {
    acc += p[i];
    if (to_find < acc)
        break;
}
found_index = i;
This runs in 439 ns – 40% slower than the STL version! – which adds the question
  • Why is the for-loop slower than the STL version?

Why is the for-loop slower?

GCC is generating the obvious assembly for the for-loop
.L4:
        movslq  (%r8,%rax,4), %rcx
        addq    %rcx, %rdx
        cmpq    %rsi, %rdx
        jg      .L7
.L3:
        addq    $1, %rax
        cmpq    %rdi, %rax
        jl      .L4
.L7:
        ...
I had expected the compiler to generate similar code for std::find_if, and that is what happens if it is used with an input iterator, but libstdc++ has an overload for random-access iterators which partially unrolls the loop
template<typename _RandomAccessIterator, typename _Predicate>
  _RandomAccessIterator
  __find_if(_RandomAccessIterator __first, _RandomAccessIterator __last,
            _Predicate __pred, random_access_iterator_tag)
  {
    typename iterator_traits<_RandomAccessIterator>::difference_type
      __trip_count = (__last - __first) >> 2;

    for (; __trip_count > 0; --__trip_count)
      {
        if (__pred(__first))
          return __first;
        ++__first;

        if (__pred(__first))
          return __first;
        ++__first;

        if (__pred(__first))
          return __first;
        ++__first;

        if (__pred(__first))
          return __first;
        ++__first;
      }

    switch (__last - __first)
      {
      case 3:
        if (__pred(__first))
          return __first;
        ++__first;
      case 2:
        if (__pred(__first))
          return __first;
        ++__first;
      case 1:
        if (__pred(__first))
          return __first;
        ++__first;
      case 0:
      default:
        return __last;
      }
  }
This partial unrolling gets rid of a large fraction of the comparisons and branches, which makes a big difference for this kind of micro-benchmark.

Why does GCC generate slow code for range-v3?

The range-v3 code generated by GCC have a few objects placed on the stack which adds some (useless) memory operations. The reason they are not optimized has to do with how GCC are optimizing structures and the order the optimization passes are being run.

The GCC “Scalar Replacement of Aggregates” (SRA) optimization pass splits structures into their elements. That is,
struct S {
  int a, b, c;
};

struct S s;

s.a = s.b = s.c = 0;
...
is transformed to the equivalent of
int a, b, c;

a = b = c = 0;
...
and the variables are then optimized and placed in registers in the same way as normal non-structure variables.

The compiler cannot split structures that have their address taken as it would then need to do expensive pointer tracking to find how each element is used, so such structures are kept on the stack. The GCC SRA pass is conservative and does not split a structure if any part of it has been captured by a pointer, such as
struct S s;

s.a = s.b = s.c = 0;
int *p = &s.a;
...
that could be split into
int a, b, c;

a = b = c = 0;
int *p = &a;
...
but that is not done by GCC.

It is usually not a problem that address-taking limits SRA as optimization passes such as constant propagation eliminates use of pointers when they are only used locally in a function, so code of the form
struct S s;

int *p = &s.a;
...
*p = 0;
is transformed to
struct S s;

...
s.a = 0;
which can then be optimized by SRA. But this requires that all paths to the use of p pass through the same initialization and that the compiler can see that they pass through the same initialization – we cannot easily eliminate the pointers for code such as
struct S s;
int *p;

if (cond)
    p = &s.a;
...
if (cond)
    *p = 0;
that need the compiler to track values to see that all executions of *p initializes p to &s.a.

And that is how the range-v3 code looks like after templates has been expanded and all functions inlined – the code does different initializations depending on if the range is empty or not and ends up with code segments of the form
if (begin != end) {
    // Initialize some variables
}

...

if (begin != end) {
    // Use the variables
}
I have a hard time trying to follow exactly what range-v3 is trying to do – the code expands to more than 700 functions, so I have only looked at the compiler’s IR after inlining and I do not know exactly how it look in the C++ source code – but the result is that the compiler fails to propagate some addresses due to this issue and three objects (one struct take_while_view and two struct basic_iterator) are still placed on the stack when the last SRA pass has been run.

GCC do eventually manage to simplify the code enough that SRA could eliminate all structures, but that is later in the optimization pipeline, after the last SRA pass has been run. I tested to add an extra late SRA pass – this eliminates the memory operations, and the function runs in 709 ns. Much better, but still only half the speed of the STL version.

Why is range-v3 slower than the STL?

Both GCC and LLVM generate the range-v3 code to something of the form
static long foo(const int *begin, const int *end, long to_find)
{
    long result = 0;
    const int *p = begin;
    if (begin != end) {
        result = *begin;
        while (1) {
            if (p == end)
                break;
            if (to_find < result)
                break;

            p++;
            if (p != end)
                result += *p;
        }
    }
    return p - begin;
}
that does one extra comparison in the loop body compared to the for-loop version. This kind of code is supposed to be simplified by the loop optimizers, but they are running relatively early in the optimization pipeline (partly so that later optimizations may take advantage of the improved loop structure, and partly as many optimizations makes life harder for the loop optimizer) so they are limited by the same issues mentioned in the previous section – that is, I assume the redundant comparison would be eliminated if the range-v3 library improved its handling of empty ranges etc.

Sunday, June 4, 2017

-fipa-pta

My previous blog post had a minimal description of -fipa-pta and I have received several questions about what it actually do. This blog post will try to give some more details...

Points-to analysis

Many optimizations need to know if two operations may access the same memory address. For example, the if-statement in
i = 5;
*p = -1;
if (i < 0)
  do_something();
can be optimized away if *p cannot modify i.

GCC tracks what the pointers may point to using the general ideas from the paper “Efficient Field-sensitive pointer analysis for C”. I will not describe the details – the first few pages of the paper do it better than I can do here – but the principle is that each pointer is represented by a set of locations it may point to, the compiler is generating set constraints representing each statement in the program, and then solving those constraints to get the actual set of locations the pointer may point to.

But this process is expensive, so GCC is normally doing this one function at a time and assumes that called functions may access any memory visible to them.

-fipa-pta

The -fipa-pta optimization takes the bodies of the called functions into account when doing the analysis, so compiling
void __attribute__((noinline))
bar(int *x, int *y)
{
  *x = *y;
}

int foo(void)
{
  int a, b = 5;
  bar(&a, &b);
  return b + 10;
}
with -fipa-pta makes the compiler see that bar does not modify b, and the compiler optimizes foo by changing b+10 to 15
int foo(void)
{
  int a, b = 5;
  bar(&a, &b);
  return 15;
}

A more relevant example is the “slow” code from the “Integer division is slow” blog post
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());
std::uniform_int_distribution<int> theIntDist(0, 99);

for (int i = 0; i < 1000000000; i++) {
  volatile auto r = theIntDist(randGenerator);
}
Compiling this with -fipa-pta makes the compiler see that theIntDist is not modified within the loop, and the inlined code can thus be constant-folded in the same way as the “fast” version – with the result that it runs four times faster.