Wednesday, October 20, 2021

This blog has moved

I have started blogging again!

But I am too annoyed at Blogger, so the new posts will be published in my new blog: https://kristerw.github.io/.

Sunday, February 2, 2020

Watching for software inefficiencies with Valgrind

Software inefficiencies often show up as wasteful memory operations, and many research tools can detect various cases of this. One such tool is described in the paper “Watching for Software Inefficiencies with Witch”, which is using the Intel CPU performance monitoring units (PMU) and debug registers to find
  • dead stores – writing a value that is never read,
  • silent stores – writing the same value that was already present at the memory location,
  • silent loads – reading an unchanged value from memory that was previously read.
Most of the dead/silent memory operations are not problematic, and the analyzed program would most likely become slower if we tried to get rid of them by making the code more complicated. But there are cases, such as mostly dead stores of the form
*p = expensive_computation();
where it may make sense to reorganize the code to only do the expensive computation when the value is needed.

I thought it would be fun to play with this, but the “Witch” tools need a custom Linux kernel which I found too complicated to set up… But I have always wanted to learn how to write a Valgrind tool, and writing a Witch-like tool seemed like a good beginner project (although the main contribution in the “Witch” paper is doing this without performance overhead – that is definitely not the case when using Valgrind…).

The deadstores Valgrind tool

My Valgrind tool, “deadstores”, gathers data from each instruction that is reading and writing memory:
  • For stores, it records how many bytes were written, and how many of those bytes were read before the memory was overwritten (or made invalid by calling free, etc.). It also records how many times the instruction was executed, and how many of those wrote the same value as was there before.
  • For loads, it records how many times the instruction was executed, and how many of those read a value that had previously been read.
The reason dead stores are tracked on a byte granularity is that I want to be able to find obsolete, “write-only”, structure elements that are never read. For example, if we have a structure such as
struct {
  int a, b, c, d;
} foo;
and code that is initializing the structure as
memset(&foo, 0, sizeof(foo));
then the compiler will usually optimize the memset to one SIMD instruction that is clearing all 128 bits. If the program now only reads three of the values, then we will see that 4 of the 16 bytes are dead – instruction granularity would not detect this as the stored value is (partly) used.

Silent loads and stores are, however, tracked on instruction granularity. The reason is that byte granularity would report spurious silent loads/stores. For example, if sum is a 32-bit int variable containing the value 15, and we add 10 to it
sum += 10;
then this would be reported as 3 silent stores if we used byte granularity (as three of the bytes are 0 both before and after the operation).

How to use the tool

The deadstores tool is used in the same way as other Valgrind tools
valgrind --tool=deadstores [valgrind-options] your-prog [your-prog-options]
The program to analyze must be built with debug symbols (-g or better), and optimizations should be enabled in order to produce relevant data. You also need -fno-omit-frame-pointer if you are passing --use-stack-trace=yes (see below) to the tool.

The result is written to a JSON file containing all data collected during the run. There is an example python script script/print_result.py that extracts the top \(n\) instructions from each of “dead stores”, “silent stores”, and “silent loads”.

The name of the output file is per default deadstores.out.<pid> (where <pid> is the program’s process ID), but it can be changed by --deadstores-out-file=<filename>. The %p, %q, and %n format specifiers can be used to embed the process ID, the contents of an environment variable, and/or a sequence number in the name, in the same ways as for the core option --log-file.

The tool per default tracks the instructions that read/write the memory, but this is often not too useful as it tends to report things like
0x054b0a25:
  bytes_written: 11916560
  bytes_read:      248126
  bytes_dead:    11668434
  nof_stores:      744785
  nof_silent:       19326
    at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)
The dead stores here come from the memset implementation in libc, and the values are the sums of all calls to memset in the program. This is not that helpful if we want to find dead stores we can eliminate, but the tool can instead track the loads/stores by the stack trace, which splits the reported data for an instruction according to where it was called from
0x054b0a25:
  bytes_written:  1707072
  bytes_read:        1120
  bytes_dead:     1705952
  nof_stores:      106692
  nof_silent:        2500
    at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)
    by 0x00b317aa: poison_pages() (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-page.c:2125)
    by 0x00b3362c: ggc_collect() (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-page.c:2226)
    by 0x00bb82d4: cgraph_node::finalize_function(tree_node*, bool) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cgraphunit.c:494)
    by 0x00a60e43: expand_or_defer_fn(tree_node*) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cp/semantics.c:4486)

0x054b0a25:
  bytes_written:   521808
  bytes_read:          12
  bytes_dead:      521796
  nof_stores:       32613
  nof_silent:           0
    at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)
    by 0x00d127fe: ggc_internal_cleared_alloc(unsigned long, void (*)(void*), unsigned long, unsigned long) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-common.c:118)
    by 0x012b796e: make_node(tree_code) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc.h:143)
    by 0x012b7c96: build_tree_list(tree_node*, tree_node*) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/tree.c:3184)
    by 0x009e32d6: cp_parser_parameter_declaration_list(cp_parser*, int) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cp/parser.c:22525)

...
This is enabled by --use-stack-trace=yes which in general makes the result much more useful (and the tool much slower…). The program to analyze must be compiled width frame pointer support (-fno-omit-frame-pointer) in order to get a reliable result.

Most of the Valgrind core options are also supported, such as --trace-children=yes which is useful when analyzing programs (such as gcc) that start subprocesses using the exec system call.

Limitations

Valgrind is not an optimal framework for writing this kind of tool. The reason is that it works by translating the assembly instructions to an internal, target-independent, representation before it is analyzed/instrumented by the tools, and this translation does not preserve all details of the instructions. This is not a problem for most instructions – for example, a simple addition instruction
add %rcx,%rax
is translated in a natural way to an IR instruction adding two values
t10 = Add64(t7,t4)
But some instructions are translated in a way that causes problems. For example, the bit test instruction
bt %rax,%r8
is generated as a sequence that is storing 8 bytes on an unused address on the stack and reading back one byte of the result
t75 = GET:I64(48)          ; <-- Get stack pointer
t74 = Sub64(t75,0x120:I64)
PUT(48) = t74
STle(t74) = t73            ; <-- Store 8 bytes
t18 = And64(t36,0x3F:I64)
t78 = Sar64(t18,0x3:I8)
t77 = Add64(t74,t78)
t80 = And64(t18,0x7:I64)
t112 = 64to8(t80)
t79 = t112
t15 = LDle:I8(t77)         ; <-- Load 1 byte
t84 = GET:I64(168)
t113 = amd64g_calculate_rflags_all[mcx=0x9]{0x581732b0}(0x13:I64,t61,0x0:I64,t84):I64
t85 = t113
t114 = 8Uto64(t15)
t89 = t114
t88 = Shr64(t89,t79)
t87 = And64(t88,0x1:I64)
t90 = And64(t85,0x8D4:I64)
t86 = Or64(t90,t87)
PUT(144) = 0x0:I64
PUT(160) = 0x0:I64
PUT(152) = t86
PUT(168) = 0x0:I64
t91 = Add64(t74,0x120:I64)
PUT(48) = t91
This makes the deadstores tool treat this as an 8-byte store where 7 of those bytes are dead… I have worked around this particular case (by ignoring stores to offset 0 from the stack pointer – this holds the pointer to the stack frame for normal functions, so these memory accesses are not relevant for the tool anyway), but there may be additional cases I have not found yet…

One other limitation of the deadstores tool is that the tracking of silent loads only handles memory addresses that have been stored to, so it does not detect silent loads in read-only data segments. This could be fixed by tracking valid memory in the same way as done by the memcheck tool.

Implementation details

Structures

The tool keeps track of statistics for each instruction reading/writing memory (number of bytes written, number of times executed, etc.) and the status of each byte written (to track if the stored value is later read or re-written with the same value).

The statistics for each instruction is kept in the node structure
struct node {
  handle parent;
  handle left;
  handle right;
  enum color color;
  struct key key;

  // Store
  Long bytes_written;
  Long bytes_read;
  Long nof_stores;
  Long nof_silent_stores;

  // Load
  Long nof_loads;
  Long nof_silent_loads;
};
The node structures are stored in an array called nodes, and the nodes are organized as a red-black tree keyed by the address of the instruction (or rather, “generalized address” – it contains the top 5 addresses of the stack trace when --use-stack-trace=yes is used). The tree structure is managed by the parent, left, and right fields – these are conceptually pointers to node structures, but we are using handles (that are just the index into the nodes array) to make them valid even if we need to realloc the nodes array to increase its size.

Each byte accessed by the program gets an addr_info structure for that address (these are allocated in blocks of 1<<20 addresses, and the blocks are stored in a hash table addr_info_table).
struct addr_info {
  unsigned store_instr_h : 31;
  unsigned is_written : 1;
};
The addr_info structure contains 1 bit telling if the byte is valid (that is, if it has been written to, and not been invalidated by the memory being freeed etc.), and a handle to the instruction that wrote to this byte. Programs do not use that many load/store instructions, so the handle is stored in 31 bits to make the structure as structure fit in 32 bits. This makes a big difference when tracing memory-hungry programs...

Instrumentation

Valgrind work by transforming the binary code to its own internal representation (IR) and the tools can then add instrumentation to the IR before Valgrind JIT-compiles and executes the code. The deadstores tool adds its instrumentation in the function ds_instrument.

The code in ds_instrument iterates through the IR to find the instructions that access memory – the most important are Ist_WrTemp (load) and Ist_Store (store), but there are a few others (such as the atomic compare-and-swap Ist_CAS). Each such instruction is instrumented using the Ist_Dirty IR instruction that is used to call a helper function. The deadstores tool has one such helper function for tracking each of “dead stores”, “silent stores”, and “loads” (each instruction that writes memory gets both of the “dead stores” and “silent stores” helper functions, and the “silent stores” helper function must be called first).

The helper functions used by the deadstores tool take three parameters – the address and size of the memory access, and the address of the instruction. The function that is tracking dead stores updates the counters for the instruction (that the instruction has been executed and written size bytes) and the status of each written byte (that it has been written, and which instruction that wrote the value).
static void dead_store_tracker(Addr addr, SizeT size, Addr iaddr)
{
  handle h = iaddr2handle(iaddr);
  nodes[h].nof_stores += 1;
  nodes[h].bytes_written += size;

  for (SizeT i = 0; i < size; i++) {
    struct addr_info *ai = addr2addr_info(addr + i);
    ai->is_written = True;
    ai->store_instr_h = h;
  }
}
The load tracking checks each byte’s store_instr_h to see which instruction wrote this byte. If this is different from 0, then the instruction corresponding to the handle has its bytes_read counter incremented, and the byte’s store_instr_h is set to 0 to mark it as read. If all of the bytes were written but has a zero for store_instr_h, then the load is silent, and the load instruction’s counter is updated accordingly.
static void load_tracker(Addr addr, SizeT size, Addr iaddr)
{
  SizeT silent_load = 0;
  for (SizeT i = 0; i < size; i++) {
    struct addr_info *ai = addr2addr_info(addr + i);
    if (ai->store_instr_h != 0) {
      nodes[ai->store_instr_h].bytes_read++;
      ai->store_instr_h = 0;
    } else if (ai->is_written) {
      silent_load += 1;
    }
  }

  handle h = iaddr2handle(iaddr);
  nodes[h].nof_loads += 1;
  if (silent_load == size) {
    nodes[h].nof_silent_loads += 1;
  }
}
The tracking of silent stores needs a bit more work as it must compare the value of the original memory with the newly written. This is cannot be done in the helper function, so it is done by inserting new IR that does the comparison, and the helper function is not called unless the values are identical. This functionality is partially supported by Valgrind – the Ist_Dirty instruction has an optional guard parameter taking an IR expression determining if the function is to be called, so the only thing needed is to generate the comparison and pass it as the guard. And this comparison is straight forward to emit – if st is an IRExpr pointer to a 32-bit Ist_Store, then the IR for the comparison can be emitted as
IRExpr *data = st->Ist.Store.data;
IRType type = typeOfIRExpr(tyenv, data);
IRExpr *addr = st->Ist.Store.addr;
tl_assert(type == Ity_I32);
IRExpr *load = emit_load(sb, type, addr);
IRExpr *guard = emit_binop(sb, Iop_CmpEQ32, data, load);
where the emit_-functions are simple wrappers around the Valgrind API to get rid of some boilerplate code (sb is the IR superblock containing the instructions – this is one of the inputs to ds_instrument).

The only thing left to do for the silent store tracker is to check that all addresses have been written to (in order to prevent reporting spurious silent stores for, for example, the first write to malloced memory that by luck happens to contain the same value that is stored by the instruction), and update the instruction’s silent_stores counter if they have.
static void silent_store_tracker(Addr addr, SizeT size, Addr iaddr)
{
  Bool is_written = True;
  for (SizeT i = 0; i < size; i++) {
    struct addr_info *ai = addr2addr_info(addr + i);
    is_written = is_written && ai->is_written;
  }
  if (is_written) {
    handle h = iaddr2handle(iaddr);
    nodes[h].nof_silent_stores += 1;
  }
}

Callbacks

It is possible to register callbacks that Valgrind calls for various things that cannot be done in the IR instrumentation (or that most tools must do, so it is better to have a common mechanism for this). Some examples of this are track_new_mem_stack and track_die_mem_stack that are called each time the stack pointer is decreased/increased, or the needs_malloc_replacement callbacks that are called when the program is calling malloc, free, etc.

The deadstores tool is using these in order to mark the memory as invalid (that is, clearing the is_written flag) to prevent counters being incorrectly updated by the silent load/store tracking when, for example, previously freed memory is returned from malloc.

TODO

There are many things that could be improved:
  • The tool is slow. I have not done any profiling, but I would guess most of the time is spent in iaddr2handle and addr2addr_info. It is easy to improve how this is done.
  • Add a test suite.
  • Better tracking of valid memory (as described in the “limitations” section above).
  • Better tracking of variables and structures (to automatically find write-only variables and structure elements).
  • Finding other types of inefficiencies, such as memcpy where the program could have used the original data.
But I do not expect to do this anytime soon (if ever…) as my main goal was just to learn how to write a Valgrind tool…

Sunday, April 28, 2019

How LLVM optimizes power sums

LLVM optimizes power sums, such as
int sum(int count)
{
  int result = 0;

  for (int j = 0; j < count; ++j)
    result += j*j;

  return result;
}
to code that calculates the result without a loop (godbolt)
sum(int):
        test    edi, edi
        jle     .LBB0_1
        lea     eax, [rdi - 1]
        lea     ecx, [rdi - 2]
        imul    rcx, rax
        lea     eax, [rdi - 3]
        imul    rax, rcx
        shr     rax
        imul    eax, eax, 1431655766
        add     eax, edi
        shr     rcx
        lea     ecx, [rcx + 2*rcx]
        lea     eax, [rax + rcx]
        add     eax, -1
        ret
.LBB0_1:
        xor     eax, eax
        ret
It handles more complex cases too (godbolt) – that is, the optimization is not just a simple pattern matcher. This post will show how the optimization is done.

Loop analysis – scalar evolution

There are many cases where compilers need to track how values are updated within a loop. For example, the loop vectorizer needs to check that the pointers are moved to the adjacent element in the next iteration, and check that no other pointer indexing may alias the range we are vectorizing.

Both GCC and LLVM does this in the same way in their scalar evolution passes, where each variable at iteration \(i\) (we start enumerating iterations from \(0\)) is represented as a function \(f_0(i)\) defined as a linear recurrence of the form
\[f_j(i) =
\begin{cases}
\phi_j & \text{if $i = 0$} \\
f_j(i-1) \odot_{j+1} f_{j+1}(i-1) & \text{if $i > 0$}
\end{cases}\] where \(\odot\in\{+,*\}\).

Example 1

The simplest case is a loop such as
void foo(int m, int *p)
{
  for (int j = 0; j < m; j++)
    *p++ = j;
}
The loop writes \(0\) to *p++ in the first iteration, \(1\) in the second, etc. So we can express the value written at iteration \(i\) as\[f(i) =
\begin{cases}
0 & \text{if $i = 0$} \\
f(i-1) + 1 & \text{if $i > 0$}
\end{cases}\]

Example 2

Polynomials in the iteration variable can also be expressed in this form.
void foo(int m, int k, int *p)
{
  for (int j = 0; < m; j++)
    *p++ = j*j*j - 2*j*j + k*j + 7;
}
We will see below how to build the functions, but the result for the value stored in this loop is \[\begin{align}f_2(i) & =
\begin{cases}
2\phantom{f_0(i-1) + f_1(i-1)} & \text{if $i = 0$} \\
f_2(i-1) + 6 & \text{if $i > 0$}
\end{cases}\\
f_1(i) & =
\begin{cases}
k-1 & \text{if $i = 0$} \\
f_1(i-1) + f_2(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases}\\
f(i) = f_0(i) & =
\begin{cases}
7 & \text{if $i = 0$} \\
f_0(i-1) + f_1(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases}\end{align}\] One optimization we can see directly from these functions is that the value can be calculated by just three additions within the loop
void foo(int m, int k, int *p)
{
  int t0 = 7;
  int t1 = k-1;
  int t2 = 2;
  for (int j = 0; j < m; j++) {
    *p++ = t0;
    t0 = t0 + t1;
    t1 = t1 + t2;
    t2 = t2 + 6;
  }
}
which is a useful optimization for architectures were multiplication is expensive. This kind of code is, however, uncommon, so most compilers do not do this optimzation – but they usually do this for simpler cases, such as
void foo(int m, int k, int *p)
{
  for (int j = 0; < m; j++)
    *p++ = k*j + 7;
}
as constructs of the form k*j+7 are common in address calculations.

Chains of recurrences

It is cumbersome to write the recursive functions all the time, so the functions are usually written in the form \(\{\phi_j, \odot_{j+1}, f_{j+1}\}\). For example \[\begin{align}f_2(i) & =
\begin{cases}
2\phantom{f_0(i-1) + f_1(i-1)} & \text{if $i = 0$} \\
f_2(i-1) + 6 & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{2,+,6\}$}\\
f_1(i) & =
\begin{cases}
k-1 & \text{if $i = 0$} \\
f_1(i-1) + f_2(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{k-1,+,f_2\}$}\\
f(i) = f_0(i) & =
\begin{cases}
7 & \text{if $i = 0$} \\
f_0(i-1) + f_1(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{7,+,f_1\}$}\end{align}\] These can be chained, so \(f(i)\) can be written as a chain of recurrences (CR) \(\{7,+,\{k-1,+,\{2,+,6\}\}\}\). The inner curly braces are redundant, so the CR is usually written as a tuple \(\{7,+,k-1,+,2,+,6\}\).

Building the chains of recurrences

The chains of recurrences are built by iterating over the code and calculating the CR for the result of each operation (or marking it as unknown if we cannot handle the operation), using simplification rules such as\[\begin{align}c * \{\phi_0, +, \phi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{c * \phi_0, +, c * \phi_1\} \\
\{\phi_0, +, \phi_1\} + \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 + \psi_0, +, \phi_1 + \psi_1\} \\
\{\phi_0, +, \phi_1\}* \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 * \psi_0, +, \psi_1 * \{\phi_0, +, \phi_1\} + \phi_1 * \{\psi_0, +, \psi_1\} + \phi_1*\psi_1\} \\
\{\phi_0, +, \phi_1,+,0\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0, +, \phi_1\}\end{align}\] So for the loop from the sum function
for (int j = 0; j < count; ++j)
  result += j*j;
we start with j which we know has the CR \(\{0,+,1\}\) per Example 1. This is then used as j*j when calculating result, so we calculate the CR for j*j using the simplification rules as \[\begin{align}j*j& = \{0,+,1\} * \{0,+,1\} \\
& = \{0 * 0, +, 1 * \{0, +,1\} + 1 * \{0, +, 1\} + 1*1\} \\
& = \{0, +, 1,+,2\}\end{align}\] Similar calculations for result gives us the CR \(\{0,+,0,+,1,+,2\}\) for the value at the beginning of the iteration, and \(\{0,+,1,+,3,+,2\}\) after adding j*j.

Doing the optimization

The optimization is done during induction variable simplification, and LLVM has transformed the function to a form more convenient for analysis and optimization
int sum(int count)
{
  int result = 0;

  if (count > 0) {
    int j = 0;
    do {
      result = result + j*j;
      ++j;
    } while (j < count);
  }

  return result;
}
or as it looks like in the LLVM IR
define i32 @sum(i32) {
  %2 = icmp sgt i32 %0, 0
  br i1 %2, label %3, label %6

; <label>:3:
  br label %8

; <label>:4:
  %5 = phi i32 [ %12, %8 ]
  br label %6

; <label>:6:
  %7 = phi i32 [ 0, %1 ], [ %5, %4 ]
  ret i32 %7

; <label>:8:
  %9 = phi i32 [ %13, %8 ], [ 0, %3 ]     ; {0,+,1}
  %10 = phi i32 [ %12, %8 ], [ 0, %3 ]    ; {0,+,0,+,1,+,2}
  %11 = mul nsw i32 %9, %9                ; {0,+,1,+,2}
  %12 = add nuw nsw i32 %11, %10          ; {0,+,1,+,3,+,2}
  %13 = add nuw nsw i32 %9, 1             ; {1,+,1}
  %14 = icmp slt i32 %13, %0
  br i1 %14, label %8, label %4
}
The compiler can see that the function returns 0 if count <= 0, otherwise it returns the value of result at loop iteration count-1.

One nice property of the chains of recurrences is that it is easy to calculate the value at a specific iteration – if we have a CR \(\{\phi_0,+,\phi_1,+,\ldots,+,\phi_n\}\), then the value at iteration \(i\) can be calculated as \[\begin{align}f(i) & = \sum_{j=0}^{n}\phi_j{i \choose j} \\ & = \phi_0 + \phi_1i + \phi_2{i(i-1)\over 2!} + \ldots + \phi_n{i(i-1)\cdots(i-n+1)\over n!}\end{align}\] Inserting the values for the CR \(\{0,+,1,+,3,+,2\}\) describing result gives us \[f(i) = i + {3i(i-1)\over 2} + {i(i-1)(i-2) \over 3}\] The compiler now only need to insert code that calculates this with \(i =\) count-1, after the loop
result = count-1 + 3*(count-1)*(count-2)/2 + (count-1)*(count-2)(count-3)/3;
but it need to take some care to calculate in the correct precision (as temporary values may not fit in 32-bit integers). Integer division is slow, so it is also doing tricks to replace the divisions with multiplication and shift instructions. The result is the LLVM IR
  %4 = add i32 %0, -1
  %5 = zext i32 %4 to i33
  %6 = add i32 %0, -2
  %7 = zext i32 %6 to i33
  %8 = mul i33 %5, %7
  %9 = add i32 %0, -3
  %10 = zext i32 %9 to i33
  %11 = mul i33 %8, %10
  %12 = lshr i33 %11, 1
  %13 = trunc i33 %12 to i32
  %14 = mul i32 %13, 1431655766
  %15 = add i32 %14, %0
  %16 = lshr i33 %8, 1
  %17 = trunc i33 %16 to i32
  %18 = mul i32 %17, 3
  %19 = add i32 %15, %18
  %20 = add i32 %19, -1
Inserting this makes the loop become dead, so it is later removed by dead code elimination, and we eventually end up with the code
sum(int):
        test    edi, edi
        jle     .LBB0_1
        lea     eax, [rdi - 1]
        lea     ecx, [rdi - 2]
        imul    rcx, rax
        lea     eax, [rdi - 3]
        imul    rax, rcx
        shr     rax
        imul    eax, eax, 1431655766
        add     eax, edi
        shr     rcx
        lea     ecx, [rcx + 2*rcx]
        lea     eax, [rax + rcx]
        add     eax, -1
        ret
.LBB0_1:
        xor     eax, eax
        ret

Performance

This optimization is not always profitable. For example,
int sum(int count)
{
  int result = 0;

  for (int j = 0; j < count; ++j)
    result += j*j*j*j*j*j;

  return result;
}
is calculated by three 32-bit multiplications and one addition within the loop, while the optimized version needs six 64-bit multiplications, five 32-bit multiplications, and a slew of other instructions (godbolt), so the optimized version is slower for small values of count. I benchmarked on my PC, and count must be larger than 5 for the optimized version to be faster than the loop. Smaller CPUs with, for example, more expensive 64-bit multiplication, will need a higher count for the optimization to help. And CPUs not having instructions for 64-bit multiplication (godbolt) will need a much higher count.

One problem with this optimization is that it is hard for developers to make the compiler generate a loop for this if they know that the majority of values used in reality are small enough for the loop to be the fastest option. GCC does, therefore, not replace the final value of a loop if the expression is expensive
/* Do not emit expensive expressions.  The rationale is that
   when someone writes a code like

   while (n > 45) n -= 45;

   he probably knows that n is not large, and does not want it
   to be turned into n %= 45.  */
|| expression_expensive_p (def))
So GCC not doing this optimization is a feature – not a bug.

Further readings

Chains of recurrences:
Loop optimizations using chains of recurrences:
Optimizing division using multiplication and shift instructions:

Updated: The original post incorrectly called the power sums “geometric sums”.

Saturday, January 26, 2019

Building GCC 1.27

GCC 1.27 was released in 1988 and is the first version of GCC supporting the x86 CPU. I thought it would be fun to make it work on my desktop computer.

Mikhail Maltsev wrote a great blog post about this a while back “Building and using a 29-year-old compiler on a modern system”. I used Mikhail’s work as a starting point, but I built on a 64-bit Ubuntu system so I needed to update paths and as/ld options for running on a 64-bit OS, and I had much bigger problems with the ancient GCC not understanding the system headers. I also enabled the DBX debug format instead of the UNIX/32V SDB format that GDB does not understand. But I did not need to make that big changes to Mikhail’s patch.

It is amazing how well things work – the modern assembler, linker, and debugger handles the code generated by GCC 1.27 without any problems. And command options such as -O, -E, -S, -c, -g, -W, -pedantic, and -fomit-frame-pointer does what you expect from using a modern GCC. All the options are documented in the manual page – you can format the text by passing the gcc.1 file to man as
man -l gcc.1

How to build GCC 1.27 on Ubuntu

I have built the compiler on 64-bit Ubuntu Desktop 16.04 as described below.

Prerequisites

GCC 1.27 is a 32-bit program, so we need to install 32-bit compiler and runtime support
sudo apt install gcc-multilib

Downloading and preparing the source code

Download the source code and the patch, and apply the patch as
wget https://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.27.tar.bz2
wget https://gist.github.com/kristerw/b854b6d285e678452a44a6bcbf7ef86f/raw/gcc-1.27.patch
tar xf gcc-1.27.tar.bz2 
cd gcc-1.27
patch -p1 < ../gcc-1.27.patch

Configuring the source code

The compiler is configured by setting up symbolic links to the correct configuration files
ln -s config-i386v.h config.h
ln -s tm-i386v.h tm.h
ln -s i386.md md
ln -s output-i386.c aux-output.c
You may want to change where the compiler is installed by updating bindir and libdir in the Makefile. I set them to
bindir = /home/kristerw/compilers/gcc-1.27/bin
libdir = /home/kristerw/compilers/gcc-1.27/lib

Build and install

The compiler is built in two (or three) stages, We start by building it using the system compiler
make
We then build it again using the newly built compiler
make stage1
make CC=stage1/gcc CFLAGS="-O -Bstage1/ -Iinclude"
As a third, optional, step, we build it again using the second compiler and checks that the resulting binaries are identical with the second compiler (if not, then the compiler has miscompiled itself).
make stage2
make CC=stage2/gcc CFLAGS="-O -Bstage2/ -Iinclude"
diff cpp stage2/cpp
diff gcc stage2/gcc
diff cc1 stage2/cc1
We are now ready to install the compiler
make install

Saturday, December 29, 2018

Commodore 64 programming

I have done some Commodore 64 programming over the holidays. The C64 is old, but I think there are things that can be learned from dealing with all the hardware limitations (although there may be better ways of getting this experience with more recent platforms...)

C64 Hardware

The 6502 instruction set consists of 56 very simple instructions, and the CPU has been completely reversed engineered so you can follow how it does its work – Visual 6502 even let you visualize transistor-level Simulation of the CPU in the browser! The details of the CPU are described in the talk “Reverse Engineering the MOS 6502 CPU”:


The rest of the C64 hardware is also well described by a talk – “The Ultimate Commodore 64 Talk”:


There are lots of information about C64 available on the net, but I have not found any convenient place that collects the relevant information... But everything I needed (including examples/tutorials) was available at C64 Wiki and Codebase 64.  The details of the Video Interface Chip is documented in “The MOS 6567/6569 video controller (VIC-II) and its application in the Commodore 64”, and Andre Weissflog’s chip emulators contains very readable code if you want clarifications on what the various hardware registers do.

Getting started – raster bars

My first test was to implement two raster bars and some sprites moving around, just to verify that I understood the basics.


Much of C64 programming centers around getting around hardware limitations by timing the code carefully. For example, the raster bars are drawn in the border of the screen, but the border can only have one color. It is, however, possible to modify the border color while the hardware is drawing the screen (the C64 hardware is designed for old CRT monitors that draw the picture one line at a time using an electron beam), so we can draw the raster bars by changing the border color exactly when a new line starts!

I had thought the raster bars should be trivial (just enable an interrupt on a line, and change the colors) but the C64 is not fast enough for this – it takes 9-16 cycles to enter the IRQ, so we are already a bit into the screen when we can change the color. And there are only 63 CPU cycles for each line, so we don’t have the time to set up a new interrupt for the next line anyway. We, therefore, need to first synchronize our code with the start of a line, and then write the code (using NOPs etc. to pad it out) so that we change the colors exactly every 63 cycles.

But there are more complications – some lines have less than 63 cycles available to do the work. The reason is that the VIC-II chip steals cycles from the CPU for lines where it must fetch extra data. There are two cases:
  • The first raster line of each text line must fetch the characters, which steals one cycle per character. These lines are usually called “bad lines”.
  • Each sprite that is enabled on the raster line steals 2 cycles.
There is, in addition, an overhead to this cycle stealing mechanism that may waste up to 3 cycles per raster line (depending on what the CPU is doing when the cycle stealing starts).

I was lazy and ensured that my raster bars were not on a bad line and that there were no sprites on the same raster line, but careful programming can handle this kind of things too. The talk “Behind the scenes of a C64 demo” mentions some insane techniques, such as treating the cycle counter as an address and jump to it (this jumps to different addresses depending on how many cycles have executed, and careful layout of the code can make this compensate for differences in execution time).


Source code

The source code for my test program is available below, and you can build it using the ACME Cross-Assembler as
acme -o test.prg test.asm
The resulting program test.prg can be run in the VICE emulator by loading it using the “Smart attach Disk/Tape” option.

Sunday, October 21, 2018

Optimization and performance measurement

The performance of programs depend surprisingly much on things that “should not” matter (such as the order files are passed to the linker), and modifying one part of the program may change the performance of other parts.

An example

We will look at the libquantum library as an example. Configuring and compiling this as
./configure
make demos
produces two demo applications. Running the shor application as
shor 456 427
takes 11.0 seconds1 on my PC (Ubuntu 16.04, Intel Broadwell CPU).

The performance of the shor application is relatively sensitive to changes, and we can see surprising performance differences when modifying the code. For example, adding this function
void foo(int *a, int *b, int *c, int n)
{
  for (int i = 0; i < n; i++)
    *a++ = *b++ + *c++;
}
at the top of measure.c makes the program need 11.8 seconds to run – it has become 7% slower by just adding an unused function!

Why we get the performance difference

The only effect of adding the unused function is that some other functions are moved to different addresses2, so it is natural to attribute the performance difference to “cache effects“. But there are several other hardware bottlenecks that can cause this kind of performance variability – this particular case is due to branch prediction.

Branch predictors keep track of the histories of branches by building various tables in hardware (see Dan Luu’s great blog post). But the branch prediction hardware looks only at a few bits of the addresses, so several branches may map to the same slot, with the result that the branch predictor cannot differentiate between them. Changing the size of one code segment makes other parts of the program move to different addresses, and some branches may be unlucky in that they now alias with branches behaving in a different way, which cause the performance to regress due to the excessive misprediction.

Impact of this when optimizing

All books and conference talks about optimization tell you to always measure your change to verify that it actually improves the performance before committing. This is good advice, but just measuring the running time is not enough! For example, if our change to libquantum was a real improvement that made the code 5% faster, then we would still have seen a 2% regression due to the (unrelated) branch prediction issues. But this regression is just us being unlucky with code placement, and further changes may make us lucky again, so it would probably have been better long-term to do the change. Similarly, I have seen too many bad changes done because “we measured it, and this made the program faster (but we do not understand why...)“.

So it is often useful to dig deeper and understand why we get the performance result so that we can attack the root cause and not rely on being lucky – for the libquantum example we could ensure the branches in the sensitive part of the code are located near each other in memory (for example, by forcing the code to be inlined) or try to eliminate some branches by algorithmic changes.

Further reading



1. There are some randomized algorithms in libquantum, so I hardcoded a seed in order to measure the same work in each run:
--- shor.c.orig 2007-09-03 08:47:55.000000000 +0200
+++ shor.c 2018-10-20 21:18:45.026076220 +0200
@@ -37,7 +37,7 @@
   int N;
   int c,q,a,b, factor;
 
-  srand(time(0));
+  srand(123456);
 
   if(argc == 1)
     {
The running time of this is stable (11.0 seconds with a standard deviation of 0.014).
2. It could also have made the compiler make different inlining decisions etc., but I have verified that the only difference for this program is that some code is placed on different addresses compared to the original version.

Saturday, August 11, 2018

Inlining and microbenchmarking

Inlining is important for C++ performance, but the compiler must be careful not to increase the code size too much. GCC’s inlining process is basically inlining functions sorted by priority until a growth percentage limit is hit (or until all relevant functions have been inlined), which works fine for large translation units where the parts of the code not helped by inlining can compensate for the code increase in the parts where inlining helps. But it works less well for small translating units that need much inlining, which is common when doing microbenchmarks.

Take for example this quick-bench benchmark from Bartlomiej Filipek’s blog post “Speeding Up string_view String Split Implementation” that measures the performance for different ways of splitting a string.

We can see that StringSplitStd is about 9% faster than StringViewSplit (they have the score 7829 and 7189), but the reason for this difference is that GCC has inlined everything for StringSplitStd, but not for StringViewSplit.

We get a different result if we run the benchmark with a different set of functions. For example, removing StringViewSplitStd and StringViewSplitPtr from the benchmark makes the compiler make different inlining decisions, and we get the same performance for both StringSplitStd and StringViewSplit (quick-bench).

It is a good idea when doing microbenchmarking to check that the compiler makes the same inlining decisions as when the code is used in a real use case (in a realistically sized translation unit).

Saturday, July 28, 2018

Don’t trust quick-bench results you see on the internet

It is easy to use quick-bench to benchmark small C++ functions, but it is hard to ensure they are measuring what is intended. Take this benchmark as an example. It is measuring different ways of transforming a string to lower case using code of the form
void by_index(char* s, size_t n)
{
    for (size_t i = 0; i < n; ++i)
        s[i] = tolower(s[i]);
}

static void LowerByIndex(benchmark::State& state)
{
    // Code inside this loop is measured repeatedly
    for (auto _ : state) {
        char upper_string[] = "UPPER TEXT";
        by_index(upper_string, 10);

        // Make sure the variable is not optimized away by compiler
        benchmark::DoNotOptimize(upper_string);
    }
}

// Register the function as a benchmark
BENCHMARK(LowerByIndex);
We are going to look at how compilers may optimize it in ways that were probably not intended.

How a compiler may optimize the code

The following corresponds to how GCC optimizes this on Linux when using the libc++ header files.1

The tolower function is defined in /user/include/ctype.h as
extern const __int32_t **__ctype_tolower_loc (void)
    throw () __attribute__ ((__const__));

extern __inline __attribute__ ((__gnu_inline__)) int
tolower (int __c) throw ()
{
    return __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
}
This is inlined into by_index, which in turn is inlined into LowerByIndex
static void LowerByIndex(benchmark::State&amp; state)
{
    // Code inside this loop is measured repeatedly
    for (auto _ : state) {
        char upper_string[] = "UPPER TEXT";
        for (size_t i = 0; i < 10; ++i) {
            int __c = upper_string[i]);
            upper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
        }

        benchmark::DoNotOptimize(upper_string);
    }
}

A char has values in the range -128 to 127 when compiling for the x86 architecture, so the compiler determines that the comparisons in
upper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
are always true (as __c is assigned from a char), and the line is simplified to
upper_string[i] = (*__ctype_tolower_loc())[__c];

The __ctype_tolower_loc function is decorated with the const attribute so the function call can be moved out of the loop, provided the loop body is always executed. Compilers typically represent for-loops as an if-statement followed by a do-while loop – for example
for (i = 0; i < n; ++i) {
    do_something();
}
is represented as
i = 0;
if (i < n) {
    do {
        do_something();
        ++i;
    } while (i < n);
}
This representation simplifies the work for the compiler as the pre-condition is separated from the actual loop, and constant expressions can now trivially be moved out of the loop. The if-statement is simplified to always true (or always false) when the iteration count is known at compile time. Rewriting the loops, and moving __ctype_tolower_loc out of the loops gives us the result
static void LowerByIndex(benchmark::State&amp; state)
{
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[] = "UPPER TEXT";
            size_t i = 0;
            do {
                int __c = upper_string[i]);
                upper_string[i] = p[__c];
                ++i;
            } while (i < 10);

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
Note that the call to __ctype_tolower_loc is now outside of the code segment being measured!

The inner loop is small, so it is fully unrolled
static void LowerByIndex(benchmark::State&amp;amp; state) {
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[] = "UPPER TEXT";
            int __c0 = upper_string[0]);
            upper_string[0] = p[__c0];
            int __c1 = upper_string[1]);
            upper_string[1] = p[__c1];
            int __c2 = upper_string[2]);
            upper_string[2] = p[__c2];
            int __c3 = upper_string[3]);
            upper_string[3] = p[__c3];
            int __c4 = upper_string[4]);
            upper_string[4] = p[__c4];
            int __c5 = upper_string[5]);
            upper_string[5] = p[__c5];
            int __c6 = upper_string[6]);
            upper_string[6] = p[__c6];
            int __c7 = upper_string[7]);
            upper_string[7] = p[__c7];
            int __c8 = upper_string[8]);
            upper_string[8] = p[__c8];
            int __c9 = upper_string[9]);
            upper_string[9] = p[__c9];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
All accesses to upper_string are now to known positions, so the compiler can easily forward the values written to where they are read, and the generated code does not need to initialize or read from upper_string
static void LowerByIndex(benchmark::State&amp;amp; state) {
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[10];
            upper_string[0] = p['U'];
            upper_string[1] = p['P'];
            upper_string[2] = p['P'];
            upper_string[3] = p['E'];
            upper_string[4] = p['R'];
            upper_string[5] = p[' '];
            upper_string[6] = p['T'];
            upper_string[7] = p['E'];
            upper_string[8] = p['X'];
            upper_string[9] = p['T'];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
Finally, several characters are the same, so the compiler can CSE (Common Subexpression Elimination) them to only load these values once from p
static void LowerByIndex(benchmark::State&amp;amp; state)
{
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[10];
            upper_string[0] = p['U'];
            upper_string[1] = upper_string[2] = p['P'];
            upper_string[3] = upper_string[7] = p['E'];
            upper_string[4] = p['R'];
            upper_string[5] = p[' '];
            upper_string[6] = upper_string[9] = p['T'];
            upper_string[8] = p['X'];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
That is, the compiler has used its knowledge of the input to basically hard code the result (this may be OK, depending on what the benchmark is trying to measure). And it has moved some code out of the code segment being measured (which is probably not OK for the benchmark).

How to fix the benchmark

I usually prefer keeping the benchmarked function in a separate translation unit in order to guarantee that the compiler cannot take advantage of the code setting up the benchmark, but that does not work in quick-bench. One way to get a similar effect is to mark the function as noinline, but that only solves part of the problem – compilers do various interprocedural optimizations, and for GCC you should specify at least noclone too. Other compilers may need to be restricted in different ways.

It may also be possible to hide information from the compiler by using volatile or functionality from the benchmarking framework (such as benchmark::DoNotOptimize and benchmark::ClobberMemory), but this may also introduce unintended behavior. For example, these workarounds make the code look “unusual” to the compiler, which may make various heuristics make different optimization decisions compare to normal usage.

In general, you need to spend some time analyzing the benchmark in order to determine what the result means (for example, are we measuring the difference in how fast different methods can transform a string, or are we only measuring the difference for the string “UPPER TEXT”?), or as Fabian Giesen says in “A whirlwind introduction to dataflow graphs
With microbenchmarks, like a trial lawyer during cross-examination, you should never ask a question you don’t know the answer to (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.


1. The libstdc++ headers use a normal function call for tolower instead of using the inlined version from ctype.h. You can see the optimization from this blog post using libstdc++ too by including ctype.h before any C++ header (but this is not possible in quick-bench, as it adds its own headers before the user code).