## Sunday, August 13, 2017

### Getting started with a GCC back end

This is part two of a series “Writing a GCC back end”.

Most CPU architectures have a common subset – they have instructions doing arithmetics and bit operations on a few general registers, an instruction that can write a register to memory, and an instruction that can read from memory and place the result in a register. It is therefore easy to make a compiler that can compile simple straight-line functions by taking an existing back end and restricting it to this common subset. This is enough to start running the test suite, and it is then straightforward to address one deficiency at a time (adding additional instructions, addressing modes, ABI, etc.).

My original thought was that the RISC-V back end would be a good choice as a starting point – the architecture is fully documented, and it is a new, actively maintained, backend that does not use legacy APIs. But the RISC-V back end has lots of functionality (such as support for multiple ISA profiles, 32- and 64-bit modes, and features such as position-independent code, exception handling and debug information) and the work of reducing it became unnecessarily complicated when I tried...

I now think it is better to start from one of the minimal back ends, such as the back end for the Moxie architecture. Moxie seems to be a good choice as there is also a blog series “How To Retarget the GNU Toolchain in 21 Patches” describing step-by-step how it was developed. The blog series is old, but GCC has a very stable API, so it is essentially the same now (I once updated a GCC backend from GCC 4.3 to GCC 4.9, which were released 6 years apart, and only a few lines needed to be modified...).

One thing missing from the Moxie blog series is how to build the compiler and how to configure and run the test-suite, but I blogged about that a while back in “Running the GCC test-suite for epiphany-sim”.

## Sunday, August 6, 2017

### The structure of a GCC back end

This is part one of a series “Writing a GCC back end”.

The GCC back end is configured in gcc/config.host and the implementation is placed in directories machine under gcc/config and gcc/common/config where “machine” is the name of the back end (for example, i386 for the x86 architecture).

The back end places some functionality in libgcc. For example, architectures that do not have an instruction for integer division will instead generate a call to a function __divsi3 in libgcc. libgcc is configured in libgcc/config.host and target-specific files are located in a directory machine under libgcc/config.

### gcc/config.gcc

config.gcc is a shell script that parses the target string (e.g. x86_64-linux-gnu) and sets variables pointing out where to find the rest of the back end and how to compile it. The variables that can be set are documented at the top of the config.gcc file.

The only variable that must be set is cpu_type that specifies machine. Most targets also set extra_objs that specifies extra object files that should be linked into the compiler, tmake_file that contains makefile fragments that compiles those extra objects (or sets makefile variables modifying the build), and tm_file that adds header files containing target-specific information.

A typical configuration for a simple target (such as ft32-unknown-elf) looks something like
cpu_type=ft32
extra_parts="$extra_parts crti.o crtn.o crtbegin.o crtend.o"  ### libgcc/config/machine The libgcc/config/machine directory contains extra files that may be needed for the target architecture. Simple implementations typically only contain a crti.S and crtn.S (crtbegin/crtend and the makefile support for building all of these have default implementation) and a file sfp-machine.h containing defaults for the soft-float implementation. 1. GCC is written in C++03 these days, but the structure has not been changed since it was written in C. ## Friday, August 4, 2017 ### Writing a GCC back end It is surprisingly easy to design a CPU (see for example Colin Riley’s blog series) and I was recently asked how hard it is to write a GCC back end for your new architecture. That too is easy — provided you have done it once before. But the first time is quite painful... I plan to write some blog posts the coming weeks that will try to ease the pain by showing what is involved in creating a “working” back end that is capable of compiling simple functions, give some pointers to how to proceed to make this production-ready, and in general provide the overview I would have liked before I started developing my backend (GCC has a good reference manual, “GNU Compiler Collection Internals”, describing everything you need to know, but it is a bit overwhelming when you start...) The series will cover the following (I’ll update the list with links to the posts as they become available) 1. The structure of a GCC back end • Which files you need to write/modify 2. Getting started with a GCC back end • Pointers to resources describing how to set up the initial back end 3. Low-level IR and basic instruction selection • How the low-level IR works • How the IR is lowered to instructions • How to write simple instruction patterns 4. Specifying registers • Size/number of registers • Register classes • Allocation order • ABI 5. More advanced instruction patterns • “All” functionality of the instruction patterns definitions • Examples of how to use this functionality 6. Peephole optimizations, etc. 7. Pipeline description 8. Cost model 9. ... ## Monday, July 24, 2017 ### Phoronix SciMark benchmarking results Phoronix recently published an article “Ryzen Compiler Performance: Clang 4/5 vs. GCC 6/7/8 Benchmarks”, and there are many results in that article that surprises me... One such is the result for SciMark that shows that GCC generates much slower code than LLVM – there is a big difference in several tests, and the composite score is 25% lower. I do not have any Ryzen CPU to test on, but my testing on Broadwell shows very little difference between GCC and LLVM when SciMark is compiled with -O3 -march=x86-64 as in the article, and the Ryzen microarchitecture should not introduce that big difference. And the reported numbers seem low... The Phoronix testing also shows strange performance variations between different GCC versions that I don’t see in my testing – I see a performance increase for each newer version of the compiler. The published test results are made running scripts available at OpenBenchmarking.org, and looking at the build script for SciMark shows that it is built as cc$CFLAGS -o scimark2 -O *.c -lm

Note the -O – this overrides the optimization level set by \$CFLAGS and explains at least some of the discrepancies in the test results.1 GCC maps -O to the -O1 optimization level that is meant to be a good choice to use while developing – it optimizes the code, but focuses as much on fast compilation time and good debug information as on producing fast code. LLVM maps -O to -O2 that is a “real” optimization level that prioritizes performance, so it is not surprising that LLVM produces faster code in this case.

So the benchmarking result does not show what is intended, and both compilers can do better than what the test results show...

1. I get similar results as the article when I use -O, but my result for FFT is very different...

## Thursday, July 20, 2017

### A load/store performance corner case

I have recently seen a number of “is X faster than Y?” discussions where micro benchmarks are used to determine the truth. But performance measuring is hard and may depend on seemingly irrelevant details...

Consider for example this code calculating a histogram
int histogram[256];

void calculate_histogram(unsigned char *p, int len)
{
memset(histogram, 0, sizeof(histogram));
for (int i = 0; i < len; i++)
histogram[p[i]]++;
}

The performance “should not” depend on the distribution of the values in the buffer p, but running this on a buffer with all bytes set to 0 and one buffer with random values gives me the result (using the Google benchmark library and this code)
Benchmark                Time           CPU Iterations
------------------------------------------------------
BM_cleared/4096       7226 ns       7226 ns      96737
BM_random/4096        2049 ns       2049 ns     343001

That is, running on random data is 3.5x faster compared to running on all-zero data! The reason for this is that loads and stores are slow, and the CPU tries to improve performance by executing later instructions out of order. But it cannot proceed with a load before the previous store to that address has been done,1 which slows down progress when all loop iterations read and write the same memory address histogram[0] .

This is usually not much of a problem for normal programs as they have more instructions that can be executed out of order, but it is easy to trigger this kind of CPU corner cases when trying to measure the performance of small code fragments, which results in the benchmark measuring something else than intended. Do not trust benchmark results unless you can explain the performance and know how it applies to your use case...

1. The CPU does “store to load forwarding” that saves cycles by enabling the load to obtain the data directly from the store operation instead of through memory, but it still comes with a cost of a few cycles.

## Tuesday, July 4, 2017

### Strict aliasing in C90 vs. C99 – and how to read the C standard

I often see claims that the strict aliasing rules were introduced in C99, but that is not true – the relevant part of the standard is essentially the same for C90 and C99. Some compilers used the strict aliasing rules for optimization well before 1999 as was noted in this 1998 post to the GCC mailing list (that argues that enabling strict aliasing will not cause many problems as most software already has fixed their strict aliasing bugs to work with those other compilers...)

### C99 – 6.5 Expressions

The C standard does not talk about “strict aliasing rules”, but they follow from the text in “6.5 Expressions”:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:73
• a type compatible with the effective type of the object,
• a qualified version of a type compatible with the effective type of the object,
• a type that is the signed or unsigned type corresponding to the effective type of the object,
• a a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
• an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
• a character type.

73 The intent of this list is to specify those circumstances in which an object may or may not be aliased.
Note the footnote that says that the intention of these rules is to let the compiler determine that objects are not aliased (and thus be able to optimize more aggressively).

### C90 – 6.3 Expressions

The corresponding text in C90 is located in “6.3 Expressions”:
An object shall have its stored value accessed only by an lvalue that has one of the following types:36
• the declared type of the object,
• a qualified version of the declared type of the object,
• a type that is the signed or unsigned type corresponding to the declared type of the object,
• a type that is the signed or unsigned type corresponding to a qualified version of the declared type of the object,
• an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
• a character type.

36 The intent of this list is to specify those circumstances in which an object may or may not be aliased.
It is similar to the text in C99, and it even has the footnote that says it is meant to be used to determine if an object may be aliased or not, so C90 allows optimizations using the strict aliasing rules.

But standard have bugs, and those can be patched by publishing technical corrigenda, so it is not enough to read the published standard to see what is/is not allowed. There are two technical corrigenda published for C90 (ISO/IEC 9899 TCOR1 and ISO/IEC 9899 TCOR2), and the TCOR1 updates the two first bullet points. The corrected version of the standard says
An object shall have its stored value accessed only by an lvalue that has one of the following types:36
• a type compatible with the declared type of the object,
• a qualified version of a type compatible with the declared type of the object,
• a type that is the signed or unsigned type corresponding to the declared type of the object,
• a type that is the signed or unsigned type corresponding to a qualified version of the declared type of the object,
• an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
• a character type.

36 The intent of this list is to specify those circumstances in which an object may or may not be aliased.
The only difference compared to C99 is that it does not talk about effective type, which makes it unclear how malloc:ed memory is handled as it does not have a declared type. This is discussed in the defect report DR 28 that asks if it is allowed to optimize
void f(int *x, double *y) {
*x = 0;
*y = 3.14;
*x = *x + 2;
}

to
void f(int *x, double *y) {
*x = 0;
*y = 3.14;
*x = 2; /* *x known to be zero */
}

if x and y point to malloc:ed memory, and the committee answered (citing the bullet point list from 6.3)
We must take recourse to intent. The intent is clear from the above two citations and from Footnote 36 on page 38: The intent of this list is to specify those circumstances in which an object may or may not be aliased.
Therefore, this alias is not permitted and the optimization is allowed.
In summary, yes, the rules do apply to dynamically allocated objects.
That is, the allocated memory gets its declared type when written and the subsequent reads must be done following the rules in the bullet-point list, which is essentially the same as what C99 says.

### One difference between C90 and C99

There is one difference between the C90 and C99 strict aliasing rules in how unions are handled – C99 allows type-punning using code such as
union a_union {
int i;
float f;
};

int f() {
union a_union t;
t.f = 3.0;
return t.i;
}

while this is implementation-defined in C90 per 6.3.2.3
[...] if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined.

Language lawyering is a popular sport on the internet, but it is a strange game where often the only winning move is not to play. Take for example DR 258 where the committee is asked about a special case in macro-expansion that is unclear. The committee answers
The standard does not clearly specify what happens in this case, so portable programs should not use these sorts of constructs.
That is, unclear parts of the standard should be avoided – not tried to get language lawyered into saying what you want.

And the committee is pragmatic; DR 464 is a case where the defect report asks to add an example for a construct involving the #line directive that some compilers get wrong, but the committee thought it was better to make it unspecified behavior
Investigation during the meeting revealed that several (in fact all that were tested) compilers did not seem to follow the interpretation of the standard as given in N1842, and that it would be best to acknowledge this as unspecified behavior.
So just because the standard says something does not mean that it is the specified behavior. One other fun example of this is DR 476 where the standard does not make sense with respect to the behavior of volatile:
All implementors represented on the committee were polled and all confirmed that indeed, the intent, not the standard, is implemented. In addition to the linux experience documented in the paper, at least two committee members described discussions with systems engineers where this difference between the standard vs the implementation was discussed because the systems engineers clearly depended on the implementation of actual intent. The sense was that this was simply a well known discrepency.

## Saturday, July 1, 2017

### Hard-coded hardware addresses in C/C++

I read a blog post “reinterpret_cast vs. constant expression” that discusses how to get rid of C-style casts for code such as
#define FOO ((struct S*)0xdff000)

But there is no need to have hard-coded addresses in the code – it is better to declare a normal structure
extern struct S hw_s;

and tell the linker to place it at address 0xdff000 using an assembly file containing the lines
.global hw_s
hw_s = 0xdff000

FOO can now be defined without a cast
#define FOO &hw_s

although it is probably better to use hw_s directly...

It is good to get rid of hard-coded addresses in C/C++ code even if you do not care about ugly casts. One reason is that the compiler cannot know which objects the hard-coded addresses point to, which restricts the data flow analysis. One other reason is that hard-coded addresses interact badly with instruction selection in the backend. This is especially true for code accessing hardware registers that expand to assignments of the form
*(volatile int *)(0xdff008) = 0;
*(volatile int *)(0xdff010) = 10;

The best way of generating the code depends on the CPU architecture, but it usually involves loading a base address into a register and storing using a “base + offset” addressing mode, so the compiler needs to split and re-combine the addresses (which is complicated as there are often restrictions on which offsets are valid, the cost of the base depends on the value, etc.). The ARM backend is good at this, but I have seen many cases where much slower and larger code than necessary is generated for more obscure architectures. For example, GCC 7.1 for RISC-V compiles
void foo(void)
{
*(volatile int *)(0xfffff00023400008) = 0;
*(volatile int *)(0xfffff00023400010) = 10;
}

to
foo:
lui a5,%hi(.LC0)
ld a5,%lo(.LC0)(a5)
li a4,10
sw zero,0(a5)
lui a5,%hi(.LC1)
ld a5,%lo(.LC1)(a5)
sw a4,0(a5)
ret
.LC0:
.dword -17591594647544
.LC1:
.dword -17591594647536

instead of the smaller and faster
foo:
lui a5,%hi(.LC0)
ld a5,%lo(.LC0)(a5)
li a4,10
sw zero,8(a5)
sw a4,16(a5)
ret
.LC0:
.dword -17591594647552

you get by writing through a normal structure.