Krister Walfridsson’s old blog: 2015

Sunday, October 4, 2015

spirv-tools — status, problems, and plans

I have done some work on spirv-tools since the posts about the human friendly SPIR-V representation and Python API, so it makes sense to do a status update now.

Most of the development has been to actually make the code work. And I think it does now. For example, one of my tests is to disassemble all the shaders from the Mesa shader-db (containing shaders from some old games), assemble the result, and verifying that the binaries are identical.

API

IR improvements

The major user-visible change in the API is that the IR has been modified so that

An ID is represented by an ID object. The ir.Id class contains a reference to the instruction defining the ID, so there is no need to use module.id_to_inst[id] each time you want to get the instruction (which you usually want each time you have an ID). The instruction is now accessed as id.inst.
A literal number is represented as an integer.
A literal string is represented as a string.
An enumerated value is represented as a string of the enumeration name. I choose this representation in order to make it easy to see what the code does. For example
```
if inst.operands[0] == 'FPFastMathMode':
```
checks if a decoration is specifying the floating-point fast math flags.
A mask is represented as a list of strings of the enumeration names, and the empty list is used when no value is set in the mask. Checking if a bit is set is done as
```
if 'NotNaN' in inst.operands[1]:
```

Optimizations

I have also added optimization passes corresponding to the LLVM instcombine, constprop, die (Dead Instruction Elimination), and simplifycfg passes. And a mem2reg pass will be available soon.

I'm mostly working on the optimizations just to verify that the API makes sense, and some of the passes (constprop and instcombine) are essentially placeholders right now, but I will finish up the code when the final SPIR-V specification is released.

Plans

My short-term plan for the API:

Make this a real Python package, installable with pip. And make it work for Python 3.x too.
Add a mem2reg pass (including infrastructure for simple data flow analysis).
Implement a better API for handling constants, types, and global instructions.
Clean up the API. There are some things that need to be renamed and tweaked a little bit (for example some functions having "uses" in their name treat decorations as a usage, and some does not).
Document the API. Add examples/tutorials.

Assembler / disassembler

The biggest user-visible change in the assembler/disassembler is that global symbols now use normal ID tokens (such as %main) instead of prefixing the name with @ (such as @main). The original implementation used @ in order to simplify parsing of a more convenient syntax for declaring global variables

@gl_FragColor = Output <4 x f32> BuiltIn(FragColor)

but decorations are appended to normal instructions, so this is not much more convenient than using an OpVariable instruction

%gl_FragColor = OpVariable %44 BuiltIn(FragColor) Output

The only real difference is that the type must be specified as a pointer type for OpVariable, so it is not pretty-printed (The reason is that the pointer type contains a storage class, and I have not found a good way to include it in a pretty-printed type. The storage class is an operand to OpVariable too, so this could be written as

%gl_FragColor = OpVariable *<4 x f32> BuiltIn(FragColor) Output

if the assembler is updated to infer the storage class from the instruction. But I'm not sure if that is a good idea or not...).

The assembler/disassembler are mostly done, but two things needs to be implemented:

Floating-point literals
Correct name handling

And there are lots of minor things that could be improved...

Floating-point literals

The assembler is supposed to allow

%52 = OpFSub <4 x f32> %49, (0.5, 0.5, 0.5, 0.5)

instead of

%50 = OpConstant f32 0x3f000000
%51 = OpConstantComposite <4 x f32> %50, %50, %50, %50
%52 = OpFSub <4 x f32> %49, %51

but the current implementation does only handle integer/Boolean literals.

Name handling

The assembler accepts named IDs (such as %main) and the disassembler is using the names from OpName decorations to create named IDs. But there are some problems with the implementation:

Name mangling need to be implemented in order to handle polymorphism (the disassembler currently use the numerical value for IDs if it finds several identical names in the binary, and the assembler returns errors for re-declared names).
ID names declared within a function should be local to the function.
How should the tools handle multiple names for the same ID? For example, if the name in OpEntryPoint differs from the name in an OpName debug instruction for the function. Or if one instruction is decorated with multiple OpName names.

Minor tweaks

SPIR-V spreads out some information over the file (e.g. decorations are placed in the beginning of the file, far from the instruction they refer to), and the goal of the textual representation is to collect it in a way that makes it is easy to see all relevant details affecting each instruction, as well as supressing irrelevant details. But it is a bit unclear how to make this as useful as possible...

Some examples of things to consider:

Is it better to be consistent or compact? Decorations are written after the instruction name, and a FPFastMathMode decoration is currently written as
```
%52 = OpFSub <4 x f32> FPFastMathMode(NotNaN | NotInf) %49, %50
```
The values are unique, so it could be written more compact, without the FPFastMathMode keyword
```
%52 = OpFSub <4 x f32> NotNaN | NotInf %49, %50
```
But there may exist better ways of improving this...
Pointers are heavily used in Kernels, so it makes sense to pretty-print them. But how should the storage class be handled?
Should structures be pretty printed?
Are OpLoopMerge and OpSelectionMerge instructions necessary, or should the assembler insert them automatically when needed?

I need to look at lots of real world shaders in order to get an idea of what makes sense, but that need to wait for the SPIR-V specification to be released and shader compilers becoming available. And I need to find relevant modern shaders to look at...

Plans

My short-term plan for the assembler/disassembler:

Implement floating-point literals
Document the assembler syntax

Saturday, August 29, 2015

Instruction-less computation – technical details

This blog post continues the previous post by describing the technical details.

Configuring x86 hardware

The IA-32 hardware is configured by a combination of architectural registers and descriptors in memory. The construction of the movdbz machine uses four different kinds of descriptors:

memory translation tables
The memory translation is done through a two-level page table, where the first level is called a "page directory". The CR3 register contains the physical address of the page directory to use.

TSS – Task-State Segment
The TSS is a structure where the processor state (registers, status flags, etc.) is stored when the task is not running.

GDT – Global Descriptor Table
The GDT is an array of "segment descriptors" containing a pointer to a segment (such as a TSS) and some flags. The task switching mechanism access the TSS by indexing the GDT. For example,
```
ljmp $24, $0
```
switches to a new task whose TSS is described at offset 24 in the GDT.

IDT – Interrupt Descriptor Table
The IDT defines how each interrupt is handled by providing an interrupt descriptor for each interrupt type. We configure it to switch to a new task for #PF and #DF exceptions, and each descriptor contains an offset into the GDT corresponding to the target TSS.

Interrupts, task switching, and memory translation tables

The branch targets in the instructions are specified by the IDT, so it needs to be changed when stepping into a new instruction (i.e. when switching to a new task). The task switch cannot change it directly, but the IDT is placed in virtual memory, and the TSS contains CR3 that points to the page directory, so the new task may map a new page for the IDT, and thus modify the branch targets for the new instruction.

This does, however, give us new problems... The TSS contains the stack pointer which is used to encode the register value, so the TSS is now used to represent both a register and an instruction, and we need to separate them.

The TSS is defined as:

We can place this structure in memory so that it crosses a page boundary, where the stack pointer (ESP) is in one page, and CR3 in the other. That is, the TSS is mapped as a combination of a "register page" and an "instruction page".

The implementation places the TSS so that the page boundary is between ECX and EDX (ESP is thus placed 8 bytes from the top of the page).

Handling source and destination registers

The execution of one instruction in the movdbz machine is done in three steps:

Read the TSS for the new instruction.
Generate exception (as the program counter is invalid). If the stack pointer is not zero, write status and update the stack pointer.
Store the TSS.

The registers are represented by the stack pointer stored in the TSS, but the instructions use two registers (source and destination), so the read and write of the stack pointer should be to different TSS structures.

The TSS is stored in virtual memory, and we may play some tricks with the memory translation. The "read TSS" step reads the state (including the CR3) from the TSS, which updates the memory translation. We may use this to map a new register page for the TSS as a side effect of reading the TSS, and the "store TSS" will now write to this new register page (i.e. the destination register). So we will in some sense have two TSS structures; one "source TSS" and one "destination TSS".

Task switching to itself

There is a hardware restriction that task switching must be done to a new task – it is not possible to switch to the currently running task. This means that instructions of the form

loop:    movdbz r0, r0, loop, done

are invalid. But this is not much of a problem as it is easily fixed by inserting a NOP instruction

loop:    movdbz r0, r0, nop, done
nop:     movdbz discard, 0, loop, loop

TSS segment descriptor busy flag

The IA-32 tries to prevent recursive task switching; it has a "busy flag" that is set in the TSS segment descriptor when a task is entered from an interrupt, and the flag must be cleared before the task can be run a second time.

We need to clear this flag, and we do this by mapping the GDT page containing the instruction's segment descriptor as the instruction page in the destination TSS. Writing the TSS will now overwrite the six last segment descriptors in that GDT page. The descriptor at offset 0x0ff8 is overwritten by the value of EAX and EDX, so the instruction page initializes those registers with the original descriptor value, which clears the flag when they are written to the TSS.

This means that we can only have one TSS segment descriptor per page, and the GDT has a maximum size of 16 pages, so we can only have 16 segment descriptors (and thus only 16 TSS). But each instruction sets up the page tables, so we may remap these to correspond to different instructions over time. There are restrictions; each instruction needs to have its own and its destinations' TSS mapped, so their TSS segment descriptors must be different. The trapcc implementation handles this by graph coloring, but there are simpler ways (see "Generating assembler instructions" below).

Encoding of the instructions

We now have everything we need in order to describe the implementation.

The instless_comp uses three TSS segment descriptors for the instructions, at GDT offset 0x1ff8, 0x2ff8, and 0x3ff8, and have the corresponding TSS mapped at address 0x40ffd0, 0x41ffd0, and 0x42ffd0 in the virtual address space.

The instructions are encoded in four pages:

Page Directory
The page directory sets up five virtual address ranges:
- STACK – the stack for the movdbz machine
- INST – mappings for the movdbz instruction (where TSS and IDT are placed)
- GDT – the GDT
- X86 – the code and data for the CPU (only needed to access the TSS when exiting the movdbz machine)
- PROG – the movdbz program
It is only the INST address range that is specific for the instruction – the other ranges have identical mappings in all instructions.

Page Table
This maps the IDT page, the destination TSS (destination register page and GDT page) for the current instruction, and the source TSS (source register page and instruction page) for the successor instructions, into the address space.

Instruction page
The instruction page of the TSS is set up so that
- CR3 points to the page directory.
- The program counter (EIP) is set to an invalid address.
- EAX and EDX have the value of the corresponding TSS segment descriptor.
- EFLAGS, that contains miscellaneous flags for the system state, is set to the value used when running the CPU.

IDT page
The IDT page contains an IDT that handles #PF and #DF exceptions by switching to a task corresponding to the destination instruction.

The movdbz machine is started by switching to the task corresponding to the first instruction, and it is possible to exit from the machine by letting it switch to the task identified by GDT index 0x18, which is the task for the original program running on the CPU.

Generating assembler instructions

The restrictions on the instructions make it hard to write assembly code, and are annoying when doing code generation. The assembler in instless_comp solves this by generating each assembler instruction as three "real" instructions, so that an instruction of the form

label:    movdbz r_dest, r_src, label_nonzero, label_zero

is generated as

label:    movdbz discard, 1, label+2, label+2
label+1:  movdbz discard, 1, label+2, label+2
label+2:  movdbz r_dest, r_src, label_nonzero, label_zero+1

The first instruction is always encoded with the TSS descriptor at offset 0x1ff8, the second at offset 0x2ff8, and the third at offset 0x3ff8, so this always satisfies the constraints without needing graph coloring. And it also handles the assembler instruction branching to itself, as that will branch from the last instruction to one of the two preceding NOP instructions.

We also know that the target in an assembly instruction always is a NOP, so its input register is always the constant 1, and we do not need to propagate the input registers to its predecessors.

Sunday, August 23, 2015

Instruction-less computation

The paper "The Page-Fault Weird Machine: Lessons in Instruction-less Computation" show that you can run programs on the X86 fault handling mechanism, i.e. without executing any instructions on the CPU!

The construction of this weird machine is using two properties of the IA-32 architecture:

Task switching
The IA-32 has hardware support for task switching — storing the CPU state (registers and status flags), and resuming execution with a new memory map and previously stored CPU state. The interrupt controller may be configured so that a task switch happens when an exception is raised.
Interrupts writes an error code to the stack
The Page-Fault Exception (#PF) writes a 32-bit error code to the stack and decrements the stack pointer before handling the exception, which lets us modify the stack pointer without executing any CPU instructions. The stack pointer cannot wrap, so a Double Fault Exception (#DF) is raised instead of #PF if the stack pointer is less than four.

This can be used to create a machine with registers r0, r1, ..., and a move-branch-if-zero-or-decrement instruction

movdbz r_dest, r_src, label_nonzero, label_zero

that is doing the work corresponding to

if (r_src == 0) {
    r_dest = r_src;
    goto label_zero;
} else {
    r_dest = r_src - 1;
    goto label_nonzero;
}

The movdbz instruction is Turing-complete (see e.g. Wikipedia "One instruction set computer"¹, or this blog post that describes a BF compiler that only generates movdbz instructions). The paper comes with an implementation in trapcc, and I have made a somewhat simpler and more user-friendly implementation in instless_comp.²

A simplified description of the construction is that each register is represented by a task where the stack pointer contains the register's value multiplied by four. Each instruction is represented by an interrupt vector that switches to new tasks (including a new interrupt vector), where #DF switches to the task corresponding to label_zero and #PF switches to label_nonzero. All tasks have invalid program counters, so the new task will raise a new #PF exception (or #DF if the stack pointer is zero), and the machine steps into the next instruction. The implementation is however much more involved, and I describe the construction in detail in a separate blog post.

The stack pointer needs to point to valid memory (or we will get other interrupts), so the registers can only have values that are represented as valid stack addresses, and we cannot represent 32-bit values even if we map all memory for the stack (the decrement is done by subtracting four from the stack pointer, so we lose the two least significant bits...). It is natural to just map page 0 for the stack to give us registers that are 10 bits wide.

It is possible to make the registers wider by adding more pages to the stack, but it does not help much; the machine can only do subtraction, so e.g. addition

a = b + c;

need to be done by subtracting from a maximum value

a = MAX_VAL - (MAX_VAL - b - c);

And the machine can only decrement by one, so the subtractions need to be done by looping

tmp = MAX_VAL;
while (b--)
    tmp--;
while (c--)
    tmp--;
a = MAX_VAL;
while (tmp--)
    a--;

That is, the running time of addition is exponential in the bit width of the registers, so wide registers are not that useful.

The blog post is continued in part 2 that describes the technical details.

1. movdbz reduces trivially to the subtract-and-branch-if-nonzero instruction.

2. The trapcc need to do global analysis (graph coloring etc.) when generating the instructions in order to satisfy some annoying hardware constraints. My implementation uses a less efficient instruction encoding that makes it possible to handle each instruction independently (and makes it easier to write assembly manually).

Sunday, July 12, 2015

mov-only code generation

I stumbled on the M/o/Vfuscator compiler that only use mov instructions in the generated code¹ using ideas from the "mov is Turing-complete" paper.

It proves to be surprisingly easy to generate such code. This blog post shows how it is done using a somewhat idealized version of the CPU, but it should be easy to map it on the real X86 ISA. The examples use registers r0, r1, …, and a mov instruction that can be in one of the three forms

mov r_dest, r_src
mov r_dest, [r_src + r_offset]      ; Load indexed
mov [r_dest + r_offset], r_src      ; Store indexed

It is possible to use an immediate constant instead of r_src, or for the r_dest and r_offset in the addresses. The width of the move can be 8 bits (movb) or 32 bits (movl). We'll only use 8-bit values for the computations — wider widths can be handled by working on one byte at a time. Pointers are assumed to be 32 bits wide.

if-statements

We'll begin with if-statements. Comparing two registers r1 == r2 and placing the result in r0 can be done as

movb [&scratch + r1], 0
movb [&scratch + r2], 1
movb r0, [&scratch + r1]

where scratch is an array

uint8_t scratch[256];

We cannot have any flow control when only using mov instructions, so the idea is to handle if-statements by executing both the true and false parts, and ignoring the result from the false part. E.g. code of the form

if (r1 == r2)
    r0 = r3;
else
    r0 = r4;

is compiled as if written

r0 = (r1 == r2) ? r3 : r4;

which can be done as

movb [&scratch + r1], 0
movb [&scratch + r2], 1
movb r0, [&scratch + r1]
movb [&scratch + 1], r3
movb [&scratch + 0], r4
movb r0, [&scratch + r0]

This works fine as long as everything is in registers, or both the true and false parts write to the same memory. The case where they write different memory need to be handled by writing to a dummy location for the cases where no value should be stored. For example,

if (r1 == r2)
    a = r3;

is compiled to

movb [&scratch + r1], 0
movb [&scratch + r2], 4
movb r0, [&scratch + r1]
movl [&tmp + 4], &a
movl [&tmp + 0], &dummy
movl r0, [&tmp + r0]
movb [r0 + 0], r3

where tmp and dummy are defined as

uint32_t tmp[2];
uint8_t dummy;

Addition

Addition by a constant can be implemented by leveraging the addition done within the addressing mode by indexing in a constant array

const uint8_t constants[512] = {
    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
    0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
       ...
    0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff,
    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
       ...
    0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff
};

so e.g. adding 23 to r0 is done as

movb r0, [&(constants+23) + r0]

Loops etc.

We now have nearly everything needed to implement a Turing machine where the tape is implemented as an array that is indexed by the current position. The only problem is that we only can do straight line code... So we cheat by adding a jmp instruction at the end of our program, jumping back to the first instruction. This works fine for implementing a Turing machine, as it is typically written as

while (1) {
    if (current_state == 0) {
        /* Do the work for this state */
        current_state = /* State transition */
    } else if (current_state == 1) {
        /* Do the work for this state */
        current_state = /* State transition */
    } else if ...
}

But the same approach works for normal languages too, by treating each basic block as a state, and to do a state transition for each branch.

Program termination

Finally, we need a way to terminate the program. This is done by writing to address 0

movb [0 + 0], 0

which terminates the process with a segmentation fault. Conditional termination is done with the same dummy-variable trick as for the conditional store above.

1. This is not completely true – it need exactly one jmp instruction in addition to the mov instructions.

Sunday, June 14, 2015

Proving correctness of the GCC match.pd — Part 2

The previous post talked about the background and the GCC GIMPLE IR. This post continues with a look at the format of match.pd.

The match.pd contains transformation rules for the IR, written in a "lispy" domain-specific language. Each transformation rule is specified by the simplify expression that has two operands: a template expression that is matched to the IR, and a replacement expression that replaces the matching expression in the IR. For example

/* x & ~0 -> x */
(simplify
  (bit_and @0 integer_all_onesp)
  @0)

match expressions corresponding to "x&~0" and replaces them with "x".

A template operand of the form @<number> is called a "capture" — it captures the operand so you can refer to it in other parts of the simplify expression. Using the same capture multiple times in a template means that the operands need to be identical, e.g.

/* x & x -> x */
(simplify
  (bit_and @0 @0)
  @0)

match a bit_and with two identical operands.

The predicates, such as integer_all_onesp in the first example above, are the normal predicates used in the GCC middle-end. It is possible to capture operators and predicates too, so you may write

/* x | ~0 -> ~0 */
(simplify
  (bit_ior @0 integer_all_onesp@1)
  @1)

for the rule that changes "x|~0" to "~0".

Pattern matching for commutative operators may use the :c decoration to match the operands in any order. For example

/* x | ~0 -> ~0 */
(simplify
  (bit_ior:c @0 integer_all_onesp@1)
  @1)

will match both "x|~0" and "~0|x" (although it does not matter in this case, as the GCC IR always has the constant as its second argument for commutative operations).

There are cases where you cannot express the replacement using this syntax, so it is possible to generate the replacement expression using C++ code by placing it between curly braces

/* x ^ x -> 0 */
(simplify
  (bit_xor @0 @0)
  { build_zero_cst (type); })

The type of the outermost matching expression (i.e. the type of the bit_xor) is available in the variable type.

It is often the case that additional conditions need to be fulfilled in order to permit the replacement. This can be handled with an if expression:

/* x / -1 -> -x */
(simplify
  (exact_div @0 integer_minus_onep)
  (if (!TYPE_UNSIGNED (type))
   (negate @0)))

This replaces the original expression only if the type of the outermost expression is signed. The operand to if is a C-style expression, so it is possible to create complex conditions, such as

(if (!HONOR_SNANS (element_mode (type))
     && (!HONOR_SIGNED_ZEROS (element_mode (type))
         || !COMPLEX_FLOAT_TYPE_P (type)))

Many rules are identical for several operations, and you can handle this with a for expression

/* x / -1 -> -x */
(for div (trunc_div ceil_div floor_div round_div exact_div)
 (simplify
   (div @0 integer_minus_onep)
   (if (!TYPE_UNSIGNED (type))
    (negate @0))))

The if and for can be nested arbitrarily, and you can use them to construct complex rules where multiple simplifications are done within the same for or if expression.

Finally, there are some additional functionality, such as specifying parts of templates optional, but I'll wait with that until I have made some progress with the functionality described so far...

The next blog post will look at actually proving things.

Sunday, June 7, 2015

API for manipulating and optimizing SPIR-V binaries

I have cleaned up the API in the spirv-tools module I wrote about in a previous blog post. There is currently functionality for

Reading/writing SPIR-V binaries and my high level assembler
Iterating over instructions, functions, and basic blocks
Adding/removing/examining instructions, functions, and basic blocks
Some optimization passes (dead code elimination, and CFG simplification)

The idea behind the API is that it should correspond directly to the SPIR-V binary; the binary is conceptually represented as a list of SPIR-V instructions by the Module class, and each Instruction consist of the operation name, result ID, type ID, and operands. Iterating over the module returns instructions in the same order as in the binary. The API also has concepts of functions and basic blocks, and the Function and BasicBlock classes encapsulates sub-sequences of the binary's instructions.

Reading and examining binaries can be done with a minimum of code. For example, here is how to read a SPIR-V binary and count the number of load instructions:

#!/usr/bin/env python
import read_spirv

with open('frag.spv', 'r') as stream:
    module = read_spirv.read_module(stream)

nof_loads = 0
for inst in module.instructions():
    if inst.op_name == 'OpLoad':
        nof_loads += 1

print 'Number of load instructions: ' + str(nof_loads)

The main use case for the API is analyzing binaries, test generation, etc., but it is useful for implementing optimizations too. For example, a simple peephole optimization for transforming integer "-(-x)" to "x" can be written as

for inst in module.instructions():
    if inst.op_name == 'OpSNegate':
        op_inst = module.id_to_inst[inst.operands[0]]
        if op_inst.op_name == 'OpSNegate':
            src_inst = module.id_to_inst[op_inst.operands[0]]
            inst.replace_uses_with(src_inst)

For each instruction we check if it is an OpSNegate instruction, if so, we access the predecessor instruction. If that too is an OpSNegate, then we replaces all uses of the original instruction with the second instruction's predecessor. This leaves the OpSNegate dead, so you probably want to run the dead code elimination pass after this, which you do as

dead_code_elim.optimize(module)

A slightly more involved example is optimizing integer "x+x" to "x<<1":

for inst in module.instructions():
    if (inst.op_name == 'OpIAdd' and
           inst.operands[0] == inst.operands[1]):
        const1 = module.get_constant(inst.type_id, 1)
        sll = ir.Instruction(module, 'OpShiftLeftLogical',
                             module.new_id(), inst.type_id,
                             [inst.operands[0], const1.result_id])
        sll.copy_decorations(inst)
        inst.replace_with(sll)

Here we need to replace the OpIAdd instruction with a newly created OpShiftLeftLogical. One complication is that SPIR-V specifies the type partially by the PrecisionLow, PrecisionMedium, and PrecisionHigh decorations, so we need to copy the decorations each time we create a new instruction as failing to do this may give a dramatic performance reduction for some architectures. I still think that SPIR-V should change how the type and precision modifiers are handled for graphical shaders...

The module.get_constant returns an OpConstant or OpConstantComposite instruction with the specified type and value. The value should in general have the same number of elements as the type for vector types, but it is allowed to pass a scalar value, which replicates the value over the vector width.

inst.replace_with(sll) replaces inst with sll in the basic block, updates all uses of inst to use sll, and destroys inst. It is safe to add/remove instructions while iterating. The only possible issue is that the iterator may not see instructions that are inserted in the current basic block, and you may see an instruction in its old place if it is moved within the current basic block. The predecessors/successors are however always correctly updated — it is only the iteration order that may be surprising.

Basic blocks are handled in a similar way to instructions. See the simplify_cfg pass for an example of how they are used in reality.

There are still some functionality missing (see the TODO file) that I'm planning to fix eventually. Please let me know if you need some specific functionality, and I'll bump its priority.

Sunday, May 31, 2015

Running the GCC test-suite for epiphany-sim

I wanted to run the GCC test-suite on Adapteva’s Epiphany architecture, but I could not find much useful information on how to do it. This post documents what I eventually managed to get running.

The GCC "simtest howto" (having examples/results from 2003 — I'll send a patch to update it...) suggests using a "combined tree" where the source code from GCC, binutils, GDB, and newlib are merged. I'd like to avoid this, as I want to be able to test with different revisions of the components, and I do not trust that I will get reproducible results with the combined tree (for example, both binutils and GDB includes libbfd, and I want to ensure that binutils is built with the correct version).

The instructions below builds everything separately, using the latest released versions. It is assumed that DIST contains the path to the source code packages, and that PREFIX is the path where the resulting toolchain will be installed.

Building binutils

Binutils is built as

tar zxf ${DIST}/binutils-2.25.tar.gz
mkdir build_binutils && cd build_binutils
../binutils-2.25/configure --prefix=${PREFIX} --target=epiphany-elf
make -j4
make install
cd ..

Building GCC

GCC need support from GMP, MPFR, etc. These can be handled using shared libraries, but I want to make sure I know which versions are used. The easiest way of handling this is to place the libraries' source code within the GCC source tree, which builds them as a part of GCC.

tar zxf ${DIST}/gcc-5.1.0.tar.gz
tar zxf ${DIST}/gmp-6.0.0a.tar.bz2 
mv gmp-6.0.0 gcc-5.1.0/gmp
tar zxf ${DIST}/mpc-1.0.3.tar.gz 
mv mpc-1.0.3 gcc-5.1.0/mpc
tar zxf ${DIST}/mpfr-3.1.2.tar.gz 
mv mpfr-3.1.2 gcc-5.1.0/mpfr
tar zxf ${DIST}/isl-0.14.tar.bz2 
mv isl-0.14 gcc-5.1.0/isl

The GCC source tree has a script contrib/download_prerequisites that downloads and extracts the correct versions of GMP etc.

We cannot build GCC before we have a full environment with newlib, but GCC is needed in order to build newlib. We, therefore, start by building a somewhat limited version of GCC that can be used to build the library.

mkdir build_gcc_tmp && cd build_gcc_tmp
../gcc-5.1.0/configure --prefix=${PREFIX} --target=epiphany-elf \
    --enable-languages="c" --with-newlib --without-headers
make -j4 all-gcc
make install-gcc
cd ..

Building newlib

Newlib can now be built as

tar zxf ${DIST}/newlib-2.2.0.tar.gz
mkdir build_newlib && cd build_newlib
env PATH="${PREFIX}/bin:${PATH}" \
    ../newlib-2.2.0/configure --prefix=${PREFIX} --target=epiphany-elf
env PATH="${PREFIX}/bin:${PATH}" make -j4 all
env PATH="${PREFIX}/bin:${PATH}" make install
cd ..

Building GCC again

The "real" GCC is built as

mkdir build_gcc && cd build_gcc
../gcc-5.1.0/configure --prefix=${PREFIX} --target=epiphany-elf \
    --enable-languages="c,c++" --with-newlib
make -j4
make install
cd ..

Building the simulator

The testing is done by running the compiled code on a simulator that is built as a part of GDB, but the GNU GDB distribution does not have support for Epiphany. We, therefore, use the epiphany-gdb-7.8 branch from https://github.com/adapteva/epiphany-binutils-gdb. This repository contains both GDB and some random version of binutils, but we only need the simulator:

unzip ${DIST}/epiphany-binutils-gdb-epiphany-gdb-7.8.zip
mkdir build_sim && cd build_sim
../epiphany-binutils-gdb-epiphany-gdb-7.8/configure \
    --prefix=${PREFIX} --target=epiphany-elf
make -j4 all-sim
make install-sim
cd ..

Running the GCC test-suite

Dejagnu has configuration files for running tests on simulators for most hardware architectures, but not for Epiphany, so we need to create a configuration file epiphany-sim.exp. I'm using the following, that is a modified version of arm-sim.exp:

# Load the generic configuration for this board. This will define a basic
# set of routines used to communicate with the board.
load_generic_config "sim"

# No multilib flags needed by default.
process_multilib_options ""

# basic-sim.exp is a basic description for the standard Cygnus simulator.
load_base_board_description "basic-sim"

# The name of the directory in the build tree where the simulator lives.
setup_sim epiphany

# The compiler used to build for this board. This has *nothing* to do
# with what compiler is tested if we're testing gcc.
set_board_info compiler "[find_gcc]"

# The basic set of flags needed to build "hello world" for this
# board. This board uses libgloss and newlib.
set_board_info cflags   "[libgloss_include_flags] [newlib_include_flags]"
set_board_info ldflags  "[libgloss_link_flags] [newlib_link_flags]"

# This board doesn't use a linker script.
set_board_info ldscript ""

# No support for signals.
set_board_info gdb,nosignals 1

This file needs to be added to dejagnu's search path through a global configuration file. But you do not really need to add the path to the configuration file — dejagnu automatically adds a search path to the board directory in the same place as the configuration file is located. So it is enough to create an empty file ~/dejagnu/config.exp, and copy epiphany-sim.exp to ~/dejagnu/boards/epiphany-sim.exp.

The GCC test-suite can now be run as

cd build_gcc
env PATH="${PREFIX}/bin:${PATH}" DEJAGNU="~/dejagnu/config.exp" \
    make -j4 check-gcc RUNTESTFLAGS="--target_board=epiphany-sim"

This post was updated 2017-08-13 with a note about contrib/download_prerequisites.

Sunday, May 24, 2015

Human friendly SPIR-V textual representation

SPIR-V is a binary IL that is not meant to be written by humans. But there are many cases where it is desirable to write/modify IL, so I have defined a textual representation that I believe is more convenient to work with than the raw disassembly format used in the SPIR-V specification.

I have chosen to use an LLVM-like representation, as I'm used to that format. A typical instruction is written as

%58 = OpIAdd s32 %57, %32

Constants may be written directly as operands to the instructions. For example, if %32 is a constant

%32 = OpConstant s32 1

then the instruction %58 above can be written as

%58 = OpIAdd s32 %57, 1

In the same way, decorations may be attached directly to the instructions instead of having separate decoration instructions at the top of the file. For example

OpDecorate %56, PrecisionMedium
%56 = OpFMul <4 x f32> %55, %54

can be written as

%56 = OpFMul PrecisionMedium <4 x f32> %55, %54

Names can be used instead of the <id> number

%tmp = OpLoad <4 x f32> %21
%56 = OpFMul <4 x f32> %tmp, %54

This makes the assembler allocate a numerical <id> and adds debug information with the name. In general, you do not need to specify things that the assembler can generate by itself, such as the constant and decoration instructions above, or the CFG — the assembler reorders the basic blocks when needed.

The SPIR-V format spreads some information over several instructions in different parts of the binary. This textual representation allows collecting those to one statement, so global variables may be written as

@gl_VertexID = Input s32 PrecisionHigh BuiltIn(5) NoStaticUse

which generates instructions

OpName %16, "gl_VertexID"
OpDecorate %16, PrecisionHigh
OpDecorate %16, BuiltIn, 5
OpDecorate %16, NoStaticUse
%15 = OpTypePointer Input, s32
%16 = OpVariable %15 Input

and function definitions can in a similar way be written as

define <4 x f32> @foo(<4 x f32> %a) {
  ...
}

instead of

OpName %12, "foo"
OpName %11, "a"
%10 = OpTypeFunction <4 x f32>, <4 x f32>
%12 = OpFunction %8 0, %10
%11 = OpFunctionParameter <4 x f32>
  ...
OpFunctionEnd

As an example of how this looks like, I have disassembled a shader using my format

The shader is the same as used in the raw disassembly example in the SPIR-V specification

An assembler/disassembler implementing most of the above is available in my spirv-tools github repository. The disassembler tries to take advantage of the syntactic sugar per default, which has the drawback that you do not have full control over <id> numbering etc., and you will in general get a different binary if you re-assemble the shader. But there is a command line option -r to disable this and output instructions exactly as in the binary, which is useful if you want to e.g. modify the code to trigger some special case in your compiler.

The implementation is rather rough right now, so it may not work on your favorite SPIR-V binary. But I'll spend some more time on this the coming weeks (I plan to to formalize and document the syntax, and fix the issues mentioned in the TODO file), so I expect to have a working assembler/disassembler well before the first Vulkan driver is available... :)

Sunday, May 17, 2015

Out of memory handling

I watched a video from CppCon 2014 where the speaker said during Q&A

[...] if you are on Linux, you know, malloc is never going to return NULL. It's always going to give you a chunk of memory, even if memory is full. It's going to say "I can get it from somewhere at some point", and if you actually runs out of memory, what happens is that the OS kills you.

I hear this a lot — there is no need to handle out of memory conditions as you'll never get NULL from malloc, and the OS will kill your process anyway. But it is wrong; there are at least two cases where malloc will return NULL on Linux:

Per-process memory limits are configured, and the process is exceeding those.
A 32-bit application running under a 64-bit kernel is trying to use more than about 4 gigabytes of memory.

So you need to deal with malloc returning NULL.

I'm not saying that you must handle out of memory conditions gracefully, although I would argue it is a good idea (especially if you are developing libraries). But you should at least check if malloc fails, as dereferencing NULL invokes undefined behavior in C, and may lead to surprising results from compiler optimizations.^1,2

¹ Such as this old Linux 2.6.30 kernel exploit.
² I cannot see how the compiler may introduce problems by exploiting the undefined behavior resulting from not checking for malloc failure, but I'm sure GCC will find a way...

Tuesday, May 12, 2015

Optimizing ESSL

The cartoon understanding of compiler design is that compilers consist of three parts:

front end — handling everything that is language specific
middle end — language- and hardware-independent optimizations
back end — code generation, independent of the language

One point I was trying to make in my two previous posts is that the situation is more complex in reality; the backend may take advantage of the ESSL precision qualifiers during instruction selection/scheduling, and this affects what the optimizations are allowed to do. So you cannot use a language/hardware-independent middle end if you have a sufficiently strange architecture and want to take advantage of the latitude ESSL gives you.

There are many other considerations when writing a high-performance compiler for some specific market/language/hardware architecture that may be surprising if you have not worked in that area. I'll give some examples below that have surprised me over the years.

Performance, power, and performance measurement

Mobile devices are power constrained, so the clock frequency is dynamically managed to prevent the GPU from running too hot. Different operations consume a different amount of power, and it is not obvious that the fastest shader measured in "number of cycles" is the fastest in "running time", as a slower shader using less power-hungry instructions may be run at a higher clock frequency. So the cycle count may deceive you when you are optimizing shaders.

It is actually very hard to get meaningful performance data when evaluating optimizations (on all systems — not only GPUs), and just implementing an optimization and observing the difference in run time may not tell you if the optimization is beneficial or not. My favorite paper on this is "Producing Wrong Data Without Doing Anything Obviously Wrong!" by Mytkowicz et al. that show that performance of real world applications depend surprisingly much on luck in things like alignment and cache effects. For example, changing the order of files when linking gives up to 15% performance variance for applications in the SPEC CPU2006 benchmark suite. And the result is different for different environments, so you may see a healthy 5% performance uplift in your environment, while the change is actually harmful and makes it slower for most other environments. I have seen many optimization results that I believe are due to this rather than any real improvement...

Compilation speed

High end mobile games may have hundreds of shaders, and shader compilation is done at application start up, so it is important that the compiler is fast. This means that the optimization strategy should be different compared to a desktop compiler, as you need to be more careful in the tradeoff between optimization run time and potential benefit, and not slow down the compiler by handling cases that are unlikely to happen in real world shaders.

Mobile CPUs have improved a lot the last couple of years, but they are still lagging the desktop when it comes to out-of-order execution etc. This makes the abstraction penalty more painful on mobile processors, and you may want to take that into account when designing an ESSL compiler.

Optimizations

Desktop compilers are insanely complex, but most of that complexity deals with things that does not happen in shaders; ESSL does not have pointers, so data tracking and aliasing analysis is easy. Shaders does not work on large arrays, so you do not need to transform loops to get better memory accesses pattern. Vectorization is essentially software based warping, so that does not help warp based GPUs. Etc. etc.

And shaders are by necessity small — all mobile phones have high resolution screens, and you cannot spend that many cycles on each pixel if you want a decent frame rate.¹ There are not much opportunity for optimizations in small pieces of code, so the relevant optimizations are essentially what you had in an early 90's desktop compiler: inlining, simple loop unrolling, if-conversion, etc.

An important part of compiler development, that is usually glossed over in the compiler literature, is implementing peephole optimizations that maps common code idioms to efficient instruction sequences. Application developers keep inventing strange code constructs, so this is a work package that is never finished. To take a random example from GCC: WebKit implements arithmetic right shift by 4 bits using the idiom

r = (v & ~15) / 16;

so GCC needed to add a rule to recognize as an "arithmetic shift right" instruction. A big part of creating a good compiler is to handle "all" such cases, and graphical shaders have different constructs compared to typical C/C++ code, so you need to invest lots of time looking at real world shaders.

¹ For example, 500MHz, 30fps, 1920x1080 translates to 8 cycles/pixel. Most GPUs have multiple cores (or whatever they are called — all GPU vendors have different terminology), so the cycle budget is larger for most devices. But still rather limited.

Sunday, April 26, 2015

Floating point, precision quaifiers, and optimization

ESSL permits optimizations that may change the value of floating point expressions (lowp and mediump precision change, reassociation of addition/multiplication, etc.), which means that identical expressions may give different results in different shaders. This may cause problems with e.g. alignment of geometry in multi-pass algorithms, so output variables may be decorated with the invariant qualifier to force the compiler to be consistent in how it generates code for them. The compiler is still allowed to do value-changing optimizations for invariant expressions, but it need to do it in the same way for all shaders. This may give us interesting problems if optimizations and code generation are done without knowledge of each other...

Example 1

As an example of the problems we may get with invariant, consider an application that is generating optimized SPIR-V using an offline ESSL compiler, and uses the IR with a Vulkan driver having a simple backend. The backend works on one basic block at a time, and is generating FMA (Fused Multiply-Add) instructions when multiplication is followed by addition. This is fine for invariant, even though FMA changes the precision, as the backend is consistent and always generates FMA when possible (i.e. identical expressions in different shaders will generate identical instructions).

The application has a shader

#version 310 es

in float a, b, c;
out invariant float result;

void main() {
    float tmp = a * b;
    if (c < 0.0) {
       result = tmp - 1.0;
    } else {
       result = tmp + 1.0;
    }
}

This is generated exactly as written if no optimization is done; first a multiplication, followed by a compare and branch, and we have two basic blocks doing one addition each. But the offline compiler optimizes this with if-conversion, so it generates SPIR-V as if main was written as

void main()
{
    float tmp = a * b;
    result = (c < 0.0) ? (tmp - 1.0) : (tmp + 1.0);
}

The optimization has eliminated the branches, and the backend will now see that it can use FMA instructions as everything is in the same basic block.

But the application has one additional shader where main looks like

void main() {
    float tmp = a * b;
    if (c < 0.0) {
       foo();
       result = tmp - 1.0;
    } else {
       result = tmp + 1.0;
    }
}

The optimization cannot transform the if-statement here, as the basic blocks are too complex. So this will not use FMA, and will therefore break the invariance guarantee.

Example 2

It is not only invariant expressions that are problematic — you may get surprising results from normal code too when optimizations done offline and in the backend interacts in interesting ways. For example, you can get different precision in different threads from "redundant computation elimination" optimizations. This happens for cases such as

mediump float tmp = a + b;
if (x == 0) {
  /* Code not using tmp */
  ...
} else if (x == 1) {
  /* Code using tmp */
  ...
} else {
  /* Code using tmp */
  ...
}

where tmp is calculated, but not used, for the case "x == 0". The optimization moves the tmp calculation into the two basic blocks where it is used

if (x == 0) {
  /* Code not using tmp */
  ...
} else if (x == 1) {
  mediump float tmp = a + b;
  /* Code using tmp */
  ...
} else {
  mediump float tmp = a + b;
  /* Code using tmp */
  ...
}

and the backend may now chose to use different precisions for the two mediump tmp calculations.

Offline optimization with SPIR-V

The examples above are of course silly — higher level optimizations should not be allowed to change control flow for invariant statements, and the "redundant computation elimination" does not make sense for warp-based architectures. But the first optimization would have been fine if used with a better backend that could combine instructions from different basic blocks. And not all GPUs are warp-based. That is, it is reasonable to do this kind of optimizations, but they need to be done in the driver where you have full knowledge about the backend and architecture.

My impression is that many developers believe that SPIR-V and Vulkan implies that the driver will just do simple code generation, and that all optimizations are done offline. But that will prevent some optimizations. It may work for a game engine generating IR for a known GPU, but I'm not sure that the GPU vendors will provide enough information on their architectures/backends that this will be viable either.

So my guess is that the drivers will continue to do all the current optimizations on SPIR-V too, and that offline optimizations will not matter...

Thursday, April 9, 2015

Precision qualifiers in SPIR-V

SPIR-V is a bit inconsistent in how it handles types for graphical shaders and compute kernels. Kernels are using sized types, and there are explicit conversions when converting between sizes. Shaders are using 32-bit types for everything, but there are precision decorations that indicates which size is really used, and conversions between sizes are done implicitly. I guess much of this is due to historical reasons in how ESSL defines its types, but I think it would be good to be more consistent in the IR.

ESSL 1 played fast and loose with types. For example, it has an integer type int, but the platform is allowed to implement it as floating point, so it is not necessarily true that "a+1 != a" for a sufficiently large a. ESSL 3 strengthened the type system, so for example high precision integers are now represented as 32-bit values in two's complement form. The rest of this post will use the ESSL 3 semantics.

ESSL does not care much about the size of variables; it has only one integer type "int" and one floating point type "float". But you need to specify which precision to use in calculations by adding precision qualifiers when you declare your variables, such as

highp float x;

Using highp means that the calculations must be done in 32-bit precision, mediump means at least 16-bit precision, and lowp means using at lest 9 bits (yes, "nine". You cannot fit a lowp value in a byte). The compiler may use any size for the variables, as long as the precision is preserved.

So "mediump int" is similar to the int_least16_t type in C, but ESSL permits the compiler to use different precision for different instructions. It can for example use 16-bit precision for one mediump addition, and 32-bit for another, so it is not necessarily true that "a+b == a+b" for mediump integers a and b if the addition overflow 16 bits. The reason for having this semantics is to be able to use the hardware efficiently. Consider for example a processor having two parallel arithmetic units — one 16-bit and one 32-bit. If we have a shader where all instructions are mediump, then we could only reach 50% utilization by executing all instructions as 16-bit. But the backend can now promote half of them to 32-bit and thus be able to double the performance by using both arithmetic units.

SPIR-V is representing this by always using a 32-bit type and decorating the variables and instructions with PrecisionLow, PrecisionMedium, or PrecisionHigh. The IR does not have any type conversions for the precision as the actual type is the same, and it is only the precision of the instruction that differ. But ESSL has requirements on conversions when changing precision in operations that is similar to how size change is handled in other languages:

When converting from a higher precision to a lower precision, if the value is representable by the implementation of the target precision, the conversion must also be exact. If the value is not representable, the behavior is dependent on the type:

For signed and unsigned integers, the value is truncated; bits in positions not present in the target precision are set to zero. (Positions start at zero and the least significant bit is considered to be position zero for this purpose.)

For floating point values, the value should either clamp to +INF or -INF, or to the maximum or minimum value that the implementation supports. While this behavior is implementation dependent, it should be consistent for a given implementation

It is of course fine to have the conversions implicit in the IR, but the conversions are explicit for the similar conversion fp32 to fp16 in kernels, so it is inconsistent. I would in general want the shader and kernel IR to be as similar as possible in order to avoid confusion when writing SPIR-V tools working on both types of IR, and I think it is possible to improve this with minor changes:

The highp precision qualifier means that the compiler must use 32-bit precision, i.e. a highp-qualified type is the same as as the normal non-qualified 32-bit type. So the PrecisionHigh does not tell the compiler anything; it just adds noise to the IR, and can be removed from SPIR-V.

Are GPUs really taking advantage of lowp for calculations? I can understand how lowp may be helpful for e.g. saving power in varying interpolation, and those cases are handled by having the PrecisionLow decoration on variables. But it seems unlikely to me that any GPU have added the extra hardware to do arithmetic in lowp precision, and I would assume all GPUs use 16-bit or higher for lowp arithmetic. If so, then PrecisionLow should not be a valid decoration for instructions.

The precision decorations are placed on instructions, but it seems better to me to have the them on the type instead. If PrecisionLow and PrecisionHigh are removed, then PrecisionMedium is the only decoration left. But this can be treated as a normal 16-bit type from the optimizers point of view, so we could instead permit both 32- and 16-bit types for graphical shaders, and specify in the execution model that it is allowed to promote 16-bit to 32-bit. Optimizations and type conversions can then be done in exactly the same way as for kernels, and the backend can promote the types as appropriate for the hardware.

Tuesday, April 7, 2015

Comments on the SPIR-V provisional specification

Below are some random comments/thoughts/questions from my initial reading of the SPIR-V provisional specification (revision 30).

Many of my comments are that the specification is unclear. I may agree that it is obvious what the specification mean, but my experience from specification work is that it is often the case that everybody agree that it is obvious, but they do not agree on what the obvious thing is. So I think the specification need to be more detailed. Especially as one of the goals of SPIR-V is to "be targeted by new front ends for novel high-level languages", and those may generate constructs that are not possible in GLSL or OpenCL C, so it is important that all constraints are documented.

Some other comments are related to tradeoffs. I think the specification is OK, so my comments are mostly highlighting some limitations (and I may have chosen a different tradeoff for some of them...). It would be great to have the rationale described for this kind of decisions.

Const and Pure functions

Functions can be marked as Const or Pure. Const is described as

Compiler can assume this function has no side effects, and will not access global memory or dereference function parameters. Always computes the same result for the same argument values.

while Pure is described as

Compiler can assume this function has no side effect, but might read global memory or read through dereferenced function parameters. Always computes the same result for the same argument values.

I assume the intention is that the compiler is allowed to optimize calls to Const functions, such as moving function calls out of loops, CSE:ing function calls, etc. And similar for the Pure functions, as long as there are no writes to global memory that may affect the result.

But the specification talks about "global memory" without defining what it is. For example, is UniformConstant global variables included in this? Those cannot change, so we can do all the Const optimizations even if the function is reading from them. And what about WorkgroupLocal? That name does not sound like global memory, but it does of course still prevent optimizations.

I would suggest the specification change to explicitly list the storage classes permitted in Const and Pure functions...

Storage Classes

I'm a bit confused by the Uniform and Function storage classes...

The Uniform storage class is a required capability for Shader. But the GLSL uniform is handled by the UniformConstant storage class, so what is the usage/semantics of Uniform?

Function is described as "A variable local to a function" and is also a required capability for Shader. But OpenCL does also have function-local variables... How are those handled? Why are they not handled in the same way for Shader and Kernel?

Restrict

The Restrict decoration is described as

Apply to a variable, to indicate the compiler may compile as if there is no aliasing.

This does not give you the full picture, as you can express that pointers do not alias as described in the Memory Model section. But pointers have different semantics compared to variables, and that introduces some complications.

OpenCL C defines restrict to work in the same way as for C99, and that is different from the SPIR-V specification. What C99 says is, much simplified, that a value pointed to by a restrict-qualified pointer cannot be modified through a pointer not based on that restrict-qualified pointer. So two pointers can alias if the have the correct "based-on" relationship, and are following some rules on how they are accessed. The frontend may of course decide to not decorate the pointers when it cannot express the semantics in the IR, but it is unclear to me that it is easy to detect the problematic cases.

I think this needs to be clarified along the line of what the LLVM Language Reference Manual does for noalias.

Volatile

There is a Memory Access value Volatile that is described as

This access cannot be optimized away; it has to be executed.

This does not really make sense... The memory model is still mostly TBD in the document, but the principle in GPU programming is that you need atomics or barriers in order to make memory accesses consistent. So there is no way you can observe the difference between the compiler respecting Volatile or not.

My understanding is that the rationale for Volatile in SPIR-V is to be able to work around compiler bugs by decorating memory operations with Volatile and in that way disable some compiler transformations. If so, then I think it would be useful to document this in order to make it more likely that compilers do the right thing. After all, I would expect the project manager to tell the team to do more useful work than fixing a bug for which you cannot see the difference between correct and incorrect behavior.

It has historically been rather common that C compilers miscompile volatile. A typical example is for optimizations such as store forwarding, that substitutes a loaded value by a previously stored value, where the developer forgets to check for volatility when writing the optimization. So a sequence such as

 7:             TypeInt 32 1
15:      7(int) Constant 0
                Store 14(tmp) 15 
16:      7(int) Load 11(b) 
17:      7(int) Load 14(tmp) 
18:      7(int) IMul 16 17
                Store 10(a) 18

corresponding to

volatile int tmp = 0;
a = b * tmp;

gets the ID 17 substituted by the constant 0, and is then optimized to

 7:             TypeInt 32 1
15:      7(int) Constant 0
                Store 14(tmp) 15 
17:      7(int) Load 14(tmp) 
                Store 10(a) 15

which is not what it is expected. But you can argue that this actually follows the SPIR-V specification — we have not optimized away the memory accesses!

Volatile and OpenCL

The OpenCL C specification says that

The type qualifiers const, restrict and volatile as defined by the C99 specification are supported.

which I interpret as volatile works in exactly the same way as for C99. And C99 says

An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously. What constitutes an access to an object that has volatile-qualified type is implementation-defined.

That is, the compiler is not allowed to reorder volatile memory accesses, even if it know that they do not alias. So the definition of the SPIR-V Volatile need to be strengthened if that is meant to be used for implementing the OpenCL volatile. Although I guess you may get around this by a suitable implementation-defined definition of what constitutes an access to an object...

Differences between graphical shaders and OpenCL

The Validation Rules says that for graphical shaders

Scalar integer types can be parameterized only as:

– 32-bit signed
– 32-bit unsigned

while OpenCL cannot use signed/unsigned

OpTypeInt validation rules
– The bit width operand can only be parameterized as 8, 16, 32 and 64 bit.
– The sign operand must always be 0

I guess this lack of signed/unsigned information is the reason why there are Function Parameter Attributes called Zext and Sext described as

Value should be zero/sign extended if needed.

Both choices regarding the signed/unsigned information are fine for an IR, but why is SPIR-V treating graphics and OpenCL differently?

Endianness

Khronos thinks that SPIR-V is an in-memory format, not a file format, which means that the words are stored in the host's native byte order. But one of of the goals of SPIR-V is "enabling shared tools to generate or operate on it", so it will be passed in files between tools. The specification has a helpful hint that you can use the magic number to detect endianness, but that means that all tools need to do the (admittedly simple) extra work to handle both big and little endian.

I think that the specification should define one file format encoding (preferably with a standardized file name extension), and say that all tools should use this encoding.

By the way, are there really any big endian platforms in the target market?

Sunday, March 29, 2015

SPIR-V

I'm going to play some with SPIR-V, and the first step is to look at existing projects/blogs. Below is what I have found so far. Have I missed anything relevant?

Blogs

Redefining the shading languages ecosystem with SPIR-V

Great overview of the benefits of SPIR-V, and what problems it does not solve.

Software

glslang

Khronos' reference compiler front end for the OpenGL ES and OpenGL shading languages. This can now generate SPIR-V.

lunarglass

LunarGLASS: LunarG's LLVM Adaptation to Shader Compiler Stacks.

It compiles shaders to LLVM IR using the glslang frontend, and it can use the SPIR-V output from glslang, so it seems to contain most code needed to convert SPIR-V to LLVM IR.

https://github.com/cheery/spirthon

Translation from python bytecode to SPIR-V.

This also contains

A SPIR-V decoder/encoder (assembler/disassembler)

Machine readable specification for SPIR-V in json -format that can be used to drive a decoder/encoder. And the script used to generate it.

https://github.com/Philip-Trettner/SpirvNet