Saturday, August 29, 2015

Instruction-less computation – technical details

This blog post continues the previous post by describing the technical details.

Configuring x86 hardware

The IA-32 hardware is configured by a combination of architectural registers and descriptors in memory. The construction of the movdbz machine uses four different kinds of descriptors:
  • memory translation tables
    The memory translation is done through a two-level page table, where the first level is called a "page directory". The CR3 register contains the physical address of the page directory to use.
  • TSS – Task-State Segment
    The TSS is a structure where the processor state (registers, status flags, etc.) is stored when the task is not running.
  • GDT – Global Descriptor Table
    The GDT is an array of "segment descriptors" containing a pointer to a segment (such as a TSS) and some flags. The task switching mechanism access the TSS by indexing the GDT. For example,
    ljmp $24, $0
    switches to a new task whose TSS is described at offset 24 in the GDT.
  • IDT – Interrupt Descriptor Table
    The IDT defines how each interrupt is handled by providing an interrupt descriptor for each interrupt type. We configure it to switch to a new task for #PF and #DF exceptions, and each descriptor contains an offset into the GDT corresponding to the target TSS.

Interrupts, task switching, and memory translation tables

The branch targets in the instructions are specified by the IDT, so it needs to be changed when stepping into a new instruction (i.e. when switching to a new task). The task switch cannot change it directly, but the IDT is placed in virtual memory, and the TSS contains CR3 that points to the page directory, so the new task may map a new page for the IDT, and thus modify the branch targets for the new instruction.

This does, however, give us new problems... The TSS contains the stack pointer which is used to encode the register value, so the TSS is now used to represent both a register and an instruction, and we need to separate them.

The TSS is defined as:


We can place this structure in memory so that it crosses a page boundary, where the stack pointer (ESP) is in one page, and CR3 in the other. That is, the TSS is mapped as a combination of a "register page" and an "instruction page".

The implementation places the TSS so that the page boundary is between ECX and EDX (ESP is thus placed 8 bytes from the top of the page).

Handling source and destination registers

The execution of one instruction in the movdbz machine is done in three steps:
  1. Read the TSS for the new instruction.
  2. Generate exception (as the program counter is invalid). If the stack pointer is not zero, write status and update the stack pointer.
  3. Store the TSS.
The registers are represented by the stack pointer stored in the TSS, but the instructions use two registers (source and destination), so the read and write of the stack pointer should be to different TSS structures.

The TSS is stored in virtual memory, and we may play some tricks with the memory translation. The "read TSS" step reads the state (including the CR3) from the TSS, which updates the memory translation. We may use this to map a new register page for the TSS as a side effect of reading the TSS, and the "store TSS" will now write to this new register page (i.e. the destination register). So we will in some sense have two TSS structures; one "source TSS" and one "destination TSS".

Task switching to itself

There is a hardware restriction that task switching must be done to a new task – it is not possible to switch to the currently running task. This means that instructions of the form
loop:    movdbz r0, r0, loop, done
are invalid. But this is not much of a problem as it is easily fixed by inserting a NOP instruction
loop:    movdbz r0, r0, nop, done
nop:     movdbz discard, 0, loop, loop

TSS segment descriptor busy flag

The IA-32 tries to prevent recursive task switching; it has a "busy flag" that is set in the TSS segment descriptor when a task is entered from an interrupt, and the flag must be cleared before the task can be run a second time.


We need to clear this flag, and we do this by mapping the GDT page containing the instruction's segment descriptor as the instruction page in the destination TSS. Writing the TSS will now overwrite the six last segment descriptors in that GDT page. The descriptor at offset 0x0ff8 is overwritten by the value of EAX and EDX, so the instruction page initializes those registers with the original descriptor value, which clears the flag when they are written to the TSS.

This means that we can only have one TSS segment descriptor per page, and the GDT has a maximum size of 16 pages, so we can only have 16 segment descriptors (and thus only 16 TSS). But each instruction sets up the page tables, so we may remap these to correspond to different instructions over time. There are restrictions; each instruction needs to have its own and its destinations' TSS mapped, so their TSS segment descriptors must be different. The trapcc implementation handles this by graph coloring, but there are simpler ways (see "Generating assembler instructions" below).

Encoding of the instructions

We now have everything we need in order to describe the implementation.

The instless_comp uses three TSS segment descriptors for the instructions, at GDT offset 0x1ff8, 0x2ff8, and 0x3ff8, and have the corresponding TSS mapped at address 0x40ffd0, 0x41ffd0, and 0x42ffd0 in the virtual address space.

The instructions are encoded in four pages:
  • Page Directory
    The page directory sets up five virtual address ranges:
    • STACK – the stack for the movdbz machine
    • INST – mappings for the movdbz instruction (where TSS  and IDT are placed)
    • GDT – the GDT
    • X86 – the code and data for the CPU (only needed to access the TSS when exiting the movdbz machine)
    • PROG – the movdbz program
    It is only the INST address range that is specific for the instruction – the other ranges have identical mappings in all instructions.
  • Page Table
    This maps the IDT page, the destination TSS (destination register page and GDT page) for the current instruction, and the source TSS (source register page and instruction page) for the successor instructions, into the address space.
  • Instruction page
    The instruction page of the TSS is set up so that
    • CR3 points to the page directory.
    • The program counter (EIP) is set to an invalid address.
    • EAX and EDX have the value of the corresponding TSS segment descriptor.
    • EFLAGS, that contains miscellaneous flags for the system state, is set to the value used when running the CPU.
  • IDT page
    The IDT page contains an IDT that handles #PF and #DF exceptions by switching to a task corresponding to the destination instruction.
The movdbz machine is started by switching to the task corresponding to the first instruction, and it is possible to exit from the machine by letting it switch to the task identified by GDT index 0x18, which is the task for the original program running on the CPU.

Generating assembler instructions

The restrictions on the instructions make it hard to write assembly code, and are annoying when doing code generation. The assembler in instless_comp solves this by generating each assembler instruction as three "real" instructions, so that an instruction of the form
label:    movdbz rdest, rsrc, label_nonzero, label_zero
is generated as
label:    movdbz discard, 1, label+2, label+2
label+1:  movdbz discard, 1, label+2, label+2
label+2:  movdbz rdest, rsrc, label_nonzero, label_zero+1
The first instruction is always encoded with the TSS descriptor at offset 0x1ff8, the second at offset  0x2ff8, and the third at offset 0x3ff8, so this always satisfies the constraints without needing graph coloring. And it also handles the assembler instruction branching to itself, as that will branch from the last instruction to one of the two preceding NOP instructions.

We also know that the target in an assembly instruction always is a NOP, so its input register is always the constant 1, and we do not need to propagate the input registers to its predecessors.

Sunday, August 23, 2015

Instruction-less computation

The paper "The Page-Fault Weird Machine: Lessons in Instruction-less Computation" show that you can run programs on the X86 fault handling mechanism, i.e. without executing any instructions on the CPU!

The construction of this weird machine is using two properties of the IA-32 architecture:
  • Task switching
    The IA-32 has hardware support for task switching — storing the CPU state (registers and status flags), and resuming execution with a new memory map and previously stored CPU state. The interrupt controller may be configured so that a task switch happens when an exception is raised.
  • Interrupts writes an error code to the stack
    The Page-Fault Exception (#PF) writes a 32-bit error code to the stack and decrements the stack pointer before handling the exception, which lets us modify the stack pointer without executing any CPU instructions. The stack pointer cannot wrap, so a Double Fault Exception (#DF) is raised instead of #PF if the stack pointer is less than four.
This can be used to  create a machine with registers r0, r1, ..., and a move-branch-if-zero-or-decrement instruction
movdbz rdest, rsrc, label_nonzero, label_zero
that is doing the work corresponding to
if (rsrc == 0) {
    rdest = rsrc;
    goto label_zero;
} else {
    rdest = rsrc - 1;
    goto label_nonzero;
}
The movdbz instruction is Turing-complete (see e.g. Wikipedia "One instruction set computer"1, or this blog post that describes a BF compiler that only generates movdbz instructions). The paper comes with an implementation in trapcc, and I have made a somewhat simpler and more user-friendly implementation in instless_comp.2

A simplified description of the construction is that each register is represented by a task where the stack pointer contains the register's value multiplied by four. Each instruction is represented by an interrupt vector that switches to new tasks (including a new interrupt vector), where #DF switches to the task corresponding to label_zero and #PF switches to label_nonzero. All tasks have invalid program counters, so the new task will raise a new #PF exception (or #DF if the stack pointer is zero), and the machine steps into the next instruction. The implementation is however much more involved, and I describe the construction in detail in a separate blog post.

The stack pointer needs to point to valid memory (or we will get other interrupts), so the registers can only have values that are represented as valid stack addresses, and we cannot represent 32-bit values even if we map all memory for the stack (the decrement is done by subtracting four from the stack pointer, so we lose the two least significant bits...). It is natural to just map page 0 for the stack to give us registers that are 10 bits wide.

It is possible to make the registers wider by adding more pages to the stack, but it does not help much; the machine can only do subtraction, so e.g. addition
a = b + c;
need to be done by subtracting from a maximum value
a = MAX_VAL - (MAX_VAL - b - c);
And the machine can only decrement by one, so the subtractions need to be done by looping
tmp = MAX_VAL;
while (b--)
    tmp--;
while (c--)
    tmp--;
a = MAX_VAL;
while (tmp--)
    a--;
That is, the running time of addition is exponential in the bit width of the registers, so wide registers are not that useful.

The blog post is continued in part 2 that describes the technical details.



1. movdbz reduces trivially to the subtract-and-branch-if-nonzero instruction.
2. The trapcc need to do global analysis (graph coloring etc.) when generating the instructions in order to satisfy some annoying hardware constraints. My implementation uses a less efficient instruction encoding that makes it possible to handle each instruction independently (and makes it easier to write assembly manually).