Thursday, July 7, 2016

Checked C

Microsoft Research recently announced a work-in-progress project, Checked C, that adds bounds checking to pointers and arrays:
Checked C adds new pointer types and array types that are bounds-checked, yet layout-compatible with existing pointer and array types. In keeping with the low-level nature of C, programmers control the placement of bounds information in data structures and the flow of bounds information through programs. Static checking enforces the integrity of the bounds information and allows the eliding of some dynamic checking. Dynamic checking enforces the integrity of memory accesses at runtime when static checking cannot. Checked C is backwards-compatible: existing C programs work “as is”. Programmers incrementally opt-in to bounds checking, while maintaining binary compatibility.
Below are my summary of, and comments on, the Checked C version 0.5 specification.

Overview

Checked C extends C with checked arrays and pointers where memory accesses are checked at runtime, and a runtime error is produced if the memory access is out of range. The compiler is allowed to (and expected to) eliminate checks that it knows are always true.

Checked arrays are created using the checked keyword, so
int x checked[5];
creates a checked array x having 5 elements. There are three new checked pointer types that are declared using syntax borrowed from C++:
  • array_ptr<T> — A pointer to an element of an array of type T values. This pointer type works as a normal C pointer (but with bounds checking).
  • ptr<T> — A pointer to a value of type T. Pointer arithmetic is not allowed on this pointer type.
  • span<T> — The span pointer works in the same way as the array_ptr, but it is represented differently in the generated code (see below).
As an example
ptr<int> p;
declares a checked pointer to an int.

The checked pointer types can have const and volatile modifiers, so a pointer to a constant integer is written as
ptr<const int> p;
while
int x;
const ptr<int> p = &x;
defines a pointer that cannot be modified.

The checked arrays and pointers are used in the same way as normal C arrays and pointers, so a checked pointer p can be dereferenced as *p, and an array_ptr<T> or span<T> pointer p can be dereferenced using expressions such as *(p+4) or p[4].

The array_ptr<T> and span<T> pointers need bounds to be defined before they may be dereferenced. Defining the bounds for a pointer p is done using the count and bounds keywords
  • p : count(len) — the number of elements that are accessible beginning at p
  • p : bounds(low, high) — the range of memory that can be accessed through p
The bounds are placed on the declaration, such as
array_ptr<int> p : count(n) = malloc((sizeof(int) * n);
Using this pointer as p[i] will conceptually add a check
dynamic_check(0 <= i && i < n);
right before the access.1 Pointer assignment transfer the bounds in the natural way, so q will get the bound from p in
array_ptr<int> q = p;
The array_ptr<T> and ptr<T> pointers have the same size as normal C pointers (such that sizeof(array_ptr<int>) is equal to sizeof(int*)) so checked pointers can be used without changing the layout of strucures. This means that the bounds are maintained locally by the compiler. The span<T> pointers do however keep the bounds within the type, so sizeof(span<int>) is larger than sizeof(int*).

It is possible to add additional constraints to handle things like this aligned memcpy
int aligned_memcpy(array_ptr<char> dest : count(len) where aligned(dest, 4),
                   array_ptr<char> src : count(len) where aligned(src, 4),
                   int len where len % 4 == 0);
although the constraint specifications seems to be a bit under-specified in the document, and I am not completely sure how they work in detail...

Undefined behavior

Doing all of this is a bit meaningless if code such as a[i+j] can make anything happen because of undefined behavior from overflow of i+j, so Checked C defines the behavior for some things that invokes undefined behavior in standard C.

Checked C requires that signed integer overflow produces a value or a runtime error:
  • To be able to maintain pointer bounds safety, it is important that signed integer overflow produce a defined value. When a signed integer expression produces an out-of-range value, either (1) the operation must convert that value to an in-range integer value or (2) the expression shall produce a runtime error. The conversion must be a function of only the input values of the expression.
  • Integer division by 0 shall also produce a runtime error or produce a defined value.
Checked C does also define pointers to work more like hardware pointers than what is the case in standard C. The checked pointers are treated in the same way as unsigned integers (all values are valid, even if they do not point at an object), but they produce runtime errors for pointer wrap and pointer arithmetic involving NULL. The rules for undefined behavior for unchecked pointers are modified in a similar way:
Unchecked pointers shall be treated as addresses of locations in memory, just as checked pointers are treated as addresses. The addresses shall be unsigned integers with a defined range of 0 to UINTPTR_MAX:
  • Comparison of pointers for all different kinds of pointers shall be defined as the corresponding integer comparison.
  • Subtraction p - r of two pointers p and r of type T where one pointer is a checked pointer and the other is an unchecked pointer shall be done following the rules for subtraction of checked pointers, treating the unchecked pointer as a checked pointer in those rules.

Bounds evaluation

The bounds are evaluated each time a pointer is checked, so the program need to be careful when updating variables used in a bounds declaration. The compiler must report an error when the bound is extended
int sum(array_ptr<int> start : bounds(start, end), array_ptr<int> end)
{
    end = end + 1; // bounds(start, end) does not hold after this,
                   // so program is rejected
    start[5] = 0;
    ...
}
but Checked C allows modifying bounds in this way, so for example
array_ptr<int> x : bounds(x, high) = ...
int sum = 0;
while (x < high) {
    sum += *x;
    x++;
}
is fine as the bound is reduced when x is incremented.

And there are more problems... For example, let e be an array_ref<T> with bound(x+1, x+5). This will not work when assigning
x = e;
as the range depends on x. Or consider this example
w = ...
where w : bounds(x, x + y);
int t = *w + (y = tmp);
The bounds for w depends on y, but y is modified in the same expression that dereferences w, and it is unclear if y is updated before or after w is checked. The compiler must reject the code for both of these examples.

A big part of the specification deals with this, and there are rules for which expressions are valid in bounds declarations, and how to do data flow analysis to verify that variables are allowed to be changed. But data flow analysis is expensive, so there are restriction that limit how much the compiler need to check, with the result that small changes to the code may push the program over the limit and thus fail to compile.

This would be so much simpler if the bounds were evaluated where declared. The compiler could place the bounds in hidden temporary variables, but this is rejected in the rationale:
We considered eager evaluation, but rejected it because it would turn array_ptr types into span types. When bounds expressions are always eagerly evaluated, the results need to be stored somewhere so that they can be used when v is used. For local variables, hidden temporary variables could be introduced. This breaks the design principle of not introducing hidden costs, though.
I do not understand what they mean by this... I would say that the current specification adds hidden costs as the bounds may be evaluated each time the pointer is used, while keeping the bounds in hidden variables will only evaluate them once. Hidden variables may increase the register pressure, but the current specification will most likely increase the live ranges for the variables used in the bounds, which also increases register pressure, so I do not expect a difference in reality.

Doing eager evaluation would however cause problems for array_ptr<T> pointers in structures. They are currently handled as
struct S {
    array_ptr<int> arr : count(len);
    int len;
}
where the variables used in the bounds calculations lives in the structure. I have not thought this through in detail, but I think it would make sense to forbid derefencing such pointers, and require the program to copy them to a local variable in order to use them. I do not think this is a big problem, as I would guess that most pointers in arrays are of the ptr<T> type, which can be used directly as they do not have bounds.


1. The real checking is somewhat more complex, and it also checks that the count(n) is valid (i.e. that n is less than (INTPTR_MAX/sizeof(int)), etc.)

Updated 2016-07-07: Added clarification in note 1.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.