3 Pointers
As in C, Cyclone pointers are just addresses. Operations on pointers,
such as *x, x->f, and x[e], behave the same
as in C, with the exception that run-time checks sometimes precede
memory accesses. (Exactly when and where these checks occur is
described below.) However, Cyclone prevents memory errors such as
dereferencing dangling pointers, so it may reject legal C operations
on pointers.
In order to enforce memory safety, Cyclone pointer types contain more
information than their C counterparts. In addition to the type of the
object pointed to, pointer types indicate:
For example, the type int *{7}`H
is for possibly-null
pointers to seven int objects in the heap. The syntax and
semantics of all this additional pointer information is now explained.
Then we introduce a new type for arrays of
unknown size. Pointer arithmetic is allowed only on this last
collection of types. Throughout, we mention planned improvements. We
end with a summary.
Cyclone's type system distinguishes between pointers that may be
NULL and those that may not.
Syntax and Semantics
The syntax is straightforward:
The * character is for pointers that may be NULL (as in
C), and the @ character is for pointers that may not be
NULL. So ``int * x = NULL;'' is accepted, but
``int @ x = NULL;'' is not.
Subtyping
For any type t, the type t@ is a subtype of t*.
The type of malloc(sizeof(t)) is t@, as is new e
where e has type t. Hence in the declaration,
``int *x = malloc(sizeof(int))'', there is an implicit legal
cast from t@ to t*. Note that even when t1 is a
subtype of t2, the type t1* is not necessarily a subtype
of t2*, nor is t1@ necessarily a subtype of t2@.
For example, int@@ is not a subtype of int*@. This
illegal code shows why:
void f(int @@ x) {
int *@ y = x; // would be legal were int *@ a subtype of int @@
*y = NULL; // legal because *y has type int *
**x; // seg faults even though the type of *x is int @
}
You can explicitly cast a value of type t* to
t@. Doing so will perform a run-time check. The
cast can be omitted, but the compiler emits a warning and performs the
run-time check. Because the current implementation does not
consider tests to change a t* to t@, such casts are
sometimes necessary to avoid spurious warnings, such as in this code:
extern void f(int @);
void g(int * x) {
if (x != NULL)
f((int @)x);
}
Implementation
A run-time null check is a simple comparison. If it fails (i.e., the
value is NULL), the exception Null_Exception is thrown.
A check is inserted whenever a t* is (explicitly or implicitly)
cast to a t@. Casting t@ to t* has no run-time
effect.
Safety demands that if x may be NULL, then *e,
e.f, e->f, and e[e2] are translated such that we
first check that e is not NULL. e is only
evaluated once. The only way to guarantee there is no check at
run-time is to use @ instead of *. For example, the
function on the left performs one check whereas the one on the right
performs three (both throw Null_Exception if passed
NULL):
int sum3(int *{3} x) { int sum3(int *{3} x) {
int @{3} y = x; return x[0]+x[1]+x[2];
return y[0]+y[1]+y[2]; }
}
Note that &e->f and &e[e2] check (if necessary) that
e is not NULL even though these constructs do not read
through e.
Future
-
We may use dataflow information to avoid spurious warning about
implicit casts from t* to t@ and to avoid
inserting unnecessary checks. However, the analysis is
non-trivial (due to the address-of operator, unstructured control
flow, and undefined evaluation order), and the C compiler may be
able to eliminate unnecessary checks for us.
- For debugging purposes, we may have Null_Exception
carry source-position information.
Syntax and Semantics
The type t@{37}
(similarly t*{37}
) describes pointers to
37 t values. In other words, if x has type
t@{37}
, then x[e] is safe so long as e is between
0 and 36, inclusive. If the {n}
is omitted, it is implicitly
{1}
. Currently, the number must be a compile-time
constant---see below for arrays of unknown size.
We are taking pains not to say t@{37}
describes an array of 37
t values because C (and therefore Cyclone) distinguishes arrays
and pointers in certain contexts. For example, a local declaration
``t@{37}
x;'' allocates space on the stack for a pointer
(which must hold a pointer to 37 t values) whereas
``t x[37];'' allocates space on the stack for 37 t
values.
Subtyping
Pointers to more objects are subtypes of pointers to fewer objects (of
the same type). An explicit cast is not necessary. Put another way,
we could say t@{37}
describes pointers to at least 37 t
values.
Implementation
The length information is not present at run-time, except implicitly
in run-time checks. That is, if e has type t @{37}
, the
compiler translates e[e2] to check that e2 is less than
37. e2 is evaluated only once. If e2 is a constant
expression, there is no run-time check. If e2 is a constant
expression not less than 37, it is a compile-time error.
Future
In the future, the bounds information on a pointer will not have to be
a compile-time constant. For example, you will be able to write
void f(int n) {}
int *{n} arr = new {for i < n : 37};
...
}
This addition is non-trivial because, in terms of the above example,
the variable n may be mutated later in the function. In general,
we are developing a general system where the sizes of pointer bounds
may be expressed in terms of expressions, yet the compiler can always
insert the correct bounds check or verify that the check is
unnecessary.
Currently, pointer arithmetic is only allowed on types of the form
t?. Soon we will allow adding a compile-time constant c
to t@{n}
(for example), with the type of the result being
t@{n-c}
. It will be a compile-time error if c > n.
Syntax and Semantics
The type t@`r
describes pointers into region `r. All regions start with the
` character so that they are not confused with identifiers. If
the region is omitted, the compiler inserts one. The region inserted
depends on where the type occurs, as described below.
The heap region (written `H) conceptually lives forever; in
practice, it is garbage-collected.
Every block (i.e., local scope) is a region. If you label a block
with L:, then the region's name is `L. Similarly, the
parameters to a function f are in a region named `f.
Thanks to region inference, you can point into regions without explicit
names. For example, you can say int *x = &y if y is a
local variable in an unlabeled block. Conceptually, the compiler
creates a label for the block and fills in the corresponding region
name for you. (The output need not actually have a label.)
Because every pointer has a type and every pointer type has a region,
a pointer cannot be mutated so that it points into a different region
than it did before the assignment. Often subtyping (see below) is
sufficient, but in some cases it is necessary to rewrite C code to use
different variables for pointers into different regions. Note that
there is no way for a global variable to hold a stack pointer.
Functions are implicitly polymorphic over the regions of their
arguments. For example, void f(int *`r); is a prototype that
can be passed a pointer into any accessible region. That is,
it can be passed a stack pointer or a heap pointer, so long as it is
not passed a dangling pointer.
Note that our example function f could not possibly assign its
argument to a global, whereas void g(int *`H); could. On the
other hand, g cannot be passed a stack pointer.
The rules the compiler uses for filling in regions when they are
omitted from pointer types are numerous, but they are designed to
avoid clutter in the common case:
-
In function-argument types, a fresh region name is used.
- In function-return types, `H is used.
- In type definitions, including typedef, `H is used.
- In function bodies, unification is used to infer the region
based on how the location assigned the pointer type is used in the
function.
In the future, we intend to change the rule for typedef so
that the meaning can be different at each use of the
typedef, as dictated by the other rules. Until then, be
warned that
typedef int * foo_t;
void g(foo_t);
is different than
void g(int *);
Also, note that these rules are exactly the same as the rules for
omitted regions in instantiations of parameterized types.
Subtyping
t *`r1 is a subtype of t *`r2
if `r1 is known to outlive `r2. In
particular, you can always cast a heap pointer to a pointer into
another region.
Implementation
A pointer's region is not stored with the
pointer at run-time. So there is no way to ask for the region into
which a pointer points. For stack regions there is no region object
at run-time per se, just the stack space for the objects. As
is normal with region-based systems, Cyclone does not prevent dangling
pointers. Rather, it prevents dereferencing dangling pointers.
But this is a subtle point.
So far, we have not provided a way to point to a number of objects
when the number is not known at compile-time.
Syntax and Semantics
The type t ? describes such pointers
to objects of type t. Such types may be assigned
NULL. They may be annotated with a region, which (as with other
pointer types) is the region into which the pointer points. Omitted
region annotations are filled in by the compiler. Clearly,
explicit bounds information makes no sense for these types. If
e has type t ?, then
e.size has type int and is the number of objects pointed
to by type t. (Actually, e.size is allowed for any
pointer type, but for other pointers it is evaluated at compile-time.)
The meaning of operations on t ? objects is best explained in
terms of the implementation.
Implementation
Unlike with types like t*{37}
,
the implementation stores bounds information with objects of type
t ?. Currently, a t ? object occupies three machine
words. Conceptually, the object maintains the starting address of the
collection of objects pointed to, the length of the collection, and
the current value of the pointer used for accessing memory. Pointer
arithmetic may cause the access pointer to be out-of-bounds; no error
occurs at this point. On the other hand, a subscript operation
e1[e2] where e1 has type t? checks that the
access-pointer of e1 plus e2 is within bounds of
e1. Both e1 and e2 are evaluated once. If the
bound is violated the exception Null_Exception is thrown.
When an object of type t? is assigned to, it gets the bounds
information from the ``right-hand side'' of the assignment. So
x=y copies all of y's fields to the fields of x.
Similarly x = y + 17, copies y's fields and then adds 17
to x's access pointer. Finally, x++ just increments
x's access pointer. As in C, pointer arithmetic is limited to
addition of constants and subtraction. The result of pointer
subtraction has type unsigned int, so there is no bounds
information.
Even though, t ? types are implemented as multi-word values,
comparison operations (e.g., ==) are defined on them---the
comparison is performed on the access pointers.
Conversions to/from t ? `r types from/to t*{n}`r
and
t@{n}`r
types exist. Converting to a t? type just uses
the t* or t@'s static type information to initialize the
bounds information. The cast may be implicit; no warning is given.
Converting to a t* or t@ type incurs a run-time check
that the access pointer has a value such that the target type's bounds
information is sound. If so, the access pointer is returned, else the
exception Null_Exception is thrown. Implicit casts of this form
cause the compiler to give a warning.
Future
We may add a ``cannot be NULL'' version of these types
for sake of completeness. More significantly, we intend to allow
user-defined types to have certain fields describe the bounds
information for other fields, rather than relying on types built into
the language.
A pointer type has one of the following forms, where t is a
type and n is a constant unsigned expression:
-
t *{n}`r
, a possibly NULL pointer to n
elements of type t in region `r
-
t @{n}`r
, a non-NULL pointer to n elements
of type t in region `r
- t ? `r, a pointer to an unknown number of elements of type
t in region `r. Implemented as a multi-word object.
If {n}
is omitted, it is {1}
. If the region is omitted,
the compiler inserts one. The region inserted depends on where the
type is written.
The easiest way to port code is to replace uses of t * with
t ?. Of course, this technique does not address region
annotations. Functions that can take only heap pointers (because the
pointers escape into data structures, for example) will need to add
`H annotations for the relevant parameters.
Of course, using t? delays errors until run-time and is less
efficient. Using t@ is the most efficient and
guarantees that Null_Exception will not be thrown.
Currently, code performing pointer arithmetic must use t?.