Nest
February 5th, 2006
| george: |
Hey Aaron! How’s the C programming coming?
|
| aaron: |
Not so good. It doesn’t support all the things I’m used to, from other programming languages. Where’s the garbage collection? Where’s the type safety? Where’s the lexically scoped functions?
|
| george: |
Yeah, ISO standard C is pretty anemic. Wait, lexus-what?
|
| aaron: |
Lexically scoped functions. You know, nested functions – the ability to put a function inside another function. Lots of fancy functional languages can do that, and I wish it were possible in C.
|
| george: |
Oh, right. Well, did you know that GCC supports nested functions as an extension? You can call a nested function inside its parent function, and you can also pass a pointer to the nested function to qsort() and other functions. And the nested function can access the local variables of its parent function.
Like, say we wanted to sort a bunch of ints with qsort(), but no other function will need to use the comparison function, AND we want to keep track of how many times the comparison function gets called. That’s really easy with a nested function:
int sort_ints(int* array, int count) {
int num_comparisons = 0;
int comparison_function(const void* a, const void* b) {
int first = *(const int*)a;
int second = *(const int*)b;
num_comparisons++; /* look, we're accessing a variable from our parent function */
if (first < second) return -1;
else return first > second;
}
qsort(array, count, sizeof *array, comparison_function);
return num_comparisons;
}
See? comparison_function() is defined inside sort_ints(). Pretty easy.
|
| aaron: |
Hang on. When I try to compile it, this is what happens:
prompt ) cc -c nested_function.c
nested_function.c: In function 'sort_ints':
nested_function.c:5: error: nested functions are not supported on MacOSX
|
| george: |
Wha? Nested functions not supported? That’s never happened before. In fact, I think that’s new with the latest developer tools release from Apple. Does it still work with gcc 3.3? Yes, it seems to. I wonder what changed? Why would Apple stop supporting this feature?
Maybe taking a look at how nested functions (used to) work will give us some clues.
So the tricky part of nested functions is accessing the variables in the containing function, such as num_comparisons in the above example. This is because the nested function could be called by some other function (through a pointer) at any point up the stack, so the variable could be at any offset from the stack pointer.
How does comparison_function() determine this offset so it knows where to find num_comparisons? Well, it can’t, because any number of bizarre functions could be called between sort_ints() and comparison_function(). The stack frame of sort_ints() has to be passed to comparison_function(), as a parameter, so that it can access num_comparisons.
Uh-oh, that sounds like trouble. qsort() is as dumb as a box of granola. It doesn’t know anything about extra parameters that have to be passed to comparison_function(). All it has is the function pointer, which is just an address in memory. How will comparison_function() get the stack frame containing num_comparisons, then?
Maybe it could be put in a global variable? But no, that won’t work – sort_ints() could be called recursively, so there could be multiple sort_ints() stack frames that we need to keep track of at once.
What if, instead of handing qsort() a pointer to the comparison_function() code, we instead passed a pointer to some data structure containing the stack frame of sort_ints() AND the address of comparison_function()? That data structure could turn around and call comparison_function() passing it the stack frame it needs. And the data structure could live on the sort_ints() stack, so we would have one per invocation of sort_ints(), which is just what we need.
But again, qsort() is as dumb as a baked potato. It wants to jump to whatever address we give it. So our "data structure" will have to be real, executable code that does this dirty work. Yes, we will have to generate executable code at runtime. Who wants to play compiler?
Hmm, this is as clear as cola. Maybe a diagram will help.
- sort_ints() passes the address of the generated code ("trampoline") to qsort.
- qsort() jumps to the address of the generated code, thinking it’s the real function! Ha ha ha!
- The generated code calculates the address of num_comparisons by subtracting a little from the instruction pointer, and then calls comparison_function(), passing it that address.
Believe it or not, this is what gcc actually does (did). Let’s take a simpler example so we can look at the code that gcc generates. Because gcc-3.3, which still supports nested functions, is only available on the PowerPC in OS X, we’ll look at PowerPC code.
Here’s some simple nested-function code:
void extern_function(void (*func_ptr)(void));
void outer_function(void) {
int x = 0;
void inner_function(void) {
x++;
}
extern_function(inner_function);
}
Here’s how we compile it. -static cuts through the position-independent code haze, making it clearer what’s going on:
prompt ) gcc-3.3 -Os -static -S nested_function.c
Here’s what we get:
_inner_function.0:
stw r11,-16(r1)
lwz r2,0(r11)
addi r2,r2,1
stw r2,0(r11)
blr
_outer_function:
mflr r0
lis r5,ha16(_inner_function.0)
stw r0,8(r1)
la r5,lo16(_inner_function.0)(r5)
stwu r1,-128(r1)
li r4,40
addi r3,r1,80
addi r6,r1,64
bl ___trampoline_setup
li r0,0
addi r3,r1,80
stw r0,64(r1)
bl _extern_function
lwz r0,136(r1)
addi r1,r1,128
mtlr r0
blr
|
| aaron: |
I see where inner_function() is defined. It looks pretty simple. It’s just loading four bytes pointed at by register 11, adding one to it, and then writing it back. I guess the stack frame pointer, that thing we’re jumping through all these hoops to get, must be passed in register 11.
|
| george: |
That’s right. Note that while the C code for inner_function() is inside outer_function(), the assembly functions aren’t nested. See that ___trampoline_setup business in outer_function()? That’s where the runtime code generation occurs. Here’s where ___trampoline_setup is defined. Let’s see what it does.
/* R3 = stack address to store trampoline */
/* R4 = length of trampoline area */
/* R5 = function address */
/* R6 = static chain */
.globl ___trampoline_setup
___trampoline_setup:
mflr r0 /* save return address */
bcl 20,31,LCF0 /* load up __trampoline_initial into r7 */
LCF0:
mflr r11
addis r7,r11,ha16(LTRAMP-LCF0)
lg r7,lo16(LTRAMP-LCF0)(r7)
subi r7,r7,4
li r8,trampoline_size /* verify trampoline big enough */
cmpg cr1,r8,r4
srwi r4,r4,2 /* # words to move (insns always 4-byte) */
addi r9,r3,-4 /* adjust pointer for lgu */
mtctr r4
blt cr1,Labort
mtlr r0
That code does some setup and also checks that the caller allocated enough space for the generated code (the "trampoline"). If something weird were to happen, such as the compiler getting out of sync with libc, then the dynamically generated code could produce a buffer overflow – this makes sure that doesn’t happen.
/* Copy the instructions to the stack */
Lmove:
lwzu r10,4(r7)
stwu r10,4(r9)
bdnz Lmove
Most of the generated code is boilerplate (thankfully). In fact, the only non-boilerplate code is the address of the nested function and something called the "static chain," which I don’t understand (yet). In that code, we are just copying the boilerplate from where its "prototype" lives into our block of instructions on the stack.
/* Store correct function and static chain */
stg r5,Lfunc(r3)
stg r6,Lchain(r3)
Our dynamic code generation is all of two instructions. In that code, we stick the address of the nested function into the (recently copied) code block.
/* Now flush both caches */
mtctr r4
Lcache:
icbi 0,r3
dcbf 0,r3
addi r3,r3,4
bdnz Lcache
/* Finally synchronize things & return */
sync
isync
blr
The PowerPC is picky about making you flush the instruction cache after changing memory that contains instructions. This is normally no problem, since it’s a very infrequent operation; in fact, it may even provide some protection against buffer overflow exploits (since you might have to flush the instruction cache before it will see the evil instructions you wrote). But here we have a legitimate reason to write instructions, so we flush the cache.
As an interesting side note, we jump through this hoop to get at the stack frame for the outer function. But we never actually have to store the stack frame in our dynamically generated code because the code itself lives on the stack. That is, the code just asks itself for the instruction pointer, and because the code is in the stack frame of the outer function, the instruction pointer points to exactly the place we want (after a bit of arithmetic). The location of the generated code is as important as the code itself.
|
| aaron: |
Back up a second, I think you just hit on the answer to our original question. Why did Apple stop supporting nested functions? Nested functions require you to take code at runtime and put it somewhere you can write to and then jump to, usually the stack. The same thing is true for buffer overflow exploits, right? To exploit a buffer overflow to do something malicious, you have to write your instructions and get it to jump to them. But if the code won’t jump to a place that’s writable…
|
| george: |
That’s right! If the stack isn’t executable, then buffer overflow exploits may be reduced in severity from a total system compromise to perhaps just a denial of service. Rather than, say, installing something evil, an overflowed program will more likely just crash.
So nested functions are an unfortunate and surprising casualty of a proactive stance on security. There’s a technote about them that has a little more information.
|
| aaron: |
I’m still not convinced there’s no better way to implement this, to save nested functions. What if the function malloc()ed a page, set it to be executable, and generated the code in there?
|
| george: |
We want to prevent the heap from being executable as well, and page granularity is pretty big; there would be a lot of unused executable space that’s just asking to be taken advantage of. Plus malloc() is expensive. But perhaps you’re right, and there is a better way. How do other compilers implement nested functions in C? How do other languages support them?
|
| aaron: |
Maybe our readers know, and will be kind enough to leave a comment.
|
The Internet!
π = 3.2860203432
I have got to say… though your entries are exceedingly technical, and though you’ve only written a single-digit number of entries, I really enjoy reading your weblog. I learn new things every single entry, and even though I may not understand all of it, it still rocks. (I never knew about nested functions, but I guess it doesn’t matter anymore since they’re not supported anymore.
)
Keep these entries coming, even if they only come around once a month.
– Simone
I’ve just taken a course in compiler technology (rough translation from the Swedish name…). Our teacher worked for a company that made pretty much the only Simula compiler in the world back in the days. (Which he always was very proud of, and he made very sure that everyone knew how much Simula was superior to C).
Anyhow, the way he taught us compilers, the… *thinks a bit* … activation record (header data for every stackable or heapable thing) for a function/block instance kept two pointers: one to the dynamically surrounding block, and one for the statically surrounding block. For example, qsort’s statical father for the above example would be the global scope, while the dynamical father would be the block of sort_ints(). Comparison function would have sort_ints() as static father, and qsort as dynamical father.
This way, nested functions would be a piece of cake. comparison_function would be invisible in the global scope, and it could access any data in its statically surrounding block (sort_ints) through the static father pointer.
The theory of the insides of his compilers always looked so clean… No need for runtime generation of executable kludges!
That was very interesting. I did not know GCC had support for nested functions, but I remember using them all the time in Turbo Pascal many years ago.
So if I follow you correctly, this could mean that in a future version of Mac OS X nested functions in other compiled programming languages (like Pascal) may stop working?
Aaron Jacobs
I’d like to agree with Simone and say that I really really enjoy your posts, *because* they are extremely technical. Post more often!
Pascal and Delphi supports nested functions on Intel (obviously). But, it’s been a while since I’ve filtered through the dcc (delphi command compiler) source, so I don’t quite remember how they did it. I’ll have to fire up my windows machine some day and take a look at the x86 gened from a nested function.
-corbin
wrf3
Pascal supports nested functions by the design of the language. So nested functions will always work in Pascal. Nested functions in C is a non-standard addition and this particular implementation was a “hack” to get it working.
In my experience, other compilers implement nested functions via the same way gcc does it for C — writing trampoline code onto the stack and executing it. This wasn’t a problem until fairly recently, since there was no such thing as disabling stack execution.
On processors which support disabling stack execution (like the Core Duo), your code can modify memory permission on a per-page basis. For example, you could write this trampoline and then use vm_protect to make the page containing the trampoline executable. Of course, since the trampoline is smaller than a page you’d also be making some code around it executable. The way to avoid this is to allocate a page-sized block on the heap for the trampoline, then make it executable and non-writable.
I’d love to see a GCC patch to do this for trampolines. It’d make -fnested-functions safe, whereas the current implementation turns off stack protection for the entire process.
http://www.gnu-pascal.de/crystal/gpc/en/mail13020.html is interesting: the GNU Pascal people seem to be considering at least three different solutions.
One of these is trampolines, just as discussed here. It has the advantage of allowing interfacing with C and other bare-metal languages.
Another is ‘descriptors’, which seem to be an AIX thing and which I don’t understand.
A third is ‘fat pointers’. I think the idea here is that a pointer to a function is actually two pointers: a pointer to the code, and a pointer to the stack frame to use for its lexical scope. It’s not obvious to me whether that would affect how every function is called, or if it could be limited to functions called through function pointers. It would also have the weird side effect of making function pointers twice as big as data pointers; would that be within the C spec? Even if so, it would break a lot of peoples’ assumptions.
Jens Ayton
The old Mac OS PEF (Prefered Executable Format, oh the irony) calling conventions, based on the reccomended conventions for PowerPCs, used a form of fat pointer called a transition vector (Tvector). This consisted of a pointer to the in-memory address of the function and a pointer to the table of contents of the function’s code fragment (used to look up globals). Every cross-fragment function call or pointer-based function call went through glue code which loaded the TOC pointer into RTOC (R2, IIRC), loaded the address of the Tvector into R12 and then called the function.
Function pointers were implemented as pointers to the Tvector, and a function pointer call (*Ptr)(foo) would be treated by calling the equivalent of __ptr_glue(Ptr, foo). (In answer to the question about the C spec: pointers may be different sizes, as long as void * is the size of the biggest pointer type; but this indirection makes it a non-issue.)
Normally a Tvector would just be two 32-bit words in the code fragment’s data section. However, the fact that the address of the Tvector was put in a register before the call meant that the Tvector could contain additionall arbitrary fields, and as the Tvector contained no executable code it could easily be built on the stack, even with protection. A nested function code would then generate code along these lines:
typedef struct { void *entryPoint, *TOC; } TVector;
typedef struct { void *entryPoint, *TOC; char *parentStack; } NestTVector;
void outer_function(void) {
int x = 0;
TVector tv_inner_function;
NestTVector tvn_inner_function;
tv_inner_function = (TVector *)&__outer_function__inner_function;
tvn_inner_function.entryPoint = tv_inner_function.entryPoint;
tvn_inner_function.TOC = GetRTOC();
tvn_inner_function.stack = GetStackPointer();
extern_function(inner_function);
}
void __outer_function__inner_function(void) {
NestTVector *tvn_inner_function;
tvn_inner_function = &GetR12();
*(int *)(tvn_inner_function.parentStack + offsetForX)++;
}
Jens Ayton
Tweaks: that should be extern_function((void *)&tvn_inner_function) instead of extern_function(inner_function), and (NestTVector *)GetR12(), not &GetR12().
I don’t have an answer to your question, but I do have a new question. Isn’t this same exploit possible with any code that uses a function pointer?
If another part of the code has a buffer overflow exploit and I inject my new instructions at the location in memory where the function pointer is located I can do exactly the same thing. Right…?
In reference to Lee’s question, overwriting the value of a function pointer with valid instructions won’t cause those instructions to be executed. Instead, the byte values of those instructions would be treated as a value pointing at some other instructions, and the code would jump there.
So it’s possible for a buffer overflow to do some damage by exploiting existing code – if you write a SendToAllMyContactsThenFormatMyHardDrive() function, then a buffer overflow exploit could lead to that being called unexpectedly. But the no-execute protection makes it more difficult to inject new code.
I should also at this point do a shameless plug for my Stack Smash Protection patches for Apple’s GCC 3.3
I can’t remember where I read that, but I’ve seen the suggestion of a separate stack for trampolines which would be located in executable memory. This seems the best option as it doesn’t require much changes but still allow the “data” stack to not be executable.
nobody special
what an interesting series of articles!
i would be interested in reading your comments on the following discussion regarding the differences between the GNU and NeXT objc runtime:
http://www.objc.net/blogger/
thanks.
Anon
OK, how about this: create a seperate stack, just for trampolines. Leave it write-protected, but executable. When you need to create a trampoline, you specifically enable write access, build the trampoline, and then lock it down again (perhaps this should be in priviliged code?).
Looking at OpenBSD’s man page for GCC, it appears they also disable trampolines by default, they have to be re-enabled with -ftrampolines. Also, to quote : “trampoline code marks the smallest possible area around the trampoline stub executable using mprotect(2), since the stack area is by default non-executable.”
http://www.openbsd.org/cgi-bin/man.cgi?query=gcc-local&sektion=1
Yorick
ABIs which implement C function pointers as pointers to descriptors rather than to the code directly indeed have no problem with this. For instance, the 64-bit PowerPC ABI used by Linux does this (I’m not sure about Apple’s). The cost is a slight overhead for each indirect function call. Thus the C call
(*function_pointer)();
which translates to, in the 32-bit case,
mtctr function_pointer
bctrl
would instead be, with descriptors,
ld r1,0(function_pointer) ; code address
mtctr r1
ld r2,8(function_pointer) ; data pointer (not sure about the register here)
bctrl
but this is precisely what is desired for function closures. Not only are there no problems with non-executable stacks, but it is much more efficient since there is no need to use expensive cache flush and barrier instructions. And the “trampoline” (which is just another descriptor) is much smaller.
It is possible to have a secondary stack just for the closure data, as other people have pointed out, and the original paper, on which GCC’s implementation is based, mentions this (http://people.debian.org/~aaronl/Usenix88-lexic.pdf). The problem is with longjmp() which must be aware of the system, or else the stacks will get out of sync.
Ironically, generating code at runtime is slightly easier on x86 than PowerPC because the former have coherent instruction and data caches, so no explicit cache flush is generally needed.
Trevor
“How do other languages support them?”
There is a reasonably strong correlation between languages that support lexically-scoped functions and those that run on some kind of virtual machine. A decent virtual machine is the trivial answer to this problem, since you can architect the machine to make code insertion impossible (at least in traditional senses, e.g. writing data on the stack).
I can’t see any method for doing this kind of thing if:
- you want to compile directly to hardware, and
- the pointer representation can’t be made wider, and
- all code pages must be read-only, and
- code needs to be re-entrant
PsyMar
All I have to say is that this coding style seems like a bit of a kludge in itself, although I guess it could be used to hide subroutines of functions from other functions — but if you’re going to use function pointers to get around that, which is what requires the trampoline, it completely destroys any OOP basis for doing it in the first place. I don’t like it.
The only solution I can think of is for the nested function to have its own copy of the variables of the containing function, along with a bool that is set when it’s being called normally (not via a pointer) by the containing function; if it is, then it uses the containing function’s memory storage for these (the reference distance from the stack pointer will be fixed), and if it isn’t, then it uses the local space.
Daniel B. Giffin
If you’re willing to change the calling conventions — e.g., if you’re compiling a different language — you can just let your “function pointers” point to a data-structure that includes both a “stack frame” and a pointer to the code. We call this a “closure”.
To be clear, this is essentially the solution that ridiculous_fish described:
What if, instead of handing qsort() a pointer to the comparison_function() code, we instead passed a pointer to some data structure containing the stack frame of sort_ints() AND the address of comparison_function()? That data structure could turn around and call comparison_function() passing it the stack frame it needs. And the data structure could live on the sort_ints() stack, so we would have one per invocation of sort_ints(), which is just what we need.
The difference is that the trampoline code (and necessity for writable-and-executable memory) is no longer necessary: because of the different calling convention, the caller (e.g., qsort, still dumb as rocks) essentially sets the stack pointer to the one stored in the closure before jumping to the code. The code is all created at compile-time and lives in your read-only section.
(A fun trick is to allocate those “stack frames” and closures on a garbage-collected heap, which lets you acutally return such inner functions back out of the dynamic context (control stack) that created them, to be called whenever you like. So functions can create and return new, parameterized functions. Or “functions are first-class values.”)
(Another fun trick is to “capture” the current “heap-allocated” stack data-structure and store it away for later, even returning it out of the dynamic context as above; called a “continuation”, this could be “invoked” (returned into, if you will) any time later (even multiple times!). For example, you could implement a sort of multi-threading system this way. Or a backtracking algorithm.)
(The above “tricks” are fundamental features of Scheme. I believe (recent) Smalltalk does the same, with more explicit stack-frame introspection.)
Alex Nekrasov
My understanding is that forbidding writable memory to be executable does not really save from exploits.
A buffer overrun can be used to put a pointer to exec() and format context so that exec() will call up a terminal or any other existing program (including mail and format command line utils). This context would need to be put where the return pointer had been.
This limits the possibilities somehow, but does not remove the threat completely.
If I’m not making sense I’d like to know abou tit.
thanks,
Alex
kgw
Snow Leopard will introduce “blocks” (known as closures in other languages) to gcc. This will allow for the creation of functions with runtime state, which allows for elegant qsort patterns, among other things. I can’t wait!
Look up “return-oriented programming”. On many systems, marking a stack as non-executable merely raises the bar slightly for the attacker. It seems the possibilities are not limited. Anything goes.
In short, we fill the stack with pointers to existing snippets of trusted code (e.g. libc, BIOS), and exploit the behaviour of the return instruction. In particular, a return increments the stack pointer for us, and also jumps to the next snippet. It turns out on several popular systems there is enough variety in the trusted code to do any computation. We never need to inject our own code.
There are tools that automate all this. In other words, we can write source and have it compile into a bunch of carefully chosen addresses.
Prohibiting code on the stack is a toothless security measure. Furthermore, GCC should use return-oriented programming tricks to support nested functions on systems that insist on non-executable stacks.