Barrier
February 17th, 2007

"That's a lot of cores. And while 80-core floating point monsters like that aren't likely to show up in an iMac any time soon, multicore chips in multiprocessor computers are here today. Single chip machines are so 2004. Programmers better get crackin'. The megahertz free ride is over - and we have work to do."

There, that's the noisome little pitch everyone's been spreading like so much thermal paste. As if multiprocessing is something new! But of course it's not - heck, I remember Apple shipping dualies more than ten years ago, as the Power Macintosh 9500. Multiprocessing is more accessible (read: cheaper) now, but it's anything save new. It's been around long enough that we should have it figured out by now.

So what's my excuse? I admit it - I don't get multiprocessing, not, you know, really get it, and that's gone on long enough. It's time to get to the bottom of it - or if not to the bottom, at least deep enough that my ears start popping.

Threadinology

Where to start, where to start...well, let's define our terms. Ok, here's the things I mean when I say the following, uh, things:

  • Threads are just preemptively scheduled contexts of execution that share an address space. But you already know what threads are. Frankly, for my purposes, they're all pretty much the same whether you're using Objective-C or C++ or Java on Mac OS X or Linux or Windows...
  • Threading means creating multiple threads. But you often create multiple threads for simpler control flow or to get around blocking system calls, not to improve performance through true simultaneous execution.
  • Multithreading is the physically simultaneous execution of multiple threads for increased performance, which requires a dualie or more. Now things get hard.

Yeah, I know. "Multithreading is hard" is a cliché, and it bugs me, because it is not some truism describing a fundamental property of nature, but it's something we did. We made multithreading hard because we optimized so heavily for the single threaded case.

What do I mean? Well, processor speeds outrun memory so much that we started guessing at what's in memory so the processor doesn't have to waste time checking. "Guessing" is a loaded term; a nicer phrase might be "make increasingly aggressive assumptions" about the state of memory. And by "we," I mean both the compiler and the processor - both make things hard, as we'll see. We'll figure this stuff out - but they're going to try to confuse us. Oh well. Right or wrong, this is the bed we've made, and now we get to lie in it.

while (1) {
   x++;
   y++;
}

x should always be at least as big as y, right? Right?

Blah blah. Let's look at some code. We have two variables, variable1 and variable2, that both start out at 0. The writer thread will do this:

Writer thread

while (1) {
   variable1++;
   variable2++;
}
They both started out at zero, so therefore variable1, at every point in time, will always be the same as variable2, or larger - but never smaller. Right?

The reader thread will do this:

Reader thread

while (1) {
   local2 = variable2;
   local1 = variable1;
   if (local2 > local1) {
      print("Error!");
   }
}

That's odd - why does the reader thread load the second variable before the first? That's the opposite order from the writer thread! But it makes sense if you think about it.

See, it's possible that variable1 and/or variable2 will be incremented by the writer thread between the loads from the reader thread. If variable2 gets incremented, that doesn't matter - variable2 has already been read. If variable1 gets incremented, then that makes variable1 appear larger. So we conclude that variable2 should never be seen as larger than variable1, in the reader thread. If we loaded variable1 before variable2, then variable2 might be incremented after the load of variable1, and we would see variable 2 as larger.

Analogy to the rescue. Imagine some bratty kid playing polo while his uptight father looks on with a starched collar and monocle (which can only see one thing at a time, and not even very well, dontcha know). The kid smacks the ball, and then gallops after it, and so on. Now the squinting, snobby father first finds the ball, then looks around for his sproggen. But! If in the meantime, the kid hits it and takes off after it, Dad will find the kid ahead of where he first found the ball. Dad doesn't realize the ball has moved, and concludes that his budding athlete is ahead of the ball, and running the wrong way! If, however, Dad finds the kid first, and then the ball, things will always appear in the right order, if not the right place. The order of action (move ball, move kid) has to be the opposite from the order of observation (find kid, find ball).

Threadalogy

So we've got two threads operating on two variables, and we think we know what'll happen. Let's try it out, on my G5:

unsigned variable1 = 0;
unsigned variable2 = 0;
#define ITERATIONS 50000000
void *writer(void *unused) {
        for (;;) {
                variable1 = variable1 + 1;
                variable2 = variable2 + 1;
        }
}
void *reader(void *unused) {
        struct timeval start, end;
        gettimeofday(&start, NULL);
        unsigned i, failureCount = 0;
        for (i=0; i < ITERATIONS; i++) {
                unsigned v2 = variable2;
                unsigned v1 = variable1;
                if (v2 > v1) failureCount++;
        }
        gettimeofday(&end, NULL);
        double seconds = end.tv_sec + end.tv_usec / 1000000. - start.tv_sec - start.tv_usec / 1000000.;
        printf("%u failure%s (%2.1f percent of the time) in %2.1f seconds\n",
               failureCount, failureCount == 1 ? "" : "s",
               (100. * failureCount) / ITERATIONS, seconds);
        exit(0);
        return NULL;
}
int main(void) {
        pthread_t thread1, thread2;
        pthread_create(&thread1, NULL, writer, NULL);
        pthread_create(&thread2, NULL, reader, NULL);
        for (;;) sleep(1000000);
        return 0;
}
What do we get when we run this?
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
How do we know that the reader thread won't see a variable in some intermediate state, midway through being updated? We have to know that these particular operations are atomic. On PowerPC and x86 machines, 32 bit writes to aligned addresses are guaranteed atomic. Other types of memory accesses are not always atomic - in particular, 64 bit writes (say, of a double precision floating point value) on a 32 bit PowerPC are not atomic. We have to check the documentation to know.

So, we're done?

Our expectations were confirmed! The writer thread ordered its writes so that the first variable would always be at least as big as the second, and the reader thread ordered its reads the opposite way to preserve that invariant, and everything worked as planned.

But we might just be getting lucky, right? I mean, if thread1 and thread2 were always scheduled on the same processor, then we wouldn't see any failures - a processor is always self-consistent with how it appears to order reads and writes. In other words, a particular processor remembers where and what it pretends to have written, so if you read from that location with that same processor, you get what you expect. It's only when you read with processor1 from the same address where processor2 wrote - or pretended to write - that you might get into trouble.

So let's try to force thread1 and thread2 to run on separate processors. We can do that with the utilBindThreadToCPU() function, in the CHUD framework. That function should never go in a shipping app, but it's useful for debugging. Here it is:

void *writer(void *unused) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                variable2 = variable2 + 1;
        }
}
void *reader(void *unused) {
        utilBindThreadToCPU(1);
        struct timeval start, end;
        gettimeofday(&start, NULL);
        ...
int main(void) {
        pthread_t thread1, thread2;
        chudInitialize();
        unsigned variable2 = 0;
        pthread_create(&thread1, NULL, writer, &variable2);
        pthread_create(&thread2, NULL, reader, &variable2);
        while (1) sleep(1000000);
        return 0;
}
To run it:
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds

NOW are we done?

Still no failures. Hmm... But wait - processors operate on cache lines, and variable1 and variable2 are right next to each other, so they probably share the same cache line - that is, they get brought in together and treated the same by each processor. What if we separate them? We'll put one on the stack and leave the other where it is.

unsigned variable1 = 0;
#define ITERATIONS 50000000
void *writer(unsigned *variable2) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                *variable2 = *variable2 + 1;
        }
        return NULL;
}
void *reader(unsigned *variable2) {
        utilBindThreadToCPU(1);
        struct timeval start, end;
        gettimeofday(&start, NULL);
        unsigned i;
        unsigned failureCount = 0;
        for (i=0; i < ITERATIONS; i++) {
                unsigned v2 = *variable2;
                unsigned v1 = variable1;
                if (v2 > v1) failureCount++;
        }
        gettimeofday(&end, NULL);
        double seconds = end.tv_sec + end.tv_usec / 1000000. - start.tv_sec - start.tv_usec / 1000000.;
        printf("%u failure%s (%2.1f percent of the time) in %2.1f seconds\n", failureCount, failureCount == 1 ? "" : "s", (100. * failureCount) / ITERATIONS, seconds);
        exit(0);
        return NULL;
}
int main(void) {
        pthread_t thread1, thread2;
        unsigned variable2 = 0;
        chudInitialize();
        pthread_create(&thread1, NULL, writer, &variable2);
        pthread_create(&thread2, NULL, reader, &variable2);
        while (1) sleep(1000000);
        return 0;
}
So now, one variable is way high up on the stack and the other is way down low in the .data section. Does this change anything?
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
Still nothing! I'm not going to have an article after all! Arrrghhh **BANG BANG BANG BANG**
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
fish ) ./a.out
50000000 failures (100.0 percent of the time) in 0.1 seconds

Hey, there it is! Most of the time, every test passes, but that last time, every test failed.

Our Enemy the Compiler

The lesson here is something you already knew, but I'll state it anyways: Multithreading bugs are very delicate. There is a real bug here, but it was masked by the fact that the kernel scheduled them on the same processor, and then by the fact that the variables were too close together in memory, and once those two issues were removed, (un)lucky timing usually masked the bug anyways. In fact, if I didn't know there was a bug there, I'd never have found it - and I still have my doubts!

So first of all, why would every test pass or every test fail? If there's a subtle timing bug, we'd expect most tests to pass, with a few failing - not all or nothing. Let's look at what gcc is giving us for the reader function:

        lis r9,0x2fa
        ori r2,r9,61568
        mtctr r2
L8:
        bdnz L8
        lis r2,0x2fa
        ori r2,r2,61568
        mullw r2,r0,r2

Hey! The entire extent of that big long 50 million iteration loop has been hoisted out, leaving just the blue bits - essentially fifty million no-ops. Instead of adding one or zero each time through the loop, it calculates the one or zero once, and then multiplies it by 50 million.

gcc is loading from variable1 and variable2 exactly once, and comparing them exactly once, and assuming their values do not change throughout the function - which would be a fine assumption if there weren't also other threads manipulating those variables.

This is an example of what I mentioned above, about the compiler making things difficult by optimizing so aggressively for the single threaded case.

Well, you know the punchline - to stop gcc from optimizing aggressively, you use the volatile keyword. So let's do that:

volatile unsigned variable1 = 0;
#define ITERATIONS 50000000
void *writer(volatile unsigned *variable2) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                *variable2 = *variable2 + 1;
        }
        return NULL;
}
void *reader(volatile unsigned *variable2) {
        utilBindThreadToCPU(1);
        struct timeval start, end;
        ...
What does this change get us?
fish ) ./a.out
12462711 failures (24.9 percent of the time) in 3.7 seconds

It's much slower (expected, since volatile defeats optimizations), but more importantly, it fails intermittently instead of all or nothing. Inspection of the assembly shows that gcc is generating the straightforward sequence of loads and stores that you'd expect.

Our Enemy the Processor

Is this really the cross-processor synchronization issues we're trying to investigate? We can find out by binding both threads to the same CPU:

void *writer(unsigned *variable2) {
        utilBindThreadToCPU(0);
        ...
void *reader(unsigned *variable2) {
        utilBindThreadToCPU(0);
        ...
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.4 seconds

The tests pass all the time - this really is a cross-processor issue.

So somehow variable2 is becoming larger than variable1 even though variable1 is always incremented first. How's that possible? It's possible that the writer thread, on processor 0, is writing in the wrong order - it's writing variable2 before variable1 even though we explicitly say to write variable1 first. It's also possible that the reader thread, on processor 1, is reading variable1 before variable 2, even though we tell it to do things in the opposite order. In other words, the processors could be reading and writing those variables in any order they feel like instead of the order we tell them to.

Pop and Lock?

What's the usual response to cross-processor synchronization issues like this? A mutex! Let's try it.

fish ) ./a.out
0 failures (0.0 percent of the time) in 479.5 seconds

It made the tests pass, all right - but it was 130 times slower! A spinlock does substantially better, at 20 seconds, but that's still 440% worse than no locking - and spinlocks won't scale. Surely we can do better.

Even the kitchen

Our problem is this: our processors are doing things in a different order than we tell them to, and not informing each other. Each processor is only keeping track of its own shenanigans! For shame! We know of two super-horrible ways to fix this: force both threads onto the same CPU, which is a very bad idea, or to use a lock, which is a very slow idea. So what's the right way to make this work?

What we really want is a way to turn off the reordering for that particular sequence of loads and stores. They don't call it "turn off reordering", of course, because that might imply that reordering is bad. So instead they call it just plain "ordering". We want to order the reads and writes. Ask and ye shall receive - the mechanism for that is called a "memory barrier".

And boy, does the PowerPC have them. I count at least three: sync, lwsync, and the hilariously named eieio. Here's what they do:

  • sync is the sledgehammer of the bunch - it orders all reads and writes, no matter what. It works, but it's slow.
  • lwsync (for "lightweight sync") is the newest addition. It's limited to plain ol' system memory, but it's also faster than sync.
  • eieio ("Enforce In-Order execution of I/O") is weird - it orders writes to "device" memory (like a memory mapped peripheral) and regular ol' system memory, but each separately. We only care about system memory, and IBM says not to use eieio just for that. Nevertheless, it should still order our reads and writes like we want.

Because we're not working with devices, lwsync is what we're after. Processor 0 is writing variable2 after variable1, so we'll insert a memory barrier to prevent that:

Do we need a memory barrier after the write to variable2 as well? No - that would guard against the possibility of the next increment landing on variable1 before the previous increment hits variable2. But the goal is to make sure that variable1 is larger than variable2, so it's OK if that happens.
volatile unsigned variable1 = 0;
#define barrier() __asm__ volatile ("lwsync")
#define ITERATIONS 50000000
void *writer(volatile unsigned *variable2) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                barrier();
                *variable2 = *variable2 + 1;
        }
        return NULL;
}

So! Let's run it!

fish ) ./a.out
260 failures (0.0 percent of the time) in 0.9 seconds

So we reduced the failure count from 12462711 to 260. Much better, but still not perfect. Why are we still failing at times? The answer, of course, is that just because processor 0 writes in the order we want is no guarantee that processor1 will read in the desired order. Processor 1 may issue the reads in the wrong order, and processor 0 would write in between those two reads. We need a memory barrier in the reader thread, to force the reads into the right order as well:

void *reader(volatile unsigned *variable2) {
        struct timeval start, end;
        utilBindThreadToCPU(1);
        gettimeofday(&start, NULL);
        unsigned i;
        unsigned failureCount = 0;
        for (i=0; i < ITERATIONS; i++) {
                unsigned v2 = *variable2;
                barrier();
                unsigned v1 = variable1;
                if (v2 > v1) failureCount++;
        }
        gettimeofday(&end, NULL);
        double seconds = end.tv_sec + end.tv_usec / 1000000. - start.tv_sec - start.tv_usec / 1000000.;
        printf("%u failure%s (%2.1f percent of the time) in %2.1f seconds\n",
               failureCount, failureCount == 1 ? "" : "s",
               (100. * failureCount) / ITERATIONS, seconds);
        exit(0);
        return NULL;
}
fish ) ./a.out
0 failures (0.0 percent of the time) in 4.2 seconds

That did it!

The lesson here is that if you care about the order of reads or writes by one thread, it's because you care about the order of writes or reads by another thread. Both threads need a memory barrier. Memory barriers always come in pairs, or triplets or more. (Of course, if both threads are in the same function, there may only be one memory barrier that appears in your code - as long as both threads execute it.)

This should not come as a surprise: locks have the same behavior. If only one thread ever locks, it's not a very useful lock.

31 Flavors

What's that? You noticed that the PowerPC comes with three different kinds of memory barriers. Right - as reads and writes get scheduled increasingly out of order, the more expensive it becomes to order them - so the PowerPC allows you to request various less expensive partial orderings, for performance. Processors that schedule I/O out of order more aggressively offer even more barrier flavors. At the extreme end is the DEC Alpha, that sports read barriers with device memory ordering, read barriers without, write barriers with, write barriers without, page table barriers, and various birds in fruit trees. The Alpha's memory model guarantees so little that it is said to define the Linux kernel memory model - that is, the set of available barriers in the kernel source match the Alpha's instruction set. (Of course, many of them get compiled out when targetting a different processor.)

Actually - and here my understanding is especially thin - while x86 is strongly ordered in general, I believe that Intel has managed to slip some weakly ordered operations in sideways, through the SIMD unit. These are referred to as "streaming" or "nontemporal" instructions. And when writing to specially tagged "write combining" memory, like, say, memory mapped VRAM, the rules are different still.

And on the other end, we have strongly ordered memory models that do very little reordering, like the - here it comes - x86. No matter how many times I run that code, even on a double-dualie Intel Mac Pro, I never saw any failures. Why not? My understanding (and here it grows fuzzy) is that early multiprocessing setups were strongly ordered because modern reordering tricks weren't that useful - memory was still pretty fast, y'know, relatively speaking, so there wasn't much win to be had. So developers blithely assumed the good times would never end, and we've been wearing the backwards compatibility shackles ever since.

But that doesn't answer the question of why x86_64, y'know, the 64 bit x86 implementation in the Core 2s and all, isn't more weakly ordered - or at least, reserve the right to be weaker. That's what IA64 - remember Itanium? - did: early models were strongly ordered, but the official memory model was weak, for future proofing. Why didn't AMD follow suit with x86_64? My only guess (emphasis on guess) is that it was a way of jockeying for position against Itanium, when the 64 bit future for the x86 was still fuzzy. AMD's strongly ordered memory model means better compatibility and less hair-pulling when porting x86 software to 64 bit, and that made x86_64 more attractive compared to the Itanium. A pity, at least for Apple, since of course all of Apple's software runs on the weak PowerPC - there's no compatibility benefit to be had. So it goes. Is my theory right?

Makin' a lock, checkin' it twice

Ok! I think we're ready to take on that perennial bugaboo of Java programmers - the double checked lock. How does it go again? Let's see it in Objective-C:

+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            sharedObject = [[self alloc] init];
        }
        UNLOCK;
    }
    return sharedObject;
}

What's the theory? We want to create a single shared object, exactly once, while preventing conflict between multiple threads. The hope is that we can do a quick test to avoid taking the expensive lock. If the static variable is set, which it will be most of the time, we can return the object immediately, without taking the lock.

Sometimes memory barriers are needed to guard against past or future reads and writes that occur in, say, the function that's calling your function. Reordering can cross function and library boundaries!

This sounds good, but of course you already know it's not. Why not? Well, if you're creating this object, you're probably initializing it in some way - at the very least, you're setting its isa (class) pointer. And then you're turning around and writing it back to the sharedObject variable. But these can happen in any order, as seen from another processor - so when the getSharedObject method is called from some other processor, it can see the sharedObject variable as set, and happily return the object before its class pointer is even valid. Cripes.

But now you know we have the know-how to make this work, no? How? The problem is that we need to order the writes within the alloc and init methods relative to the write to the sharedObject variable - the alloc and init writes must come first, the write to sharedObject last. So we store the object into a temporary local variable, insert a memory barrier, and then copy from the temporary to the shared object. This time, I'll use Apple's portable memory barrier function:

+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            id temp = [[self alloc] init];
            OSMemoryBarrier();
            sharedObject = temp;
        }
        UNLOCK;
    }
    return sharedObject;
}

There! Now we're guaranteed that the initializing thread really will write to sharedObject after the object is fully initialized. All done.

Hmm? Oh, nuts! I forgot my rule - write barriers come in pairs. If thread A initializes the object, it goes through a memory barrier, but if thread B then comes along, it will see the object and return it without any barrier at all. Our rule tells us that something is wrong, but what? Why's that bad?

Well, thread B's going to do something with the shared object, like send it a message, and that requires at the very least accessing the isa class pointer. But we know the isa pointer really was written to memory first, before the sharedObject pointer, and thread B got ahold of the sharedObject pointer, so logically, the isa pointer should be written, right? The laws of physics require it! Isn't that, like, you put an object in a box and hand it to me, and then I open the box to find that you haven't put something into it yet! It's a temporal paradox!

The answer is that, yes, amazingly, dependent reads like that can be performed seemingly out of order, but not on any processors that Apple ships. I've only heard of it happening in the - you guessed it - the Alpha. Crazy, huh?

So where should the memory barrier go? The goal is to order future reads - reads that occur after this sharedObject function returns - against the read from the sharedObject variable. So it's gotta go here:

+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            id temp = [[self alloc] init];
            OSMemoryBarrier();
            sharedObject = temp;
        }
        UNLOCK;
    }
    OSMemoryBarrier();
    return sharedObject;
}

Now, this differs slightly from the usual solution, which stores the static variable into a temporary in all cases. However, for the life of me I can't figure out why that's necessary - the placement of the second memory barrier above seems correct to me, assuming the compiler doesn't hoist the final read of sharedObject above the memory barrier (which it shouldn't). If I screwed it up, let me know how, please!

Do we want it?

That second memory barrier makes the double checked lock correct - but is it wise? As we discussed, it's not technically necessary on any machine you or I are likely to encounter. And, after all, it does incur a real performance hit if we leave it in. What to do?

The Linux kernel defines a set of fine-grained barrier macros that get compiled in or out appropriately (we would want a "data dependency barrier" in that case). You could go that route, but my suggestion is to just leave a semi-standard comment to help you locate these places in the future. That will help future-proof your code, but more importantly, it forces you to reason carefully about the threading issues, and to record your thoughts. You're more likely to get it right.

+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            id temp = [[self alloc] init];
            OSMemoryBarrier();
            sharedObject = temp;
        }
        UNLOCK;
    }
    /* data dependency memory barrier here */
    return sharedObject;
}
Wrapping up!
Skimmers skip to here.

Now are we done?

I think so, Mr. Subheading. Let's see if we can summarize all this:


Locks are like tanks - powerful, slow, safe, expensive, and prone to getting you stuck.
  • The compiler and the processor both conspire to defeat your threads by moving your code around! Be warned and wary! You will have to do battle with both.
  • Even so, it is very easy to mask serious threading bugs. We had to work hard, even in highly contrived circumstances, to get our bug to poke its head out even occasionally.
  • Ergo, testing probably won't catch these types of bugs. That makes it more important to get it right the first time.
  • Locks are the heavy tanks of threading tools - powerful, but slow and expensive, and if you're not careful, you'll get yourself stuck in a deadlock.
  • Memory barriers are a faster, non-blocking, deadlock free alternative to locks. They take more thought, and aren't always applicable, but your code'll be faster and scale better.
  • Memory barriers always come in logical pairs or more. Understanding where the second barrier has to go will help you reason about your code, even if that particular architecture doesn't require a second barrier.

Further reading

Seriously? You want to know more? Ok - the best technical source I know of is actually a document called "memory-barriers.txt" that comes with the Linux kernel source. You can get it here. Thanks to my co-worker for finding it and directing me to it.

Things I wanna know

I'm still scratching my head about some things. Maybe you can help me out.
  • Why is x86_64 strongly ordered? Is my theory about gaining a competitive edge over Itanium reasonable?
  • Is my double checked lock right, even though it doesn't use a temporary variable in the usual place?
  • What's up with the so-called "nontemporal" streaming instructions on x86?
Leave a comment if you know! Thanks!