Fish tank with Puck the Fish Fish tank with Puck the Fish Water on the floor spelling Fish
 

Barrier

February 17th, 2007

"That’s a lot of cores. And while 80-core floating point monsters like that aren’t likely to show up in an iMac any time soon, multicore chips in multiprocessor computers are here today. Single chip machines are so 2004. Programmers better get crackin’. The megahertz free ride is over – and we have work to do."

There, that’s the noisome little pitch everyone’s been spreading like so much thermal paste. As if multiprocessing is something new! But of course it’s not – heck, I remember Apple shipping dualies more than ten years ago, as the Power Macintosh 9500. Multiprocessing is more accessible (read: cheaper) now, but it’s anything save new. It’s been around long enough that we should have it figured out by now.

So what’s my excuse? I admit it – I don’t get multiprocessing, not, you know, really get it, and that’s gone on long enough. It’s time to get to the bottom of it – or if not to the bottom, at least deep enough that my ears start popping.

Threadinology

Where to start, where to start…well, let’s define our terms. Ok, here’s the things I mean when I say the following, uh, things:

  • Threads are just preemptively scheduled contexts of execution that share an address space. But you already know what threads are. Frankly, for my purposes, they’re all pretty much the same whether you’re using Objective-C or C++ or Java on Mac OS X or Linux or Windows…
  • Threading means creating multiple threads. But you often create multiple threads for simpler control flow or to get around blocking system calls, not to improve performance through true simultaneous execution.
  • Multithreading is the physically simultaneous execution of multiple threads for increased performance, which requires a dualie or more. Now things get hard.

Yeah, I know. "Multithreading is hard" is a cliché, and it bugs me, because it is not some truism describing a fundamental property of nature, but it’s something we did. We made multithreading hard because we optimized so heavily for the single threaded case.

What do I mean? Well, processor speeds outrun memory so much that we started guessing at what’s in memory so the processor doesn’t have to waste time checking. "Guessing" is a loaded term; a nicer phrase might be "make increasingly aggressive assumptions" about the state of memory. And by "we," I mean both the compiler and the processor – both make things hard, as we’ll see. We’ll figure this stuff out – but they’re going to try to confuse us. Oh well. Right or wrong, this is the bed we’ve made, and now we get to lie in it.

while (1) {
   x++;
   y++;
}

x should always be at least as big as y, right? Right?

Blah blah. Let’s look at some code. We have two variables, variable1 and variable2, that both start out at 0. The writer thread will do this:

Writer thread

while (1) {
   variable1++;
   variable2++;
}

They both started out at zero, so therefore variable1, at every point in time, will always be the same as variable2, or larger – but never smaller. Right?

The reader thread will do this:

Reader thread

while (1) {
   local2 = variable2;
   local1 = variable1;
   if (local2 > local1) {
      print("Error!");
   }
}

That’s odd – why does the reader thread load the second variable before the first? That’s the opposite order from the writer thread! But it makes sense if you think about it.

See, it’s possible that variable1 and/or variable2 will be incremented by the writer thread between the loads from the reader thread. If variable2 gets incremented, that doesn’t matter – variable2 has already been read. If variable1 gets incremented, then that makes variable1 appear larger. So we conclude that variable2 should never be seen as larger than variable1, in the reader thread. If we loaded variable1 before variable2, then variable2 might be incremented after the load of variable1, and we would see variable 2 as larger.

Analogy to the rescue. Imagine some bratty kid playing polo while his uptight father looks on with a starched collar and monocle (which can only see one thing at a time, and not even very well, dontcha know). The kid smacks the ball, and then gallops after it, and so on. Now the squinting, snobby father first finds the ball, then looks around for his sproggen. But! If in the meantime, the kid hits it and takes off after it, Dad will find the kid ahead of where he first found the ball. Dad doesn’t realize the ball has moved, and concludes that his budding athlete is ahead of the ball, and running the wrong way! If, however, Dad finds the kid first, and then the ball, things will always appear in the right order, if not the right place. The order of action (move ball, move kid) has to be the opposite from the order of observation (find kid, find ball).

Threadalogy

So we’ve got two threads operating on two variables, and we think we know what’ll happen. Let’s try it out, on my G5:

unsigned variable1 = 0;
unsigned variable2 = 0;

#define ITERATIONS 50000000

void *writer(void *unused) {
        for (;;) {
                variable1 = variable1 + 1;
                variable2 = variable2 + 1;
        }
}

void *reader(void *unused) {
        struct timeval start, end;
        gettimeofday(&start, NULL);
        unsigned i, failureCount = 0;
        for (i=0; i < ITERATIONS; i++) {
                unsigned v2 = variable2;
                unsigned v1 = variable1;
                if (v2 > v1) failureCount++;
        }
        gettimeofday(&end, NULL);
        double seconds = end.tv_sec + end.tv_usec / 1000000. - start.tv_sec - start.tv_usec / 1000000.;
        printf("%u failure%s (%2.1f percent of the time) in %2.1f seconds\n",
               failureCount, failureCount == 1 ? "" : "s",
               (100. * failureCount) / ITERATIONS, seconds);
        exit(0);
        return NULL;
}

int main(void) {
        pthread_t thread1, thread2;
        pthread_create(&thread1, NULL, writer, NULL);
        pthread_create(&thread2, NULL, reader, NULL);
        for (;;) sleep(1000000);
        return 0;
}

What do we get when we run this?

fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
How do we know that the reader thread won’t see a variable in some intermediate state, midway through being updated? We have to know that these particular operations are atomic. On PowerPC and x86 machines, 32 bit writes to aligned addresses are guaranteed atomic. Other types of memory accesses are not always atomic – in particular, 64 bit writes (say, of a double precision floating point value) on a 32 bit PowerPC are not atomic. We have to check the documentation to know.

So, we’re done?

Our expectations were confirmed! The writer thread ordered its writes so that the first variable would always be at least as big as the second, and the reader thread ordered its reads the opposite way to preserve that invariant, and everything worked as planned.

But we might just be getting lucky, right? I mean, if thread1 and thread2 were always scheduled on the same processor, then we wouldn’t see any failures – a processor is always self-consistent with how it appears to order reads and writes. In other words, a particular processor remembers where and what it pretends to have written, so if you read from that location with that same processor, you get what you expect. It’s only when you read with processor1 from the same address where processor2 wrote – or pretended to write – that you might get into trouble.

So let’s try to force thread1 and thread2 to run on separate processors. We can do that with the utilBindThreadToCPU() function, in the CHUD framework. That function should never go in a shipping app, but it’s useful for debugging. Here it is:


void *writer(void *unused) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                variable2 = variable2 + 1;
        }
}

void *reader(void *unused) {
        utilBindThreadToCPU(1);
        struct timeval start, end;
        gettimeofday(&start, NULL);
        ...

int main(void) {
        pthread_t thread1, thread2;
        chudInitialize();
        unsigned variable2 = 0;
        pthread_create(&thread1, NULL, writer, &variable2);
        pthread_create(&thread2, NULL, reader, &variable2);
        while (1) sleep(1000000);
        return 0;
}

To run it:

fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds

NOW are we done?

Still no failures. Hmm… But wait – processors operate on cache lines, and variable1 and variable2 are right next to each other, so they probably share the same cache line – that is, they get brought in together and treated the same by each processor. What if we separate them? We’ll put one on the stack and leave the other where it is.

unsigned variable1 = 0;

#define ITERATIONS 50000000

void *writer(unsigned *variable2) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                *variable2 = *variable2 + 1;
        }
        return NULL;
}

void *reader(unsigned *variable2) {
        utilBindThreadToCPU(1);
        struct timeval start, end;
        gettimeofday(&start, NULL);
        unsigned i;
        unsigned failureCount = 0;
        for (i=0; i < ITERATIONS; i++) {
                unsigned v2 = *variable2;
                unsigned v1 = variable1;
                if (v2 > v1) failureCount++;
        }
        gettimeofday(&end, NULL);
        double seconds = end.tv_sec + end.tv_usec / 1000000. - start.tv_sec - start.tv_usec / 1000000.;
        printf("%u failure%s (%2.1f percent of the time) in %2.1f seconds\n", failureCount, failureCount == 1 ? "" : "s", (100. * failureCount) / ITERATIONS, seconds);
        exit(0);
        return NULL;
}

int main(void) {
        pthread_t thread1, thread2;
        unsigned variable2 = 0;
        chudInitialize();
        pthread_create(&thread1, NULL, writer, &variable2);
        pthread_create(&thread2, NULL, reader, &variable2);
        while (1) sleep(1000000);
        return 0;
}

So now, one variable is way high up on the stack and the other is way down low in the .data section. Does this change anything?

fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds

Still nothing! I’m not going to have an article after all! Arrrghhh **BANG BANG BANG BANG**

fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
fish ) ./a.out
50000000 failures (100.0 percent of the time) in 0.1 seconds

Hey, there it is! Most of the time, every test passes, but that last time, every test failed.

Our Enemy the Compiler

The lesson here is something you already knew, but I’ll state it anyways: Multithreading bugs are very delicate. There is a real bug here, but it was masked by the fact that the kernel scheduled them on the same processor, and then by the fact that the variables were too close together in memory, and once those two issues were removed, (un)lucky timing usually masked the bug anyways. In fact, if I didn’t know there was a bug there, I’d never have found it – and I still have my doubts!

So first of all, why would every test pass or every test fail? If there’s a subtle timing bug, we’d expect most tests to pass, with a few failing – not all or nothing. Let’s look at what gcc is giving us for the reader function:

        lis r9,0x2fa
        ori r2,r9,61568
        mtctr r2
L8:
        bdnz L8
        lis r2,0x2fa
        ori r2,r2,61568
        mullw r2,r0,r2

Hey! The entire extent of that big long 50 million iteration loop has been hoisted out, leaving just the blue bits – essentially fifty million no-ops. Instead of adding one or zero each time through the loop, it calculates the one or zero once, and then multiplies it by 50 million.

gcc is loading from variable1 and variable2 exactly once, and comparing them exactly once, and assuming their values do not change throughout the function – which would be a fine assumption if there weren’t also other threads manipulating those variables.

This is an example of what I mentioned above, about the compiler making things difficult by optimizing so aggressively for the single threaded case.

Well, you know the punchline – to stop gcc from optimizing aggressively, you use the volatile keyword. So let’s do that:

volatile unsigned variable1 = 0;

#define ITERATIONS 50000000

void *writer(volatile unsigned *variable2) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                *variable2 = *variable2 + 1;
        }
        return NULL;
}

void *reader(volatile unsigned *variable2) {
        utilBindThreadToCPU(1);
        struct timeval start, end;
        ...

What does this change get us?

fish ) ./a.out
12462711 failures (24.9 percent of the time) in 3.7 seconds

It’s much slower (expected, since volatile defeats optimizations), but more importantly, it fails intermittently instead of all or nothing. Inspection of the assembly shows that gcc is generating the straightforward sequence of loads and stores that you’d expect.

Our Enemy the Processor

Is this really the cross-processor synchronization issues we’re trying to investigate? We can find out by binding both threads to the same CPU:


void *writer(unsigned *variable2) {
        utilBindThreadToCPU(0);
        ...

void *reader(unsigned *variable2) {
        utilBindThreadToCPU(0);
        ...
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.4 seconds

The tests pass all the time – this really is a cross-processor issue.

So somehow variable2 is becoming larger than variable1 even though variable1 is always incremented first. How’s that possible? It’s possible that the writer thread, on processor 0, is writing in the wrong order – it’s writing variable2 before variable1 even though we explicitly say to write variable1 first. It’s also possible that the reader thread, on processor 1, is reading variable1 before variable 2, even though we tell it to do things in the opposite order. In other words, the processors could be reading and writing those variables in any order they feel like instead of the order we tell them to.

Pop and Lock?

What’s the usual response to cross-processor synchronization issues like this? A mutex! Let’s try it.

fish ) ./a.out
0 failures (0.0 percent of the time) in 479.5 seconds

It made the tests pass, all right – but it was 130 times slower! A spinlock does substantially better, at 20 seconds, but that’s still 440% worse than no locking – and spinlocks won’t scale. Surely we can do better.

Even the kitchen

Our problem is this: our processors are doing things in a different order than we tell them to, and not informing each other. Each processor is only keeping track of its own shenanigans! For shame! We know of two super-horrible ways to fix this: force both threads onto the same CPU, which is a very bad idea, or to use a lock, which is a very slow idea. So what’s the right way to make this work?

What we really want is a way to turn off the reordering for that particular sequence of loads and stores. They don’t call it "turn off reordering", of course, because that might imply that reordering is bad. So instead they call it just plain "ordering". We want to order the reads and writes. Ask and ye shall receive – the mechanism for that is called a "memory barrier".

And boy, does the PowerPC have them. I count at least three: sync, lwsync, and the hilariously named eieio. Here’s what they do:

  • sync is the sledgehammer of the bunch – it orders all reads and writes, no matter what. It works, but it’s slow.
  • lwsync (for "lightweight sync") is the newest addition. It’s limited to plain ol’ system memory, but it’s also faster than sync.
  • eieio ("Enforce In-Order execution of I/O") is weird – it orders writes to "device" memory (like a memory mapped peripheral) and regular ol’ system memory, but each separately. We only care about system memory, and IBM says not to use eieio just for that. Nevertheless, it should still order our reads and writes like we want.

Because we’re not working with devices, lwsync is what we’re after. Processor 0 is writing variable2 after variable1, so we’ll insert a memory barrier to prevent that:

Do we need a memory barrier after the write to variable2 as well? No – that would guard against the possibility of the next increment landing on variable1 before the previous increment hits variable2. But the goal is to make sure that variable1 is larger than variable2, so it’s OK if that happens.
volatile unsigned variable1 = 0;

#define barrier() __asm__ volatile ("lwsync")

#define ITERATIONS 50000000

void *writer(volatile unsigned *variable2) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                barrier();
                *variable2 = *variable2 + 1;
        }
        return NULL;
}

So! Let’s run it!

fish ) ./a.out
260 failures (0.0 percent of the time) in 0.9 seconds

So we reduced the failure count from 12462711 to 260. Much better, but still not perfect. Why are we still failing at times? The answer, of course, is that just because processor 0 writes in the order we want is no guarantee that processor1 will read in the desired order. Processor 1 may issue the reads in the wrong order, and processor 0 would write in between those two reads. We need a memory barrier in the reader thread, to force the reads into the right order as well:

void *reader(volatile unsigned *variable2) {
        struct timeval start, end;
        utilBindThreadToCPU(1);
        gettimeofday(&start, NULL);
        unsigned i;
        unsigned failureCount = 0;
        for (i=0; i < ITERATIONS; i++) {
                unsigned v2 = *variable2;
                barrier();
                unsigned v1 = variable1;
                if (v2 > v1) failureCount++;
        }
        gettimeofday(&end, NULL);
        double seconds = end.tv_sec + end.tv_usec / 1000000. - start.tv_sec - start.tv_usec / 1000000.;
        printf("%u failure%s (%2.1f percent of the time) in %2.1f seconds\n",
               failureCount, failureCount == 1 ? "" : "s",
               (100. * failureCount) / ITERATIONS, seconds);
        exit(0);
        return NULL;
}
fish ) ./a.out
0 failures (0.0 percent of the time) in 4.2 seconds

That did it!

The lesson here is that if you care about the order of reads or writes by one thread, it’s because you care about the order of writes or reads by another thread. Both threads need a memory barrier. Memory barriers always come in pairs, or triplets or more. (Of course, if both threads are in the same function, there may only be one memory barrier that appears in your code – as long as both threads execute it.)

This should not come as a surprise: locks have the same behavior. If only one thread ever locks, it’s not a very useful lock.

31 Flavors

What’s that? You noticed that the PowerPC comes with three different kinds of memory barriers. Right – as reads and writes get scheduled increasingly out of order, the more expensive it becomes to order them – so the PowerPC allows you to request various less expensive partial orderings, for performance. Processors that schedule I/O out of order more aggressively offer even more barrier flavors. At the extreme end is the DEC Alpha, that sports read barriers with device memory ordering, read barriers without, write barriers with, write barriers without, page table barriers, and various birds in fruit trees. The Alpha’s memory model guarantees so little that it is said to define the Linux kernel memory model – that is, the set of available barriers in the kernel source match the Alpha’s instruction set. (Of course, many of them get compiled out when targetting a different processor.)

Actually – and here my understanding is especially thin – while x86 is strongly ordered in general, I believe that Intel has managed to slip some weakly ordered operations in sideways, through the SIMD unit. These are referred to as "streaming" or "nontemporal" instructions. And when writing to specially tagged "write combining" memory, like, say, memory mapped VRAM, the rules are different still.

And on the other end, we have strongly ordered memory models that do very little reordering, like the – here it comes – x86. No matter how many times I run that code, even on a double-dualie Intel Mac Pro, I never saw any failures. Why not? My understanding (and here it grows fuzzy) is that early multiprocessing setups were strongly ordered because modern reordering tricks weren’t that useful – memory was still pretty fast, y’know, relatively speaking, so there wasn’t much win to be had. So developers blithely assumed the good times would never end, and we’ve been wearing the backwards compatibility shackles ever since.

But that doesn’t answer the question of why x86_64, y’know, the 64 bit x86 implementation in the Core 2s and all, isn’t more weakly ordered – or at least, reserve the right to be weaker. That’s what IA64 – remember Itanium? – did: early models were strongly ordered, but the official memory model was weak, for future proofing. Why didn’t AMD follow suit with x86_64? My only guess (emphasis on guess) is that it was a way of jockeying for position against Itanium, when the 64 bit future for the x86 was still fuzzy. AMD’s strongly ordered memory model means better compatibility and less hair-pulling when porting x86 software to 64 bit, and that made x86_64 more attractive compared to the Itanium. A pity, at least for Apple, since of course all of Apple’s software runs on the weak PowerPC – there’s no compatibility benefit to be had. So it goes. Is my theory right?

Makin’ a lock, checkin’ it twice

Ok! I think we’re ready to take on that perennial bugaboo of Java programmers – the double checked lock. How does it go again? Let’s see it in Objective-C:

+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            sharedObject = [[self alloc] init];
        }
        UNLOCK;
    }
    return sharedObject;
}

What’s the theory? We want to create a single shared object, exactly once, while preventing conflict between multiple threads. The hope is that we can do a quick test to avoid taking the expensive lock. If the static variable is set, which it will be most of the time, we can return the object immediately, without taking the lock.

Sometimes memory barriers are needed to guard against past or future reads and writes that occur in, say, the function that’s calling your function. Reordering can cross function and library boundaries!

This sounds good, but of course you already know it’s not. Why not? Well, if you’re creating this object, you’re probably initializing it in some way – at the very least, you’re setting its isa (class) pointer. And then you’re turning around and writing it back to the sharedObject variable. But these can happen in any order, as seen from another processor – so when the getSharedObject method is called from some other processor, it can see the sharedObject variable as set, and happily return the object before its class pointer is even valid. Cripes.

But now you know we have the know-how to make this work, no? How? The problem is that we need to order the writes within the alloc and init methods relative to the write to the sharedObject variable – the alloc and init writes must come first, the write to sharedObject last. So we store the object into a temporary local variable, insert a memory barrier, and then copy from the temporary to the shared object. This time, I’ll use Apple’s portable memory barrier function:

+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            id temp = [[self alloc] init];
            OSMemoryBarrier();
            sharedObject = temp;
        }
        UNLOCK;
    }
    return sharedObject;
}

There! Now we’re guaranteed that the initializing thread really will write to sharedObject after the object is fully initialized. All done.

Hmm? Oh, nuts! I forgot my rule – write barriers come in pairs. If thread A initializes the object, it goes through a memory barrier, but if thread B then comes along, it will see the object and return it without any barrier at all. Our rule tells us that something is wrong, but what? Why’s that bad?

Well, thread B’s going to do something with the shared object, like send it a message, and that requires at the very least accessing the isa class pointer. But we know the isa pointer really was written to memory first, before the sharedObject pointer, and thread B got ahold of the sharedObject pointer, so logically, the isa pointer should be written, right? The laws of physics require it! Isn’t that, like, you put an object in a box and hand it to me, and then I open the box to find that you haven’t put something into it yet! It’s a temporal paradox!

The answer is that, yes, amazingly, dependent reads like that can be performed seemingly out of order, but not on any processors that Apple ships. I’ve only heard of it happening in the – you guessed it – the Alpha. Crazy, huh?

So where should the memory barrier go? The goal is to order future reads – reads that occur after this sharedObject function returns – against the read from the sharedObject variable. So it’s gotta go here:

+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            id temp = [[self alloc] init];
            OSMemoryBarrier();
            sharedObject = temp;
        }
        UNLOCK;
    }
    OSMemoryBarrier();
    return sharedObject;
}

Now, this differs slightly from the usual solution, which stores the static variable into a temporary in all cases. However, for the life of me I can’t figure out why that’s necessary – the placement of the second memory barrier above seems correct to me, assuming the compiler doesn’t hoist the final read of sharedObject above the memory barrier (which it shouldn’t). If I screwed it up, let me know how, please!

Do we want it?

That second memory barrier makes the double checked lock correct – but is it wise? As we discussed, it’s not technically necessary on any machine you or I are likely to encounter. And, after all, it does incur a real performance hit if we leave it in. What to do?

The Linux kernel defines a set of fine-grained barrier macros that get compiled in or out appropriately (we would want a "data dependency barrier" in that case). You could go that route, but my suggestion is to just leave a semi-standard comment to help you locate these places in the future. That will help future-proof your code, but more importantly, it forces you to reason carefully about the threading issues, and to record your thoughts. You’re more likely to get it right.

+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            id temp = [[self alloc] init];
            OSMemoryBarrier();
            sharedObject = temp;
        }
        UNLOCK;
    }
    /* data dependency memory barrier here */
    return sharedObject;
}
Wrapping up!
Skimmers skip to here.

Now are we done?

I think so, Mr. Subheading. Let’s see if we can summarize all this:


Locks are like tanks – powerful, slow, safe, expensive, and prone to getting you stuck.
  • The compiler and the processor both conspire to defeat your threads by moving your code around! Be warned and wary! You will have to do battle with both.
  • Even so, it is very easy to mask serious threading bugs. We had to work hard, even in highly contrived circumstances, to get our bug to poke its head out even occasionally.
  • Ergo, testing probably won’t catch these types of bugs. That makes it more important to get it right the first time.
  • Locks are the heavy tanks of threading tools – powerful, but slow and expensive, and if you’re not careful, you’ll get yourself stuck in a deadlock.
  • Memory barriers are a faster, non-blocking, deadlock free alternative to locks. They take more thought, and aren’t always applicable, but your code’ll be faster and scale better.
  • Memory barriers always come in logical pairs or more. Understanding where the second barrier has to go will help you reason about your code, even if that particular architecture doesn’t require a second barrier.

Further reading

Seriously? You want to know more? Ok – the best technical source I know of is actually a document called “memory-barriers.txt” that comes with the Linux kernel source. You can get it here. Thanks to my co-worker for finding it and directing me to it.

Things I wanna know

I’m still scratching my head about some things. Maybe you can help me out.

  • Why is x86_64 strongly ordered? Is my theory about gaining a competitive edge over Itanium reasonable?
  • Is my double checked lock right, even though it doesn’t use a temporary variable in the usual place?
  • What’s up with the so-called "nontemporal" streaming instructions on x86?

Leave a comment if you know! Thanks!

 

The Internet!

π = 3.2860203432

x86-64’s strong ordering was a practical rather than a philosophical consideration.

Basically, AMD found themselves between a rock and a hard place. They had based their company around producing very good IA32-compliant processors. And now Intel and IBM were moving ahead into 64-bit computing with entirely new architectures. Heavily patented new architectures. AMD has licenses to produce Intel’s 80486, 80386, etc. CPUs. They don’t have licenses for the Pentium line. The Athlon line was AMD moving out of the role of secondary manufacturer and into the role of designer by producing their own superscalar implementations of the IA32 ISA.

However, since they have no licenses for them, and are unlikely to get licenses, a competitive AMD-produced IA64- or PowerPC-compatible processor would require designing completely original processors that are both fully compatible and better performing than the offerings from Intel or IBM. Which would be both very difficult and very expensive to do.

Hence, AMD opted to extend the life of the x86 architecture by extending it to include 64-bit addressing. They attempted to do that with the fewest number of changes. he basic premise being that fewer changes in the ISA would mean fewer changes required in compilers and operating systems to support the new ISA. And that fewer changes would give them the largest amount of backwards-compatibility to x86-32 ‘for free’.

Which is why x86-64 is strongly-ordered, has a small number of registers, and all the other x86 evils programmers taught MIPS and PowerPC assembly in school complain about…

Vincent Bernardi

Thanks for a very interesting post! An article as well written as this one deserves a lot of publicity. You should make a pdf version of it; it would make a very good teaching resource.

Thanks again,
V.

Soeren

Great post (as always). Interesting and entertaining.
Thanks!

Jens Ayton

(Wow, this comment text is pretty small. The usual web designer’s false assumption that I lied when I told the browser what size text I want, taken to extremes. Bah, humbug.)

Nitpicks:
shouldn’t “Processor 0 is writing variable2 after variable1, so we’ll insert a memory barrier to prevent that:” say “…to ensure that…”?

The link “usual solution” [http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html] is broken.

The problem you ran into under “Our Enemy the Compiler” is an important one for test cases and benchmarks in general. I remember fondly the time someone who shall remain anonymous (because I can’t be bothered to dig through IRC logs) “proved” that long doubles are as fast as doubles on PowerPCs. His project produced similar assembly.

Remco

Thanks for writing this piece. I enjoyed and learned a lot.

You missed one important point. When you introduced your memory barrier, the construct you used (__asm__ volatile) also instructed gcc not to optimize loads or stores across that barrier. In absence of that kind of barrier (which is sometimes called an “optimization barrier”) the compiler is free to reorder loads and stores, as long as the apparent behavior is the same, which can thwart your attempts to write lock-free threads.

http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Extended-Asm.html

Remco

While I’m at it I have a question: Are there high-level computer languages around that deal with multi cores implicitly and can generate optimized and multi-threaded code?

Ugh, threads make my head spin (excuse the pun).

I’m not entirely sure, but I believe something like io, with it’s actor model, avoids a number of these strange memory problems.

Great article and a very interesting read. I think that how memory barriers work and why we need them can be very difficult for one to wrap one’s head around; at some point we all end up writing an article about them to help us sort it out. This one is mine.

E

In the last example, could you just use an atomic compare & swap function rather than the memory barrier?

Great post. Very informative and well written.

James Bailey

I just wanted to point out that the double checked locking problem in Java is now fixed with the correct semantics on volatile when using Java 5.

http://en.wikipedia.org/wiki/Double-checked_locking

// Works with acquire/release semantics for volatile
// Broken under Java 1.4 and earlier semantics for volatile
class Foo {
private volatile Helper helper = null;
public Helper getHelper() {
if (helper == null) {
synchronized(this) {
if (helper == null)
helper = new Helper();
}
}
return helper;
}
}

It’s not really agressive optimization for the single-threaded case that’s our enemy. The main problem is that all (widely used) computers in existence today have a von-Neumann architecture. As a result, you have registers, L1 cache, L2 cache, etc…

In other words, data, at any given time, lives in several places. Which is a recipe for disaster. Keeping data consistent in multiple places is painful, even in the single-threaded case. It’s just that we’ve figured out how to let the hardware do it in most cases when we’re single-threaded.

Remco: To get really fast, highly optimized code you need to make assumptions about the hardware — those assumptions tend to break the program eventually. To get a program that operates properly at all times, you need to avoid all of those assumptions and thus get slow programs. Now, that’s a broad generalization, but I think it tends towards true.

There is a lot of work going into Erlang ( http://en.wikipedia.org/wiki/Erlang_programming_language ) at the moment. Designed from the ground up as a concurrent language it tries to codify safe practices directly into the language. (Although that didn’t work out too well for Ada and it’s ironclad type safety.)

ken

Colin:

IO doesn’t solve any of this with its actors, because its actors all run on a single OS thread. It cannot take advantage of multiple cores, and there is no real concurrency.

To the extent that the actor model does solve this, it’s because access to an object’s data is always done from the actor’s one thread. That will take care of anything where all of the ordering-sensitive data is owned by a single actor.

Trackbacks don’t seem to work on your site, therefore let me point you to my response here:

http://www.thinkingparallel.com/2007/02/19/please-dont-rely-on-memory-barriers-for-synchronization/

Short version: please add big, fat warning signs to this article that it is not for beginners, but for advanced readers with experience with concurrent programming only! Thanks!

HÃ¥kan W

First and foremost: Very interesting and informative post, thank you!

I think the fact that you have to go to such great lengths in C to get true multithreaded code /bug-free/ is a very heavy argument that C-derivatives will not be interesting in a future where you want to really harness the power of many processors. It simply doesn’t seem realistic, just like sane people don’t write desktop apps in assembly (any more).

Erlang uses the Actor model, and its runtime distributes computation and IO across processes accordingly. It’s quite prepared for the multicore future. (It might not be the best solution for multithreading, but it’s far, far better than trying to haemorrhage multithreading support into C/C++/Objective-C/etc.)

Please, please read Edward Lee’s excellent paper “The Problem with Threads” and think _very_ hard about using shared-state concurrency before dooming yourself to spending the rest of your life in gdb typing ‘thread apply bt all’ over and over. I have some of my own commentary on it at . There are other ways to do concurrency besides shared-state concurrency.

I normally love ridiculous_fish’s blog entries, but encouraging using shared-state concurrency in this day and age is a dangerous idea. Just say no to using mutexes, memory barriers and locks, and try to use share-nothing with message-passing or _any_ form of concurrency instead.

The designers of the Lua language said it best:

“… we did not (and still do not) believe in the standard multithreading model, which is preemptive concurrency with shared memory: we still think that no one can write correct programs in a language where ‘a=a+1’ is not deterministic.”

Found a reply to your post on reddit.com that basically says, “Memory barriers are for those who really know what they’re doing!”: http://www.thinkingparallel.com/2007/02/19/please-dont-rely-on-memory-barriers-for-synchronization/

Glen Simmons

Picking a nit – In the section “Makin’ a lock, checkin’ it twice”, you need to initialize your static variable to nil.

Chris

Yes, threading is hard, but I believe it’s hard because people continue to think about it in a serial way. I’ve been writing lots of threaded and parallel programs over the last couple of years, and I’ve had the best luck in writing programs that dump dependence upon locks and contracts.

I’ve adopted the actor model, which lets threads communicate, but not at the expense of locking for the communication. I’ve found that letting a bunch of actors work on tasks at their own pace, with non-blocking communication (usually by some sort of director) to check on when things are done avoids synchronization problems, and also takes advantage of whatever system level optimizations the compiler or operating system are trying to perform.

Not a programmer myself

Anyone have any insight into how much easier Mac OS X Leopard’s NSOperation and NSOperationqueue API’s make things?

They’re mentioned in Apple’s ‘Session 000 – Mac OS X State of the Union’ video, (free with a free Apple developers membership), roughly from 13:00:00 onwards.

Richard Kettering

For those who, like me, don’t know diddly about multi-threading, synchronization, and all that jazz, the following summarizes a few key things, and links to related topics.

http://en.wikipedia.org/wiki/Memory_barrier

Thanks for the article, btw!

Random guy

It’s not really agressive optimization for the single-threaded case that’s our enemy. The main problem is that all (widely used) computers in existence today have a von-Neumann architecture. As a result, you have registers, L1 cache, L2 cache, etc…

In other words, data, at any given time, lives in several places. Which is a recipe for disaster. Keeping data consistent in multiple places is painful, even in the single-threaded case. It’s just that we’ve figured out how to let the hardware do it in most cases when we’re single-threaded.

There are two different problems when dealing with shared-memory multiprocessing: the one you address, and the one this blog entry addresses. They’re related, but distinct, which is why they have two different labels:

The problem addressed by this entry is one of memory consistency (basically, what order memory accesses can happen in). Despite its importance when dealing with multithreading, it is largely a single-processor issue: the necessary ordering constraints are completely independent of other processors in the system. Memory consistency is handled by a combination of architectural restrictions (e.g. no reordering of memory accesses) and program annotations (e.g. memory fences).

The problem you address is one of memory coherence (making sure that multiple copies of data are consistent — which is annoying overloading of the term “consistent.”) This is largely handled in hardware through coherence protocols. Ideally, this is all under-the-hood, and only affects the performance of your program (maintaining memory coherence when there is a lot of sharing can be quite expensive), not the correctness.

Thanx for an article that goes deep into the guts of our computing environment.
I do make use of nolocking shared data and like the fact that it’s so much faster.
Even nowadays you can’t have enough processing power and even if it’s just to help the processor to go to sleep faster and SAVE ENERGY :-)

Erlang has worked out extremely well for me for concurrent apps. I would never dream of going back to godawful pthread stuff. If I need more performance I’d just start up more nodes (probably on other machines).

I don’t agree that it “codifies safe practices into the language”. It provides very good examples of what’s safe and what isn’t by way of the “standard library”, but it definitely lets you screw things up if you want to. Not in the same way as C, because it’s a functional language with single assignment, but it doesn’t try all that hard to protect you from yourself if you don’t know what you’re doing. You can write all of the non-tail recursive functions you want, have race conditions, mutable non-transactional state (via ets), etc.

That said, the way you generally use Erlang (OTP behaviors) is essentially problem free. I wrote a system with Erlang that handles several million HTTP requests per day per node (my *first* Erlang project). It doesn’t crash (even with arbitrary invalid input, errors are isolated and non-fatal), it works correctly, it only uses up about 40mb of RAM per node (there’s a few RAM databases that account for at least 90% of that), and it actually doesn’t even use measurable amounts of CPU (from top). When I want to fix bugs or add features I just upload new code and reload the module. In a running node. 0msec downtime. Stick that in your pthread and smoke it :)

Frank

Hmm … Erlang, io, … what about Scheme48? It’s said to have “a quite an advanced multi-threading system with user-defined schedulers, channels and software transactional memory” ( http://groups.google.com/group/fa.caml/msg/387160fd9a77a8e3 , http://s48.org/1.5/manual/manual-Z-H-9.html#node_chap_7 , http://www-pu.informatik.uni-tuebingen.de/users/sperber/papers/adding-threads.pdf )

Mark Tandel

A very different approach to parallel programming is the functional skeleton approach.

http://ocamlp3l.inria.fr/eng.htm
http://www.inria.fr/actualites/inedit/inedit57_indusb.en.html

Nathan

I’m a little confused by why the second OSMemoryBarrier() is required in the last example, but after some pondering, I think I’ve figured it out. If I’m right, perhaps you could expound on explanation a little to make it more clear for people like me.

In your example, thread B is running code like:

1. sharedObjectB = [ExampleObjectClass sharedObject];
2. [sharedObjectB exampleMethod];

What you are saying is that thread B will run line 1 and enter +sharedObject. But the processor running thread B, knowing that line 2 will access the sharedObject->isa pointer to call its method, thinks it is being helpful by reading sharedObject->isa while line 1 is still executing. So, line 2 reads the sharedObject->isa variable before thread A has initialized sharedObject. Thread A then runs [self init], and after it has finished, line 1 reads sharedObject. It is initialized (now), so it skips the OSMemoryBarrier() contained in the if statement, which would have forced line 2 to re-read sharedObject->isa (I assume). It then continues to line 2, finds the pre-initialized sharedObject->isa waiting for it and executes line 2 using whatever random value existed in memory before sharedObject->isa was initialized.

Now that I write it out, I’m pretty sure that is correct. Perhaps you could add something like:
“…not on any processors Apple ships. Thread B will read the sharedObject isa pointer before it reads the sharedObject pointer in anticipation of using it. Thus, it can read the isa pointer before thread A has initialized sharedObject and then read the sharedObject pointer after thread A has initilized it, causing it to skip the first OSMemoryBarrier and not re-read the isa pointer.”

Thanks for another great article. I just wish Apple would pay you to write more of them.

I too am a fan of avoiding shared state. I prefer to have threads communicate with one another via lock-free queues, but sometimes shared states is unavoidable.

In this case, I would think one could implement this method as:

+ getSharedObject {
static volatile id sharedObject;
id localSO = sharedObject;
if (! localSO) {
id temp = [[self alloc] init];
if (!CompareAndSwap(localSO, temp, &sharedObject)) {
// oops, someone beat us to the punch
[temp release];
}
}
return sharedObject;
}

Sam

The correct implementation of a singleton in Java does not require a double-checked lock or volatile keyword.

private static Singleton singleton = null ;

private Singleton(){}

public static synchronized Singleton getInstance()
{
if( singleton == null )
singleton = new Singleton() ;
return singleton ;
}

Nils

Great Post :)

I tried out the first example on my Linux box (intel core duo) with volatile and variable2 not next to the first in the stack. I got 0 failures all the time, but I guess this is because both threads run on the same core. I did not figure out how to move them to different cores and how to check it so far..

Phil

Just for the record:

Can Programming Be Liberated from the von Neumann Style?

http://www.stanford.edu/class/cs242/readings/backus.pdf

alex.r.

Sam:
That implementation may be correct but it is slow, because you are required to acquire a lock for every call to getInstance().

This is useless once the first call has been completed. Let’s pretend Thread 1 and 2 execute in parallel but Thread 1 calls getInstance() right before Thread 2:
Thread 1 : getInstance() : acquires Lock
Thread 2 : getInstance() : waits for lock

If the singleton variable is not null, Thread 1 won’t modify it so there’s no reason for Thread 2 to wait before entering the method.

The “double read” combined with “volatile” makes sure we only get the lock when we actually need it. Locking is an expansive thing to do for no reason.

Sam

alex.r.

Fair enough, you can actually replace the need for lazy instanciation, still better than a double-checked lock.

John D.

How about an efficient thread safe static initialization? Would this make a difference?

http://avitzur.hax.com/2006/11/efficient_thread_safe_static_i.html

Having seen the Lua and Io comments, I’ll add to the queue of comments.

Lua does not –as of its standard libraries– take a stand on how multithreading be made. Coroutines it offers are essentially within-the thread elements, and useful as such.

But with a small addon module to allow chunks of code ‘thrown’ to other Lua states, I can see the language scale easily to 1000s of cores, if so wanted. This development is surely favourable to small footprint languages such as it.

Anyone wanting to have a go at N core Lua, send me a line. :)

tomsci

Cripes, I’m a bit concerned this is the first I recall hearing about memory barriers! I (thought I) knew about out-of-order execution, but the multithreaded implications of that had escaped me.

V interesting post, this is something I’ll file under the category of “Things I hope I never need to use but if I do I now know to go and look it up before even attempting it myself”

Helluva post, Fish. Thanks.

Friskywitz

I got asked several questions pertaining to this exact subject on my third phone interview from Apple Computers.

I said mutex… and should have read this blog first. D’oh!

lwsync ? Wasn’t that called isync once ? :)

I thought that an UNLOCK would imply a sync and that therefore the memory barrier would be redundant ? At least that’s how I implemented it (isync in lock, sync in unlock).

dina

Thanks for this great post. I have been debugging code on multiple platforms and very often need to know what processor is used in the specific platform and what memory synchronizations are missing on that platform(since the bugs are usually caused when the programmer forgets to explicitly add barriers that are not implemented on the platform). Here is the small list I have and would love it if people can add to it:
1. x86, x86-64 : strongly ordered
2. SPARC : Writes are not ordered with respect to reads and writes.
3. Itanium : Writes are not ordered. Reads (?)
4. HPUX : ?

Mark Travis

I’m working on a project involving mutexes. I suspect that I’m doing it wrong, but I can’t seem to force data inconsistency in my own tests. I’m basically doing message passing of arbitrary data, as follows:

Here’s what I’m basically doing:

int srcpayload, destpayload;
int *srcptr, *queueptr, *destptr;

thread 1:
payload=5;
srcptr=&payload;

pthread_mutex_lock(&whatevermutex);
srcptr=queueptr;
pthread_mutex_unlock(&whatevermutex);

thread 2:
pthread_mutex_lock(&whatevermutex);
destptr=queueptr;
pthread_mutex_unlock(&whatevermutex);

destpayload=*destptr;

————————–

What I want is for destpayload to always equal srcpayload. But since I’m only copying pointers within mutex locks, I’m afraid that it’s possible that the assignment of the payload to the pointer might not be guaranteed to be correct on the destination side. Is pthreads’ mutex logic smart enough to ensure that any assignment of a value to a pointer’s destination is within a proper memory barrier? My development platform is Linux on amd64, but I want to do the right thing from a portability standpoint.

You seem like a guy (guys?) (gal?) (gal-guys?) who would know this.

Mark Travis

I typo’d my above pseudo-code.

thread 1 should read as follows:
payload=5;
srcptr=&payload;

pthread_mutex_lock(&whatevermutex);
queueptr=srcptr;
pthread_mutex_unlock(&whatevermutex);

————————-

Thanks

Mark Travis

I think I found an answer to my question–as long as proper fencing is done, such as mfence on x86, prior to the attempted read (LOAD) in thread 2, then the correct value should be referenced by the destination pointer.

Any mutex should do the trick, since I understand that pthread_mutex_lock calls a load-store barrier whenever it’s invoked.