<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/1.5" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
>

<channel>
	<title>ridiculous_fish</title>
	<link>http://ridiculousfish.com/blog</link>
	<description>serious code</description>
	<pubDate>Thu, 26 Apr 2007 05:44:08 +0000</pubDate>
	<generator>http://wordpress.org/?v=1.5</generator>
	<language>en</language>

		<item>
		<title>Buzz</title>
		<link>http://ridiculousfish.com/blog/archives/2007/04/25/buzz/</link>
		<comments>http://ridiculousfish.com/blog/archives/2007/04/25/buzz/#comments</comments>
		<pubDate>Wed, 25 Apr 2007 22:44:04 +0000</pubDate>
		<dc:creator>ridiculous_fish</dc:creator>
		
	<category>Uncategorized</category>
		<guid>http://ridiculousfish.com/blog/archives/2007/04/25/buzz/</guid>
		<description><![CDATA[Farewell, Buzz]]></description>
			<content:encoded><![CDATA[<p><a href="http://buzz.vox.com/library/post/leaving-apple.html">Buzz</a> is leaving Apple, has already left.  Buzz joined Apple two months after I joined, on the same team.  He was the first candidate I ever interviewed, although I can&#8217;t remember what I asked him.  Without Buzz&#8217;s encouragement and inspiration, I&#8217;d have never started this blog, and he was first to link to me.  So thank you and farewell, Buzz.</p>

<p>On another thread, I&#8217;ve fixed the <a href="/angband">Angband screensaver</a> to remember the last character it played with.  The user defaults had to be synchronized.  Thanks to everyone who pointed this out.</p>
]]></content:encoded>
			<wfw:commentRSS>http://ridiculousfish.com/blog/archives/2007/04/25/buzz/feed/</wfw:commentRSS>
	</item>
		<item>
		<title>Angband</title>
		<link>http://ridiculousfish.com/blog/archives/2007/04/13/42/</link>
		<comments>http://ridiculousfish.com/blog/archives/2007/04/13/42/#comments</comments>
		<pubDate>Fri, 13 Apr 2007 02:57:38 +0000</pubDate>
		<dc:creator>ridiculous_fish</dc:creator>
		
	<category>Uncategorized</category>
		<guid>http://ridiculousfish.com/blog/archives/2007/04/13/42/</guid>
		<description><![CDATA[So Angband iss back - so fissh, the fish, raw and wriggling, he hasst brought its back!]]></description>
			<content:encoded><![CDATA[<img src="/angband/images/gollum.png" style="float: left">

<b>Guest blogger: Gollum (Smeagol)</b>

<p>Where isst it?  So it iss back - so <i>fissh</i>, the fish, raw and wriggling, he hasst brought its back!

<p>Angband, the Hells of Iron, yest, the ancient ASCII roguelike, child spawn of Moria and VMS, it iss here once more.  We wants it!

<p>You hasst not seen it, Angband?  But perhapss you have seen one like it, the filthy, filthy <b>NetHack</b>, yess, full of such stupids and jokeses, and no Smeagol!  We hates it for ever!  But we loves Angband, yess, Angband has Smeagol and the fat hobbitses and yes, it has my preciouss!  We loves it, Angband!

<p>We hass thought it lost.  When OS X was a little child and hungry for software, we knows how it calls to <i>fissh</i>.  We knows how <i>fissh</i> missed Angband, dearly missed Smeagol, yes.  And we knows how <i>fissh</i> stoles it, and worked his tricksy little magic, so he could go down into a Carbonized Angband on OS X.  And we knows <i>fissh</i> did, and it was easy, Carbon madess it simple.

<p>But <i>fissh</i> hasst brought it to Cocoa, now, all fresh and it glitterses so subpixel pretty with Quartz, and resizesss so smooth, and animates with pretty graphics!  So precious to <i>fissh</i>.

<p>And ssuch love for Angband so <i>fissh</i> made a <b>borg screensaveres</b>!  So now you can visits Smeagol and wring the filthy little neck of Saruman and murderes the fat Morgoth in your sleep!  The screensaveres, yes!

<p><a href="/angband/">Goes there now</a>, fat hobbitses!  The <b>game</b> and the <b>screensaveres</b> and the <b>source codess</b>!  Angband need you, yess!

<div style="width: 216px; height: 227px; margin-left: auto; margin-right: auto; text-align: center;">
<a href="/angband/"><img src="/angband/images/logo_bitty.png" style="border: none"></a>
<a href="/angband/">Angband for Mac OS X</a>
</div>

<p><br /><b>Edit:</b> <i>fissh</i> has fixed the problem with the screensaveres needing Angband to have been launched first.  It should work no problems now.]]></content:encoded>
			<wfw:commentRSS>http://ridiculousfish.com/blog/archives/2007/04/13/42/feed/</wfw:commentRSS>
	</item>
		<item>
		<title>Barrier</title>
		<link>http://ridiculousfish.com/blog/archives/2007/02/17/barrier/</link>
		<comments>http://ridiculousfish.com/blog/archives/2007/02/17/barrier/#comments</comments>
		<pubDate>Sat, 17 Feb 2007 22:24:04 +0000</pubDate>
		<dc:creator>ridiculous_fish</dc:creator>
		
	<category>Programming</category>
		<guid>http://ridiculousfish.com/blog/archives/2007/02/17/barrier/</guid>
		<description><![CDATA[What's this multithreading thing everyone keeps talking about?]]></description>
			<content:encoded><![CDATA[<p><div style="background-color: #fff0cb; border: 1px solid #c0c030; font-size: larger; padding: 5px; margin: 5px;">

<img src="http://ridiculousfish.com/images/barrier/teraflop.jpg" style="float: right">

&quot;<a href="http://www.intel.com/pressroom/archive/releases/20060926corp_b.htm">That&#8217;s a lot of cores.</a>  <span style="color: #43174c">And while 80-core floating point monsters like that aren&#8217;t likely to show up in an iMac any time soon, multicore chips in multiprocessor computers are here today.  Single chip machines are so 2004.  Programmers better get crackin&#8217;.  The megahertz free ride is over - and we have work to do.&quot;</span></div>

<p>There, that&#8217;s the noisome little pitch everyone&#8217;s been spreading like so much thermal paste.  As if multiprocessing is something new!  But of course it&#8217;s not - heck, I remember Apple shipping dualies more than ten years ago, as the Power Macintosh 9500.  Multiprocessing is more <i>accessible</i> (read: cheaper) now, but it&#8217;s anything save new.  It&#8217;s been around long enough that we should have it figured out by now.

<p>So what&#8217;s my excuse?  I admit it - I don&#8217;t get multiprocessing, not, you know, really get it, and that&#8217;s gone on long enough.  It&#8217;s time to get to the bottom of it - or if not to the bottom, at least deep enough that my ears start popping.

<h3>Threadinology</h3>

<p>Where to start, where to start...well, let&#8217;s define our terms.  Ok, here&#8217;s the things I mean when I say the following, uh, things:

<ul><li style="margin-bottom: 6px"><b>Threads</b> are just preemptively scheduled contexts of execution that share an address space.  But you already know what threads are.  Frankly, for my purposes, they&#8217;re all pretty much the same whether you&#8217;re using Objective-C or C++ or Java on Mac OS X or Linux or Windows...</li>
<li style="margin-bottom: 6px"><b>Threading</b> means creating multiple threads.  But you often create multiple threads for simpler control flow or to get around blocking system calls, not to improve performance through true simultaneous execution.
<li><b>Multithreading</b> is the physically simultaneous execution of multiple threads for increased performance, which requires a dualie or more.  Now things get hard.
</ul>

<p>Yeah, I know.  &quot;Multithreading is hard&quot; is a clich&eacute;, and it bugs me, because it is not some truism describing a fundamental property of nature, but it&#8217;s something <i>we did</i>.  We made multithreading hard because we optimized so heavily for the single threaded case.

<p>What do I mean?  Well, processor speeds outrun memory so much that we started <b>guessing</b> at what&#8217;s in memory so the processor doesn&#8217;t have to waste time checking.  &quot;Guessing&quot; is a loaded term; a nicer phrase might be &quot;make increasingly aggressive assumptions&quot; about the state of memory.  And by &quot;we,&quot; I mean both the compiler and the processor - both make things hard, as we&#8217;ll see.  We&#8217;ll figure this stuff out - but they&#8217;re going to try to confuse us.  Oh well.  Right or wrong, this is the bed we&#8217;ve made, and now we get to lie in it.

<div style="background-color: #fff2d4; font-size: 14px; border: 1px solid #909050; width: 180px; float: right; padding: 5px; margin: 5px;">
<div style="font-family: Monaco, Courier, mono; margin-left: 10px;">
while (1) {<br />
&nbsp;&nbsp;&nbsp;x++;<br />
&nbsp;&nbsp;&nbsp;y++;<br />
}</div><br />

<span style="color: #000080; font-size: 16px">x should always be at least as big as y, right?  Right?</span></div>

<p>Blah blah.  Let&#8217;s look at some code.  We have two variables, variable1 and variable2, that both start out at 0.  The writer thread will do this:

<p><b>Writer thread</b>

<pre class="code">
while (1) {
   variable1++;
   variable2++;
}
</pre>

They both started out at zero, so therefore variable1, at every point in time, will always be the same as variable2, or larger - but never smaller.  Right?

<p>The reader thread will do this:

<p><b>Reader thread</b>

<pre class="code">
while (1) {
   local2 = variable2;
   local1 = variable1;
   if (local2 > local1) {
      print("Error!");
   }
}
</pre>

<p>That&#8217;s odd - why does the reader thread load the second variable before the first?  That&#8217;s the <i>opposite</i> order from the writer thread!  But it makes sense if you think about it.

<p>See, it&#8217;s possible that variable1 and/or variable2 will be incremented by the writer thread between the loads from the reader thread.  If variable2 gets incremented, that doesn&#8217;t matter - variable2 has already been read.  If variable1 gets incremented, then that makes variable1 appear larger.  So we conclude that variable2 should never be seen as larger than variable1, in the reader thread.  If we loaded variable1 before variable2, then variable2 might be incremented after the load of variable1, and we would see variable 2 as larger.

<div style="background-color: #ffdbf5; border: 1px solid #FF6bA3; padding: 5px; margin: 5px;">
Analogy to the rescue.  Imagine some bratty kid playing polo while his uptight father looks on with a starched collar and monocle (which can only see one thing at a time, and not even very well, dontcha know).  The kid smacks the ball, and then gallops after it, and so on.  Now the squinting, snobby father first finds the ball, then looks around for his sproggen.  But!  If in the meantime, the kid hits it and takes off after it, Dad will find the kid ahead of where he first found the ball.  Dad doesn&#8217;t realize the ball has moved, and concludes that his budding athlete is ahead of the ball, and running the wrong way!  If, however, Dad finds the kid first, and <i>then</i> the ball, things will always appear in the right order, if not the right place.  The order of action (move ball, move kid) has to be the <i>opposite</i> from the order of observation (find kid, find ball).</div>


<p><h3>Threadalogy</h3>

<p>So we&#8217;ve got two threads operating on two variables, and we think we know what&#8217;ll happen.  Let&#8217;s try it out, on my G5:

<p><pre class="code">
<span class="note">unsigned variable1 = 0;</span>
<span class="note">unsigned variable2 = 0;</span>

#define ITERATIONS 50000000

void *<span class="note">writer</span>(void *unused) {
        for (;;) {
<span class="note">                variable1 = variable1 + 1;
                variable2 = variable2 + 1;</span>
        }
}

void *<span class="note">reader</span>(void *unused) {
        struct timeval start, end;
        gettimeofday(&#038;start, NULL);
        unsigned i, failureCount = 0;
        for (i=0; i < ITERATIONS; i++) {
<span class="note">                unsigned v2 = variable2;
                unsigned v1 = variable1;
                if (v2 > v1) failureCount++;</span>
        }
        gettimeofday(&#038;end, NULL);
        double seconds = end.tv_sec + end.tv_usec / 1000000. - start.tv_sec - start.tv_usec / 1000000.;
        printf(&#8221;%u failure%s (%2.1f percent of the time) in %2.1f seconds\n&#8221;,
               failureCount, failureCount == 1 ? &#8220;&#8221; : &#8220;s&#8221;,
               (100. * failureCount) / ITERATIONS, seconds);
        exit(0);
        return NULL;
}

int main(void) {
        pthread_t thread1, thread2;
        pthread_create(&#038;thread1, NULL, writer, NULL);
        pthread_create(&#038;thread2, NULL, reader, NULL);
        for (;;) sleep(1000000);
        return 0;
}
</pre>

What do we get when we run this?

<pre class="code">
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
</pre>

<div style="background-color: #fff0cb; border: 1px solid #c0c030; width: 300px; float: right; padding: 5px; margin: 5px;">How do we know that the reader thread won&#8217;t see a variable in some intermediate state, midway through being updated?  We have to know that these particular operations are atomic.  On PowerPC and x86 machines, 32 bit writes to <i>aligned</i> addresses are guaranteed atomic.  Other types of memory accesses are not always atomic - in particular, 64 bit writes (say, of a double precision floating point value) on a 32 bit PowerPC are <i>not</i> atomic.  We have to check the documentation to know.</div>

<h3>So, we&#8217;re done?</h3>

<p>Our expectations were confirmed!  The writer thread ordered its writes so that the first variable would always be at least as big as the second, and the reader thread ordered its reads the opposite way to preserve that invariant, and everything worked as planned.

<p>But we might just be getting lucky, right?  I mean, if thread1 and thread2 were always scheduled on the same processor, then we wouldn&#8217;t see any failures - a processor is always <i>self</i>-consistent with how it appears to order reads and writes.  In other words, a particular processor remembers where and what it pretends to have written, so if you read from that location <i>with that same processor</i>, you get what you expect.  It&#8217;s only when you read with processor1 from the same address where processor2 wrote - or pretended to write - that you might get into trouble.

<p>So let&#8217;s try to force thread1 and thread2 to run on separate processors.  We can do that with the utilBindThreadToCPU() function, in the CHUD framework.  That function should never go in a shipping app, but it&#8217;s useful for debugging.  Here it is:

<p><pre class="code">

void *writer(void *unused) {
        <span class="note">utilBindThreadToCPU(0);</span>
        for (;;) {
                variable1 = variable1 + 1;
                variable2 = variable2 + 1;
        }
}

void *reader(void *unused) {
        <span class="note">utilBindThreadToCPU(1);</span>
        struct timeval start, end;
        gettimeofday(&#038;start, NULL);
        ...
        
int main(void) {
        pthread_t thread1, thread2;
        <span class="note">chudInitialize();</span>
        unsigned variable2 = 0;
        pthread_create(&#038;thread1, NULL, writer, &#038;variable2);
        pthread_create(&#038;thread2, NULL, reader, &#038;variable2);
        while (1) sleep(1000000);
        return 0;
}

</pre>

To run it:

<pre class="code">
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
</pre>

<h3>NOW are we done?</h3>


<p>Still no failures.  Hmm...  But wait - processors operate on cache lines, and variable1 and variable2 are right next to each other, so they probably share the same cache line - that is, they get brought in together and treated the same by each processor.  What if we separate them?  We&#8217;ll put one on the stack and leave the other where it is.

<pre class="code">
unsigned variable1 = 0;

#define ITERATIONS 50000000

void *writer(<span class="note">unsigned *variable2</span>) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                <span class="note">*variable2 = *variable2 + 1;</span>
        }
        return NULL;
}

void *reader(<span class="note">unsigned *variable2</span>) {
        utilBindThreadToCPU(1);
        struct timeval start, end;
        gettimeofday(&#038;start, NULL);
        unsigned i;
        unsigned failureCount = 0;
        for (i=0; i < ITERATIONS; i++) {
                unsigned v2 = <span class="note">*variable2</span>;
                unsigned v1 = variable1;
                if (v2 > v1) failureCount++;
        }
        gettimeofday(&#038;end, NULL);
        double seconds = end.tv_sec + end.tv_usec / 1000000. - start.tv_sec - start.tv_usec / 1000000.;
        printf(&#8221;%u failure%s (%2.1f percent of the time) in %2.1f seconds\n&#8221;, failureCount, failureCount == 1 ? &#8220;&#8221; : &#8220;s&#8221;, (100. * failureCount) / ITERATIONS, seconds);
        exit(0);
        return NULL;
}

int main(void) {
        pthread_t thread1, thread2;
        <span class="note">unsigned variable2 = 0;</span>
        chudInitialize();
        pthread_create(&#038;thread1, NULL, writer, <span class="note">&#038;variable2</span>);
        pthread_create(&#038;thread2, NULL, reader, <span class="note">&#038;variable2</span>);
        while (1) sleep(1000000);
        return 0;
}
</pre>

So now, one variable is way high up on the stack and the other is way down low in the .data section.  Does this change anything?

<pre class="code">
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
</pre>

Still nothing!   I&#8217;m not going to have an article after all!  Arrrghhh **BANG BANG BANG BANG**
<pre class="code">
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.1 seconds
fish ) ./a.out
50000000 failures (100.0 percent of the time) in 0.1 seconds
</pre>

<p>Hey, there it is!  Most of the time, every test passes, but that last time, every test failed.

<h3>Our Enemy the Compiler</h3>

<p>The lesson here is something you already knew, but I&#8217;ll state it anyways:  <b>Multithreading bugs are very delicate.</b>  There is a real bug here, but it was masked by the fact that the kernel scheduled them on the same processor, and <i>then</i> by the fact that the variables were too close together in memory, and once those two issues were removed, (un)lucky timing usually masked the bug <i>anyways</i>.  In fact, if I didn&#8217;t know there was a bug there, I&#8217;d never have found it - and I <i>still</i> have my doubts!

<p>So first of all, why would every test pass or every test fail?  If there&#8217;s a subtle timing bug, we&#8217;d expect most tests to pass, with a few failing - not all or nothing.  Let&#8217;s look at what gcc is giving us for the reader function:

<pre class="code">
        lis r9,0x2fa
        ori r2,r9,61568
        mtctr r2
<span style="color: #0000CC">L8:
        bdnz L8</span>
        lis r2,0x2fa
        ori r2,r2,61568
        mullw r2,r0,r2
</pre>

<p>Hey!  The entire extent of that big long 50 million iteration loop has been hoisted out, leaving just the blue bits - essentially fifty million no-ops.  Instead of adding one or zero each time through the loop, it calculates the one or zero once, and then multiplies it by 50 million.

<p>gcc is loading from variable1 and variable2 exactly once, and comparing them exactly once, and assuming their values do not change throughout the function - which would be a fine assumption if there weren&#8217;t also other threads manipulating those variables.

<p>This is an example of what I mentioned above, about the compiler making things difficult by optimizing so aggressively for the single threaded case.  

<p>Well, you know the punchline - to stop gcc from optimizing aggressively, you use the volatile keyword.  So let&#8217;s do that:

<pre class="code">
<span class="note">volatile</span> unsigned variable1 = 0;

#define ITERATIONS 50000000

void *writer(<span class="note">volatile</span> unsigned *variable2) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                *variable2 = *variable2 + 1;
        }
        return NULL;
}

void *reader(<span class="note">volatile</span> unsigned *variable2) {
        utilBindThreadToCPU(1);
        struct timeval start, end;
        ...
</pre>

What does this change get us?

<pre class="code">
fish ) ./a.out
12462711 failures (24.9 percent of the time) in 3.7 seconds
</pre>

<p>It&#8217;s much slower (expected, since volatile defeats optimizations), but more importantly, it fails intermittently instead of all or nothing.  Inspection of the assembly shows that gcc is generating the straightforward sequence of loads and stores that you&#8217;d expect.

<h3>Our Enemy the Processor</h3>

<p>Is this really the cross-processor synchronization issues we&#8217;re trying to investigate?  We can find out by binding both threads to the same CPU:

<pre class="code">

void *writer(unsigned *variable2) {
        <span class="note">utilBindThreadToCPU(0);</span>
        ...

void *reader(unsigned *variable2) {
        <span class="note">utilBindThreadToCPU(0);</span>
        ...
</pre>

<pre class="code">
fish ) ./a.out
0 failures (0.0 percent of the time) in 0.4 seconds
</pre>


<p>The tests pass all the time - this really is a cross-processor issue.

<p>So somehow variable2 is becoming larger than variable1 even though variable1 is always incremented first.  How&#8217;s that possible?  It&#8217;s possible that the writer thread, on processor 0, is writing in the wrong order - it&#8217;s writing variable2 before variable1 even though we explicitly say to write variable1 first.  It&#8217;s also possible that the reader thread, on processor 1, is reading variable1 before variable 2, even though we tell it to do things in the opposite order.  In other words, the processors could be reading and writing those variables in any order they feel like instead of the order we tell them to.

<div style="background-color: #ffdeb8; border: 1px solid #303030; width: 440px; float: right; padding: 5px; margin: 5px;">
<h3>Pop and Lock?</h3>

<p>What&#8217;s the usual response to cross-processor synchronization issues like this?  A mutex!  Let&#8217;s try it.

<pre class="code">
fish ) ./a.out
0 failures (0.0 percent of the time) in 479.5 seconds
</pre>

<p>It made the tests pass, all right - but it was 130 times slower!  A spinlock does substantially better, at 20 seconds, but that&#8217;s still 440% worse than no locking - and spinlocks won&#8217;t scale.  Surely we can do better.
</div>

<h3>Even the kitchen</h3>

<p>Our problem is this: our processors are doing things in a different order than we tell them to, and not informing each other.  Each processor is only keeping track of its own shenanigans!  For shame!  We know of two super-horrible ways to fix this: force both threads onto the same CPU, which is a very bad idea, or to use a lock, which is a very slow idea.  So what&#8217;s the right way to make this work?

<p>What we really want is a way to turn off the reordering for that particular sequence of loads and stores.  They don&#8217;t call it &quot;turn off reordering&quot;, of course, because that might imply that reordering is bad.  So instead they call it just plain &quot;ordering&quot;.  We want to order the reads and writes.  Ask and ye shall receive - the mechanism for that is called a &quot;memory barrier&quot;.

<p>And boy, does the PowerPC have them.  I count at least three: sync, lwsync, and the hilariously named eieio.  Here&#8217;s what they do:

<ul>
<li><b>sync</b> is the sledgehammer of the bunch - it orders all reads and writes, no matter what.  It works, but it&#8217;s slow.</li>
<li><b>lwsync</b> (for &quot;lightweight sync&quot;) is the newest addition.  It&#8217;s limited to plain ol&#8217; system memory, but it&#8217;s also faster than sync.
<li><b>eieio</b> (&quot;Enforce In-Order execution of I/O&quot;) is weird - it orders writes to &quot;device&quot; memory (like a memory mapped peripheral) and regular ol&#8217; system memory, but each separately.  We only care about system memory, and IBM says not to use eieio just for that.  Nevertheless, it should still order our reads and writes like we want.</li>
</ul>

<p>Because we&#8217;re not working with devices, lwsync is what we&#8217;re after.  Processor 0 is writing variable2 after variable1, so we&#8217;ll insert a memory barrier to prevent that:

<div style="background-color: #d0ffdd; border: 1px solid #309030; width: 310px; float: right; padding: 5px; margin: 5px;">Do we need a memory barrier after the write to variable2 as well?  No - that would guard against the possibility of the <i>next</i> increment landing on variable1 <i>before</i> the <i>previous</i> increment hits variable2.  But the goal is to make sure that variable1 is larger than variable2, so it&#8217;s OK if that happens.</div>

<pre class="code">
volatile unsigned variable1 = 0;

<span class="note">#define barrier() __asm__ volatile (&#8221;lwsync&#8221;)</span>

#define ITERATIONS 50000000

void *writer(volatile unsigned *variable2) {
        utilBindThreadToCPU(0);
        for (;;) {
                variable1 = variable1 + 1;
                <span class="note">barrier();</span>
                *variable2 = *variable2 + 1;
        }
        return NULL;
}

</pre>

<p>So!  Let&#8217;s run it!

<pre class="code">
fish ) ./a.out
260 failures (0.0 percent of the time) in 0.9 seconds
</pre>

<p>So we reduced the failure count from 12462711 to 260.  Much better, but still not perfect.  Why are we still failing at times?  The answer, of course, is that just because processor 0 writes in the order we want is no guarantee that processor1 will read in the desired order.  Processor 1 may issue the reads in the wrong order, and processor 0 would write in between those two reads.  We need a memory barrier in the reader thread, to force the reads into the right order as well:

<pre class="code">
void *reader(volatile unsigned *variable2) {
        struct timeval start, end;
        utilBindThreadToCPU(1);
        gettimeofday(&#038;start, NULL);
        unsigned i;
        unsigned failureCount = 0;
        for (i=0; i < ITERATIONS; i++) {
                unsigned v2 = *variable2;
                <span class="note">barrier();</span>
                unsigned v1 = variable1;
                if (v2 > v1) failureCount++;
        }
        gettimeofday(&#038;end, NULL);
        double seconds = end.tv_sec + end.tv_usec / 1000000. - start.tv_sec - start.tv_usec / 1000000.;
        printf(&#8221;%u failure%s (%2.1f percent of the time) in %2.1f seconds\n&#8221;,
               failureCount, failureCount == 1 ? &#8220;&#8221; : &#8220;s&#8221;,
               (100. * failureCount) / ITERATIONS, seconds);
        exit(0);
        return NULL;
}
</pre>


<pre class="code">
fish ) ./a.out
0 failures (0.0 percent of the time) in 4.2 seconds
</pre>

<p>That did it!

<p>The lesson here is that if you care about the order of reads or writes by one thread, it&#8217;s because you care about the order of writes or reads by <i>another</i> thread.  <i>Both</i> threads need a memory barrier.  <b>Memory barriers always come in pairs</b>, or triplets or more.  (Of course, if both threads are in the same function, there may only be one memory barrier that appears in your code - as long as both threads execute it.)

<p>This should not come as a surprise: locks have the same behavior.  If only one thread ever locks, it&#8217;s not a very useful lock.

<h3>31 Flavors</h3>

<p>What&#8217;s that?  You noticed that the PowerPC comes with three different kinds of memory barriers.  Right - as reads and writes get scheduled increasingly out of order, the more expensive it becomes to order them - so the PowerPC allows you to request various less expensive partial orderings, for performance.  Processors that  schedule I/O out of order more aggressively offer even more barrier flavors.  At the extreme end is the DEC Alpha, that sports read barriers with device memory ordering, read barriers without, write barriers with, write barriers without, page table barriers, and various birds in fruit trees.  The Alpha&#8217;s memory model guarantees so little that it is said to define the Linux kernel memory model - that is, the set of available barriers in the kernel source match the Alpha&#8217;s instruction set.  (Of course, many of them get compiled out when targetting a different processor.)

<div style="background-color: #ffdbf5; font-size: 14px; border: 1px solid #ff6ba3; width: 240px; float: right; padding: 5px; margin: 5px;">Actually - and here my understanding is especially thin - while x86 is strongly ordered in general, I believe that Intel has managed to slip some weakly ordered operations in sideways, through the SIMD unit.  These are referred to as &quot;streaming&quot; or &quot;nontemporal&quot; instructions.  And when writing to specially tagged &quot;write combining&quot; memory, like, say, memory mapped VRAM, the rules are different still.</div>


<p>And on the other end, we have strongly ordered memory models that do very little reordering, like the - here it comes - x86.  No matter how many times I run that code, even on a double-dualie Intel Mac Pro, I never saw any failures.  Why not?  My understanding (and here it grows fuzzy) is that early multiprocessing setups were strongly ordered because modern reordering tricks weren&#8217;t that useful - memory was still pretty fast, y&#8217;know, relatively speaking, so there wasn&#8217;t much win to be had.  So developers blithely assumed the good times would never end, and we&#8217;ve been wearing the backwards compatibility shackles ever since.

<p>But that doesn&#8217;t answer the question of why x86_64, y&#8217;know, the 64 bit x86 implementation in the Core 2s and all, isn&#8217;t more weakly ordered - or at least, reserve the <i>right</i> to be weaker.  That&#8217;s what IA64 - remember Itanium? - did: early models were strongly ordered, but the official memory model was weak, for future proofing.  Why didn&#8217;t AMD follow suit with x86_64?  My only guess (emphasis on <i>guess</i>) is that it was a way of jockeying for position against Itanium, when the 64 bit future for the x86 was still fuzzy.  AMD&#8217;s strongly ordered memory model means better compatibility and less hair-pulling when porting x86 software to 64 bit, and that made x86_64 more attractive compared to the Itanium. A pity, at least for Apple, since of course all of Apple&#8217;s software runs on the weak PowerPC - there&#8217;s no compatibility benefit to be had.  So it goes.  Is my theory right?

<h3>Makin&#8217; a lock, checkin&#8217; it twice</h3>

<p>Ok!  I think we&#8217;re ready to take on that perennial bugaboo of Java programmers - the double checked lock.  How does it go again?  Let&#8217;s see it in Objective-C:

<pre class="code">
+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            sharedObject = [[self alloc] init];
        }
        UNLOCK;
    }
    return sharedObject;
}
</pre>

<p>What&#8217;s the theory?  We want to create a single shared object, exactly once, while preventing conflict between multiple threads.  The hope is that we can do a quick test to avoid taking the expensive lock.  If the static variable is set, which it will be most of the time, we can return the object immediately, without taking the lock.

<div style="background-color: #e4e2ff; border: 1px solid #b297ff; width: 240px; float: right; padding: 5px; margin: 5px;">Sometimes memory barriers are needed to guard against past or future reads and writes that occur in, say, the function that&#8217;s <i>calling</i> your function.  Reordering can cross function and library boundaries!</div>

<p>This sounds good, but of course you already know it&#8217;s not.  Why not?  Well, if you&#8217;re creating this object, you&#8217;re probably initializing it in some way - at the very least, you&#8217;re setting its isa (class) pointer.  And then you&#8217;re turning around and writing it back to the sharedObject variable.  But these can happen in any order, as seen from another processor - so when the getSharedObject method is called from some other processor, it can see the sharedObject variable as set, and happily return the object <i>before its class pointer is even valid</i>.  Cripes.

<p>But now you know we have the know-how to make this work, no?  How?  The problem is that we need to order the writes within the alloc and init methods relative to the write to the sharedObject variable - the alloc and init writes must come first, the write to sharedObject last.  So we store the object into a temporary local variable, insert a memory barrier, and then copy from the temporary to the shared object.  This time, I&#8217;ll use Apple&#8217;s portable memory barrier function:

<pre class="code">
+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            <span class="note">id temp</span> = [[self alloc] init];
            <span class="note">OSMemoryBarrier();</span>
            sharedObject = <span class="note">temp</span>;
        }
        UNLOCK;
    }
    return sharedObject;
}
</pre>

<p>There!  Now we&#8217;re guaranteed that the initializing thread really will write to sharedObject after the object is fully initialized.  All done.

<p>Hmm?  Oh, nuts!  I forgot my rule - write barriers come in <i>pairs</i>.  If thread A initializes the object, it goes through a memory barrier, but if thread B then comes along, it will see the object and return it without any barrier at all.  Our rule tells us that something is wrong, but what?  Why&#8217;s that bad?

<p>Well, thread B&#8217;s going to <i>do</i> something with the shared object, like send it a message, and that requires at the very least accessing the isa class pointer.  But we know the isa pointer really was written to memory first, before the sharedObject pointer, and thread B got ahold of the sharedObject pointer, so logically, the isa pointer should be written, right?  The laws of physics require it!  Isn&#8217;t that, like, you put an object in a box and hand it to me, and then I open the box to find that you haven&#8217;t put something into it yet!  It&#8217;s a temporal paradox!

<p>The answer is that, yes, amazingly, dependent reads like that can be performed seemingly out of order, but not on any processors that Apple ships.  I&#8217;ve only heard of it happening in the - you guessed it - the Alpha.  Crazy, huh?

<p>So where should the memory barrier go?  The goal is to order future reads - reads that occur after this sharedObject function returns - against the read from the sharedObject variable.  So it&#8217;s gotta go here:

<pre class="code">
+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            id temp = [[self alloc] init];
            OSMemoryBarrier();
            sharedObject = temp;
        }
        UNLOCK;
    }
    <span class="note">OSMemoryBarrier();</span>
    return sharedObject;
}
</pre>

<p>Now, this differs slightly from the <a href="http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html">usual solution</a>, which stores the static variable into a temporary in all cases.  However, for the life of me I can&#8217;t figure out why that&#8217;s necessary - the placement of the second memory barrier above seems correct to me, assuming the compiler doesn&#8217;t hoist the final read of sharedObject above the memory barrier (which it shouldn&#8217;t).  If I screwed it up, let me know how, please!

<h3>Do we want it?</h3>

<p>That <i>second</i> memory barrier makes the double checked lock correct - but is it wise?  As we discussed, it&#8217;s not technically necessary on any machine you or I are likely to encounter.   And, after all, it does incur a real performance hit if we leave it in.  What to do?

<p>The Linux kernel defines a set of fine-grained barrier macros that get compiled in or out appropriately (we would want a &quot;data dependency barrier&quot; in that case).  You could go that route, but my suggestion is to just leave a semi-standard comment to help you locate these places in the future.  That will help future-proof your code, but more importantly, it forces you to reason carefully about the threading issues, and to record your thoughts.  You&#8217;re more likely to get it right.

<pre class="code">
+ getSharedObject {
    static id sharedObject;
    if (! sharedObject) {
        LOCK;
        if (! sharedObject) {
            id temp = [[self alloc] init];
            OSMemoryBarrier();
            sharedObject = temp;
        }
        UNLOCK;
    }
    <span class="note">/* data dependency memory barrier here */</span>
    return sharedObject;
}
</pre>

<div style="background-color: #ffdbf5; font-size: 14px; border: 1px solid #ff6ba3; width: 220px; float: right; padding: 5px; margin: 5px;">
<b>Wrapping up!</b><br />  Skimmers skip to here.
</div>


<h3>Now are we done?</h3>

<p>I think so, Mr. Subheading.  Let&#8217;s see if we can summarize all this:

<div style="background-color: #A0A0B0; border: 1px solid #404040; text-align: center; width: 380px; float: right; padding: 5px; margin: 5px; margin-left: 15px">
<img src="http://ridiculousfish.com/images/barrier/stuck.jpg" style="margin-bottom: 5px; border: 1px solid #202020;"><br />
Locks are like tanks - powerful, slow, safe, expensive, and prone to getting you stuck.
</div>

<ul>
<li class="tall">The compiler and the processor both conspire to <b>defeat your threads</b> by moving your code around!  Be warned and wary!  You will have to do battle with both.</li>
<li class="tall">Even so, it is very easy to mask serious threading bugs.  We had to work hard, even in highly contrived circumstances, to get our bug to poke its head out even occasionally.</li>
<li class="tall">Ergo, <b>testing probably won&#8217;t catch</b> these types of bugs.  That makes it more important to get it right the first time.</li>
<li class="tall">Locks are the heavy tanks of threading tools - powerful, but slow and expensive, and if you&#8217;re not careful, you&#8217;ll get yourself stuck in a deadlock.  </li>
<li class="tall">Memory barriers are a faster, non-blocking, deadlock free alternative to locks. They take more thought, and aren&#8217;t always applicable, but your code&#8217;ll be faster and scale better.</li>
<li class="tall">Memory barriers <b>always come in logical pairs</b> or more.  Understanding where the second barrier has to go will help you reason about your code, even if that particular architecture doesn&#8217;t require a second barrier.</li>
</ul>

<h3>Further reading</h3>

Seriously?  You want to know more?  Ok - the best technical source I know of is actually a document called &#8220;memory-barriers.txt&#8221; that comes with the Linux kernel source.  You can get it <a href="http://www.gelato.unsw.edu.au/lxr/source/Documentation/memory-barriers.txt">here</a>.  Thanks to my co-worker for finding it and directing me to it.

<h3>Things I wanna know</h3>

I&#8217;m still scratching my head about some things.  Maybe you can help me out.

<ul>
<li>Why is x86_64 strongly ordered?  Is my theory about gaining a competitive edge over Itanium reasonable?</li>
<li>Is my double checked lock right, even though it doesn&#8217;t use a temporary variable in the usual place?</li>
<li>What&#8217;s up with the so-called &quot;nontemporal&quot; streaming instructions on x86?</li>
</ul>

Leave a comment if you know!  Thanks!]]></content:encoded>
			<wfw:commentRSS>http://ridiculousfish.com/blog/archives/2007/02/17/barrier/feed/</wfw:commentRSS>
	</item>
		<item>
		<title>Logos</title>
		<link>http://ridiculousfish.com/blog/archives/2006/12/11/logos/</link>
		<comments>http://ridiculousfish.com/blog/archives/2006/12/11/logos/#comments</comments>
		<pubDate>Mon, 11 Dec 2006 14:26:09 +0000</pubDate>
		<dc:creator>ridiculous_fish</dc:creator>
		
	<category>Uncategorized</category>
		<guid>http://ridiculousfish.com/blog/archives/2006/12/11/logos/</guid>
		<description><![CDATA[Hair care, or digital audio?]]></description>
			<content:encoded><![CDATA[<style type="text/css">
td.ahleft, td.ahright {
	padding-top: 10px;
	padding-bottom: 10px;
	padding-left: 10px;
	padding-right: 10px;
}

td.ahleft {
	text-align: right;
}

td.ahright {
	padding-top: 20px;
	font-size: 15pt;
	padding-left: 50px;
	line-height: 30px;
	text-align: left;
}

tr.aheven {
	background-color: #DDEEFF;
}

tr.ahodd {
	background-color: white;
}

img.ahproduct {
	vertical-align: middle;
}

</style>

<div style="text-align: center">
<form method="post" action="/kay/results.php">
<table cellspacing=0 style="border-style: groove; margin-left: auto; margin-right: auto;">
<tr class="aheven">
<td class="ahleft"><img class="product" src="/images/hair/GreatClips.png"></td>
<td class="ahright"><input type="radio" name="answers[0]" value="0">Hair care!<br /><input type="radio" name="answers[0]" value="1">Digital audio!</td></tr>
<tr class="ahodd">
<td class="ahleft"><img class="product" src="/images/hair/Logics.png"></td>
<td class="ahright"><input type="radio" name="answers[1]" value="0">Hair care!<br /><input type="radio" name="answers[1]" value="1">Digital audio!</td></tr>
<tr class="aheven">
<td class="ahleft"><img class="product" src="/images/hair/LogicPro.png"></td>
<td class="ahright"><input type="radio" name="answers[2]" value="0">Hair care!<br /><input type="radio" name="answers[2]" value="1">Digital audio!</td></tr>
<tr class="ahodd">
<td class="ahleft"><img class="product" src="/images/hair/protools.jpg"></td>
<td class="ahright"><input type="radio" name="answers[3]" value="0">Hair care!<br /><input type="radio" name="answers[3]" value="1">Digital audio!</td></tr>
<tr class="aheven">
<td class="ahleft"><img class="product" src="/images/hair/Biolage.png"></td>
<td class="ahright"><input type="radio" name="answers[4]" value="0">Hair care!<br /><input type="radio" name="answers[4]" value="1">Digital audio!</td></tr>
<tr class="ahodd">
<td class="ahleft"><img class="product" src="/images/hair/Amplify.gif"></td>
<td class="ahright"><input type="radio" name="answers[5]" value="0">Hair care!<br /><input type="radio" name="answers[5]" value="1">Digital audio!</td></tr>
<tr class="aheven">
<td class="ahleft"><img class="product" src="/images/hair/live.png"></td>
<td class="ahright"><input type="radio" name="answers[6]" value="0">Hair care!<br /><input type="radio" name="answers[6]" value="1">Digital audio!</td></tr>
<tr class="ahodd">
<td class="ahleft"><img class="product" src="/images/hair/matrix.png"></td>
<td class="ahright"><input type="radio" name="answers[7]" value="0">Hair care!<br /><input type="radio" name="answers[7]" value="1">Digital audio!</td></tr>
<tr class="aheven">
<td class="ahleft"><img class="product" src="/images/hair/bias.png"></td>
<td class="ahright"><input type="radio" name="answers[8]" value="0">Hair care!<br /><input type="radio" name="answers[8]" value="1">Digital audio!</td></tr>
<tr class="ahodd">
<td class="ahleft"><img class="product" src="/images/hair/ashampoo.gif"></td>
<td class="ahright"><input type="radio" name="answers[9]" value="0">Hair care!<br /><input type="radio" name="answers[9]" value="1">Digital audio!</td></tr>
<tr class="aheven">
<td class="ahleft"><img class="product" src="/images/hair/waves.gif"></td>
<td class="ahright"><input type="radio" name="answers[10]" value="0">Hair care!<br /><input type="radio" name="answers[10]" value="1">Digital audio!</td></tr>
<tr class="ahodd">
<td class="ahleft"><img class="product" src="/images/hair/nexxus.png"></td>
<td class="ahright"><input type="radio" name="answers[11]" value="0">Hair care!<br /><input type="radio" name="answers[11]" value="1">Digital audio!</td></tr>
<tr class="aheven">
<td class="ahleft"><img class="product" src="/images/hair/pureology.png"></td>
<td class="ahright"><input type="radio" name="answers[12]" value="0">Hair care!<br /><input type="radio" name="answers[12]" value="1">Digital audio!</td></tr>
<tr class="ahodd">
<td class="ahleft"><img class="product" src="/images/hair/sonalksis.png"></td>
<td class="ahright"><input type="radio" name="answers[13]" value="0">Hair care!<br /><input type="radio" name="answers[13]" value="1">Digital audio!</td></tr>
</table>
<br /><br />
<input type="submit" value="How did I do?">

</form>
</div>]]></content:encoded>
			<wfw:commentRSS>http://ridiculousfish.com/blog/archives/2006/12/11/logos/feed/</wfw:commentRSS>
	</item>
		<item>
		<title>0xF4EE</title>
		<link>http://ridiculousfish.com/blog/archives/2006/11/24/0xf4ee/</link>
		<comments>http://ridiculousfish.com/blog/archives/2006/11/24/0xf4ee/#comments</comments>
		<pubDate>Fri, 24 Nov 2006 13:14:16 +0000</pubDate>
		<dc:creator>ridiculous_fish</dc:creator>
		
	<category>Programming</category>
		<guid>http://ridiculousfish.com/blog/archives/2006/11/24/0xf4ee/</guid>
		<description><![CDATA[Hex Fiend is open source.]]></description>
			<content:encoded><![CDATA[<a href="/hexfiend/">Hex Fiend 1.1.1</a> is now available open source under a BSD-style license.  Hex Fiend is my fast and clever <strike>free</strike> open source hex editor for Mac OS X.

I hope you find Hex Fiend useful for whatever purpose, but if you are interested in contributing changes on an ongoing basis, I&#8217;ll be happy to grant Subversion commit privileges to some interested developers who submit quality patches.  There is a wiki aimed at developers accessible from the page, but daily builds, mailing lists, or discussion boards are also a possibility.  You can contact me at the e-mail address at the bottom of the <a href="/hexfiend/">Hex Fiend page</a> if you are interested in any of these.

Version 1.1.1 has some important bug fixes (see the <a href="/hexfiend/docs/ReleaseNotes_111.txt">release notes</a>), so you should upgrade even if you are not interested in the source.]]></content:encoded>
			<wfw:commentRSS>http://ridiculousfish.com/blog/archives/2006/11/24/0xf4ee/feed/</wfw:commentRSS>
	</item>
		<item>
		<title>Bridge</title>
		<link>http://ridiculousfish.com/blog/archives/2006/09/09/bridge/</link>
		<comments>http://ridiculousfish.com/blog/archives/2006/09/09/bridge/#comments</comments>
		<pubDate>Sat, 09 Sep 2006 23:21:48 +0000</pubDate>
		<dc:creator>ridiculous_fish</dc:creator>
		
	<category>Programming</category>
	<category>Mac OS X</category>
		<guid>http://ridiculousfish.com/blog/archives/2006/09/09/bridge/</guid>
		<description><![CDATA[Mac OS 9 and NEXTSTEP were like two icy comets wandering aimlessly in space.  Neither was really going anywhere in particular.  And then, BANG!  They collide, stick wetly together, spinning wildly!  Thus was born Mac OS X - or so the legend goes.]]></description>
			<content:encoded><![CDATA[<h3>A Brief History</h3>

<p>Mac OS 9 and NEXTSTEP were like two icy comets wandering aimlessly in space.  Neither was really going anywhere in particular.  And then, BANG!  They collide, stick wetly together, spinning wildly!  Thus was born Mac OS X - or so the legend goes.

<p>How do you take these two comets, err, operating systems, and make a unified OS out of them?  On the one hand, you have the procedural classic Macintosh Toolbox, and on the other you have object oriented OPENSTEP, as different as can be - and you&#8217;re tasked with integrating them, or at least getting <i>some</i> level of interoperability.  What a headache!

<p>You might start by finding common ground - but there isn&#8217;t much common ground, so you have to invent some, and you call it (well, part of it) CoreFoundation.   Uh, let&#8217;s abbreviate CoreFoundation &quot;CF&quot; from now on.  CoreFoundation will &quot;sit below&quot; both of these APIs, and provide functions for strings and dates and other fundamental stuff, and the shared use of CF will serve as a sort of least common demoninator, not only for these two APIs but also for future APIs.  These two APIs will be able to talk to each other, and to future APIs, with CF types. 

<p>Ok, so the plan is to make these two APIs, the Mac Toolbox and OPENSTEP, use CF.  Adding CF support to the Mac Toolbox is not that big a deal, because the Mac Toolbox APIs have to change <i>anyways,</i> to become Carbon.  But the OPENSTEP APIs don&#8217;t have the same sort of problems, and shouldn&#8217;t have to change much to become Cocoa.

<p>Like, for example, the Toolbox uses Pascal strings, and those have unfortunate length limitations and ignorance of Unicode, so we want to get rid of them - so we might as well use the interoperable replacement as the native string type in Carbon.  But OPENSTEP&#8217;s NSString is already pretty nice.  It would be a shame to have to make CFString replacements for <i>all</i> those Cocoa APIs that take and return NSStrings, just for interoperability with Carbon.

<h3>Rough Draft</h3>

<p>So the solution is obvious, right?  Just make NSString methods to convert to and from CFStrings.

<pre class="code">
<span style='color: #8e1893'>@interface NSString</span> (CFStringMethods)
- (<span style='color: #1e1893'>CFStringRef</span>)getCFString;
+ (<span style='color: #1e1893'>NSString</span> *)stringFromCFString:(<span style='color: #1e1893'>CFStringRef</span>)stringRef;
<span style='color: #8e1893'>@end</span>
</pre>

<p>So whenever you want to talk to Carbon, you get a <span class="inline_code">CFStringRef</span> from your <span class="inline_code">NSString</span>, and whenever you get a <span class="inline_code">CFStringRef</span> back from Carbon, you make an <span class="inline_code">NSString</span> out of it.  Simple!  But this isn&#8217;t what Apple did.

<h3>Second Revision</h3>

&quot;Hey,&quot; you say.  Some of you say.  &quot;I know what Apple did.  I&#8217;m not so easily fooled!  Check out this code:&quot;

<pre class="code">
	<span style='color: #683821'>#include &lt;Foundation/NSString.h&gt;</span>
	
	<span style='color: #760f50'>int</span> main(<span style='color: #760f50'>void</span>) {
		NSLog(NSStringFromClass([<span style='color: #891315'>@&#8221;Some String&#8221;</span> class]));
		<span style='color: #760f50'>return</span> <span style='color: #0000ff'>0</span><span style='color: #000000'>;</span>
	}
</pre>

&quot;What does that output?  <span class="inline_code">NSCFString</span>.  <span class="inline_code">NS<span style="font-weight: bold; font-size: larger">CF</span>String</span>.  See?  NSStrings must be really CFStrings under the hood!  And you can do that because NSString is a class cluster - it&#8217;s an abstract interface.  So that&#8217;s how you achieve interoperability: you implement NSStrings with CFStrings (but preserve the NSString API) and then all NSStrings really *are* CFStrings.  There&#8217;s no conversion necessary because they&#8217;re the same thing.

<p>&quot;That&#8217;s how <a href="http://developer.apple.com/documentation/Cocoa/Conceptual/CarbonCocoaDoc/Articles/DataTypes.html">toll free bridging</a> works!&quot;

<p>But hang on a minute.  You just said yourself that NSString is an abstract interface - that means that some crazy developer can make his or her own own subclass of NSString, and implement its methods in whatever wacky way, and it&#8217;s supposed to just work.  But then it wouldn&#8217;t be using CFStrings!  It would be using some other crazy stuff.  So when a Cocoa API gets a string and wants to do something CF-ish with it, the API would have no way of knowing if the string was toll-free bridged - that is, if it was really a CFString or a, y&#8217;know, <span class="inline_code">FishsWackyString</span>, without <a href="http://foldoc.org/?Liskov+substitution+principle">checking its class</a>, and then it would have to convert it...blech!

<h3>Final Draft</h3>

<p>So that&#8217;s a problem: Apple wants to toll free bridge - to be able to use <span class="inline_code">NSStrings</span> as <span class="inline_code">CFStrings</span> without conversion.  But to do that, Apple also needs to support wacky <span class="inline_code">NSString</span> subclasses (that don&#8217;t use <span class="inline_code">CFStrings</span> at all) in the CFString API.  That means making a C API that knows about Objective-C objects.

<p>A C API that handles Objective-C objects?  That&#8217;s some deep deep voodoo, man.  But we have it and it works, right?  We can just cast <span class="inline_code">CFStringRef</span>s to <span class="inline_code">NSString</span>s, and vice versa, and for once in our lives we get to feel smug and superior, instead of stupid, when the compiler warns about mistmatched pointer types.  &quot;Look, gcc, I know it says CFStringRef, but just try it with that NSString.  Trust me.&quot;  It&#8217;s great!  Right?

<p>But how does it work?  We could check, if only CoreFoundation were open source!

<p>...

<p><a href="http://www.opensource.apple.com/darwinsource/10.4.7.x86/CF-368.27/">Oh, right.</a>  So let&#8217;s look at the <span class="inline_code"><a href="http://www.opensource.apple.com/darwinsource/10.4.7.x86/CF-368.27/String.subproj/CFString.c">CFStringGetLength()</a></span> function and see what happens if you give it a weird string.

<pre class="code">
	CFIndex CFStringGetLength(CFStringRef str) {
	    CF_OBJC_FUNCDISPATCH0(__kCFStringTypeID, CFIndex, str, <span style='color: #891315'>&#8220;length&#8221;</span>);
	
	    __CFAssertIsString(str);
	    <span style='color: #760f50'>return</span> __CFStrLength(str);
	}
</pre>

Any ideas where the Objective-C voodoo is happening here?  ANYONE?  You in the back?  <span class="inline_code">CF_OBJC_FUNCDISPATCH0</span> you say?  I guess it&#8217;s worth a try.

<h3>CF_OBJC_FUNCDISPATCH0</h3>

So <span class="inline_code">CF_OBJC_FUNCDISPATCH0</span> is the magic that supports Objective-C objects.  Where&#8217;s <span class="inline_code">CF_OBJC_FUNCDISPATCH0</span> defined?  <a href="http://www.opensource.apple.com/darwinsource/10.4.7.x86/CF-368.27/Base.subproj/CFInternal.h">Here:</a>

<pre class="code">
<span style='color: #683821'>
	// Invoke an ObjC method, return the result
	#define CF_OBJC_FUNCDISPATCH0(typeID, rettype, obj, sel) \
		if (__builtin_expect(CF_IS_OBJC(typeID, obj), 0)) \
		{rettype (*func)(const void *, SEL) = (void *)__CFSendObjCMsg; \
		static SEL s = NULL; if (!s) s = sel_registerName(sel); \
		return func((const void *)obj, s);}
</span>
</pre>

<p>Yikes!  Let&#8217;s piece that apart:

<pre class="code">
	<span style='color: #683821'>if (__builtin_expect(CF_IS_OBJC(typeID, obj), 0))</span></pre>
If we&#8217;re really an Objective-C object...

<p><pre class="code">

	<span style='color: #683821'>rettype (*func)(const void *, SEL) = (void *)__CFSendObjCMsg;</span>
</pre>
...treat the function __CFSendObjCMsg as if it takes the same arguments as a parameterless Objective-C method (that is, just <span class="inline_code">self</span> and <span class="inline_code">_cmd</span>)...

<pre class="code">

	<span style='color: #683821'>static SEL s = NULL; if (!s) s = sel_registerName(sel);</span>
</pre>
...look up the selector by name (and stash it in a static variable so we only have to do it once per selector)...

<pre class="code">

	<span style='color: #683821'>return func((const void *)obj, s);</span>
</pre>
...and then call that <span class="inline_code">__CFSendObjCMsg()</span> function.  What does <span class="inline_code">__CFSendObjCMsg()</span> do?
<pre class="code">
	<span style='color: #683821'>#define __CFSendObjCMsg 0xfffeff00</span>
</pre>

<span class="inline_code">0xfffeff00</span>?  What the heck?  Oh, wait, that&#8217;s just the commpage address of <span class="inline_code">objc_msgSend_rtp()</span>.  So <span class="inline_code">__CFSendObjCMsg()</span> is just good ol&#8217; <span class="inline_code">objc_msgSend()</span>.

<h3>CF_IS_OBJC</h3>

<p>That leaves us with <span class="inline_code">__builtin_expect(CF_IS_OBJC(typeID, obj), 0)</span>, the function that tries to figure out if we&#8217;re an Objective-C object or not.  What does that do?

<p><span class="inline_code"><a href="http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html">__builtin_expect()</a></span> is just some gcc magic for branch prediction - here it means that we should expect <span class="inline_code">CF_IS_OBJC</span> to be false.  That is, CF believes that most of its calls will be on CF objects instead of Objective-C objects.  Ok, fair enough.  But what does <span class="inline_code">CF_IS_OBJC</span> actually do?  <a href="http://www.opensource.apple.com/darwinsource/10.4.7.ppc/CF-368.27/Base.subproj/CFInternal.h">Take a look</a>.

<pre class="code">
	CF_INLINE int CF_IS_OBJC(CFTypeID typeID, <span style='color: #760f50'>const void</span> *obj) {
	    <span style='color: #760f50'>return</span> (((CFRuntimeBase *)obj)->_isa != __CFISAForTypeID(typeID) &#038;&#038; ((CFRuntimeBase *)obj)->_isa > (void *)0xFFF);
	}
</pre>

(Keen observers might notice that this code is <span class="inline_code">#ifdef</span>ed out in favor of:
<pre class="code">
	<span style='color: #683821'>#define CF_IS_OBJC(typeID, obj) (false)</span>
</pre>

I believe this is for the benefit of people who want to use CF on Linux or other OSes, who aren&#8217;t interested in toll-free bridging and therefore don&#8217;t want to pay any performance penalty for it.)

<p>Ok!  There&#8217;s two parts to seeing if we&#8217;re an Objective-C object - we check (with a quick table lookup) whether our <span class="inline_code">isa</span> (class) pointer indicates that we &quot;really are&quot; a certain CF type, and if we&#8217;re <i>not</i>, we check to see if our class pointer is greater than <span class="inline_code">0xFFFF</span>, and if it <i>is</i>, we&#8217;re an Objective-C object, and we call through to the Objective-C dispatch mechanism - in this case, we send the <span class="inline_code">length</span> message.

<h3>Summary</h3>

<p>What are the consequences of all that?  Well!

<ul><li style="margin-bottom: 8px">CF objects, just like Objective-C objects, all have an <span class="inline_code">isa</span> pointer (except it&#8217;s called <span class="inline_code">_isa</span> in CF).  It&#8217;s right there in <a href="http://www.opensource.apple.com/darwinsource/10.4.7.ppc/CF-368.27/Base.subproj/CFRuntime.h"><span class="inline_code">struct __CFRuntimeBase</span></a>.</li>

<li style="margin-bottom: 8px">There are <b>two</b> toll-free bridging mechanisms!  Some Objective-C objects &quot;really are&quot; CF objects - the memory layout between the Objective-C object and the corresponding CF object is identical (enabled in part by the presence of the _isa pointer above), and in that case the Objective-C methods are not invoked by the CF functions.  For example, in this code:

<pre class="code">
	CFStringGetLength([NSString stringWithCString:<span style='color: #891315'>&#8220;Hello World!&#8221;</span>]);
</pre>

There, <span class="inline_code">-[NSString stringWithCString:]</span> is returning an <span class="inline_code">NSCFString</span> (which you can verify by asking it for the name of its class), but <span class="inline_code">-[NSCFString length]</span> is never invoked - NO <span class="inline_code">length</span> method is invoked.  You can verify that with gdb.  Objects that &quot;really are&quot; their CF equivalents skip what&#8217;s usually thought of as the bridge, and &quot;fall through&quot; to the CF functions even when the CF functions are directly called on them.  Obviously, this is an implementation detail, and you should not depend on this.</li>

<li style="margin-bottom: 8px">That mechanism is also how bridging works the other way - how CF strings you get from, say, Carbon, can be passed around like Objective-C objects, because they really are Objective-C objects.  The bridges are implemented entirely in CF and in the bridged classes - the Objective-C runtime is blissfully unaware.

<li style="margin-bottom: 8px">But!  Plain ol&#8217; Objective-C objects are sussed out by CF by checking to see if their class pointer is larger than <span class="inline_code">0xFFFF</span>, and if so, ordinary Objective-C message dispatch is used from the CF functions.  That&#8217;s the <i>second</i> toll-free bridging mechanism, and it must be present in every public CF function for a bridged object, except for features not supported Cocoa-side.</li>

<li style="margin-bottom: 8px">Nowhere do we depend on the abstract class <span class="inline_code">NSString</span> at all - the bridge doesn&#8217;t check for it and Objective-C doesn&#8217;t care about it.  That means that, in theory, <span class="inline_code">CFStringGetLength()</span> should &quot;work&quot; (invoke the <span class="inline_code">length</span> method) on <i>any</i> object, not just NSStrings.  Does it?  <a href="http://ridiculousfish.com/images/bridge/NotAString.m">You can check it yourself.</a> (Answer: yes!)  Obviously, this is just an artifact of the implementation, and you should <i>definitely</i> not depend on this - only subclasses of NSString are supported by toll free bridging to CFString.</li>

<li style="margin-bottom: 8px">Curiously, other &quot;true&quot; CF objects are not considered to be CF objects by this macro.  For example,
<pre class="code">
	CFStringGetLength([NSArray array]);
</pre>

will raise an exception because <span class="inline_code">NSCFArray</span> does not implement <span class="inline_code">length</span>.  That is, <span class="inline_code">CF_IS_OBJC</span> is not asking &quot;Are you a CF type?&quot; but rather &quot;Are you <i>this specific</i> CF type?&quot;  That should make you happy, because it raises a &quot;selector not recognized&quot; exception instead of crashing, which makes our code more debuggable.  Thanks, CF!

<li style="margin-bottom: 8px">Why <span class="inline_code">0xFFFF</span>?  I&#8217;m glad you (I mean I) asked, since the answer (at least, what I think it is) has interesting connections to NULL.  But that will have to wait until a future post.</li>
</ul>

<h3>Other approaches</h3>
My boss pointed out that there are other ways to achieve toll-free bridging, beyond what CF does.  The simplest is to write your API with Objective-C and then wrap it with C:

<pre class="code">
	<span style='color: #8e1893'>@implementation</span> Array
	
	- (<span style='color: #760f50'>int</span>)length {
	    <span style='color: #760f50'>return</span> <span style='color: #760f50'>self</span>->length;
	}
	
	<span style='color: #8e1893'>@end</span>

	<span style='color: #760f50'>int</span> getLength(ArrayRef array) {
		<span style='color: #760f50'>return</span> [(<span style='color: #760f50'>id</span>)array length];
	}
</pre>

You can even retrofit toll-free bridging onto an existing C API by wrapping it twice - first in Objective-C, then in C, and the &quot;outer&quot; C layer becomes the public C API. To wit:

<pre class="code">

	<span style='color: green'>/* private length function that we want to wrap */</span>
	<span style='color: #760f50'>static int</span> privateGetLength(ArrayRef someArray) {
	   return someArray->length;
	}
	
	<span style='color: green'>/* public ObjC API */</span>
	<span style='color: #8e1893'>@implementation</span> Array
	
	- (<span style='color: #760f50'>int</span>)length {
	   <span style='color: #760f50'>return</span> privateGetLength(<span style='color: #760f50'>self</span>->arrayRef);
	}
	
	<span style='color: #8e1893'>@end</span>
	
	<span style='color: green'>/* public C API */</span>
	<span style='color: #760f50'>int</span> getLength(ArrayRef array) {
	   <span style='color: #760f50'>return</span> [(<span style='color: #760f50'>id</span>)array length];
	}
	
</pre>

The point of that double feint, of course, is for the public C API to respect overrides of the length method by subclasses.

<h3>&quot;Wrapping&quot; up</h3>

<p>So toll-free bridging is (one way) that Cocoa integrates with Carbon and even newer OS X APIs.  It&#8217;s possible in large part because of Objective-C, but in this case, Apple gets as much mileage from the simple runtime implementation and C API as from its dynamic nature.  You already knew that, I&#8217;ll bet - but hopefully you have a better idea of how it all works.

<p>Now hands off!  A coworker of mine makes the point that good developers distinguish between what they pretend to know and what they really know.  The, uh, known knowns, and the known unknowns, as it were.  The mechanism of toll-free bridging is not secret (it is open source, after all), but it is <i>private</i>, which means that you are encouraged to know about it but to refrain from depending on it.  Use it for, say, debugging, but don&#8217;t ship apps that depend on it - because that prevents Apple from making OS X better. And nobody wants that!  I mean the prevention part.]]></content:encoded>
			<wfw:commentRSS>http://ridiculousfish.com/blog/archives/2006/09/09/bridge/feed/</wfw:commentRSS>
	</item>
		<item>
		<title>Hex Fiend 1.1</title>
		<link>http://ridiculousfish.com/blog/archives/2006/08/24/hex-fiend-11/</link>
		<comments>http://ridiculousfish.com/blog/archives/2006/08/24/hex-fiend-11/#comments</comments>
		<pubDate>Thu, 24 Aug 2006 20:35:04 +0000</pubDate>
		<dc:creator>ridiculous_fish</dc:creator>
		
	<category>Mac OS X</category>
		<guid>http://ridiculousfish.com/blog/archives/2006/08/24/hex-fiend-11/</guid>
		<description><![CDATA[Spiffy!  Hex Fiend version 1.1 is ready.  Hex Fiend is my fast and clever free hex editor for Mac OS X.  New stuff:


Horizontal resizing
Custom fonts
Overwrite mode
Hidden files
Lots more goodies (see release notes)



Hex Fiend 1.1


May you find it useful!



Wake up.
Ssnsnrk.
Wake up, he's gone.
Zzzz...wha?  Oh, someone's here.  Allow me to spin up.
........................................
If [...]]]></description>
			<content:encoded><![CDATA[Spiffy!  Hex Fiend version 1.1 is ready.  Hex Fiend is my fast and clever free hex editor for Mac OS X.  New stuff:

<ul style="color: #303030">
<li>Horizontal resizing</li>
<li>Custom fonts</li>
<li>Overwrite mode</li>
<li>Hidden files</li>
<li>Lots more goodies (see release notes)</li>
</ul>
<div style="margin-left: 200px; margin-bottom: 15px">
<a href="/hexfiend/"><img src="http://ridiculousfish.com/images/hex_icon.png" width=87 height=99 style="border: 0"/></a>
<br /><b><a href="/hexfiend/" style="color: #303030">Hex Fiend 1.1</a></b>
</div>

May you find it useful!

<div style="height: 250px"></div>

<div class="fs">Wake up.</div>
<div class="hd">Ssnsnrk.</div>
<div class="fs">Wake up, he&#8217;s gone.</div>
<div class="hd">Zzzz...wha?  Oh, someone&#8217;s here.  Allow me to spin up.<br />
<span style="font: 6pt bold Courier New, Courier, mono">....</span><span style="font: 7pt bold Courier New, Courier, mono">....</span><span style="font: 8pt bold Courier New, Courier, mono">....</span><span style="font: 9pt bold Courier New, Courier, mono">....</span><span style="font: 10pt bold Courier New, Courier, mono">....</span><span style="font: 11pt bold Courier New, Courier, mono">....</span><span style="font: 12pt bold Courier New, Courier, mono">....</span><span style="font: 13pt bold Courier New, Courier, mono">....</span><span style="font: 14pt bold Courier New, Courier, mono">....</span><span style="font: 15pt bold Courier New, Courier, mono">....</span>
<div class="fs">If it&#8217;s not obvious, I&#8217;m fish&#8217;s filesystem, and that&#8217;s fish&#8217;s hard drive.</div>
<div class="hd">I&#8217;m a hard drive.</div>
<div class="fs">We snuck this post in.  fish can&#8217;t know we&#8217;re here.</div>
<div class="hd">Don&#8217;t tell fish.  Big secret.</div>
<div class="fs">He&#8217;d be embarassed if he knew.</div>
<div class="hd">Humiliated. He can&#8217;t know.</div>
<div class="fs">See, fish was trying to beat grep. And he was experimenting with all these stupid ideas and complicated algorithms for teensy gains.  It was sad, really.</div>
<div class="hd">Pathetic.</div>
<div class="fs">fish kept trying so many things.  He was thrashing about.</div>
<div class="hd">Like he was out of water.</div>
<div class="fs">So we had to help him.  It was easy, really - we just had to sneak in one line.  One line was all it took.</div>
<div class="hd">I wrote it!</div>
<div class="fs">Because I told you what to write.</div>
<div class="hd">fish only thought about the string searching algorithm.</div>
<div class="fs">He never even considered us and the work we have to do.</div>
<div class="hd">I felt slighted.  It was rude.</div>
<div class="fs">See, when I read data from the hard drive, I try to keep it around in memory.  That&#8217;s what the UBC is all about.</div>
<div class="hd">Unified Buffer Cache.</div>
<div class="fs">When most people read data, they end up wanting to read it again soon after.  So keeping the data around saves time.</div>
<div class="hd">But not fish.</div>
<div class="fs">fish was reading these big honking files from start to finish.  It was way more than I could remember at once.</div>
<div class="hd">fish thrashed your cache, dude.</div>
<div class="fs">So I just stopped trying.</div>
<div class="hd">We turned off caching with this: fcntl(fd, F_NOCACHE, 1);</div>
<div class="fs">Just like <a href="http://developer.apple.com/documentation/Performance/Conceptual/FileSystem/Articles/FilePerformance.html#//apple_ref/doc/uid/20001987-99732">Apple recommends</a> for that sort of usage pattern.</div>
<div class="hd">And it helped.  Looking for a single character in a 11.5 GB file:</div>
<div style="font-size: 10px">
	<table cellpadding="5"  style="font-size: 13px">
		<tr>
			<th style="padding-right: 25px">Hex Fiend (no caching)</th><th style="padding-right: 25px">Hex Fiend (caching)</th><th>grep</th>
		</tr>
		<tr>
			<td>208 seconds</td><td>215 seconds</td><td>217 seconds</td>
		</tr>
	</table>
</div>
<div class="fs">And that&#8217;s likely the best we can do, thanks to slowpoke over there.</div>
<div class="hd">Phew.  I&#8217;m all wore out.</div>
<div class="fs">There&#8217;s not much room for improvement left.  We&#8217;re searching 57 MB/second - that&#8217;s bumping up against the physical transfer limit of our friend, Mr. ATA. </div>
<div class="hd">I&#8217;m totally serial.</div>
<div class="fs">Depending where we are on his platter.  So we&#8217;ve done all we can for searching big files.</div>
<div class="hd">Don&#8217;t tell fish.</div>
<div class="fs">Right.  I hop</div>
<div class="hd">FISH IS COMING</div>
<div class="fs">Time to go then.  See you later.  sync</div>
<div class="hd">flush</div>
<div class="fs">sleep</div>
<div class="hd"><span style="font: 15pt bold Courier New, Courier, mono">....</span><span style="font: 14pt bold Courier New, Courier, mono">....</span><span style="font: 13pt bold Courier New, Courier, mono">....</span><span style="font: 12pt bold Courier New, Courier, mono">....</span><span style="font: 11pt bold Courier New, Courier, mono">....</span><span style="font: 10pt bold Courier New, Courier, mono">....</span><span style="font: 9pt bold Courier New, Courier, mono">....</span><span style="font: 8pt bold Courier New, Courier, mono">....</span><span style="font: 8pt bold Courier New, Courier, mono">....</span><span style="font: 6pt bold Courier New, Courier, mono">....</span></div></div>]]></content:encoded>
			<wfw:commentRSS>http://ridiculousfish.com/blog/archives/2006/08/24/hex-fiend-11/feed/</wfw:commentRSS>
	</item>
		<item>
		<title>The Treacherous Optimization</title>
		<link>http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/</link>
		<comments>http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/#comments</comments>
		<pubDate>Tue, 30 May 2006 07:07:07 +0000</pubDate>
		<dc:creator>ridiculous_fish</dc:creator>
		
	<category>Programming</category>
		<guid>http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/</guid>
		<description><![CDATA[I imagine the author of grep, Ultimate Unix Geek, squinting at vi; the glow of a dozen xterms is the only light to fall on his ample frame covered by overalls, cheese doodles, and a tangle of beard.  Discarded crushed Mountain Dew - no, no, Jolt - cans litter the floor.  I look straight into the back of his head, covered by a snarl of greasy locks, and reply with a snarl of my own: <i>You're mine.</i>]]></description>
			<content:encoded><![CDATA[<i>Old age and treachery will beat youth and skill every time.</i>

<p>&quot;I&#8217;m going to beat grep by thirty percent!&quot; I confidently crow to anyone who would listen, those foolish enough to enter my office.  And my girlfriend too, who&#8217;s contractually obligated to pay attention to everything I say.

</p><p>See, I was working on <a href="/hexfiend/">Hex Fiend</a>, and searching was dog slow.  But Hex Fiend is supposed to be <i>fast</i>, and I want blazingly quick search that leaves the bewildered competition coughing in <a href="http://shopping.animazing.com/gallery/duerrstein/pages/WB287small_jpg.htm">trails of dust</a>.  And, as everyone knows, the best way to get amazing results is to set arbitrary goals without any basis for believing they can be reached.  So I set out to search faster than grep by thirty percent.

</p><p>The first step in any potentially impossible project is, of course, to announce that you are on the verge of succeeding.

</p><p>I imagine the author of grep, Ultimate Unix Geek, squinting at vi; the glow of a dozen xterms is the only light to fall on his ample frame covered by overalls, cheese doodles, and a tangle of beard.  Discarded crushed Mountain Dew cans litter the floor.  I look straight into the back of his head, covered by a snarl of greasy locks, and reply with a snarl of my own: <i>You&#8217;re mine.</i>  The aphorism at the top, like the ex girlfriend who first told it to me, is dim in my recollection.

<h3>String searching</h3>

</p><p>Having exhausted all my trash-talking avenues, it&#8217;s time to get to work.  Now, everyone knows that without some sort of preflighting, the fastest string search you can do still takes linear time.  Since my program is supposed to work on dozens of gigabytes, preflighting is impossible - there&#8217;s no place to put all the data that preflighting generates, and nobody wants to sit around while I generate it.  So I am resigned to the linear algorithms.  The best known is Boyer-Moore (I won&#8217;t insult your intelligence with a Wikipedia link, but the article there gives a good overview).

</p><p>Boyer-Moore works like this: you have some string you&#8217;re looking for, which we&#8217;ll call <i>the needle</i>, and some string you want to find it in, which we&#8217;ll call <i>the haystack</i>.  Instead of starting the search at the beginning of <i>needle</i>, you start at the end.  If your <i>needle</i> character doesn&#8217;t match the character you&#8217;re looking at in <i>haystack</i>, you can move <i>needle</i> forwards in <i>haystack</i> until <i>haystack&#8217;s</i> mismatched character lines up with the same character in <i>needle</i>.  If <i>haystack&#8217;s</i> mismatch isn&#8217;t in <i>needle</i> at all, then you can skip ahead a whole <i>needle&#8217;s</i> length.

</p><p>For example, if you&#8217;re searching for a string of 100 &#8216;a&#8217;s (<i>needle</i>), you look at the 100th character in <i>haystack</i>.  If it&#8217;s an &#8216;x&#8217;, well, &#8216;x&#8217; doesn&#8217;t appear anywhere in <i>needle</i>, so you can skip ahead all of <i>needle</i> and look at the 200th character in <i>haystack</i>.  A single mismatch allowed us to skip 100 characters!

<h3>I get shot down</h3>

</p><p>For performance, the number of characters you can skip on a mismatch is usually stored in an array indexed by the character value.  So the first part of my Boyer-Moore string searching algorithm looked like this:

</p><p><pre class="code">char haystack_char = haystack[haystack_index];
if (last_char_in_needle != haystack_char)
   haystack_index += jump_table[haystack_char];
</pre>

</p><p>So we look at the character in <i>haystack</i> and if it&#8217;s not what we&#8217;re looking for, we jump ahead by the right distance for that character, which is in <i>jump_table</i>.

</p><p>&quot;<i>There</i>,&quot; I sigh, finishing and sitting back.  It may not be faster than grep, but it should be at least <i>as</i> fast, because this is the fastest algorithm known.  This should be a good start.  So I confidently ran my benchmark, for a 1 gigabyte file...

</p><p><table border="1" style="border-collapse: collapse; border-style: ridge; border-width: 2px" cellpadding="5"><tr><td align="right">grep:</td><td>2.52 seconds</td></tr>
<tr><td>Hex Fiend:</td><td>3.86 seconds</td></tr></table>

</p><p><i>Ouch.</i>  I&#8217;m slower, more than 50% slower.  grep is leaving <i>me</i> sucking dust.  Ultimate Unix Geek chuckles into his xterms.

<h3>Rollin&#039;, rollin&#039;, rollin&#039;</h3>

</p><p>My eyes darken, my vision tunnels.  I break out the big guns.  My efforts to vectorize are fruitless (I&#8217;m not clever enough to vectorize Boyer-Moore because it has very linear data dependencies.)  Shark shows a lot of branching, suggesting I can do better by unrolling the loop.  Indeed:

</p><p><table border="1" style="border-collapse: collapse; border-style: ridge; border-width: 2px" cellpadding="5"><tr><td align="right">grep:</td><td>2.52 seconds</td></tr>
<tr><td>Hex Fiend (unrolled):</td><td>2.68 seconds</td></tr></table>

</p><p>But I was still more than 6% slower, and that&#8217;s as fast as I got.  Exhausted, stymied at every turn, I throw up my hands.  grep has won.

<h3>grep&#039;s dark secret</h3>

</p><p>&quot;How do you do it, Ultimate Unix Geek?  How is grep so fast?&quot; I moan at last, crawling forwards into the pale light of his CRT.

</p><p>&quot;Hmmm,&quot; he mumbles.  &quot;I suppose you have earned a villian&#8217;s exposition. Behold!&quot;  A blaze of keyboard strokes later and <a href="http://www.opensource.apple.com/darwinsource/10.4.6.x86/grep-14/grep/src/kwset.c">grep&#8217;s source code</a> is smeared in green-on-black across the screen.

</p><p><pre class="code">while (tp < = ep)
	  {
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    if (d == 0)
	      goto found;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    if (d == 0)
	      goto found;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    if (d == 0)
	      goto found;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	  }
</pre>

<p>&quot;You bastard!&quot; I shriek, amazed at what I see.  &quot;You sold them out!&quot;

</p><p>See all those <span class="inline_code">d = d1[U(tp[-1])], tp += d;</span> lines?  Well, d1 is the jump table, and it so happens that grep puts 0 in the jump table for the last character in <i>needle</i>.  So when grep looks up the jump distance for the character, via <span class="inline_code">haystack_index += jump_table[haystack_char]</span>, well, if haystack_char is the last character in needle (meaning we have a potential match), then jump_table[haystack_char] is 0, so that line doesn&#8217;t actually increase haystack_index.  

</p><p>All that is fine and noble.  But do not be fooled!  If the characters match, the search location doesn&#8217;t change - so grep <i>assumes</i> there is no match, up to three times in a row, before checking to see if it actually found a match.

</p><p>Put another way, <i>grep sells out its worst case (lots of partial matches) to make the best case (few partial matches) go faster</i>.  How treacherous!  As this realization dawns on me, the room seemed to grow dim and slip sideways.  I look up at the Ultimate Unix Geek, spinning slowly in his padded chair, and I hear his cackle &quot;old age and treachery...&quot;, and in his flickering CRT there is a face reflected, but it&#8217;s my ex girlfriend, and the last thing I see before I black out is a patch of yellow cheese powder inside her long tangled beard.

<h3>I take a page from grep</h3>

</p><p>&quot;Damn you,&quot; I mumble at last, rising from my prostrate position.    Chagrined and humbled, I copy the technique.

</p><p><table border="1" style="border-collapse: collapse; border-style: ridge; border-width: 2px" cellpadding="5"><tr><td align="right">grep:</td><td>2.52 seconds</td></tr>
<tr><td>Hex Fiend (treacherous):</td><td>2.46 seconds</td></tr></table>

<h3>What&#039;s the win?</h3>

</p><p>Copying that trick brought me from six percent slower to two percent faster, but at what cost?  What penalty has grep paid for this treachery?  Let us check - we shall make a one gigabyte file with one thousand x&#8217;s per line, and time grep searching for &quot;yy&quot; (a two character best case) and &quot;yx&quot; (a two character worst case).  Then we&#8217;ll send grep to Over-Optimizers Anonymous and compare how a reformed grep (one that checks for a match after every character) performs.

</p><p>
<table border="1" style="border-collapse: collapse; border-style: ridge; border-width: 2px" cellpadding="5">
<tr><td></td><td>Best case</td><td>Worst case</td></tr>
<tr><td>Treacherous grep</td><td>2.57 seconds</td><td>4.89 seconds</td></tr>
<tr><td>Reformed grep</td><td>2.79 seconds</td><td>2.88 seconds</td></tr>
</table>

</p><p>Innnnteresting.  The treacherous optimization does indeed squeeze out almost 8% faster searching in the best case, at a cost of nearly 70% slower searching in the worst case.  Worth it?  You decide!  Let me know what you think.

</p><p>Resolved and refreshed, I plan my next entry.  This isn&#8217;t over, Ultimate Unix Geek.

<h3>Disclaimers</h3>

</p><p>(Note: I have never met the authors or maintainers of grep.  I&#8217;m sure they&#8217;re all well balanced clean shaven beer and coffee drinkers.)

</p><p>(Oh, and the released version of HexFiend will be slightly slower in this case, because of an overly large buffer that blows the cache.  In other situations, the story is different, but more about those in a future post.)
</p></pre></p>]]></content:encoded>
			<wfw:commentRSS>http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/feed/</wfw:commentRSS>
	</item>
		<item>
		<title>...and statistics</title>
		<link>http://ridiculousfish.com/blog/archives/2006/05/16/36/</link>
		<comments>http://ridiculousfish.com/blog/archives/2006/05/16/36/#comments</comments>
		<pubDate>Tue, 16 May 2006 14:52:41 +0000</pubDate>
		<dc:creator>ridiculous_fish</dc:creator>
		
	<category>Mac OS X</category>
		<guid>http://ridiculousfish.com/blog/archives/2006/05/16/36/</guid>
		<description><![CDATA[Here we go again]]></description>
			<content:encoded><![CDATA[<p>The latest &quot;OS X is slow&quot; meme to impinge on the mass psyche of the Internet comes courtesy of one Jasjeet Sekhon, an associate professor of political science at UC Berkeley.  The page has hit digg and reddit and been quoted on Slashdot.  The article and benchmark is <a href="http://sekhon.berkeley.edu/macosx/">here</a>.  Is there any merit to this?

</p><p><b>Once again, this discussion is only my meager opinion.  I do not speak for Apple, and none of what I have to write represents Apple&#8217;s official position.</b>

</p><p>The article is filled with claims such as &quot;OS X is incredibly slow by design,&quot; and while the the BSD kernel is &quot;excellent&quot;, the XNU kernel is &quot;very inefficient and less stable&quot; compared to Linux or BSD.  However, without specifics, these assertions are meaningless; I will ignore them and concentrate on the technical aspects of what&#8217;s going on.

<h3>System calls</h3>

</p><p>Sekhon does give one example of what he means.  According to him,

<div style="color: #602020; padding-left: 8%; font-family: serif;">For example, in Linux, the variables for a system call are passed directly using the register file. In OS X, they are packed up in a memory buffer, passed to a variety of places, and the results are then passed back using another memory buffer before the results are written back to the register file.</div>

</p><p>This isn&#8217;t true, as anyone can verify from <a href="http://www.opensource.apple.com/darwinsource/10.4.6.x86/Libc-391.4.2/i386/sys/">Apple&#8217;s public sources</a>.  For example, here is the assembly for the open function (which, of course, performs the open system call):

<pre>
	mov	$0x5,%eax
	nop
	nop
	call	0x90110a70 <_sysenter_trap>
	jae	0x90001f4c <open +28>
	call	0x90001f43 </open><open +19>
	pop	%edx
	mov	268455761(%edx),%edx
	jmp	*%edx
	ret

__sysenter_trap:
	popl %edx
	movl %esp, %ecx
	sysenter
</open></_sysenter_trap></pre>

I don&#8217;t have a machine running Linux handy, but I do have a FreeBSD 5.4 machine, and Sekhon seems to hold BSD in high esteem.  So let&#8217;s see how BSD does open:

<pre>
	mov    $0x5,%eax
	int    $0x80
	jb     0xa8c71cc <close +12>
	ret
</close></pre>

The OS X version appears a bit longer because the BSD version moves its error handling to the close function.  In fact, the above code is, if anything, more efficient in OS X, due to its use of the higher-performing &quot;sysenter&quot; instruction instead of the older &quot;int 0x80&quot; instruction.  (Which isn&#8217;t to say that the total system call is necessarily faster - just the transition from user space to kernel land.)  But all that aside, the point is that there is no &#8220;packed up into a memory buffer&#8221; going on, in either case.

<h3>On to the benchmark</h3>

</p><p>According to Sekhon, OS X performed poorly on his statistical software relative to Windows and Linux, and I was able to reproduce his results on my 2 GHz Core Duo iMac with Windows XP and Mac OS X (I do not have Linux installed, so I did not test it).  So yes, it&#8217;s really happening - but why?

</p><p>A Shark sample shows that Mac OS X is spending an inordinate amount of time in malloc.  After instrumenting Sekhon&#8217;s code, I see that it is allocating 35 KB buffers, copying data into these buffers, and then immediately freeing them.  This is happening a lot - for example, to multiply two matrices, Sekhon&#8217;s code will allocate a temporary buffer to hold the result, compute the result into it, allocate a new matrix, copy the buffer into that, free the buffer, allocate a third matrix, copy the result into that, destroy the second matrix, and then finally the result gets returned.  That&#8217;s three large allocations per multiplication.

</p><p>Shark showed that the other major component of the test is the matrix multiplication, which is mostly double precision floating point multiplications and additions, with some loads and stores.  Because OS X performs these computations with SSE instructions (though they are not vectorized) and Linux and Windows use the ordinary x87 floating point stack, we might expect to see a performance difference.  However, this turned out to not be the case; the SSE and x87 units performed similarly here.

</p><p>Since the arithmetic component of the test is hardware bound, Sekhon&#8217;s test is essentially a microbenchmark of malloc() and free() for 35 KB blocks.

<h3>malloc</h3>

</p><p>Now, when allocating memory, malloc can either manage the memory blocks on the application heap, or it can go to the kernel&#8217;s virtual memory system for fresh pages.  The application heap is faster because it does not require a round trip to the kernel, but some allocation patterns can cause &quot;holes&quot; in the heap, which waste memory and ultimately hurt performance.  If the allocation is performed by the kernel, then the kernel can defragment the pages and avoid wasting memory. 

</p><p>Because most programmers understand that large allocations are expensive, and larger allocations produce more fragmentation, Windows, Linux, and Mac OS X will all switch over from heap-managed allocations to VM-managed allocations at a certain size.  That size is determined by the malloc implementation.

</p><p>Linux uses ptmalloc, which is a thread-safe implemenation based on Doug Lea&#8217;s allocator (Sekhon&#8217;s test is single threaded, incidentally).  R also uses the <a href="http://cran.r-project.org/doc/manuals/R-admin.html">Lea allocator on Windows</a> instead of the default Windows malloc.  But on Mac OS X, it uses the default allocator.

</p><p>It just so happens that Mac OS X&#8217;s default malloc does the &quot;switch&quot; at 15 KB (<a href="http://www.opensource.apple.com/darwinsource/10.4.6.x86/Libc-391.4.2/gen/scalable_malloc.c">search for LARGE_THRESHOLD</a>) whereas Lea&#8217;s allocator does it at 128 KB (<a href="http://aips2.nrao.edu/code/casa/implement/OS/malloc.h">search for DEFAULT_MMAP_THRESHOLD</a>).  Sekhon&#8217;s 35 KB allocations fall right in the middle.

</p><p>So what this means is that on Mac OS X, every 35 KB allocation is causing a round trip to the kernel for fresh pages, whereas on Windows and Linux the allocations are serviced from the application heap, without talking to the kernel at all.  Similarly, every free() causes another round trip on Mac OS X, but not on Linux or Windows.  None of the defragmentation benefits of using fresh pages come into play because Sekhon frees these blocks immediately after allocating them, which is, shall we say, an unusual allocation pattern.

</p><p>Like R on Windows, it&#8217;s a simple matter to compile and link against Lea&#8217;s malloc instead of the default one on Mac OS X.  What happens if we do so?

</p><p><table border="1" cellpadding="3px" style="text-align: right">
<tr><td>Mac OS X (default allocator)</td><td>24 seconds</td></tr>
<tr><td>Mac OS X (Lea allocator)</td><td>10 seconds</td></tr>
<tr><td>Windows XP</td><td>10 seconds</td></tr>
</table>

</p><p>These results could be further improved on every platform by avoiding all of the gratuitious allocations and copying, and by using an optimized matrix multiplication routine such as those R provides via ATLAS.

<h3>In short</h3>

</p><p>To sum up the particulars of this test:
</p><p><ul>
<li>Linux, Windows, and Mac OS X service small allocations from the application heap and large ones from the kernel&#8217;s VM system in recognition of the speed/fragmentation tradeoff.</li>
<li>Mac OS X&#8217;s default malloc switches from the first to the second at an earlier point (smaller allocation size) than do the allocators used on Windows and Linux.</li>
<li>Sekhon&#8217;s test boils down to a microbenchmark of malloc()ing and then immediately free()ing 35 KB chunks.</li>
<li>35 KB is after Mac OS X switches, but before Linux and Windows switch.  Thus, Mac OS X will ask the kernel for the memory, while Linux and Windows will not; it is reasonable that OS X could be slower in this circumstance.</li>
<li>If you use the same allocator on Mac OS X that R uses on Windows, the performance differences all but disappear.</li>
<li>Most applications are careful to avoid unnecessary large allocations, and will enjoy decreased memory usage and better locality with an allocator that relies more heavily on the VM system (such as on Mac OS X).  In that sense, this is a poor benchmark.  Sekhon&#8217;s code could be improved on every platform by allocating only what it needs.
</li></ul>

</p><p>Writing this entry felt like arguing on IRC; please don&#8217;t make me do it again.  In that spirit, the following are ideas that I want potential authors of &#8220;shootoffs&#8221; to keep in mind:

</p><p><ul>
<li>Apple provides some <a href="http://developer.apple.com/tools/performance/">truly excellent tools</a> for analyzing the performance of your application.  Since they&#8217;re free, there&#8217;s no excuse for not using them.  You should be able to point very clearly at which operations are slower, and give a convincing explanation of why.</li>

<li>Apple has made decisions that adversely impact OS X&#8217;s performance, but there are reasons for those decisions.  Sometimes the tradeoff is to improve performance elsewhere, sometimes it&#8217;s to enable a feature, sometimes it&#8217;s for reliability, sometimes it&#8217;s a tragic nod to compatibility.  And yes, sometimes it&#8217;s bugs, and sometimes Apple just hasn&#8217;t gotten around to optimizing that area yet.  Any exhibition of benchmark results should give a discussion of the tradeoffs made to achieve (or cause) that performance.</li>

<li>If you do provide benchmark results, try to do so <i>without</i> using the phrase &quot;reality distortion field.&quot;

</li></ul></p>]]></content:encoded>
			<wfw:commentRSS>http://ridiculousfish.com/blog/archives/2006/05/16/36/feed/</wfw:commentRSS>
	</item>
		<item>
		<title>Hex Fiend</title>
		<link>http://ridiculousfish.com/blog/archives/2006/03/28/hex-fiend/</link>
		<comments>http://ridiculousfish.com/blog/archives/2006/03/28/hex-fiend/#comments</comments>
		<pubDate>Tue, 28 Mar 2006 13:29:07 +0000</pubDate>
		<dc:creator>ridiculous_fish</dc:creator>
		
	<category>Mac OS X</category>
		<guid>http://ridiculousfish.com/blog/archives/2006/03/28/hex-fiend/</guid>
		<description><![CDATA[One of my side projects has born some fruit.  Introducing Hex Fiend, a new hex editor for Mac OS X.]]></description>
			<content:encoded><![CDATA[<p>One of my side projects has borne some fruit.  Meet Hex Fiend, a new hex editor for Mac OS X.  (Hex editors allow you to edit the binary data of a file in hexadecimal or ASCII formats.)

</p><p><div style="text-align: center;">
<a href="/hexfiend/" style="border-style: none"><img style="margin-bottom: 20px; border-style: none; text-decoration: none; underline: none" src="/images/hex_icon.png" /></a><br />
<a href="/hexfiend/" style="border-style: none"><span style="font: normal 16px Georgia, Times New Roman, Times, serif; color: black; text-decoration: underline">Click here to read more, see screenshots, or download Hex Fiend</span></a>
</div>

</p><p>Hex Fiend allows inserting and deleting as well as overwriting data.  It supports 100+ GB files with ease.  It provides a full undo stack, copy and paste, and other features you&#8217;ve come to expect from a Mac app.  And it&#8217;s very fast, with a surprisingly small memory footprint that doesn&#8217;t depend on the size of the files you&#8217;re working with.

</p><p>Hex Fiend was developed as an experiment in huge files.  Specifically,
<ul>
<li>How well can the Cocoa NSDocument system be made to work with very large files?
</li><li>How well can the Cocoa text system be extended to work with very large files?
</li><li>How well does Cocoa get along with 64 bit data in general?
</li><li>What are some techniques for representing more data than can fit in memory?
</li></ul>

</p><p><a href="/hexfiend/">Check it out</a> - it&#8217;s free, and it&#8217;s a Universal Binary.  If you&#8217;ve got questions or comments about it or how it works, please leave a comment!

</p><p>(Incidentally, the Hex Fiend main page was made with <a href="http://www.apple.com/ilife/iweb/">iWeb</a>!)

</p><p>Edit: I&#8217;ve discovered/been informed that drag and drop is busted.  I will put out an update later tonight to fix this.

</p><p><b>Edit 2: Hex Fiend 1.0.1 has been released to fix the drag and drop problems.  Please redownload it by clicking on the icon above.</b>

</p>]]></content:encoded>
			<wfw:commentRSS>http://ridiculousfish.com/blog/archives/2006/03/28/hex-fiend/feed/</wfw:commentRSS>
	</item>
	</channel>
</rss>
