The Treacherous Optimization

String searching

Having exhausted all my trash-talking avenues, it's time to get to work. Now, everyone knows that without some sort of preflighting, the fastest string search you can do still takes linear time. Since my program is supposed to work on dozens of gigabytes, preflighting is impossible - there's no place to put all the data that preflighting generates, and nobody wants to sit around while I generate it. So I am resigned to the linear algorithms. The best known is Boyer-Moore (I won't insult your intelligence with a Wikipedia link, but the article there gives a good overview).

Boyer-Moore works like this: you have some string you're looking for, which we'll call the needle, and some string you want to find it in, which we'll call the haystack. Instead of starting the search at the beginning of needle, you start at the end. If your needle character doesn't match the character you're looking at in haystack, you can move needle forwards in haystack until haystack's mismatched character lines up with the same character in needle. If haystack's mismatch isn't in needle at all, then you can skip ahead a whole needle's length.

For example, if you're searching for a string of 100 'a's (needle), you look at the 100th character in haystack. If it's an 'x', well, 'x' doesn't appear anywhere in needle, so you can skip ahead all of needle and look at the 200th character in haystack. A single mismatch allowed us to skip 100 characters!

I get shot down

For performance, the number of characters you can skip on a mismatch is usually stored in an array indexed by the character value. So the first part of my Boyer-Moore string searching algorithm looked like this:

char haystack_char = haystack[haystack_index];
if (last_char_in_needle != haystack_char)
   haystack_index += jump_table[haystack_char];

So we look at the character in haystack and if it's not what we're looking for, we jump ahead by the right distance for that character, which is in jump_table.

"There," I sigh, finishing and sitting back. It may not be faster than grep, but it should be at least as fast, because this is the fastest algorithm known. This should be a good start. So I confidently ran my benchmark, for a 1 gigabyte file...

grep: 2.52 seconds

Hex Fiend: 3.86 seconds

Ouch. I'm slower, more than 50% slower. grep is leaving me sucking dust. Ultimate Unix Geek chuckles into his xterms.

Rollin', rollin', rollin'

My eyes darken, my vision tunnels. I break out the big guns. My efforts to vectorize are fruitless (I'm not clever enough to vectorize Boyer-Moore because it has very linear data dependencies.) Shark shows a lot of branching, suggesting I can do better by unrolling the loop. Indeed:

grep: 2.52 seconds

Hex Fiend (unrolled): 2.68 seconds

But I was still more than 6% slower, and that's as fast as I got. Exhausted, stymied at every turn, I throw up my hands. grep has won.

grep's dark secret

"How do you do it, Ultimate Unix Geek? How is grep so fast?" I moan at last, crawling forwards into the pale light of his CRT.

"Hmmm," he mumbles. "I suppose you have earned a villian's exposition. Behold!" A blaze of keyboard strokes later and grep's source code is smeared in green-on-black across the screen.

while (tp < = ep)
	  {
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    if (d == 0)
	      goto found;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    if (d == 0)
	      goto found;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    if (d == 0)
	      goto found;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	  }

"You bastard!" I shriek, amazed at what I see. "You sold them out!"

See all those d = d1[U(tp[-1])], tp += d; lines? Well, d1 is the jump table, and it so happens that grep puts 0 in the jump table for the last character in needle. So when grep looks up the jump distance for the character, via haystack_index += jump_table[haystack_char], well, if haystack_char is the last character in needle (meaning we have a potential match), then jump_table[haystack_char] is 0, so that line doesn't actually increase haystack_index.

All that is fine and noble. But do not be fooled! If the characters match, the search location doesn't change - so grep assumes there is no match, up to three times in a row, before checking to see if it actually found a match.

Put another way, grep sells out its worst case (lots of partial matches) to make the best case (few partial matches) go faster. How treacherous! As this realization dawns on me, the room seemed to grow dim and slip sideways. I look up at the Ultimate Unix Geek, spinning slowly in his padded chair, and I hear his cackle "old age and treachery...", and in his flickering CRT there is a face reflected, but it's my ex girlfriend, and the last thing I see before I black out is a patch of yellow cheese powder inside her long tangled beard.

I take a page from grep

"Damn you," I mumble at last, rising from my prostrate position. Chagrined and humbled, I copy the technique.

grep: 2.52 seconds

Hex Fiend (treacherous): 2.46 seconds

What's the win?

Copying that trick brought me from six percent slower to two percent faster, but at what cost? What penalty has grep paid for this treachery? Let us check - we shall make a one gigabyte file with one thousand x's per line, and time grep searching for "yy" (a two character best case) and "yx" (a two character worst case). Then we'll send grep to Over-Optimizers Anonymous and compare how a reformed grep (one that checks for a match after every character) performs.

Best case Worst case

Treacherous grep 2.57 seconds 4.89 seconds

Reformed grep 2.79 seconds 2.88 seconds

Innnnteresting. The treacherous optimization does indeed squeeze out almost 8% faster searching in the best case, at a cost of nearly 70% slower searching in the worst case. Worth it? You decide! Let me know what you think.

Resolved and refreshed, I plan my next entry. This isn't over, Ultimate Unix Geek.

Disclaimers

(Note: I have never met the authors or maintainers of grep. I'm sure they're all well balanced clean shaven beer and coffee drinkers.)

(Oh, and the released version of HexFiend will be slightly slower in this case, because of an overly large buffer that blows the cache. In other situations, the story is different, but more about those in a future post.)

	Best case	Worst case
Treacherous grep	2.57 seconds	4.89 seconds
Reformed grep	2.79 seconds	2.88 seconds