Fish tank with Puck the Fish Fish tank with Puck the Fish Water on the floor spelling Fish
 

The Treacherous Optimization

May 30th, 2006

Old age and treachery will beat youth and skill every time.

"I’m going to beat grep by thirty percent!" I confidently crow to anyone who would listen, those foolish enough to enter my office. And my girlfriend too, who’s contractually obligated to pay attention to everything I say.

See, I was working on Hex Fiend, and searching was dog slow. But Hex Fiend is supposed to be fast, and I want blazingly quick search that leaves the bewildered competition coughing in trails of dust. And, as everyone knows, the best way to get amazing results is to set arbitrary goals without any basis for believing they can be reached. So I set out to search faster than grep by thirty percent.

The first step in any potentially impossible project is, of course, to announce that you are on the verge of succeeding.

I imagine the author of grep, Ultimate Unix Geek, squinting at vi; the glow of a dozen xterms is the only light to fall on his ample frame covered by overalls, cheese doodles, and a tangle of beard. Discarded crushed Mountain Dew cans litter the floor. I look straight into the back of his head, covered by a snarl of greasy locks, and reply with a snarl of my own: You’re mine. The aphorism at the top, like the ex girlfriend who first told it to me, is dim in my recollection.

String searching

Having exhausted all my trash-talking avenues, it’s time to get to work. Now, everyone knows that without some sort of preflighting, the fastest string search you can do still takes linear time. Since my program is supposed to work on dozens of gigabytes, preflighting is impossible – there’s no place to put all the data that preflighting generates, and nobody wants to sit around while I generate it. So I am resigned to the linear algorithms. The best known is Boyer-Moore (I won’t insult your intelligence with a Wikipedia link, but the article there gives a good overview).

Boyer-Moore works like this: you have some string you’re looking for, which we’ll call the needle, and some string you want to find it in, which we’ll call the haystack. Instead of starting the search at the beginning of needle, you start at the end. If your needle character doesn’t match the character you’re looking at in haystack, you can move needle forwards in haystack until haystack’s mismatched character lines up with the same character in needle. If haystack’s mismatch isn’t in needle at all, then you can skip ahead a whole needle’s length.

For example, if you’re searching for a string of 100 ‘a’s (needle), you look at the 100th character in haystack. If it’s an ‘x’, well, ‘x’ doesn’t appear anywhere in needle, so you can skip ahead all of needle and look at the 200th character in haystack. A single mismatch allowed us to skip 100 characters!

I get shot down

For performance, the number of characters you can skip on a mismatch is usually stored in an array indexed by the character value. So the first part of my Boyer-Moore string searching algorithm looked like this:

char haystack_char = haystack[haystack_index];
if (last_char_in_needle != haystack_char)
   haystack_index += jump_table[haystack_char];

So we look at the character in haystack and if it’s not what we’re looking for, we jump ahead by the right distance for that character, which is in jump_table.

"There," I sigh, finishing and sitting back. It may not be faster than grep, but it should be at least as fast, because this is the fastest algorithm known. This should be a good start. So I confidently ran my benchmark, for a 1 gigabyte file…

grep: 2.52 seconds
Hex Fiend: 3.86 seconds

Ouch. I’m slower, more than 50% slower. grep is leaving me sucking dust. Ultimate Unix Geek chuckles into his xterms.

Rollin', rollin', rollin'

My eyes darken, my vision tunnels. I break out the big guns. My efforts to vectorize are fruitless (I’m not clever enough to vectorize Boyer-Moore because it has very linear data dependencies.) Shark shows a lot of branching, suggesting I can do better by unrolling the loop. Indeed:

grep: 2.52 seconds
Hex Fiend (unrolled): 2.68 seconds

But I was still more than 6% slower, and that’s as fast as I got. Exhausted, stymied at every turn, I throw up my hands. grep has won.

grep's dark secret

"How do you do it, Ultimate Unix Geek? How is grep so fast?" I moan at last, crawling forwards into the pale light of his CRT.

"Hmmm," he mumbles. "I suppose you have earned a villian’s exposition. Behold!" A blaze of keyboard strokes later and grep’s source code is smeared in green-on-black across the screen.

while (tp < = ep)
	  {
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    if (d == 0)
	      goto found;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    if (d == 0)
	      goto found;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	    if (d == 0)
	      goto found;
	    d = d1[U(tp[-1])], tp += d;
	    d = d1[U(tp[-1])], tp += d;
	  }

"You bastard!" I shriek, amazed at what I see. "You sold them out!"

See all those d = d1[U(tp[-1])], tp += d; lines? Well, d1 is the jump table, and it so happens that grep puts 0 in the jump table for the last character in needle. So when grep looks up the jump distance for the character, via haystack_index += jump_table[haystack_char], well, if haystack_char is the last character in needle (meaning we have a potential match), then jump_table[haystack_char] is 0, so that line doesn't actually increase haystack_index.

All that is fine and noble. But do not be fooled! If the characters match, the search location doesn't change - so grep assumes there is no match, up to three times in a row, before checking to see if it actually found a match.

Put another way, grep sells out its worst case (lots of partial matches) to make the best case (few partial matches) go faster. How treacherous! As this realization dawns on me, the room seemed to grow dim and slip sideways. I look up at the Ultimate Unix Geek, spinning slowly in his padded chair, and I hear his cackle "old age and treachery...", and in his flickering CRT there is a face reflected, but it's my ex girlfriend, and the last thing I see before I black out is a patch of yellow cheese powder inside her long tangled beard.

I take a page from grep

"Damn you," I mumble at last, rising from my prostrate position. Chagrined and humbled, I copy the technique.

grep: 2.52 seconds
Hex Fiend (treacherous): 2.46 seconds

What's the win?

Copying that trick brought me from six percent slower to two percent faster, but at what cost? What penalty has grep paid for this treachery? Let us check - we shall make a one gigabyte file with one thousand x's per line, and time grep searching for "yy" (a two character best case) and "yx" (a two character worst case). Then we'll send grep to Over-Optimizers Anonymous and compare how a reformed grep (one that checks for a match after every character) performs.

Best case Worst case
Treacherous grep 2.57 seconds 4.89 seconds
Reformed grep 2.79 seconds 2.88 seconds

Innnnteresting. The treacherous optimization does indeed squeeze out almost 8% faster searching in the best case, at a cost of nearly 70% slower searching in the worst case. Worth it? You decide! Let me know what you think.

Resolved and refreshed, I plan my next entry. This isn't over, Ultimate Unix Geek.

Disclaimers

(Note: I have never met the authors or maintainers of grep. I'm sure they're all well balanced clean shaven beer and coffee drinkers.)

(Oh, and the released version of HexFiend will be slightly slower in this case, because of an overly large buffer that blows the cache. In other situations, the story is different, but more about those in a future post.)

 

The Internet!

π = 3.28266551101

Reinder

Some possible improvements spring to my mind:
- you should be able to offset tp by one, replacing “tp[-1]” by *tp (check first whether your compiler managed to do that, and whether this actually makes a difference on your CPU)
- vectorize the search by running searches starting at offsets 0, N, 2N, 3N for haystacks of size N + length( searchString) – 1 in parallel (N would have to be fairly large relative to the search string; measure first whether there is memory bandwidth available)

I also wonder whether the grep programmer used actual data to arrive at that 2,3,3,2 pattern.

Scott

I think you mean “prostrate” (with an “r”)

Vid Boi

So the question is, how common are these partial matches in the real world (or for the average search in hex fiend)? If rare then perhaps the performance gain by the “treacherous grep” technique is worth it. If partial matches are common, then I would definitely favor the “reformed grep.”

How embarassing – thank you, Scott.

“clean shaven beer” ?

I like my beer fuzz-free

how does fgrep compare?
It uses a different algorithm doesn’t it?

Steve Israelson

Can’t you use something other than character comparisons? Seems to me comparing 8 bytes at a time would be faster if they were aligned. You could make 8 byte shifted versions so you can always compare byte aligned data. Then switch to your method for the remainder bytes at the non aligned start and end of the pattern.

[...] Ridiculous Fish takes on grep Posted by Jon Baer Wed, 31 May 2006 03:20:18 GMT Ridiculous Fish takes on grep: “I love Ridiculous Fish’s blog. It makes [...]

I did some snooping around with Google, and I think I’ve located a picture of your nemesis:

http://ducky.net/~kim/web_pictures/jacob_1year/UncleMike_paino_Jacob.jpg

(He’s the taller one).

nice site design. i like these little squares. very nice.
i rarely comment.

Grep cheated! Now you are in the land of statistical arguments for your test cases. It seems time for you to set up some automatic test harnessing for your optimizations (as in testing the one hundred most common cases of using grep). Do the grep sources point to a cache of standard test cases? I would think that grepping source code, email folders, and text documents would form the basis of the most common cases. Also, it would be interesting if grep turned out to be optimized for English, since you can make some guesses about the statistical frequencies of letters. Perhaps it would be possible to run grep with optimization options for different languages. As in grep –hintlang=fr bonjour *.txt

Good work. How about submitting your improvements to the GNU grep project?

Chris

Just wanting to find out where PI is by now. Ignore me.

Bob sponge

Another thing about grep is that is has some smart IO. Normally I suspect that you fread() the data and then search. This has two copies, one from file descriptor to stdio buffer and one from stdio buffer to your buffer. Grep does file-reading+searching at once.

OTOH, if you mmap the files you shouldn’ have this problem.

Now where is Patric?

Pointy haired

Grep is just like a rrd graph that is updated every n seconds, but with searches instead of graphs :)

If you’re looking to determine whether a sacrifice is worth it, I’d suggest considering that grep and HexFiend have different problem domains.

grep is predominantly for text (significantly non-random and weighted heavily towards certain types of data pattern)

HexFiend, since it’s targetting files in excess of 1.21 gigawatts, will encounter compressed, pseudo-random and other high entropy data.

Perhaps grep’s sacrifice is worth it but HexFiend still shouldn’t follow suit.

You might be interested in “Fast String Searching”, published in Software — Practice and Experience, Volume 21(11), 1221-1248 (November 1991). One of the authors (Hume) was a part of the old grep vs. egrep speed wars back in the day.

The abstract: Since the Boyer-Moore algorithm was described in 1977, it has been the standard benchmark for the practical string search literature. Yet this yardstick compares badly with current practice. We describe two algorithms that perform 47% fewer comparisons and are about 4.5 times faster across a wide range of architectures and compilers. These new variants are members of a family of algorithms based on the skip loop structure of the preferred, but often neglected, fast form of Boyer-Moore. We present a taxonomy for this family, and describe a toolkit of components that can be used to design an algorithm most appropriate for a given set of requirements.

Nota bene: I’m fairly certain cheese doodles weren’t present.

David Cooke

There’s a description of a bunch of the string-searching algorithms out there at
http://www-igm.univ-mlv.fr/~lecroq/string/node1.html

[...] June 1, 2006 @ 03:11:40 · Filed under del.icio.us ridiculous_fish » Blog Archive » The Treacherous Optimization Interes [...]

I started a long comment here, then decided it was actually more of a blog entry. http://fob.po8.org/node/173

BTW, one of the cool things about open source is that you can talk to these folks. Drop Mike a line at if you haven’t already; I’m pretty sure he’ll be highly amused by your article. Nice writing, great content! Thanks much for sharing.

I Saw MIke Once

Mike has always been well-groomed and clean cut, never scraggly or Cheeto-covered. The one thing missing today is hair. Mike used to have hair. Those were truly the “Good Ol’ Days”.

[...] night unwind rewind linkdump.20060601 Ridiculous Fish on “Treacherous grep“… Funny stuff. “How do you do it, [...]

[...] performance no matter what the result is. If you’re struggling for inspiration, try here—doesn’t help directly but might prompt some lateral th [...]

[...] ple for how you should approach the next technical interview question you’re asked. Link. This link was tagged Geeky. [...]

There are two ‘easy’ ways to optimize string search, which is especially useful for long keys, if you know the length of string within you are willing to search for a key (i.e you are not just searching until you hit ).

Those two ideas can be combined.
One is to not check by char, but by word (i.e integer).
For instance, instead of:

register char *s, *o;
while (*p)
{
for (o = p, s = key; *p && *(s++) == *(p++); )
continue;

if (!(*s))
return o;
}

return NULL;

say sLen = strlen(p) you can do something like:

register unsigned int nWords = sLen / sizeof(uint), *word, i, *kW;
register char *s, *o;

for (*p)
{
for (word = (unsigned int *)p, kW = (unsigned int *)key, i = 0; i = sizeof(unsigned int)

I haven’t actually tried out this code, this is on top of my head, but given’s processor’s ease of manipulating integers, it should be faster.

Another way to speed it up even more is to utilize the Duff’s device algorithm [http://en.wikipedia.org/wiki/Duff's_device ], which will reduce the checks, big time, especially for long strings and/or keys.

The idea behind Duff’s Device (a clever one at that) is that you can use big steps and small steps. The gain here is that you don’t have to check for the iterator value for every iteration, instead you check once every step (say, every 8 iterations).

Anyway, nice post:)

Bad news. Looks like the spam is starting to sluice through the holes in the very large number mesh.

Although it’s got a haiku kind of feel to it. Very zen. I feel like I want to buy some Mark “Forex” Twain online (but only the distinctly native molten silver kind).

Bob Frank

One question. I was wondering about the worst case. In a file with lots of partial matches, when the needle finds a hit, would it be better to test the first letter of the needle before scanning the entire needle?

If the haystack were random, then it shouldn’t matter much what the second test is needle[0] or needle[lenght-2], but for lots of partial matches I wonder if the first char would provide faster discrimination between hits and misses.

Keep up the good bloging!

Also see HAKMEM (aka MIT AIM 239) item 179… :)

jwb

This might already be obvious, and it depends on the contents of needle, but grep (GNU grep, anyway) runs many times more quickly with LANG=C. This could matter if your normally use a UTF8 environment.

Why not code in both methods and let the user decided in the preferences menu which to use? You might also want to have a quick analyzing of each file and guess which method would be quicker and use that.

Heck, you can even record statistics on the type and size of files analyzed and methods used and send them to a DB and let your user base SHOW you which method is more suitable. :)

Awesome entry. I learned quite a lot.

The problem with improving your algorithm based on grep’s implementation is that if you do end up with substantial improvements, you’re going to feel obligated to submit a patch to grep, right? And once you do that, you won’t be faster than grep…

BTW, Eugene, please, dear lord, no to the user preference part. I know hex editors are for geeks, but there’s a limit. Any geek who’s using the hex editor is doing so because they don’t want to write their own. Which probably means they don’t want to worry about the specifics of search algorithm optimization.

Yeah, don’t make a user preference for the search algorithm. Nobody cares that much. :-)

But you know what might be nifty? You could try to automatically pick the right version at runtime. Keep a counter that tracks how many partial matches you’re getting. Every so often (32KiB of data? more? less?) switch algorithms based on the value of the counter. If you’re getting a lot, switch over to the non-treacherous version. If it’s very few, switch to the treacherous version.

That wouldn’t add a lot of overhead — one register increment, not even in the inner loop, which on modern CPUs might literally be no difference in speed — and it seems like it would let you do the right thing for a wide variety of search strings. It would even work well for a complicated file whose entropy changes dramatically throughout, like an ELF or other executable that includes code and data sections.

NB: actual implementation and benchmarking of this suggestion is left as an exercise for the blogger.

[...] p/” rel=”bookmark”>
Making war on grep

Totally geek post in ridiculous fish.

Posted by bfp in Blog Thi [...]

This was interesting reading. Thanks for writing it up!

yoda

Fish wrote: “I imagine the author of grep, Ultimate Unix Geek, squinting at vi; the glow of a dozen xterms is the only light to fall on his ample frame…”

You imagine wrong: grep was developed long before vi (vi came from Berkeley Unix), and very long before X Windows. The developer of grep probably used the “ed” editor and a 80×24 CRT display.

I thought about giving it a shot.

assuming that:
char *buffer haystack
size_t length the length of haystack

int needleLen = Strlen(needle), k;
register char *p;

switch (needleLen)
{
case 0:
return NULL;

case 1:
for (p = buffer; *p; p++)
{
if (*p == *needle)
return p;
}
return NULL;
break;

}

register const int needleSafetyBouncer = needleLen – 1;
register int i;
register byte delta[256];
register const char *end = buffer + length;
for (i = 0; i =0 ; k++, i–)
{
if (delta[(byte)needle[i]] == needleLen)
delta[(byte)needle[i]] = k;
}

for (p = buffer; p X where X around 48 or so, and use Duff’s device to reduce the checks for i going down to zero, in order to speed it up even more.

Hm, turns out wordpress is cutting off the entries / code:) oh, well

singulair

singulair side effects weight gain advair singulair asthma medicine singulair

singulair and weight gain weight gain singulair pediatric gain singulair weight

Is http://crm114.sf.net kind of a “reformed grep” ? (… an easier way though may be to use NSStringRegEx, http://homepage.mac.com/jrc/contrib/ )

I’m here with Mike Haertel, GNU grep author and the author of the performance hack that bedeviled you so. He Googled your article somehow a while ago and was highly amused.

Mike wants you (and the very first commenter) to know that the sequence of skips he chose was carefully tuned by timing a grep for “vt100″ in /etc/termcap. At the time, this was a very important benchmark.

Mike has lots of grep speed tricks left up his sleeve, and has promised to share some with me at some point. I’ll keep you posted.

Anonymous

test

Sonic

Hi I want to recommend you very useful rapidshare search http://loadingvault.com. You can find there a lot of new movies, games and music. Enjoy it!

Micro-optimizations are all well and good, but you can beat grep by 100% on dual-core systems for haystacks that are large in comparison to the needles being searched for in them:

If the haystack length is n and the needle length is m, let p = (n – m)/2 and q = p + m – 2. Now spawn two threads, each doing Boyer-Moore, one from 0 to q and the other from p to n – 1 in your big char array. If one thread finds a match, halt both.

This parallelizes the job and divides it exactly in two. Note that if the needle is in the haystack it will be found; if an occurrence contained a character one thread wouldn’t search and another the other wouldn’t search, it would have to contain the character before p and the character after q, plus the m – 1 characters from p (inclusive) to q (inclusive), and so have length m + 1, yet it has length m.

Of course, the serial version should be run if the haystack is too small.

Generalizing this to N-threaded versions is not too difficult, and is left as an exercise for the reader.