Fish tank with Puck the Fish Fish tank with Puck the Fish Water on the floor spelling Fish
 

Will it optimize?

July 23rd, 2010


See how well you know (or can anticipate) gcc’s optimizer. For each question, the left box contains some code, while the right box contains code that purports to do the same thing, but that illustrates a particular optimization. Will gcc apply that optimization? Put another way, will the code on the left be as fast as the code on the right, when compiled with an optimizing gcc?

I used a pretty ancient gcc 4.2.1 for these tests. If newer versions have different behavior, please leave a comment.

Beware: not all proposed optimizations are actually valid!

1. Recursion elimination

Can GCC replace recursive functions with a loop?

int factorial(int x) {
   if (x > 1) return x * factorial(x-1);
   else return 1;
}
int factorial(int x) {
   int result = 1;
   while (x > 1) result *= x--;
   return result;
}


2. Loop-invariant strlen()

Will GCC hoist out strlen()?

unsigned sum(const unsigned char *s) {
   unsigned result = 0;
   for (size_t i=0; i < strlen(s); i++) {
      result += s[i];
   }
   return result;
}
unsigned sum(const unsigned char *s) {
   unsigned result = 0;
   size_t length = strlen(s);
   for (size_t i=0; i < length; i++) {
      result += s[i];
   }
   return result;
}


3. Multiplication by 2 to addition – integer

Will GCC transform an integer multiplication by 2 to addition?

int double_it(int x) {
   return x * 2;
}
int double_it(int x) {
   return x + x;
}


4. Multiplication by 2 to addition – floating point

Will GCC transform a floating point multiplication by 2 to addition?

float double_it(float x) {
   return x * 2.0f;
}
float double_it(float x) {
   return x + x;
}


5. Division by 2 to right shift

Will GCC transform an integer division by 2 to a right shift?

int halve_it(int x) {
   return x / 2;
}
int halve_it(int x) {
   return x >> 1;
}


6. If-else chains to switch statements

Will GCC apply the same optimizations to if-else chains as it does to switch statements?

void function(int x) {
   if (x == 0) f0();
   else if (x == 1) f1();
   else if (x == 2) f2();
   else if (x == 3) f3();
   else if (x == 4) f4();
   else if (x == 5) f5();
}
void function(int x) {
   switch (x) {
      case 0: f0(); break;
      case 1: f1(); break;
      case 2: f2(); break;
      case 3: f3(); break;
      case 4: f4(); break;
      case 5: f5(); break;
   }
}

Summing up

It is tempting to think of compiler optimizations as reducing the constant in your program’s big-O complexity, and nothing else. They aren’t supposed to be able to make your program asymptotically faster, or affect its output.

However, as we saw, they really can reduce the asymptotic complexity in space (question 1) and time (question 2). They can also affect calculated results (discussion of question 4) and maybe even whether your program goes into an infinite loop (see here).

On the flip side, several “obvious” optimizations are subtly incorrect and so will not be performed by the compiler, especially when they involve floating point. If your floating point code is demonstrably a bottleneck and you don’t need exact precision or care about special FP values, you may be able to realize a speedup by doing some optimizations manually. However, untying the compiler’s hands through options like -ffast-math is probably a better idea, and then only for the affected files, since these flags have a global impact.

And lastly, this isn’t meant to be a prescriptive post, but we all know why micro-optimizing is usually a mistake: it wastes your time, it’s easy to screw up (see question 5), and it typically produces no measurable speedup.

Code smart, and be safe out there!

 

Labor of Division (Episode 1)

February 15th, 2010

Here’s how you divide an unsigned int by 13 in C:

unsigned divide(unsigned x) { return x / 13; }

It gets better, I promise. Here’s the corresponding x64 assembly that gcc produces:

_divide:
        movl    $1321528399, %edx
        movl    %edi, %eax
        mull    %edx
        shrl    $2, %edx
        movl    %edx, %eax
        ret

Instead of division, there’s multiplication by a bizarre number: 1321528399. What wickedness is this?

1321528399 is an example of a “magic number:” a number that lets you substitute speedy multiplication for pokey division, as if by magic. In this post, I become a big fat spoilsport and ruin the trick for everyone.

Dividing by 13 is different than multiplying by 1321528399, so there must be more here than meets the eye. Compiling it as PowerPC reveals a bit more:

_divide:
        lis r0,0x4ec4
        ori r0,r0,60495
        mulhwu r3,r3,r0
        srwi r3,r3,2
        blr

mulhwu means “multiply high word unsigned.” It multiplies by 1321528399, but takes the high 32 bits of the 64 bit result, not the low 32 bits like we are used to. x86-64 does the same thing (notice the shift right instruction operates on %edx, which contains the high bits, not %eax which is the low bits). After multiplying, it shifts the result right by two. In C:

unsigned divide(unsigned x) {
   return (unsigned)(x*1321528399ULL >> 34);
}

So dividing by 13 is the same as multiplying by 1321528399 and then dividing by 234. That means that 234 divided by 13 would be 1321528399, right? In fact, it’s 1321528398.769 and change. Pretty close, but we’re not optimizing horseshoes, so how can we be sure this works all the time?

 

The answer to the question right above this line

It’s provable, and hopefully we can get some intuition about the whole thing. In the following proof, division is exact, not that integer-truncating stuff that C does; to round down I’ll use floor. Also, this only covers unsigned division: every number that appears is non-negative. (Signed division is broadly similar, though does differ in some details.) Lastly, we’re going to assume that the denominator d is not a power of 2, which is weirdly important. If d were a power of 2, we’d skip the multiplication and just use shifts, so we haven’t lost any generality.

We have some fixed positive denominator d and some variable nonegative numerator n, and we want to compute the quotient \frac n d – no, wait, \lfloor \frac n d \rfloor, since that’s what C does. We’ll multiply the top and bottom by 2k, where k is some positive integer power – it represents the precision in a sense, and we’ll pick its value later:

\lfloor \frac n d \rfloor = \lfloor  \frac n d \times \frac {2^k} {2^k} \rfloor = \lfloor \frac {2^k} d \times \frac n {2^k} \rfloor

We’re going to call that \frac {2^k} d term mexact, because it’s our “magic number” that lets us compute division through multiplication. So we have:

m_{exact} = \frac {2^k} d \\\\ \lfloor \frac n d \rfloor = \lfloor m_{exact} \times \frac n {2^k} \rfloor

This equality looks promising, because we’ve hammered our expression into the shape we want; but we haven’t really done anything yet. The problem is that mexact is a fraction, which is hard for computers to multiply. (We know it’s a fraction because d is not a power of 2). The tricky part is finding an integer approximation of mexact that gives us the same result for any dividend up to the largest possible one (UINT_MAX in our case). Call this approximation m. So we have m \approx m_{exact} , where m is an integer.

When it comes to approximation, closer is better, so let’s start by just rounding m down: m = \lfloor m_{exact} \rfloor. Since mexact cannot be an integer, m must be strictly less than mexact. Does this m work, which is to say, does it behave the same as mexact over the range we care about?

We’ll try it when the numerator equals the denominator: n = d. Of course then the quotient is supposed to be 1. This leads to:

\frac d d = m_{exact} \times \frac d {2^k} \\ \implies \frac d d > m \times \frac d {2^k} \\ \implies 1 > \lfloor m \times \frac d {2^k} \rfloor

That’s bad, because that last expression is supposed to be 1. The approximation m = \lfloor m_{exact} \rfloor is too small, no matter what k is.

Ok, let’s round up instead: m = \lceil m_{exact} \rceil = \lceil \frac {2^k} d \rceil. Does this value for m work?

Now, \lceil \frac {2^k} d \rceil just means dividing 2k by d and then rounding up. That’s the same as rounding up 2k to the next multiple of d first, and then dividing exactly by d. (We know that 2k is not itself a multiple of d because d is not a power of 2!) So we can write:

m = \lceil \frac {2^k} d \rceil = \frac {2^k + e} d.

where e is the “thing to add to get to the next multiple of d,” which means that 0 < e < d. (Formally, e = d - 2k mod d.) This value e is sort of a measure of how much “damage” the ceil does – that is, how bad the approximation m is.

So let’s plug this new m into our division expression, and then apply some algebra:

\lfloor m \times \frac n {2^k} \rfloor = \lfloor \frac {2^k + e} d \times \frac n {2^k} \rfloor = \lfloor \frac n d + \frac e d \times \frac n {2^k} \rfloor

This last expression is really cool, because it shows the wholesome result we’re after, \lfloor \frac n d \rfloor, and also an insidious “error term:” that \frac e d \times \frac n {2^k}, which represents how much our ceil is causing us to overestimate. It also shows that we can make the error term smaller by cranking up k: that’s the sense in which k is a precision.

 

K’s Choice

We could pick some huge precision k, get a very precise result, and be done. But we defined m to be \lceil \frac {2^k}  d \rceil, so if k is too big, then m will be too big to fit in a machine register and we can’t efficiently multiply by it. So instead let’s pick the smallest k that’s precise enough, and hope that fits into a register. Since there are several variables involved, our strategy will be to find upper bounds for them, replace the variables with their upper bounds, and simplify the expression until we can solve for k.

So we want the smallest k so that \lfloor \frac n d \rfloor = \lfloor \frac n d + \frac e d \times \frac n {2^k} \rfloor . This will be true if the error term \frac e d \times \frac n {2^k} is less than 1, because then the floor operation will wipe it out, right? Oops, not quite, because there’s also a fractional contribution from \frac n d. We need to be sure that the error term, plus the fractional contribution of \frac n d, is less than 1 for that equality to be true.

We’ll start by putting an upper bound on the fractional contribution. How big can it be? If you divide any integer by d, the fractional part of the result is no more than \frac {d - 1} d. Therefore the fractional contribution of \frac n d is at most \frac {d - 1} d, and so if our error term is less than \frac 1 d, it will be erased by the floor and our equality will be true.

So k will be big enough when   \frac e d \times \frac n {2^k} < \frac 1 d.

Next we’ll tackle \frac e d. We know from up above that e is at most d-1, so we have \frac e d < 1. So we can ignore that factor: \frac n {2^k} < \frac 1 d \implies \frac e d \times \frac n {2^k}  < \frac 1 d.

The term \frac n {2^k} has a dependence on n, the numerator. How big can the numerator be? Let’s say we’re dividing 32 bit unsigned integers: n ≤ 232. (The result is easily extended to other widths). We can make \frac n {2^k} less than 1 by picking k = 32. To make it even smaller, less than \frac 1 d, we must increase k by \log_2 d. That is,

k > 32 + \log_2 d \implies \frac n {2^k} < \frac 1 d

Since we want k to be an integer, we have to round this up: k = 32 + \lceil \log_2 d \rceil. We know that value for k works.

What does that precision do to m, our magic number? Maybe this k makes m too big to fit into a 32 bit register! We defined m = \lceil \frac {2^k} d \rceil; plugging in k we have m = \lceil \frac {2^{32 + \lceil log_2 d \rceil} } d \rceil .

This is a gross expression, but we can put an upper bound on it. Note that d < {2 ^ { \lfloor log_2 d \rfloor } }, so we can write the following:

m <= \lceil \frac {2^{32 + \lceil log_2 d \rceil} } {2 ^ { \lfloor log_2 d \rfloor } } \rceil = \lceil {2^{32 + \lceil log_2 d \rceil - \lfloor log_2 d \rfloor} } \rceil \\ \\ = \lceil {2 ^ { 32 + 1}} \rceil = 2^{33}

Nuts! Our approximation m just barely does not fit into a 32 bit word!

Since we rounded some stuff, you might think that we were too lazy and a more careful analysis would produce a tighter upper bound. But in fact our upper bound is exact: the magic number for an N-bit division really may need to be as large as N+1 bits. The good news is, there may be smaller numbers that fit in N bits for specific divisors. More on that later!

 

We are fruitful and multiply

In any case, we now have our magic number m. To perform the division, we need to compute  \lfloor \frac {m n} {2^k} \rfloor as efficiently as possible. m is a 33 bit number, so how can a 32 bit processor efficiently multiply by that? One way would be to use a bignum library, but’s probably slow. A better approach is to multiply by the low 32 bits, and handle the 33rd bit specially, like this:

m n \\ \\ = (m - 2^{32} + 2^{32})n \\ \\ = (m - 2^{32})n + 2^{32} n

This looks very promising: m – 232 is definitely a 32 bit number. Furthermore, we can compute that 232n term very easily: we don’t even need to shift, just add n to the high word of the product. Unfortunately, it’s not right: a 32 bit number times a 33 bit number is 65 bits, so the addition overflows.

But we can prevent the overflow by distributing one of the 2s in the denominator, through the following trick:

 \lfloor \frac {n + q} {2^k} \rfloor = \lfloor  \frac {n - q + 2q} {2^k} \rfloor = \lfloor  ( \frac {n - q + 2q} 2 ) / {2^{k-1}} \rfloor \\ = \lfloor ( \lfloor \frac {n - q} 2 \rfloor + q ) / {2^{k-1}} \rfloor

Here q is the magic number (minus that 33rd bit) times n, and n is just n, the numerator. (Everything should be multiplied by 232, but that’s sort of implicit in the fact that we are working in the register containing the high product, so we can ignore it.)

Can the subtraction n – q underflow? No, because q is just n times the 32 bit magic number, and then divided by 232. The magic number (now bereft of its 33rd bit) is less than 232, so we have q < n, so n - q cannot underflow.

What about that +q? Can that overflow? No, because we can just throw out the floor to get an upper bound of \frac {n + q} 2 which is a 32 bit number.

So we have a practical algorithm. Given a dividend n and a fixed divisor d, where 0 < d < 232 and 0 ≤ n < 232:

  1. Precompute p = \lceil log_2 d \rceil
  2. Precompute m = \lceil \frac {2^{32 + p}} d \rceil. This will be a 33 bit number, so keep only the low 32 bits.
  3. Compute q = (m \times n) >> 32. Most processors can do this with one “high multiply” instruction.
  4. Compute t = ((n - q) >> 2) + q, which is an overflow-safe way of computing (n + q) >> 1. This extra addition corrects for dropping the 33rd bit of m.
  5. Perform the remaining shift p: t = t >> (p-1)

That’s it! We know how to efficiently replace division with multiplication. We can see all these steps in action. Here is the assembly that Clang generates for division by 7, along with my commentary:

_divide_by_7:
 movl $613566757, %ecx  613566757 is the low 32 bits of (2**35)/7, rounded up
 movl %edi, %eax
 mull %ecx              Multiply the dividend by the magic number
 subl %edx, %edi        The dividend minus the high 32 bits of the product in %edx (%eax has the low)
 shrl %edi              Initial shift by 1 to prevent overflow
 addl %edx, %edi        Add in the high 32 bits again, correcting for the 33rd bit of the magic number
 movl %edi, %eax        Move the result into the return register
 shrl $2, %eax          Final shift right of floor(log_2(7))
 ret
 

Improving the algorithm for particular divisors

This algorithm is the best we can do for the general case, but we may be able to improve on this for certain divisors. The key is that e term: the value we add to get to the next multiple of d. We reasoned that e was less than d, so we used \frac e d < 1 as an upper bound for e. But it may happen that a multiple of d is only slightly larger than a power of 2. In that case, e will be small, and if we're lucky, it will be small enough to push the entire "error term" under \frac 1 d.

For example, let’s try it for the divisor 11. With the general algorithm, we would need k = 32 + \lceil log_2 11 \rceil = 36. But what if we choose k = 35 instead? 235 + 1 is a multiple of 11, so we have e = 1, and we compute:

\frac e d \times \frac n {2^k} = \frac 1 d \times \frac n {2^{35}} < \frac 1 d

So k = 35 forces the error term to be less than \frac 1 d, and therefore k = 35 is a good enough approximation. This is good news, because then m = \lceil \frac {2^{35}} {11} \rceil = 3123612579 < 2^{32}, so our magic number fits in 32 bits, and we don't need to worry about overflow! We can avoid the subtraction, addition, and extra shifting in the general algorithm. Indeed, clang outputs:

_divide_by_11:
  movl  $-1171354717, %ecx
  movl  %edi, %eax
  mull  %ecx
  movl  %edx, %eax
  shrl  $3, %eax
  ret

The code for dividing by 11 is shorter than for dividing by 7, because a multiple of 11 happens to be very close to a power of 2. This shows the way to an improved algorithm: We can simply try successive powers of 2, from 32 up to 32 + \lfloor log_2 d \rfloor, and see if any of them happen to be close enough to a multiple of d to force the error term below \frac 1 d. If so, the corresponding magic number fits in 32 bits and we can generate very efficient code. If not, we can always fall back to the general algorithm.

The power 32 will work if \frac e d \leq \frac 1 d, the power 33 will work if \frac e d \leq \frac 2 d, the power 34 will work if \frac e d \leq \frac 4 d, etc., up to 32 + \lfloor log_2 d \rfloor, which will work if \frac e d \leq \frac 1 2. By "chance", we'd expect this last power to work half the time, the previous power to work 25% of the time, etc. Since we only need one, most divisors actually should have an efficient, 32 bit or fewer magic number. The infinite series \frac 1 2 + \frac 1 4 + \frac 1 8 sums to 1, so our chances get better with increasing divisors.

It's interesting to note that if a divisor d has one of these more efficient magic numbers for a power k < 32 + \lceil log_2 d \rceil, it also has one for all higher powers. This is easy to see: if 2k + e is a multiple of d, then 2k+1 + 2e is also a multiple of d.

\frac e d \times \frac n {2^k} \leq \frac 1 d \implies \frac {2e} d \times \frac n {2^{k+1}} \leq \frac 1 d

This is good news. It means that we only have to check one case,  \lceil \frac {2^{32 + \lfloor log_2 d \rfloor}} d \rceil (a 32 bit value) to see if there is a more efficient magic number, because if any smaller power works, that one works too. If that that power fails, we can go to the 33 bit number, which we know must work. This is useful information in case we are computing magic numbers at runtime.

Still, gcc and LLVM don't settle for just any magic number - both try to find the smallest magic number. Why? There's a certain aesthetic appeal in not using bigger numbers than necessary, and most likely the resulting smaller multiplier and smaller shifts are a bit faster. In fact, in a very few cases, the resulting shift may be zero! For example, the code for diving by 641:

_divide_by_641:
        movl    $6700417, %ecx
        movl    %edi, %eax
        mull    %ecx
        movl    %edx, %eax
        ret

No shifts at all! For this to happen, we must have \frac e d \le \frac 1 d, which of course means that e = 1, so 641 must evenly divide 232 + 1. Indeed it does.

This inspires a way to find the other "super-efficient" shiftless divisors: compute the factors of 232 + 1. Sadly, the only other factor is 6700417. I have yet to discover an occasion to divide by this factor, but if I do, I'll be ready.

 

So that's how it works

Did you skip ahead? It's OK. Here's the summary.

Every divisor has a magic number, and most have more than one! A magic number for d is nothing more than a precomputed quotient: a power of 2 divided by d and then rounded up. At runtime, we do the same thing, except backwards: multiply by this magic number and then divide by the power of 2, rounding down. The tricky part is finding a power big enough that the "rounding up" part doesn't hurt anything. If we are lucky, a multiple of d will happen to be only slightly larger than a power of 2, so rounding up doesn't change much and our magic number will fit in 32 bits. If we are unlucky, well, we can always fall back to a 33 bit number, which is almost as efficient.

I humbly acknowledge the legendary Guy Steele Henry Warren (must have been a while) and his book Hacker's Delight, which introduced me to this line of proof. (Except his version has, you know, rigor.)

 

cdecl

November 12th, 2009

Quick, what is “char (*(*(* const x[3])())[5])(int)?”

If you immediately blurted out “x­is­an­array­of­three­const­pointers­to­functions­returning­pointers­to­arrays­of­five­pointers­to­functions­taking­int­returning­char“, and your last name doesn’t end with “itchie,” then chances are you used my new website:

www.cdecl.org

Yes, the venerable cdecl – the C gibberish to English translator – is now online. With AJAX! Every C declaration will be as an open book to you! Your coworkers’ scruffy beards and suspenders will be nigh useless!

Write C casts and declarations in plain English! Write plain English in the chicken scratchings and line noise we call C! The possibilities are twain!

(Click here to try it with that declaration!)

ANNNNND…it gets better! Did I mention that my version of cdecl supports blocks? That’s right, now you can totally nail that API that requires a block taking a pointer to a block taking a block taking a pointer to an int returning an int returning a pointer to a block taking void returning int!

This site:

  • Converts readable English variable declarations or typecasts to C
  • Converts C variable declarations or typecasts to English
  • Supports Apple’s blocks extension to C
  • Uses AJAX and has nifty effects
  • Allows you to generate permalinks, so you can send hilarious declarations to your friends, or add them gratuitously to your blog

A note on licensing. The cdecl readme states:

You may well be wondering what the status of cdecl is. So am I. It was twice posted to comp.sources.unix, but neither edition carried any mention of copyright. This version is derived from the second edition. I have no reason to believe there are any limitations on its use, and strongly believe it to be in the Public Domain.

I hereby place my blocks changes to cdecl in the public domain. The cdecl source code, including my changes, is available for download on the site.

 

I Didn’t Order That, So Why Is It On My Bill, Episode 2

September 17th, 2009

This is the second episode of I Didn’t Order That, So Why Is It On My Bill: C++ features you never use, but pay for anyways. See episode one here.

For those who like their punchlines first: std::string is designed to allow copy-on-write. But once operator[] is called, that string is ruined for COW forevermore. The reason is that it has to guard against a future write at any time, because you may have stashed away the internal pointer that operator[] returns – a feature that you surely know better than to use, but that costs you nevertheless.

The C++ standard string class is std::string. Here’s a string containing 10 megabytes of the letter ‘x’:

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
}

Big string. Now let’s make a hundred copies, one by one:

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
    for (int i = 0; i < 100; i++) {
        string the_copy = the_base;
    }
}

This runs in .02 seconds, which is very fast. Suspiciously fast! This old iMac can’t schlep a gigabyte of data in two hundredths of a second. It’s using copy-on-write.

So let’s write and measure the copy:

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
    for (int i = 0; i < 100; i++) {
        string the_copy = the_base;
        the_copy[0] = 'y';
    }
}

The program now runs in 2.5 seconds. These scant 100 writes resulted in a 100x slowdown. Not surprising: it’s exactly what we would expect from a COW string.

the_copy[0] returns a reference to a char. But maybe we aren’t going to write into it – perhaps we only want to read. How can these copy-on-write strings know when the write occurs, so they can copy? The answer is that they cannot – they must be pessimistic and assume you will write, even if you never do. We can trick it: don’t assign anything, and still see the performance hit:

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
    for (int i = 0; i < 100; i++) {
        string the_copy = the_base;
        the_copy[0];
    }
}

No writes! Just a read! But same performance as before. This isn’t copy on write – it’s copy on read!

Things get even worse. operator[] returns a reference: basically a sugary pointer. You can read from it or write to it. Or you can be a jerk and just hang onto it:

string original = "hello";
char & ref = original[0];
string clone = original;
ref = 'y';

In case that’s not clear, we made a string, stashed away a reference to the first character, copied the string, and then wrote to the string’s guts. The write is just a normal store to a memory location, so the poor string is caught completely unaware. If the copy-on-write were naive, then we would have modified both the original and the copy.

Oh man, is that evil. But the string’s got to be paranoid and guard against it anyways – in fact, the C++ standard explicitly calls out this case as something that has to work! In other words, when you get a reference to a char inside a string, the string is tainted. It can never participate in copy-on-write again, because you just might have stashed away the internal pointer it handed back, and write into the string as a completely unexpected time.

Of course you know better than to hang onto internal pointers. You probably don’t need this feature. But you pay for it anyways:

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
    the_base[0];
    for (int i = 0; i < 100; i++) {
        string the_copy = the_base;
    }
}

This takes the 2.5 seconds. Just one silly read wrecked all of the COW machinery for all the future copies. You’re paying for writes that never come!

Could the C++ standard have fixed this? Sure – the C++ standard goes into great detail about when iterators and references into strings are invalidated, and all it would have taken would be to add the copy constructor and operator=() to the list of reference-invalidating functions (section 23.1.5, for those who are following along in the standard). A second option would have been to declare that operator=() invalidates existing references for writing, but not for reading. But the standards committee preferred simplicity, convenience and safety to performance – uncharacteristically, if you’ll forgive a smidgen of editorializing.

If you call a reference-invalidating function, you can, in principle, get back in the COW fast lane. Although GNU STL does not take advantage of this in all cases, it does when resizing strings:

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
    the_base[0];
    the_base.push_back('x');
    for (int i = 0; i < 100; i++) {
        string the_copy = the_base;
    }
}

Allowing the string to modify itself allowed it to know that any previous references were invalidated. We're fast again! It's called a COW, and it's not about to blow it now.

The moral is that you should avoid using operator[], the at() function, iterators, etc. to read from a non-const string - especially a long string - if it may be copied in the future. If you do use these functions, you risk paying for a write that you don't make, and for hanging onto an internal pointer that you released. Unfortunately, there's no good function for just reading from a non-const string. You can do it by adding a crazy cast:

#include <string>
using std::string;
int main(void) {
	string the_base(1024 * 1024 * 10, 'x');
	const_cast<const string &>(the_base)[0];
	for (int i = 0; i < 100; i++) {
		string the_copy = the_base;
	}
}

This hits the fast path too.

Scene. I hope you enjoyed this post. As usual the point is not to rag on C++, but to explore aspects of its design that, for whatever reason, tilt a tradeoff away from performance.

 

I’m Bringing Hexy Back

July 31st, 2009

Hooray, it’s Hex Fiend 2, a nearly complete rewrite of Hex Fiend that incorporates even better techniques for working with big files. Hex Fiend is my fast and clever hex editor for Mac OS X.

Click On Me To Get Hex Fiend

This app is about exploring the implementation of standard desktop UI features in the realm of files too large to fully read into main memory. Is it possible to do copy and paste, find and replace, undo and redo, on a document that may top a hundred gigabytes, and make it feel natural? Where do we run into trouble?

More on that later. For now, here’s what’s better about Hex Fiend 2:

  • It’s embeddable. The most requested feature was "I want a hex view in my app." Now it’s really easy: Hex Fiend 2 is built as a relatively slim shell on top of a bundle-embeddable .framework. There’s a real API, sample code, and everything. See the Developer section of the Hex Fiend page.

  • It’s faster. All around. Text rendering is speedier, and the backing data representation more efficient. It needs less I/O too.

    The save algorithm is especially improved. For example, inserting one byte at the beginning of a 340 MB file and hitting Save would take 52 seconds and 340 additional MB of temporary disk space with Hex Fiend. In Hex Fiend 2 it’s reduced to 22 seconds and requires no temporary disk space.

  • Better UI. It supports discontiguous selection. Scroll-wheel scrolling no longer feels weird. You can group the bytes into blocks (credit to Blake), and you can hide and show different views (thanks to bbum). The big dorky line number view now shrinks to fit. The data inspector panel is inline. It has Safari-style inline find and replace and “pop out” selection highlighting.
  • Long operations support progress reporting, cancellation, and don’t block the UI. For example, find and replace now has a progress bar and a cancel button, and you can keep using your document (or others) while it works.
  • Longstanding bugfixes. Backwards searching is now optimized. Certain coalesced undo bugs have been addressed. There’s some basic cross-file dependency tracking.

I hope you find it useful.