Will it optimize?
July 23rd, 2010
See how well you know (or can anticipate) gcc’s optimizer. For each question, the left box contains some code, while the right box contains code that purports to do the same thing, but that illustrates a particular optimization. Will gcc apply that optimization? Put another way, will the code on the left be as fast as the code on the right, when compiled with an optimizing gcc?
I used a pretty ancient gcc 4.2.1 for these tests. If newer versions have different behavior, please leave a comment.
Beware: not all proposed optimizations are actually valid!
1. Recursion elimination
Can GCC replace recursive functions with a loop?
int factorial(int x) {
if (x > 1) return x * factorial(x-1);
else return 1;
}
int factorial(int x) {
int result = 1;
while (x > 1) result *= x--;
return result;
}
2. Loop-invariant strlen()
Will GCC hoist out strlen()?
unsigned sum(const unsigned char *s) {
unsigned result = 0;
for (size_t i=0; i < strlen(s); i++) {
result += s[i];
}
return result;
}
unsigned sum(const unsigned char *s) {
unsigned result = 0;
size_t length = strlen(s);
for (size_t i=0; i < length; i++) {
result += s[i];
}
return result;
}
3. Multiplication by 2 to addition – integer
Will GCC transform an integer multiplication by 2 to addition?
int double_it(int x) {
return x * 2;
}
int double_it(int x) {
return x + x;
}
4. Multiplication by 2 to addition – floating point
Will GCC transform a floating point multiplication by 2 to addition?
float double_it(float x) {
return x * 2.0f;
}
float double_it(float x) {
return x + x;
}
5. Division by 2 to right shift
Will GCC transform an integer division by 2 to a right shift?
int halve_it(int x) {
return x / 2;
}
int halve_it(int x) {
return x >> 1;
}
6. If-else chains to switch statements
Will GCC apply the same optimizations to if-else chains as it does to switch statements?
void function(int x) {
if (x == 0) f0();
else if (x == 1) f1();
else if (x == 2) f2();
else if (x == 3) f3();
else if (x == 4) f4();
else if (x == 5) f5();
}
void function(int x) {
switch (x) {
case 0: f0(); break;
case 1: f1(); break;
case 2: f2(); break;
case 3: f3(); break;
case 4: f4(); break;
case 5: f5(); break;
}
}
Summing up
It is tempting to think of compiler optimizations as reducing the constant in your program’s big-O complexity, and nothing else. They aren’t supposed to be able to make your program asymptotically faster, or affect its output.
However, as we saw, they really can reduce the asymptotic complexity in space (question 1) and time (question 2). They can also affect calculated results (discussion of question 4) and maybe even whether your program goes into an infinite loop (see here).
On the flip side, several “obvious” optimizations are subtly incorrect and so will not be performed by the compiler, especially when they involve floating point. If your floating point code is demonstrably a bottleneck and you don’t need exact precision or care about special FP values, you may be able to realize a speedup by doing some optimizations manually. However, untying the compiler’s hands through options like -ffast-math is probably a better idea, and then only for the affected files, since these flags have a global impact.
And lastly, this isn’t meant to be a prescriptive post, but we all know why micro-optimizing is usually a mistake: it wastes your time, it’s easy to screw up (see question 5), and it typically produces no measurable speedup.
Code smart, and be safe out there!
The Internet!
π = 3.28661035382
regehr
Awesome post, I like the quiz format.
Excellent post! Loved the quiz format; I hope it proves educational for many fellow fishes.
Enzo90910
Thanks for another excellent post. You don’t blog a lot, but man is it worth the wait!
Anonymous
Ha! Nice article!
The only problem is that on my (admittedly crappy) laptop screen (1024×768) the code overflows from the left box to the right.
Cheers!
Alexis
Ok, my opinion is that you should add a button “I think I mustn’t bother about that”.
« Premature optimization is the root of all evil ».
Stuart Dootson
Cool post – I was especially surprised by the recursion thing! As you say, even functional language compilers don’t necessarily do that well (I’m thinking of and wondering about GHC (Haskell, as I’m sure you know) as I type that…)
And why am I not surprised that someone brings out the ‘Premature optimisation’ ‘quote’ as an argument that you should never care about optimisations…
Scenario: you’re writing code for an embedded system with limited compute resource…do optimisations matter? Hell yeah.
Scenario: you’re writing code for Windows that may be running on systems several years older (and crappier) than you brand new dev. box…do optimisations matter? If you care about the user experience, hell yeah.
And the point is, by knowing *what the compiler can do*, it means you know what micro-optimisations you don’t have to bother with that would cloud the meaning of your nice clean code…
Does gcc really not hoist function calls out of loops if properly marked (e.g. with __attribute__((pure)))? I’d be quite surprised to learn that.
Anonymous
Really great post!
And yeah, the code examples overflow on my 1024×768 iPad screen
No old devices necessary
Iain
6. If-else chains to switch statements
This example isn’t optimized with gcc 4.4.3 either.
What a great idea. And a good start of the day – 6/6 was surprisingly rewarding
Anonymous
I doubt that strlen() is lifted for the reason you state. NOT because it is a built-in. The glibc headers are what define it as inline assembly, making lifting an asm reordering optimization. The other reason might be that due to the const-ness of the function argument, telling GCC there is no possibility of side-effects.
Dear Anonymous,
being const is not *nearly* good enough. The function
int add_something(const int *i) { static int j = 0; return *i + j++; }
has a const argument, but is not remotely free of side effects. Hence why I suggested __attribute__((pure)).
Aku
Gcc does hoist functions that were declared with __attribute__((pure)), and in particular it successfully optimizes example 2 even with -fno-builtin as long as you have the appropriate headers. (Tested gcc-4.2.4 glibc-2.11)
gawker
Awesome work! I love it and have learnt a little bit more about gcc
Anonymous, gcc can and by default will ignore header declarations and definitions of many standard library functions and replace them with built-ins where this is useful. For more information, see http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
(Besides, the fish is probably not using glibc. strlen() is not inlined in Mac system headers.)
Anonymous
Very informative, thats for posting
A nice variation of question 2 is when you add foo(); inside the loop and define void foo() {} in some other source module — then GCC can no longer hoist the strlen because function foo might modify s somehow (GCC cannot normally look into other source modules). Even if s was a stack-based local variable of the sum function, GCC still doesn’t optimize (maybe because foo could modify s via pointer trickery from its own stack frame).
Anonymous
The RSS entry for this seems to be broken in Google Reader, displaying a bunch of CSS junk
Though the buttons wouldn’t work anyway…
Anonymous
I think #2 is incorrect. Another thread could modify the string. My understanding is that the definition of const allows for such a thing. If I’m wrong, then either one should not be allowed to modify data when there are any const aliases, or else it should be impossible to pass non-const data to a function expecting const data. My gcc (4.3.3) does so without warning. Granted, anything which breaks this particular optimization is probably a bug anyway.
Anonymous
Excellent article, except for one thing. Unless I’m crazy (and please let me know if I am), the optimization in question 2 does not reduce the asymptotic complexity of function. Either way the loop iterates ‘length’ times. Since the length of string s doesn’t change, the cost of strlen(s) is a constant. Therefore, you’re only adding O(length) running time in the unoptimized version (the constant strlen(s) * ‘length’ iterations). The running time of the optimized version is still O(length), so there’s no asymptotic difference.
Wow I got #5 right, I actually feel proud of myself
David Emery
I got 5 of 6, missed the last one.
But I’d hope that the GCC compiler were implementing to the actual C standard, and not K&R.
Hamilton-Lovecraft
Anonymous, the cost of strlen itself is not constant, it’s O(length), so the unoptimized version is O(length^2).
nick botulism
nice post but the quiz format was superfluous. just tell me the information.
Anonymous
@Hamilton-Lovecraft, the cost of strlen() is indeed O(length). However, (unless you modify the const string s from another thread as another Anonymous suggested) the length of the string is constant, thus giving O(1).
Nice post. You should check out some of the vector optimizations the compiler can do.
Meta: I like the quiz format. Your page design is clever but the site is slow to load.
Stripes
@ Stuart:
The premature optimization quote (to me) means “if making it fast means in any way less readable, less maintainable, or evan a chance of less correct then it needs to be justified with a benchmark”. Both a benchmark that the new code is indeed faster, and another benchmark showing that it makes a useful difference in some reasonable use case.
Just because my target system is “only 800Mhz” doesn’t mean I should make an initializer that gets called 3 times less maintainable for a 2ns per call speedup. Just because my system “only has 64K RAM” doesn’t mean I should shrink a struct by 2 bytes if there are only 10 instances of the struct and it grows the code size by 24 bytes. (well, unless I have split I and D spaces, then depending on how short of RAM I am the 24 bytes could be justified).
If your optimizations are benchmark driven then it doesn’t matter if you are on a slow system or not. You will focus your speedups on the things that give the biggest bang for the buck. The real driving difference I see on a “slow system” is what the final acceptance tests might be, and that you may find speeding up the slowest 3% of things might not cut it like a desktop system, you might need to speed up the slowest 5% or 8%.
Benchmark first though, you will save a ton of time.
Chris C
Anon: However, the cost of computing strlen(N) is N
There is actually an interesting sidenote about #2. While in general you want to just put any invariant calculations in a for loops test statement and let the compiler hoist them, in the case of Objective C constructions you need to manually hoist them since the compiler doesn’t know multiple calls into a given dispatch are invariant, it and it can’t inline the method and determine that because the method it thinks it will be calling could change at runtime. I mention this because I suspect a lot of your readers are Objective C programmers.
So in the very similiar case of:
unsigned sum(NSString *s) {
unsigned result = 0;
for (size_t i=0; i < [s length]; i++) {
result += [s characterAtIndex:i];
}
return result;
}
Will not be optimized into:
unsigned sum(NSString *s) {
unsigned result = 0;
size_t length = [s length]
for (size_t i=0; i < length; i++) {
result += [s characterAtIndex:i];
}
return result;
}
since it is not safe without semantic knowledge of NSString that goes beyond what the compiler can know. The same goes for things like array iteration, though for..in and the new block based enumerators ameliorate most of that.
This does demonstrate an important point though. Despite the work Apple has done to speed up Objective C dispatch (and it is really quite fast), microbenching it against function calls and claiming there is no significant difference in general use is a folly, since Objective C method dispatch actively thwarts a number of significant optimizations, including inlining. For any code where inlining causes a performance increase Objective C methods will be slower than C function calls in ways that microbenching and -O0 builds do not demonstrate.
Alex
Now it’s time for a similar quiz on LLVM!
MikeA
Back in the g11 1.21 .. 1.41 days, which was a bit before C89 was finalized, the answer to #5 would have been “Answer D, some of the above”. At that time, division where one operand was negative was “implementation defined”, as was right shift of a negative number. gcc _assumed_ that machine division (which it used for ( x / 2 ) acted like div(), which was not true for some machines. So it applied its “add some fudge and shift right” optimization even if the target machine’s divide would have mapped perfectly onto an arithmetic right shift. Of course, it did this only if the divisor was a power of two, or more accurately, if it _knew_ the divisor was a power of two. So the line
x = y / z;
would (if y was negative) return a different answer depending on whether the optimizer beilved that z was a power of two. Big fun!
Michael Eastham
Small nitpick on #5: I believe the C standard does not specify whether >> is an arithmetic or logical right shift. On some architectures that optimization could have a result much worse than rounding the wrong way!
Pierre Habouzit
@Michael Eastham:
IIRC C defines that right shift on signed integers is implementation defined.
And in all “sane” compilers, right shift on a signed integer does bit-sign extension.
For GCC this is documented in http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/Integers-implementation.html#Integers-implementation, quoting:
Bitwise operators act on the representation of the value including both the sign and value bits, where the sign bit is considered immediately above the highest-value value bit. Signed `>>’ acts on negative numbers by sign extension.
Colin S. Miller
For #4, there is an optimisation that can be applied. As IEEE-floating points are stored as
±(1+mantissa) * 2 ^ exponent
then after checking for NaN, ±inf, ±0, then the floating-point number can be loaded as a raw
bitfield, the exponent increment by 1, checked for overflow, and then written back as
floating point.
Ezra
What compiler settings did you use for this?
I wasn’t able to reproduce the first case using gcc 4.2.1 on MacOS X 10.6.4; even with -O3 the generated assembly (viewed with otool) still has a recursive call to the function itself. It looks like it has done some kind of clever unrolling, but I don’t think it has recognized and eliminated a recursive call. Is your environment different?
I would like to re-iterate what one of the Anonymous commenters said, in case #2 the value of ’s’ can change if it’s mutated by a different thread, just because it’s const in ’sum’ doesn’t mean it’s const throughout the program.
This may still technically be a valid optimization since C++98 does not refer to multi-threading (unlike C++0x) but it should be mentioned.
Anonymous
I think the real moral of this story is that GCC should have a way to tell you what optimizations it could/did do. (Does it? I haven’t used GCC seriously in a little while.) Other compilers that I’ve worked with (not for C) do this, and it’s awesome.
It almost makes it a little game: “OK, the code compiled and ran just fine, but the optimizer is warning me that it can’t do optimization {foo} with function {bar} … how can I do a tiny bit of work to convince the compiler to make this massively faster for me?” (Think for a minute, add 2 hint words somewhere, compile, celebrate!)
P.S., Loved the quiz format. Code samples were a bit wide for my screen, though.
@Motti even in C++0x the optimization is valid (AFAIK) because the compiler is still permitted to cache any non-volatile values. For this reason, /* global variable */ bool quit = false; and then while (!quit) { … } in some thread may never terminate, even if the flag is set to true. In order to make it behave correctly, you have to make the quit variable volatile (this is not exactly guaranteed by the standard, though) or use proper synchronization (the compiler is forced to assume that the value has changed during the synchronization). Caching the return value of a function marked pure (by a GCC attribute) is permitted is similar manner as caching the value of the quit variable.
Where I talked of caching previously, the optimizer may simply remove or modify the code so that the effect is the same as if the variable was simply cached (e.g. in CPU register) but actually it is not checked at all.
Anonymous
I think there is a rationale to not optimizing if-else chain into switch statement: there is a chance that if-else chain was already optimized “by hand” with programmer. Say, if x becoming equal to zero 90% of time, and equal to one 8% of time etc.etc., then if-else will be faster than switch.
If other words, the programmer probably knew what they did.
Nice stuff. A must-have for some developers. I have to admit I don’t know GCC details but the other questions I answered right. We do a lot wrong in our code here in the office!
JasonG
Great! and very interesting
Great post, I enjoyed that, even if I completely bombed at the quiz (two questions right… sob).
kiran
Great Post, Liked the Quiz Format
Anonymous
This is a great article in that it convinces me not to rely on compiler optimizations but to do them myself, avoiding unexplained performance variations when seemingly benign code modifications are made.
On the other hand, it would be extremely useful (IMHO) to have a compiler option that would suggest optimizations, perhaps using the same format as you presented, from which I could choose to code manually. I could then compile without optimization on various platforms and compilers (and versions) and reasonably expect the same results and be not subjected to unpredictable optimizations.
madhu
Great post…its given nice insight into compilers
djh
I was expecting a twist in Q6, along the lines of:
#define f0() x++
#define f1() x++
#define f2() x++
#define f3() x++
#define f4() x++
The notion of tail-recursive functions is still useful because its somehow the most simple subcase that is always optimizable in that way. While the above optimizations use the fact that there is only an additional instruction that doesnt consume stack. This is a well-known optimization and actually I would be surprised if most mature compilers for functional languages wouldnt do it.