Fish tank with Puck the Fish Fish tank with Puck the Fish Water on the floor spelling Fish
 

Compiled Bitmaps

June 22nd, 2005

Ancient programmers speak in hushed tones of the legend known as the "compiled bitmap." Ok, no they don’t, but there is such a thing. It’s a technique that was sometimes used in the DOS days for rapidly drawing an image to the screen. But how can you compile a bitmap? And is there any value remaining in this technique, or should it be left in the dustbin along side TSRs and DOSSHELL? Let’s find out!

Blitting

Drawing an image is called "blitting." The usual way we blit is to load a pixel from the source image buffer, store it in the destination, and repeat until we’ve blitted every pixel.

But with compiled bitmaps, the data for the image is in the instruction stream itself. So instead of "load a pixel from the source," we say "load red," and then store it as usual. Then "load green", or whatever the next color is, and store that, etc. A compiled bitmap is a function that knows how to draw a specific image.

Instruction counts

So an advantage of a compiled bitmap is that there’s no need to load data from an image when we blit; the image data is in the instruction stream itself. But this is ameliorated by the fact that we’ll have many more instructions in the function. How many more? That depends on the processor architecture, of course, but since I’m a Mac guy, I’ll stick to PowerPC.

To load a single 32 bit pixel into a register requires two instructions: one to load each 16-bit half. And storing a pixel requires another instruction. By taking advantage of the literal offset field in the stw instruction, we can avoid having to increment our destination pointer (except when we overflow the 16 bit field every sixteen thousand pixels). The PowerPC also has a special instruction form, the store word with update instructions, stwu, that lets us increment the target pointer for free. But my tests showed that to be slower than stw, which makes sense, since it introduces more inter-instruction dependencies.

But of course, if the next pixel is the same color as the current one, there’s no need to load a new color into the register: you can just store the current value again. So sometimes we can get away with only one instruction per pixel. So the number of instructions per pixel in a compiled bitmap ranges between one and three, depending on the complexity of the image and the sophistication of our compiler.

This compares favorably to ordinary blitting, where we have more instructions per pixel (load pixel, store pixel, check for loop exit, and pointer updates). So at first blush, a compiled bitmap may appear ~ 33% faster than ordinary blitting.

Bandwidth

But this isn’t the whole story, of course; a processor is only as fast as it can be fed. How much memory bandwidth do we use? In standard blitting, the instruction stream is a very short loop which fits entirely into the instruction cache. The main bandwidth hog is reading and writing the pixels: 32 bits read, 32 bits written. That’s 32 bits each way, per pixel.

With a compiled bitmap, the instruction stream no longer fits in the cache, so the instructions themselves become the bandwidth bottleneck. 32 bits for each instruction (remember, there’s one to three), and 32 bits to write out each pixel, means between 32 and 96 bits read and 32 bits written per pixel. Based on that, we might expect a compiled bitmap to be up to three times slower than memcpy().

Test approach

I wrote a test program that blits three different images, at three different sizes, using each of three different methods. The images were the standard OS X Stones background image, an image with random pixels generated via random(), and a solid blue image. I blit (blitted?) to the screen using the CGDirectDisplay API, and repeated the tests to an in-memory buffer. The blitters were a memcpy() driven blitter, a bitmap compiler, and (for the screen only) the draw method of an NSBitmapImageRep. In each case, I blit 50,000 times and timed the number of seconds it took. Division gives the frames per second, and multiplying by the number of pixels gives pixels per second.

Here is the Xcode project I wrote. I place it in the public domain. (Note that it outputs frames per second, not pixels per second. You can calculate pixels per second by multiplying the framerate by the number of pixels in a frame. Note that I lowered the test count to 500 from 50,000 so you don’t tie up your computer for too long by accident. I also left in the stwu version of the compiler in case you’re interested.)

Results

Here’s what I got.

Pixels/second for blitting to memory:

Image Size
128×128 256×256 512×512
memcpy Stones 2,582,493,355 401,030,682 244,721,918
Compiled Bitmap Stones 961,846,601 116,857,588 116,034,503
memcpy Random 2,599,446,090 392,847,697 244,445,199
Compiled Bitmap Random 732896429 116,561,007 115,185,567
memcpy Solid 2,592,274,147 431,870,370 245,302,313
Compiled Bitmap Solid 990,974,688 351,106,794 234,451,230
Pixels/second for blitting to the screen:

Image Size
128×128 256×256 512×512
memcpy Stones 40,984,006 40,984,335 40,933,309
CoreGraphics Stones 20,115,402 10,588,428 10,611,671
Compiled Bitmap Stones 13,321,951 13,300,135 13,303,299
memcpy Random 40,996,351 40,994,852 40,935,011
CoreGraphics Random 20,128,212 10,581,521 10,602,114
Compiled Bitmap Random 13,326,014 13,299,336 13,300,511
memcpy Solid 40,993,883 40,998,191 40,936,245
CoreGraphics Solid 20,125,009 10,574,159 10,602,177
Compiled Bitmap Solid 13,324,779 13,321,780 13,310,426

Man, that’s boring. All right, let’s draw some graphs.


Count the digits…one…two…ok, yeah, note that the scales are very different here. We can blit two and a half billion pixels to memory in the same amount of time as it takes to blit forty million pixels to the screen.

So it’s pretty obvious that memcpy() is faster than my bitmap compiler. The only point where my compiled bitmap is competitive is blitting a large solid color.

Ex Post Facto Explanation

Was my Latin right? Darn if I know. Anyways, I noticed that memcpy() is using a sequence of unrolled lvx and stvx instructions. That means Altivec acceleration! It may be part of the reason memcpy() is so fast. Unfortunately, the only way to fill a vector register with literal values is to assemble them in memory and then load them in, so it would not be easy to vectorize the bitmap compiler, except in special cases. The results might be somewaht different on an Altivec-lacking G3.

Blitting the solid image to memory via a compiled bitmap is between two and three times faster than the non-solid images, and is competitive with memcpy(), except in the 128×128 case. I don’t know why memcpy() is so much faster in the 128×128 case. I also don’t know why blitting to the screen is so much faster with memcpy() compared with the other techniques. If you know or have ideas, please comment on them!

It’s also very likely that I’ve missed a major optimization opportunity. I’m far from a guru when it comes to the PowerPC. If you spot one, please post it in the comments!

Looking ahead

My clumsy tests seem to confirm that compiled bitmaps aren’t very fast. But note that the PowerPC has fairly big instructions, so it may not be the best candidate for a bitmap compiler. It takes twelve bytes of PowerPC instructions to write a 32 bit literal to a location in memory, but only six bytes of x86 instructions, which halves our bandwidth requirements. When the ICBMs ship, I’ll write a bitmap compiler for them and repeat these tests.

 

The Internet!

π = 3.2860203432

Once again, a fantastically interesting read. Thank you for posting. =D

Jonathan Paisley

I’d heard of compiled bitmaps for blitting partially transparent images (with binary per-pixel alpha — so no blending). I wonder what the tradeoffs would be in that case. The compiled blitter can omit writing transparent pixels altogether, whereas other blitters (ignoring graphics hardware acceleration) would have to read the entire source bitmap.

good job, peter! cool article

Wow, this is a real trip down memory lane. Are you going to cover color-register animation in your next post? :D

I’ve come to realize that some of the techniques I’ve forgotten over the years really are best forgotten…

-jcr

The fastest way to blit would be to use the PowerPCs FPU to copy 64bit values. Trust me I’ve been using it for years.

ok Altivec … if you’ve got it

Anonymous

Hmm… yeah, but on x86 there’s also the string instructions, too…

Hi,

just curious: Do you have an ICBM yet? I’d really be interested in hearing whether it works better for them.

Cheers,
– Uli

Oh, and to Mr Paisley: I’d guess that you’d get about the same results as RLEing your data would give you… If you know a row is transparent, then you can just skip it. SO I’d doubt it’s an optimisation specific to compiled bitmaps.

Paul

blitting to the screen may be so faster because of a write-combining capability …

There is a nice S/N here :-) !

Because of the cache, using an RLE decoder ( that should be less than 32bytes long if the source and destination pixel formats are the same ) will be much more efficient than having some generated code to blit the RLE encoded bitmap.

Back in the old days of DOS I wrote a transparent sprite blitter. My main concern were the speed and the fact that the sprites could be partially outside of the screen. So I had a mask that was not compressed, but contained the RLE values of the mask. i.e: ALLLLLLL. Where the bit A determined if the run is transparent or not, and the bits L determined the length of the run.

Also as noted above, using the FPU is the way to go if all you have is 16 or 32 bits registers otherwise. Which makes me wonder what kind of CPU you have since my 21yo Atari has 16 32bits registers.

You should check :

http://www.azillionmonkeys.com/qed/blockcopy.html
http://www.azillionmonkeys.com/qed/asmexample.html (esp: Sprite data copying)

They are aimed at x86 processors but they provide actual code with annotations of which pipeline will execute which instructions, and

Alex Nekrasov

compiled bitmap is by nature something you’d do on a very low performance HW, which Macs with their instruction cache and Altivec aren’t.

If you try the same on an ARM7 the method may be still useful.

On the other hand, embedded applications are usually restricted to a certain executable size and suffer from boot-up performance problems, so we usually do bitmap serialization instead of compilation. Which is an altogether different technique, so my comment may not be very informative.

Maynard Handley

At the time this article was written, THE problem with blitting on macs was the behavior of PCI/AGP.
The important points to note are
- PCI and AGP were way slower than main memory
- these buses were 32 bits wide
- they multiplexed data and address over the same lines BUT, and this is important, they could be run in a streaming mode where you set the first address, and then just shoveled out data, with the address implicitly incrementing
- the chipsets in use by macs at this team had horrible limitations in their interactions with these buses

What does this all mean? It means the fastest way to write to write to screen was in an ascending stream of writes of maximal width.
You want an ascending stream of writes to utilize the implicit address increment. (This means that the fancy sort of blit patterns might appear to make sense when decoding certain types of data, eg interlaced images, or images comprised of square tiles, should not be used — just start at the top and proceed in raster order).
You want as wide a write as possible to keep the queue between the memory bus and the PCI bus as full as possible, which in turn means the bridge chip is most likely to keep shoveling data to PCI rather than deciding there has been enough of a break between writes that its time for a new address to be sent.

The intel bridge chips of the time (and of course still) provided write combining, so that no matter how you wrote your data to the PCI/AGP the chip would glom it into the largest possible units and then send these optimally. This meant that it didn’t much matter on PCs if you blitted using bytes, shorts, doubles, SSE or anything else. Those chipsets also seemed to allow for the maximal possible run length of data to PCI before resending an address. I could never get exact details on what the mac chips did because Apple, in their constant attempt to provide the best possible developer experience, never released the details, but the mac chips seemed to be operating on some sort of counter based system, so that after N memory transactions on the memory bus side they would start a new PCI transaction. (It’s possible that they were sufficiently lame that N was, in this case, 1 — so a word write would result in a PCI addr+data transaction, a double write would give us addr+2data, and VMX addr+4data — meanwhile Intel chips were happily doing addr+ (godawful number of) data, even when fed a stream of bytes! Remember this, kids, when you think that Apple could so so much better a job going to using its own chipsets for either Macs or iOS devices.)

In other words at that time, it was ALL about how you write the data — how you read/generate the data was a second order effect.
It looks to me like CoreGraphics is doing the right thing (for some value of “right”) at 128×128, but not at larger sizes, and that this is a bug. The right thing is presumably using doubles to read and write the data. Possibly the 256 and 512 images were not 8-aligned, and so CoreGraphics fell back to UINT32 reads and writes. (Though this is a bug — the correct thing in that case would be to use misaligned double reads, which will take a fairly minor hit on the memory read side, and aligned double writes.) Of course the larger right thing to do would be to update CoreGraphics to use VMX (again with unaligned reads handled using the two reads + permute VMX jiggery pokery), and VMX aligned writes. I don’t know exactly what stage of OSX this was at, but let’s hope that was done before the switch to Intel and then 10.6 made PPC optimizations irrelevant.

I’m out of this game now, so I don’t know the characteristics of PCI-e, but I suspect variants of the above ideas remain valid for iOS devices. Sadly Apple remains stuck in its same old ways of not providing developers with rich hardware specifications which describe these sorts of details and thereby allow one to write optimal code for the hardware.

=======================

BTW is there a bug in your pi tracking code? I wrote some quick Mathematica:
Prime[Range[num]] // (#^2 – 1)/#^2 & // FoldList[Times, 1, #] & // Sqrt[6/#] & // N

to track how pi is approximated as the number of primes increases and we get something like
{2.44949, 2.82843, 3., 3.06186, 3.09359, 3.10646, 3.11569, 3.12109,
3.12542, 3.12838, 3.13024, 3.13187, 3.13302, 3.13395, 3.1348,
3.13551, 3.13607, 3.13652, 3.13694, 3.13729, 3.1376, 3.13789,
3.13814, 3.13837, 3.13857, 3.13874, 3.13889, 3.13904, 3.13918, … 3.14119 (at num primes=100)}
Note specifically that the number is always below pi.

I then ran a simulation of the idea — generate 1000,000 random integers between 0 and 10^35, test if they were relatively prime or not, yadda yadda,
Sum[ Boole@CoprimeQ[RandomInteger[num2], RandomInteger[num2]], {i, num1}] // Sqrt[num1*6/#] & // N

(which took a few seconds — Mathematica plus modern CPUs are a pretty amazing combination)
and got on three successive runs 3.14276, 3.14127 and 3.14271
Obviously you have rather fewer than 1000,000 samples; but even when I dropped to 1000 samples I was getting values like 3.18 or 3.15 — nothing as extreme as 3.28.
It seems like either people are deliberately gaming the system with the numbers they provide (so they are not truly random) or you have a bug in your code. It might be interesting to do the calculations again with numbers that are slightly less gamed — using values input by different people against each other, or adding or subtracting one from the inputs.