<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Compiled Bitmaps</title>
	<atom:link href="http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/feed/" rel="self" type="application/rss+xml" />
	<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/</link>
	<description>serious code</description>
	<lastBuildDate>Sat, 31 Jul 2010 18:07:58 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Maynard Handley</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-1077</link>
		<dc:creator>Maynard Handley</dc:creator>
		<pubDate>Wed, 28 Jul 2010 23:23:23 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-1077</guid>
		<description>At the time this article was written, THE problem with blitting on macs was the behavior of PCI/AGP. 
The important points to note are 
- PCI and AGP were way slower than main memory
- these buses were 32 bits wide
- they multiplexed data and address over the same lines BUT, and this is important, they could be run in a streaming mode where you set the first address, and then just shoveled out data, with the address implicitly incrementing
- the chipsets in use by macs at this team had horrible limitations in their interactions with these buses

What does this all mean? It means the fastest way to write to write to screen was in an ascending stream of writes of maximal width. 
You want an ascending stream of writes to utilize the implicit address increment. (This means that the fancy sort of blit patterns might appear to make sense when decoding certain types of data, eg interlaced images, or images comprised of square tiles, should not be used -- just start at the top and proceed in raster order).
You want as wide a write as possible to keep the queue between the memory bus and the PCI bus as full as possible, which in turn means the bridge chip is most likely to keep shoveling data to PCI rather than deciding there has been enough of a break between writes that its time for a new address to be sent. 

The intel bridge chips of the time (and of course still) provided write combining, so that no matter how you wrote your data to the PCI/AGP the chip would glom it into the largest possible units and then send these optimally. This meant that it didn&#039;t much matter on PCs if you blitted using bytes, shorts, doubles, SSE or anything else. Those chipsets also seemed to allow for the maximal possible run length of data to PCI before resending an address. I could never get exact details on what the mac chips did because Apple, in their constant attempt to provide the best possible developer experience, never released the details,  but the mac chips seemed to be operating on some sort of counter based system, so that after N memory transactions on the memory bus side they would start a new PCI transaction. (It&#039;s possible that they were sufficiently lame that N was, in this case, 1 --- so a word write would result in a PCI addr+data transaction, a double write would give us addr+2data, and VMX addr+4data --- meanwhile Intel chips were happily doing addr+ (godawful number of) data, even when fed a stream of bytes! Remember this, kids, when you think that Apple could so so much better a job going to using its own chipsets for either Macs or iOS devices.)

In other words at that time, it was ALL about how you write the data --- how you read/generate the data was a second order effect.
It looks to me like CoreGraphics is doing the right thing (for some value of &quot;right&quot;) at 128x128, but not at larger sizes, and that this is a bug. The right thing is presumably using doubles to read and write the data. Possibly the 256 and 512 images were not 8-aligned, and so CoreGraphics fell back to UINT32 reads and writes. (Though this is a bug --- the correct thing in that case would be to use misaligned double reads, which will take a fairly minor hit on the memory read side, and aligned double writes.) Of course the larger right thing to do would be to update CoreGraphics to use VMX (again with unaligned reads handled using the two reads + permute VMX jiggery pokery), and VMX aligned writes. I don&#039;t know exactly what stage of OSX this was at, but let&#039;s hope that was done before the switch to Intel and then 10.6 made PPC optimizations irrelevant.

I&#039;m out of this game now, so I don&#039;t know the characteristics of PCI-e, but I suspect variants of the above ideas remain valid for iOS devices. Sadly Apple remains stuck in its same old ways of not providing developers with rich hardware specifications which describe these sorts of details and thereby allow one to write optimal code for the hardware.

=======================


BTW is there a bug in your pi tracking code? I wrote some quick Mathematica:
Prime[Range[num]] // (#^2 - 1)/#^2 &amp; // FoldList[Times, 1, #] &amp; // Sqrt[6/#] &amp; // N

to track how pi is approximated as the number of primes increases and we get something like
{2.44949, 2.82843, 3., 3.06186, 3.09359, 3.10646, 3.11569, 3.12109, 
3.12542, 3.12838, 3.13024, 3.13187, 3.13302, 3.13395, 3.1348, 
3.13551, 3.13607, 3.13652, 3.13694, 3.13729, 3.1376, 3.13789, 
3.13814, 3.13837, 3.13857, 3.13874, 3.13889, 3.13904, 3.13918, ... 3.14119 (at num primes=100)}
Note specifically that the number is always below pi.

I then ran a simulation of the idea --- generate 1000,000 random integers between 0 and 10^35, test if they were relatively prime or not, yadda yadda, 
Sum[ Boole@CoprimeQ[RandomInteger[num2], RandomInteger[num2]], {i, num1}] // Sqrt[num1*6/#] &amp; // N

(which took a few seconds --- Mathematica plus modern CPUs are a pretty amazing combination)
and got on three successive runs 3.14276, 3.14127 and 3.14271
Obviously you have rather fewer than 1000,000 samples; but even when I dropped to 1000 samples I was getting values like 3.18 or 3.15 --- nothing as extreme as 3.28. 
It seems like either people are deliberately gaming the system with the numbers they provide (so they are not truly random) or you have a bug in your code. It might be interesting to do the calculations again with numbers that are slightly less gamed --- using values input by different people against each other, or adding or subtracting one from the inputs.</description>
		<content:encoded><![CDATA[<p>At the time this article was written, THE problem with blitting on macs was the behavior of PCI/AGP.<br />
The important points to note are<br />
- PCI and AGP were way slower than main memory<br />
- these buses were 32 bits wide<br />
- they multiplexed data and address over the same lines BUT, and this is important, they could be run in a streaming mode where you set the first address, and then just shoveled out data, with the address implicitly incrementing<br />
- the chipsets in use by macs at this team had horrible limitations in their interactions with these buses</p>
<p>What does this all mean? It means the fastest way to write to write to screen was in an ascending stream of writes of maximal width.<br />
You want an ascending stream of writes to utilize the implicit address increment. (This means that the fancy sort of blit patterns might appear to make sense when decoding certain types of data, eg interlaced images, or images comprised of square tiles, should not be used &#8212; just start at the top and proceed in raster order).<br />
You want as wide a write as possible to keep the queue between the memory bus and the PCI bus as full as possible, which in turn means the bridge chip is most likely to keep shoveling data to PCI rather than deciding there has been enough of a break between writes that its time for a new address to be sent. </p>
<p>The intel bridge chips of the time (and of course still) provided write combining, so that no matter how you wrote your data to the PCI/AGP the chip would glom it into the largest possible units and then send these optimally. This meant that it didn&#8217;t much matter on PCs if you blitted using bytes, shorts, doubles, SSE or anything else. Those chipsets also seemed to allow for the maximal possible run length of data to PCI before resending an address. I could never get exact details on what the mac chips did because Apple, in their constant attempt to provide the best possible developer experience, never released the details,  but the mac chips seemed to be operating on some sort of counter based system, so that after N memory transactions on the memory bus side they would start a new PCI transaction. (It&#8217;s possible that they were sufficiently lame that N was, in this case, 1 &#8212; so a word write would result in a PCI addr+data transaction, a double write would give us addr+2data, and VMX addr+4data &#8212; meanwhile Intel chips were happily doing addr+ (godawful number of) data, even when fed a stream of bytes! Remember this, kids, when you think that Apple could so so much better a job going to using its own chipsets for either Macs or iOS devices.)</p>
<p>In other words at that time, it was ALL about how you write the data &#8212; how you read/generate the data was a second order effect.<br />
It looks to me like CoreGraphics is doing the right thing (for some value of &#8220;right&#8221;) at 128&#215;128, but not at larger sizes, and that this is a bug. The right thing is presumably using doubles to read and write the data. Possibly the 256 and 512 images were not 8-aligned, and so CoreGraphics fell back to UINT32 reads and writes. (Though this is a bug &#8212; the correct thing in that case would be to use misaligned double reads, which will take a fairly minor hit on the memory read side, and aligned double writes.) Of course the larger right thing to do would be to update CoreGraphics to use VMX (again with unaligned reads handled using the two reads + permute VMX jiggery pokery), and VMX aligned writes. I don&#8217;t know exactly what stage of OSX this was at, but let&#8217;s hope that was done before the switch to Intel and then 10.6 made PPC optimizations irrelevant.</p>
<p>I&#8217;m out of this game now, so I don&#8217;t know the characteristics of PCI-e, but I suspect variants of the above ideas remain valid for iOS devices. Sadly Apple remains stuck in its same old ways of not providing developers with rich hardware specifications which describe these sorts of details and thereby allow one to write optimal code for the hardware.</p>
<p>=======================</p>
<p>BTW is there a bug in your pi tracking code? I wrote some quick Mathematica:<br />
Prime[Range[num]] // (#^2 &#8211; 1)/#^2 &amp; // FoldList[Times, 1, #] &amp; // Sqrt[6/#] &amp; // N</p>
<p>to track how pi is approximated as the number of primes increases and we get something like<br />
{2.44949, 2.82843, 3., 3.06186, 3.09359, 3.10646, 3.11569, 3.12109,<br />
3.12542, 3.12838, 3.13024, 3.13187, 3.13302, 3.13395, 3.1348,<br />
3.13551, 3.13607, 3.13652, 3.13694, 3.13729, 3.1376, 3.13789,<br />
3.13814, 3.13837, 3.13857, 3.13874, 3.13889, 3.13904, 3.13918, &#8230; 3.14119 (at num primes=100)}<br />
Note specifically that the number is always below pi.</p>
<p>I then ran a simulation of the idea &#8212; generate 1000,000 random integers between 0 and 10^35, test if they were relatively prime or not, yadda yadda,<br />
Sum[ Boole@CoprimeQ[RandomInteger[num2], RandomInteger[num2]], {i, num1}] // Sqrt[num1*6/#] &amp; // N</p>
<p>(which took a few seconds &#8212; Mathematica plus modern CPUs are a pretty amazing combination)<br />
and got on three successive runs 3.14276, 3.14127 and 3.14271<br />
Obviously you have rather fewer than 1000,000 samples; but even when I dropped to 1000 samples I was getting values like 3.18 or 3.15 &#8212; nothing as extreme as 3.28.<br />
It seems like either people are deliberately gaming the system with the numbers they provide (so they are not truly random) or you have a bug in your code. It might be interesting to do the calculations again with numbers that are slightly less gamed &#8212; using values input by different people against each other, or adding or subtracting one from the inputs.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Alex Nekrasov</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-86</link>
		<dc:creator>Alex Nekrasov</dc:creator>
		<pubDate>Tue, 27 Feb 2007 12:45:23 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-86</guid>
		<description>compiled bitmap is by nature something you&#039;d do on a very low performance HW, which Macs with their instruction cache and Altivec aren&#039;t.

If you try the same on an ARM7 the method may be still useful.

On the other hand, embedded applications are usually restricted to a certain executable size and suffer from boot-up performance problems, so we usually do bitmap serialization instead of compilation. Which is an altogether different technique, so my comment may not be very informative.</description>
		<content:encoded><![CDATA[<p>compiled bitmap is by nature something you&#8217;d do on a very low performance HW, which Macs with their instruction cache and Altivec aren&#8217;t.</p>
<p>If you try the same on an ARM7 the method may be still useful.</p>
<p>On the other hand, embedded applications are usually restricted to a certain executable size and suffer from boot-up performance problems, so we usually do bitmap serialization instead of compilation. Which is an altogether different technique, so my comment may not be very informative.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: p01</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-85</link>
		<dc:creator>p01</dc:creator>
		<pubDate>Wed, 19 Jul 2006 14:11:33 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-85</guid>
		<description>Because of the cache, using an RLE decoder ( that should be less than 32bytes long if the source and destination pixel formats are the same ) will be much more efficient than having some generated code to blit the RLE encoded bitmap.

Back in the old days of DOS I wrote a transparent sprite blitter. My main concern were the  speed and the fact that the sprites could be partially outside of the screen. So I had a mask that was not compressed, but contained the RLE values of the mask. i.e: ALLLLLLL.  Where the bit A determined if the run is transparent or not, and the bits L determined the length of the run.


Also as noted above, using the FPU is the way to go if all you have is 16 or 32 bits registers otherwise. Which makes me wonder what kind of CPU you have since my 21yo Atari has 16 32bits registers.

You should check :

 - http://www.azillionmonkeys.com/qed/blockcopy.html
 - http://www.azillionmonkeys.com/qed/asmexample.html (esp: Sprite data copying)

They are aimed at x86 processors but they provide actual code with annotations of which pipeline will execute which instructions, and</description>
		<content:encoded><![CDATA[<p>Because of the cache, using an RLE decoder ( that should be less than 32bytes long if the source and destination pixel formats are the same ) will be much more efficient than having some generated code to blit the RLE encoded bitmap.</p>
<p>Back in the old days of DOS I wrote a transparent sprite blitter. My main concern were the  speed and the fact that the sprites could be partially outside of the screen. So I had a mask that was not compressed, but contained the RLE values of the mask. i.e: ALLLLLLL.  Where the bit A determined if the run is transparent or not, and the bits L determined the length of the run.</p>
<p>Also as noted above, using the FPU is the way to go if all you have is 16 or 32 bits registers otherwise. Which makes me wonder what kind of CPU you have since my 21yo Atari has 16 32bits registers.</p>
<p>You should check :</p>
<p> &#8211; <a href="http://www.azillionmonkeys.com/qed/blockcopy.html" rel="nofollow">http://www.azillionmonkeys.com/qed/blockcopy.html</a><br />
 &#8211; <a href="http://www.azillionmonkeys.com/qed/asmexample.html" rel="nofollow">http://www.azillionmonkeys.com/qed/asmexample.html</a> (esp: Sprite data copying)</p>
<p>They are aimed at x86 processors but they provide actual code with annotations of which pipeline will execute which instructions, and</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paul</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-84</link>
		<dc:creator>Paul</dc:creator>
		<pubDate>Thu, 18 May 2006 14:30:57 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-84</guid>
		<description>blitting to the screen may be so faster because of a write-combining capability ...

There is a nice S/N here :-) !</description>
		<content:encoded><![CDATA[<p>blitting to the screen may be so faster because of a write-combining capability &#8230;</p>
<p>There is a nice S/N here <img src='http://ridiculousfish.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  !</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Uli Kusterer</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-83</link>
		<dc:creator>Uli Kusterer</dc:creator>
		<pubDate>Thu, 30 Mar 2006 11:35:40 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-83</guid>
		<description>Oh, and to Mr Paisley: I&#039;d guess that you&#039;d get about the same results as RLEing your data would give you... If you know a row is transparent, then you can just skip it. SO I&#039;d doubt it&#039;s an optimisation specific to compiled bitmaps.</description>
		<content:encoded><![CDATA[<p>Oh, and to Mr Paisley: I&#8217;d guess that you&#8217;d get about the same results as RLEing your data would give you&#8230; If you know a row is transparent, then you can just skip it. SO I&#8217;d doubt it&#8217;s an optimisation specific to compiled bitmaps.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Uli Kusterer</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-82</link>
		<dc:creator>Uli Kusterer</dc:creator>
		<pubDate>Thu, 30 Mar 2006 11:32:38 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-82</guid>
		<description>Hi,

 just curious: Do you have an ICBM yet? I&#039;d really be interested in hearing whether it works better for them.

Cheers,
-- Uli</description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p> just curious: Do you have an ICBM yet? I&#8217;d really be interested in hearing whether it works better for them.</p>
<p>Cheers,<br />
&#8211; Uli</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anonymous</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-81</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Sat, 21 Jan 2006 04:02:12 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-81</guid>
		<description>Hmm... yeah, but on x86 there&#039;s also the string instructions, too...</description>
		<content:encoded><![CDATA[<p>Hmm&#8230; yeah, but on x86 there&#8217;s also the string instructions, too&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Samh</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-80</link>
		<dc:creator>Samh</dc:creator>
		<pubDate>Wed, 24 Aug 2005 12:32:15 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-80</guid>
		<description>ok Altivec ... if you&#039;ve got it</description>
		<content:encoded><![CDATA[<p>ok Altivec &#8230; if you&#8217;ve got it</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Samh</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-79</link>
		<dc:creator>Samh</dc:creator>
		<pubDate>Tue, 23 Aug 2005 14:17:10 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-79</guid>
		<description>The fastest way to blit would be to use the PowerPCs FPU to copy 64bit values. Trust me I&#039;ve been using it for years.</description>
		<content:encoded><![CDATA[<p>The fastest way to blit would be to use the PowerPCs FPU to copy 64bit values. Trust me I&#8217;ve been using it for years.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John C. Randolph</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-78</link>
		<dc:creator>John C. Randolph</dc:creator>
		<pubDate>Sun, 26 Jun 2005 10:08:32 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-78</guid>
		<description>Wow, this is a real trip down memory lane.   Are you going to cover color-register animation in your next post?  :D

I&#039;ve come to realize that some of the techniques I&#039;ve forgotten over the years really are best forgotten...

-jcr</description>
		<content:encoded><![CDATA[<p>Wow, this is a real trip down memory lane.   Are you going to cover color-register animation in your next post?  <img src='http://ridiculousfish.com/blog/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>I&#8217;ve come to realize that some of the techniques I&#8217;ve forgotten over the years really are best forgotten&#8230;</p>
<p>-jcr</p>
]]></content:encoded>
	</item>
</channel>
</rss>
