<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for ridiculous_fish</title>
	<atom:link href="http://ridiculousfish.com/blog/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://ridiculousfish.com/blog</link>
	<description>serious code</description>
	<lastBuildDate>Sat, 31 Jul 2010 18:07:58 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>Comment on Will it optimize? by djh</title>
		<link>http://ridiculousfish.com/blog/archives/2010/07/23/will-it-optimize/comment-page-1/#comment-1084</link>
		<dc:creator>djh</dc:creator>
		<pubDate>Sat, 31 Jul 2010 18:07:58 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=679#comment-1084</guid>
		<description>I was expecting a twist in Q6, along the lines of:

#define f0()  x++
#define f1()  x++
#define f2()  x++
#define f3()  x++
#define f4()  x++</description>
		<content:encoded><![CDATA[<p>I was expecting a twist in Q6, along the lines of:</p>
<p>#define f0()  x++<br />
#define f1()  x++<br />
#define f2()  x++<br />
#define f3()  x++<br />
#define f4()  x++</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Spam by Anonymous</title>
		<link>http://ridiculousfish.com/blog/archives/2005/08/09/spam/comment-page-8/#comment-1083</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Sat, 31 Jul 2010 06:36:42 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=23#comment-1083</guid>
		<description>boobah</description>
		<content:encoded><![CDATA[<p>boobah</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Will it optimize? by madhu</title>
		<link>http://ridiculousfish.com/blog/archives/2010/07/23/will-it-optimize/comment-page-1/#comment-1082</link>
		<dc:creator>madhu</dc:creator>
		<pubDate>Fri, 30 Jul 2010 13:38:41 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=679#comment-1082</guid>
		<description>Great post...its given nice insight into compilers</description>
		<content:encoded><![CDATA[<p>Great post&#8230;its given nice insight into compilers</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Will it optimize? by Anonymous</title>
		<link>http://ridiculousfish.com/blog/archives/2010/07/23/will-it-optimize/comment-page-1/#comment-1081</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Thu, 29 Jul 2010 16:38:43 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=679#comment-1081</guid>
		<description>This is a great article in that it convinces me not to rely on compiler optimizations but to do them myself, avoiding unexplained performance variations when seemingly benign code modifications are made.

On the other hand, it would be extremely useful (IMHO) to have a compiler option that would suggest optimizations, perhaps using the same format as you presented, from which I could choose to code manually.  I could then compile without optimization on various platforms and compilers (and versions) and reasonably expect the same results and be not subjected to unpredictable optimizations.</description>
		<content:encoded><![CDATA[<p>This is a great article in that it convinces me not to rely on compiler optimizations but to do them myself, avoiding unexplained performance variations when seemingly benign code modifications are made.</p>
<p>On the other hand, it would be extremely useful (IMHO) to have a compiler option that would suggest optimizations, perhaps using the same format as you presented, from which I could choose to code manually.  I could then compile without optimization on various platforms and compilers (and versions) and reasonably expect the same results and be not subjected to unpredictable optimizations.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Will it optimize? by kiran</title>
		<link>http://ridiculousfish.com/blog/archives/2010/07/23/will-it-optimize/comment-page-1/#comment-1080</link>
		<dc:creator>kiran</dc:creator>
		<pubDate>Thu, 29 Jul 2010 16:37:08 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=679#comment-1080</guid>
		<description>Great Post, Liked the Quiz Format</description>
		<content:encoded><![CDATA[<p>Great Post, Liked the Quiz Format</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Will it optimize? by Oisín</title>
		<link>http://ridiculousfish.com/blog/archives/2010/07/23/will-it-optimize/comment-page-1/#comment-1079</link>
		<dc:creator>Oisín</dc:creator>
		<pubDate>Thu, 29 Jul 2010 15:30:49 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=679#comment-1079</guid>
		<description>Great post, I enjoyed that, even if I completely bombed at the quiz (two questions right... sob).</description>
		<content:encoded><![CDATA[<p>Great post, I enjoyed that, even if I completely bombed at the quiz (two questions right&#8230; sob).</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Will it optimize? by JasonG</title>
		<link>http://ridiculousfish.com/blog/archives/2010/07/23/will-it-optimize/comment-page-1/#comment-1078</link>
		<dc:creator>JasonG</dc:creator>
		<pubDate>Thu, 29 Jul 2010 12:48:58 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=679#comment-1078</guid>
		<description>Great! and very interesting</description>
		<content:encoded><![CDATA[<p>Great! and very interesting</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Compiled Bitmaps by Maynard Handley</title>
		<link>http://ridiculousfish.com/blog/archives/2005/06/22/compiled-bitmaps/comment-page-1/#comment-1077</link>
		<dc:creator>Maynard Handley</dc:creator>
		<pubDate>Wed, 28 Jul 2010 23:23:23 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=18#comment-1077</guid>
		<description>At the time this article was written, THE problem with blitting on macs was the behavior of PCI/AGP. 
The important points to note are 
- PCI and AGP were way slower than main memory
- these buses were 32 bits wide
- they multiplexed data and address over the same lines BUT, and this is important, they could be run in a streaming mode where you set the first address, and then just shoveled out data, with the address implicitly incrementing
- the chipsets in use by macs at this team had horrible limitations in their interactions with these buses

What does this all mean? It means the fastest way to write to write to screen was in an ascending stream of writes of maximal width. 
You want an ascending stream of writes to utilize the implicit address increment. (This means that the fancy sort of blit patterns might appear to make sense when decoding certain types of data, eg interlaced images, or images comprised of square tiles, should not be used -- just start at the top and proceed in raster order).
You want as wide a write as possible to keep the queue between the memory bus and the PCI bus as full as possible, which in turn means the bridge chip is most likely to keep shoveling data to PCI rather than deciding there has been enough of a break between writes that its time for a new address to be sent. 

The intel bridge chips of the time (and of course still) provided write combining, so that no matter how you wrote your data to the PCI/AGP the chip would glom it into the largest possible units and then send these optimally. This meant that it didn&#039;t much matter on PCs if you blitted using bytes, shorts, doubles, SSE or anything else. Those chipsets also seemed to allow for the maximal possible run length of data to PCI before resending an address. I could never get exact details on what the mac chips did because Apple, in their constant attempt to provide the best possible developer experience, never released the details,  but the mac chips seemed to be operating on some sort of counter based system, so that after N memory transactions on the memory bus side they would start a new PCI transaction. (It&#039;s possible that they were sufficiently lame that N was, in this case, 1 --- so a word write would result in a PCI addr+data transaction, a double write would give us addr+2data, and VMX addr+4data --- meanwhile Intel chips were happily doing addr+ (godawful number of) data, even when fed a stream of bytes! Remember this, kids, when you think that Apple could so so much better a job going to using its own chipsets for either Macs or iOS devices.)

In other words at that time, it was ALL about how you write the data --- how you read/generate the data was a second order effect.
It looks to me like CoreGraphics is doing the right thing (for some value of &quot;right&quot;) at 128x128, but not at larger sizes, and that this is a bug. The right thing is presumably using doubles to read and write the data. Possibly the 256 and 512 images were not 8-aligned, and so CoreGraphics fell back to UINT32 reads and writes. (Though this is a bug --- the correct thing in that case would be to use misaligned double reads, which will take a fairly minor hit on the memory read side, and aligned double writes.) Of course the larger right thing to do would be to update CoreGraphics to use VMX (again with unaligned reads handled using the two reads + permute VMX jiggery pokery), and VMX aligned writes. I don&#039;t know exactly what stage of OSX this was at, but let&#039;s hope that was done before the switch to Intel and then 10.6 made PPC optimizations irrelevant.

I&#039;m out of this game now, so I don&#039;t know the characteristics of PCI-e, but I suspect variants of the above ideas remain valid for iOS devices. Sadly Apple remains stuck in its same old ways of not providing developers with rich hardware specifications which describe these sorts of details and thereby allow one to write optimal code for the hardware.

=======================


BTW is there a bug in your pi tracking code? I wrote some quick Mathematica:
Prime[Range[num]] // (#^2 - 1)/#^2 &amp; // FoldList[Times, 1, #] &amp; // Sqrt[6/#] &amp; // N

to track how pi is approximated as the number of primes increases and we get something like
{2.44949, 2.82843, 3., 3.06186, 3.09359, 3.10646, 3.11569, 3.12109, 
3.12542, 3.12838, 3.13024, 3.13187, 3.13302, 3.13395, 3.1348, 
3.13551, 3.13607, 3.13652, 3.13694, 3.13729, 3.1376, 3.13789, 
3.13814, 3.13837, 3.13857, 3.13874, 3.13889, 3.13904, 3.13918, ... 3.14119 (at num primes=100)}
Note specifically that the number is always below pi.

I then ran a simulation of the idea --- generate 1000,000 random integers between 0 and 10^35, test if they were relatively prime or not, yadda yadda, 
Sum[ Boole@CoprimeQ[RandomInteger[num2], RandomInteger[num2]], {i, num1}] // Sqrt[num1*6/#] &amp; // N

(which took a few seconds --- Mathematica plus modern CPUs are a pretty amazing combination)
and got on three successive runs 3.14276, 3.14127 and 3.14271
Obviously you have rather fewer than 1000,000 samples; but even when I dropped to 1000 samples I was getting values like 3.18 or 3.15 --- nothing as extreme as 3.28. 
It seems like either people are deliberately gaming the system with the numbers they provide (so they are not truly random) or you have a bug in your code. It might be interesting to do the calculations again with numbers that are slightly less gamed --- using values input by different people against each other, or adding or subtracting one from the inputs.</description>
		<content:encoded><![CDATA[<p>At the time this article was written, THE problem with blitting on macs was the behavior of PCI/AGP.<br />
The important points to note are<br />
- PCI and AGP were way slower than main memory<br />
- these buses were 32 bits wide<br />
- they multiplexed data and address over the same lines BUT, and this is important, they could be run in a streaming mode where you set the first address, and then just shoveled out data, with the address implicitly incrementing<br />
- the chipsets in use by macs at this team had horrible limitations in their interactions with these buses</p>
<p>What does this all mean? It means the fastest way to write to write to screen was in an ascending stream of writes of maximal width.<br />
You want an ascending stream of writes to utilize the implicit address increment. (This means that the fancy sort of blit patterns might appear to make sense when decoding certain types of data, eg interlaced images, or images comprised of square tiles, should not be used &#8212; just start at the top and proceed in raster order).<br />
You want as wide a write as possible to keep the queue between the memory bus and the PCI bus as full as possible, which in turn means the bridge chip is most likely to keep shoveling data to PCI rather than deciding there has been enough of a break between writes that its time for a new address to be sent. </p>
<p>The intel bridge chips of the time (and of course still) provided write combining, so that no matter how you wrote your data to the PCI/AGP the chip would glom it into the largest possible units and then send these optimally. This meant that it didn&#8217;t much matter on PCs if you blitted using bytes, shorts, doubles, SSE or anything else. Those chipsets also seemed to allow for the maximal possible run length of data to PCI before resending an address. I could never get exact details on what the mac chips did because Apple, in their constant attempt to provide the best possible developer experience, never released the details,  but the mac chips seemed to be operating on some sort of counter based system, so that after N memory transactions on the memory bus side they would start a new PCI transaction. (It&#8217;s possible that they were sufficiently lame that N was, in this case, 1 &#8212; so a word write would result in a PCI addr+data transaction, a double write would give us addr+2data, and VMX addr+4data &#8212; meanwhile Intel chips were happily doing addr+ (godawful number of) data, even when fed a stream of bytes! Remember this, kids, when you think that Apple could so so much better a job going to using its own chipsets for either Macs or iOS devices.)</p>
<p>In other words at that time, it was ALL about how you write the data &#8212; how you read/generate the data was a second order effect.<br />
It looks to me like CoreGraphics is doing the right thing (for some value of &#8220;right&#8221;) at 128&#215;128, but not at larger sizes, and that this is a bug. The right thing is presumably using doubles to read and write the data. Possibly the 256 and 512 images were not 8-aligned, and so CoreGraphics fell back to UINT32 reads and writes. (Though this is a bug &#8212; the correct thing in that case would be to use misaligned double reads, which will take a fairly minor hit on the memory read side, and aligned double writes.) Of course the larger right thing to do would be to update CoreGraphics to use VMX (again with unaligned reads handled using the two reads + permute VMX jiggery pokery), and VMX aligned writes. I don&#8217;t know exactly what stage of OSX this was at, but let&#8217;s hope that was done before the switch to Intel and then 10.6 made PPC optimizations irrelevant.</p>
<p>I&#8217;m out of this game now, so I don&#8217;t know the characteristics of PCI-e, but I suspect variants of the above ideas remain valid for iOS devices. Sadly Apple remains stuck in its same old ways of not providing developers with rich hardware specifications which describe these sorts of details and thereby allow one to write optimal code for the hardware.</p>
<p>=======================</p>
<p>BTW is there a bug in your pi tracking code? I wrote some quick Mathematica:<br />
Prime[Range[num]] // (#^2 &#8211; 1)/#^2 &amp; // FoldList[Times, 1, #] &amp; // Sqrt[6/#] &amp; // N</p>
<p>to track how pi is approximated as the number of primes increases and we get something like<br />
{2.44949, 2.82843, 3., 3.06186, 3.09359, 3.10646, 3.11569, 3.12109,<br />
3.12542, 3.12838, 3.13024, 3.13187, 3.13302, 3.13395, 3.1348,<br />
3.13551, 3.13607, 3.13652, 3.13694, 3.13729, 3.1376, 3.13789,<br />
3.13814, 3.13837, 3.13857, 3.13874, 3.13889, 3.13904, 3.13918, &#8230; 3.14119 (at num primes=100)}<br />
Note specifically that the number is always below pi.</p>
<p>I then ran a simulation of the idea &#8212; generate 1000,000 random integers between 0 and 10^35, test if they were relatively prime or not, yadda yadda,<br />
Sum[ Boole@CoprimeQ[RandomInteger[num2], RandomInteger[num2]], {i, num1}] // Sqrt[num1*6/#] &amp; // N</p>
<p>(which took a few seconds &#8212; Mathematica plus modern CPUs are a pretty amazing combination)<br />
and got on three successive runs 3.14276, 3.14127 and 3.14271<br />
Obviously you have rather fewer than 1000,000 samples; but even when I dropped to 1000 samples I was getting values like 3.18 or 3.15 &#8212; nothing as extreme as 3.28.<br />
It seems like either people are deliberately gaming the system with the numbers they provide (so they are not truly random) or you have a bug in your code. It might be interesting to do the calculations again with numbers that are slightly less gamed &#8212; using values input by different people against each other, or adding or subtracting one from the inputs.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on &#8230;and statistics by ML</title>
		<link>http://ridiculousfish.com/blog/archives/2006/05/16/36/comment-page-2/#comment-1076</link>
		<dc:creator>ML</dc:creator>
		<pubDate>Tue, 27 Jul 2010 11:18:42 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/archives/2006/05/08/36/#comment-1076</guid>
		<description>While this post is somewhat old, I wanted to add another 2 cents. In our system we have had several problems with malloc performance on Mac OS X. Since we are delivering a library, we don&#039;t want to fiddle too much with the system malloc or the environment. We discovered the problem when we added multi-threading to our library. On Windows and Linux we got a nice speed-up (almost linear), but for Mac OS X the speed-up was much much worse. The reason turned out to be inefficient malloc. Adding a small caching layer for allocated blocks in our library solved the problem, and gave a nice speed-up even for single-threaded programs. On Windows and Linux the layer gave a small but not significant speed-up. 

Another problem was for some users that did lots of benchmarking in the same process. The program allocates a small amount of memory each time around, but in their case this was slightly larger each time. This eventually led to an Out of Memory error, even though all memory was released after each round in the benchmark!</description>
		<content:encoded><![CDATA[<p>While this post is somewhat old, I wanted to add another 2 cents. In our system we have had several problems with malloc performance on Mac OS X. Since we are delivering a library, we don&#8217;t want to fiddle too much with the system malloc or the environment. We discovered the problem when we added multi-threading to our library. On Windows and Linux we got a nice speed-up (almost linear), but for Mac OS X the speed-up was much much worse. The reason turned out to be inefficient malloc. Adding a small caching layer for allocated blocks in our library solved the problem, and gave a nice speed-up even for single-threaded programs. On Windows and Linux the layer gave a small but not significant speed-up. </p>
<p>Another problem was for some users that did lots of benchmarking in the same process. The program allocates a small amount of memory each time around, but in their case this was slightly larger each time. This eventually led to an Out of Memory error, even though all memory was released after each round in the benchmark!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Will it optimize? by Wezko</title>
		<link>http://ridiculousfish.com/blog/archives/2010/07/23/will-it-optimize/comment-page-1/#comment-1075</link>
		<dc:creator>Wezko</dc:creator>
		<pubDate>Tue, 27 Jul 2010 09:56:09 +0000</pubDate>
		<guid isPermaLink="false">http://ridiculousfish.com/blog/?p=679#comment-1075</guid>
		<description>Nice stuff. A must-have for some developers. I have to admit I don&#039;t know GCC details but the other questions I answered right. We do a lot wrong in our code here in the office!</description>
		<content:encoded><![CDATA[<p>Nice stuff. A must-have for some developers. I have to admit I don&#8217;t know GCC details but the other questions I answered right. We do a lot wrong in our code here in the office!</p>
]]></content:encoded>
	</item>
</channel>
</rss>
