...and statistics

System calls

Sekhon does give one example of what he means. According to him,

For example, in Linux, the variables for a system call are passed directly using the register file. In OS X, they are packed up in a memory buffer, passed to a variety of places, and the results are then passed back using another memory buffer before the results are written back to the register file.

This isn't true, as anyone can verify from Apple's public sources. For example, here is the assembly for the open function (which, of course, performs the open system call):

	mov	$0x5,%eax
	nop
	nop
	call	0x90110a70 <_sysenter_trap>
	jae	0x90001f4c <_open +28>
	call	0x90001f43 <_open +19>
	pop	%edx
	mov	268455761(%edx),%edx
	jmp	*%edx
	ret
	
__sysenter_trap:
	popl %edx
	movl %esp, %ecx
	sysenter

I don't have a machine running Linux handy, but I do have a FreeBSD 5.4 machine, and Sekhon seems to hold BSD in high esteem. So let's see how BSD does open:

	mov    $0x5,%eax
	int    $0x80
	jb     0xa8c71cc 
	ret

The OS X version appears a bit longer because the BSD version moves its error handling to the close function. In fact, the above code is, if anything, more efficient in OS X, due to its use of the higher-performing "sysenter" instruction instead of the older "int 0x80" instruction. (Which isn't to say that the total system call is necessarily faster - just the transition from user space to kernel land.) But all that aside, the point is that there is no "packed up into a memory buffer" going on, in either case.

On to the benchmark

According to Sekhon, OS X performed poorly on his statistical software relative to Windows and Linux, and I was able to reproduce his results on my 2 GHz Core Duo iMac with Windows XP and Mac OS X (I do not have Linux installed, so I did not test it). So yes, it's really happening - but why?

A Shark sample shows that Mac OS X is spending an inordinate amount of time in malloc. After instrumenting Sekhon's code, I see that it is allocating 35 KB buffers, copying data into these buffers, and then immediately freeing them. This is happening a lot - for example, to multiply two matrices, Sekhon's code will allocate a temporary buffer to hold the result, compute the result into it, allocate a new matrix, copy the buffer into that, free the buffer, allocate a third matrix, copy the result into that, destroy the second matrix, and then finally the result gets returned. That's three large allocations per multiplication.

Shark showed that the other major component of the test is the matrix multiplication, which is mostly double precision floating point multiplications and additions, with some loads and stores. Because OS X performs these computations with SSE instructions (though they are not vectorized) and Linux and Windows use the ordinary x87 floating point stack, we might expect to see a performance difference. However, this turned out to not be the case; the SSE and x87 units performed similarly here.

Since the arithmetic component of the test is hardware bound, Sekhon's test is essentially a microbenchmark of malloc() and free() for 35 KB blocks.

malloc

Now, when allocating memory, malloc can either manage the memory blocks on the application heap, or it can go to the kernel's virtual memory system for fresh pages. The application heap is faster because it does not require a round trip to the kernel, but some allocation patterns can cause "holes" in the heap, which waste memory and ultimately hurt performance. If the allocation is performed by the kernel, then the kernel can defragment the pages and avoid wasting memory.

Because most programmers understand that large allocations are expensive, and larger allocations produce more fragmentation, Windows, Linux, and Mac OS X will all switch over from heap-managed allocations to VM-managed allocations at a certain size. That size is determined by the malloc implementation.

Linux uses ptmalloc, which is a thread-safe implemenation based on Doug Lea's allocator (Sekhon's test is single threaded, incidentally). R also uses the Lea allocator on Windows instead of the default Windows malloc. But on Mac OS X, it uses the default allocator.

It just so happens that Mac OS X's default malloc does the "switch" at 15 KB (search for LARGE_THRESHOLD) whereas Lea's allocator does it at 128 KB (search for DEFAULT_MMAP_THRESHOLD). Sekhon's 35 KB allocations fall right in the middle.

So what this means is that on Mac OS X, every 35 KB allocation is causing a round trip to the kernel for fresh pages, whereas on Windows and Linux the allocations are serviced from the application heap, without talking to the kernel at all. Similarly, every free() causes another round trip on Mac OS X, but not on Linux or Windows. None of the defragmentation benefits of using fresh pages come into play because Sekhon frees these blocks immediately after allocating them, which is, shall we say, an unusual allocation pattern.

Like R on Windows, it's a simple matter to compile and link against Lea's malloc instead of the default one on Mac OS X. What happens if we do so?

Mac OS X (default allocator) 24 seconds

Mac OS X (Lea allocator) 10 seconds

Windows XP 10 seconds

These results could be further improved on every platform by avoiding all of the gratuitious allocations and copying, and by using an optimized matrix multiplication routine such as those R provides via ATLAS.

In short

To sum up the particulars of this test:

Linux, Windows, and Mac OS X service small allocations from the application heap and large ones from the kernel's VM system in recognition of the speed/fragmentation tradeoff.
Mac OS X's default malloc switches from the first to the second at an earlier point (smaller allocation size) than do the allocators used on Windows and Linux.
Sekhon's test boils down to a microbenchmark of malloc()ing and then immediately free()ing 35 KB chunks.
35 KB is after Mac OS X switches, but before Linux and Windows switch. Thus, Mac OS X will ask the kernel for the memory, while Linux and Windows will not; it is reasonable that OS X could be slower in this circumstance.
If you use the same allocator on Mac OS X that R uses on Windows, the performance differences all but disappear.
Most applications are careful to avoid unnecessary large allocations, and will enjoy decreased memory usage and better locality with an allocator that relies more heavily on the VM system (such as on Mac OS X). In that sense, this is a poor benchmark. Sekhon's code could be improved on every platform by allocating only what it needs.

Writing this entry felt like arguing on IRC; please don't make me do it again. In that spirit, the following are ideas that I want potential authors of "shootoffs" to keep in mind:

Apple provides some truly excellent tools for analyzing the performance of your application. Since they're free, there's no excuse for not using them. You should be able to point very clearly at which operations are slower, and give a convincing explanation of why.
Apple has made decisions that adversely impact OS X's performance, but there are reasons for those decisions. Sometimes the tradeoff is to improve performance elsewhere, sometimes it's to enable a feature, sometimes it's for reliability, sometimes it's a tragic nod to compatibility. And yes, sometimes it's bugs, and sometimes Apple just hasn't gotten around to optimizing that area yet. Any exhibition of benchmark results should give a discussion of the tradeoffs made to achieve (or cause) that performance.
If you do provide benchmark results, try to do so without using the phrase "reality distortion field."

Mac OS X (default allocator)	24 seconds
Mac OS X (Lea allocator)	10 seconds
Windows XP	10 seconds