…and statistics
May 16th, 2006
The latest "OS X is slow" meme to impinge on the mass psyche of the Internet comes courtesy of one Jasjeet Sekhon, an associate professor of political science at UC Berkeley. The page has hit digg and reddit and been quoted on Slashdot. The article and benchmark is here. Is there any merit to this?
Once again, this discussion is only my meager opinion. I do not speak for Apple, and none of what I have to write represents Apple’s official position.
The article is filled with claims such as "OS X is incredibly slow by design," and while the the BSD kernel is "excellent", the XNU kernel is "very inefficient and less stable" compared to Linux or BSD. However, without specifics, these assertions are meaningless; I will ignore them and concentrate on the technical aspects of what’s going on.
System calls
Sekhon does give one example of what he means. According to him,
For example, in Linux, the variables for a system call are passed directly using the register file. In OS X, they are packed up in a memory buffer, passed to a variety of places, and the results are then passed back using another memory buffer before the results are written back to the register file.
This isn’t true, as anyone can verify from Apple’s public sources. For example, here is the assembly for the open function (which, of course, performs the open system call):
mov $0x5,%eax
nop
nop
call 0x90110a70 <_sysenter_trap>
jae 0x90001f4c
call 0x90001f43
pop %edx
mov 268455761(%edx),%edx
jmp *%edx
ret
__sysenter_trap:
popl %edx
movl %esp, %ecx
sysenter
I don’t have a machine running Linux handy, but I do have a FreeBSD 5.4 machine, and Sekhon seems to hold BSD in high esteem. So let’s see how BSD does open:
mov $0x5,%eax
int $0x80
jb 0xa8c71cc
ret
The OS X version appears a bit longer because the BSD version moves its error handling to the close function. In fact, the above code is, if anything, more efficient in OS X, due to its use of the higher-performing "sysenter" instruction instead of the older "int 0×80" instruction. (Which isn’t to say that the total system call is necessarily faster – just the transition from user space to kernel land.) But all that aside, the point is that there is no “packed up into a memory buffer” going on, in either case.
On to the benchmark
According to Sekhon, OS X performed poorly on his statistical software relative to Windows and Linux, and I was able to reproduce his results on my 2 GHz Core Duo iMac with Windows XP and Mac OS X (I do not have Linux installed, so I did not test it). So yes, it’s really happening – but why?
A Shark sample shows that Mac OS X is spending an inordinate amount of time in malloc. After instrumenting Sekhon’s code, I see that it is allocating 35 KB buffers, copying data into these buffers, and then immediately freeing them. This is happening a lot – for example, to multiply two matrices, Sekhon’s code will allocate a temporary buffer to hold the result, compute the result into it, allocate a new matrix, copy the buffer into that, free the buffer, allocate a third matrix, copy the result into that, destroy the second matrix, and then finally the result gets returned. That’s three large allocations per multiplication.
Shark showed that the other major component of the test is the matrix multiplication, which is mostly double precision floating point multiplications and additions, with some loads and stores. Because OS X performs these computations with SSE instructions (though they are not vectorized) and Linux and Windows use the ordinary x87 floating point stack, we might expect to see a performance difference. However, this turned out to not be the case; the SSE and x87 units performed similarly here.
Since the arithmetic component of the test is hardware bound, Sekhon’s test is essentially a microbenchmark of malloc() and free() for 35 KB blocks.
malloc
Now, when allocating memory, malloc can either manage the memory blocks on the application heap, or it can go to the kernel’s virtual memory system for fresh pages. The application heap is faster because it does not require a round trip to the kernel, but some allocation patterns can cause "holes" in the heap, which waste memory and ultimately hurt performance. If the allocation is performed by the kernel, then the kernel can defragment the pages and avoid wasting memory.
Because most programmers understand that large allocations are expensive, and larger allocations produce more fragmentation, Windows, Linux, and Mac OS X will all switch over from heap-managed allocations to VM-managed allocations at a certain size. That size is determined by the malloc implementation.
Linux uses ptmalloc, which is a thread-safe implemenation based on Doug Lea’s allocator (Sekhon’s test is single threaded, incidentally). R also uses the Lea allocator on Windows instead of the default Windows malloc. But on Mac OS X, it uses the default allocator.
It just so happens that Mac OS X’s default malloc does the "switch" at 15 KB (search for LARGE_THRESHOLD) whereas Lea’s allocator does it at 128 KB (search for DEFAULT_MMAP_THRESHOLD). Sekhon’s 35 KB allocations fall right in the middle.
So what this means is that on Mac OS X, every 35 KB allocation is causing a round trip to the kernel for fresh pages, whereas on Windows and Linux the allocations are serviced from the application heap, without talking to the kernel at all. Similarly, every free() causes another round trip on Mac OS X, but not on Linux or Windows. None of the defragmentation benefits of using fresh pages come into play because Sekhon frees these blocks immediately after allocating them, which is, shall we say, an unusual allocation pattern.
Like R on Windows, it’s a simple matter to compile and link against Lea’s malloc instead of the default one on Mac OS X. What happens if we do so?
| Mac OS X (default allocator) |
24 seconds |
| Mac OS X (Lea allocator) |
10 seconds |
| Windows XP |
10 seconds |
These results could be further improved on every platform by avoiding all of the gratuitious allocations and copying, and by using an optimized matrix multiplication routine such as those R provides via ATLAS.
In short
To sum up the particulars of this test:
- Linux, Windows, and Mac OS X service small allocations from the application heap and large ones from the kernel’s VM system in recognition of the speed/fragmentation tradeoff.
- Mac OS X’s default malloc switches from the first to the second at an earlier point (smaller allocation size) than do the allocators used on Windows and Linux.
- Sekhon’s test boils down to a microbenchmark of malloc()ing and then immediately free()ing 35 KB chunks.
- 35 KB is after Mac OS X switches, but before Linux and Windows switch. Thus, Mac OS X will ask the kernel for the memory, while Linux and Windows will not; it is reasonable that OS X could be slower in this circumstance.
- If you use the same allocator on Mac OS X that R uses on Windows, the performance differences all but disappear.
- Most applications are careful to avoid unnecessary large allocations, and will enjoy decreased memory usage and better locality with an allocator that relies more heavily on the VM system (such as on Mac OS X). In that sense, this is a poor benchmark. Sekhon’s code could be improved on every platform by allocating only what it needs.
Writing this entry felt like arguing on IRC; please don’t make me do it again. In that spirit, the following are ideas that I want potential authors of “shootoffs” to keep in mind:
- Apple provides some truly excellent tools for analyzing the performance of your application. Since they’re free, there’s no excuse for not using them. You should be able to point very clearly at which operations are slower, and give a convincing explanation of why.
- Apple has made decisions that adversely impact OS X’s performance, but there are reasons for those decisions. Sometimes the tradeoff is to improve performance elsewhere, sometimes it’s to enable a feature, sometimes it’s for reliability, sometimes it’s a tragic nod to compatibility. And yes, sometimes it’s bugs, and sometimes Apple just hasn’t gotten around to optimizing that area yet. Any exhibition of benchmark results should give a discussion of the tradeoffs made to achieve (or cause) that performance.
- If you do provide benchmark results, try to do so without using the phrase "reality distortion field."
The Internet!
π = 3.2851405312
super duper, again
(corby-d around the corner)
[...] ai – Blog – OS Speed Tuesday, May 16, 2006 OS Speed
Peter Ammon:
Now, when allocating memory, malloc can either manage the memory blo [...]
Matt
Can we assume you sent Mr. Sekhon a helpful email?
Interesting article (particularly the malloc part). Though I’m not so sure you’ve disproved his claim that “For example, in Linux, the variables for a system call are passed directly using the register file. In OS X, they are packed up in a memory buffer, passed to a variety of places, and the results are then passed back using another memory buffer before the results are written back to the register file.”
What you posted makes me wonder if he might have been referring to something that happens in the system call trap before the correct system call handler gets a hold of the CPU. Now, I’ve never looked at the OS X source, so you may already know that isn’t the case, either. In either case, you should check that (if you haven’t) and post that information as well.
Very nicely done, sir. I applaud your thorough disputation of the Mr. Sekhon’s claims and the gentle, factual manner in which you debunked them.
Also… It’s great to hear from you again.
Mark
Excellent article, thanks.
by the end of the day, after all the technical explanations, the results are there and the actual users just have to cope with the speed there is…
I have a 800MHZ G4/17″ and was disappointed with the speed from day 1… my old G3 Imac with os9 was much faster then the new iMac (at that time) I just bought…
Thanks for the explanations. They’re very interesting. As a casual programmer at best I wasn’t aware of those problems at all. That said, I haven’t done anything performance-critical yet…
As in this case the performance differs significantly, I wonder how often we can expect to see such phenomena? Would there be an easy way to increase OS X’s switching level to a higher number and see how/whether performance or normal applications is affected?
Stuart
kermit: You have missed the entire point of “all the technical explanations”. The point is that it’s slower *only* because it uses a very specific and unusual pattern of memory allocation that falls on different sides of a tradeoff in the allocators it used on each platform. All allocation schemes are a tradeoff of some kind, because there’s no best one-size-fits-all solution. That’s why anyone who uses a specific memory allocation pattern and really cares about performance can (your idea that people are stuck with the default is not correct) and should use an allocator that is optimized for their needs.
The benchmark results had nothing to do with the OS, and everything to do with allocator choices.
stuart: I am not a programmer, but I understood that one should optimize. In real life I can’t get rid of the impression that apps are written less efficient. In the past, why could we have apps that needed a 10th of the memory that most apps are demanding now? Most of those apps aren’t doing so much more…(?). Besides that (I know this is off topic), why is macosX slower on the same/similar hardware than others (linux, windows, OS9)? I mean the interface: windows, response etc.
Great article, nicely done..
Have Mr. Sekhon read this?
CS Professor
As a professor of computer science, it is refreshing
to see someone take the time to actually _understand_
why two programs performed differently, instead of
making odd claims about Steve Jobs. Well done!
Jacob
kermit,
Basically, the interface is “slower” because it a) does more complicated rendering, allowing things like transparency and drop shadows, and b) doesn’t sacrifice quality in display shortcuts—for instance, if you resize a Windows window, the content does not resize until you release your mouse, while on OS X, the content resizes on-the-fly. In this case, “slower” means choosing quality and consistency over brute speed, a trade-off that many prefer.
Jacob
kermit,
Basically, the interface is “slower” because it a) does more complicated rendering, allowing things like transparency and drop shadows, and b) doesn’t sacrifice quality in display shortcuts—for instance, if you resize a Windows window, the content does not resize until you release your mouse, while on OS X, the content resizes on-the-fly. In this case, “slower” means choosing quality and consistency over brute speed, a trade-off that many prefer.
q00p
Brilliant! Keep up the good work…
J Osborne
” In the past, why could we have apps that needed a 10th of the memory that most apps are demanding now?”
For one people are more demanding. In 1982 a word processor could display all the text on screen in one font and flow the text edge to edge, and only do font stuff and proper text flows when you printed! In the last 20 years that has become unacceptable. In 1982 a word processor that let you put pictures in the text was a “page layout program”. In 2006 a word processor that don’t let you put pictures in the text is called “crap!”. Spellcheckers use to be another program you bought and ran before you printed. Now they are expected to be everywhere and run all the time. Similar patterns follow for other things. All of these extra demands cost something.
Second, programers don’t have a lot of time to write applications, so they will focus on what people want, and they will make it efficient enough for whatever computer they can in the time they have. If that happens to not be efficient enough for the average computer they can normally convince management to give them “a little more time” to make it faster. If it is totally fast enough on the average computer do you think they will be able to convince management to delay shipping the product for a little while just so they can make it even faster? And would you want them to?
jtl
Sekhon is correct about Darwin’s system calls. You showed the libc code executed at user. He was talking about the system call handler executed within the kernel. I’ve compared Darwin vs. Linux vs. L4 on PowerPC for system calls using performance counters, and Darwin has massive overhead compared to the other two, including cache and TLB footprint.
YES!
Thank you for owning that incompetent benchmarking troll.
fez @ ummyeah.com
I seem to remember Jim Maggee at Apple saying that Darwin likes to allocate larger blocks of memory up front compared to other OSes. I assume this has to do with running Cocoa apps, it probably is more efficiant that way.
Actually that should be Jim Magee, though I am not sure why he would be commenting on user-land stuff, maybe I’m thinking of Quinn ….
Nicely done.. Another great article..
Post more often!
hyperden
These comparisons are interesting, but realy; many users choose a operationg system because they prefer a certain interface, not always because it is faster then another….
Sekhon is correct about Darwin’s system calls. You showed the libc code executed at user. He was talking about the system call handler executed within the kernel. I’ve compared Darwin vs. Linux vs. L4 on PowerPC for system calls using performance counters, and Darwin has massive overhead compared to the other two, including cache and TLB footprint.
Ah, so I was right. By any chance have you attempted to quantify that? Windows comparisons would also be nice.
Anonymous
“Ah, so I was right. By any chance have you attempted to quantify that? Windows comparisons would also be nice.”
How about AnandTech’s benchmarks of system bottlenecks using Lmbench 2.04? Darwin is slower than the 2.6 linux kernel in every one sometimes by several times. For example, OS X is between 2 and 5 times slower in creating new threads!
http://www.anandtech.com/mac/showdoc.aspx?i=2436&p=8
Sekhon is right about Darwin system calls.
jacob, j osborne: thanks for clarifying but I think you’ve confirmed my first post. The end user has to settle with the speed there is…
jtl
Ah, so I was right. By any chance have you attempted to quantify that? Windows comparisons would also be nice.
Yes, I’ve quantified it. To collect highly accurate data, I even wrote my own syscall stubs in assembler to bypass libc overhead (because on L4 the kernel has very little overhead). I also measured IPC (Mach IPC vs. L4 IPC vs. Linux pipe vs. BSD pipe). But I’ve never published the data because Apple is well aware of their performance problems, and I don’t relish in posting negative data.
[...] with the same one used in Sehkon’s Windows code yields equal performance on the Mac. read more | digg story
Tags: No Tags
This entry was [...]
STFU
Seems the only solution the mac zealots would accept would be that osx isnt dog slow. How…..juvenile
Johannes Fortmann
As always, your articles are a wonderful read, giving concise explanations and flawless citing. Thank you!
BTW, I’ve even thougth of sharking this thing myself, but I didn’t have the nerve of getting his modifications to R running.
Oh, and thanks again for HexFiend!
Moses
“…juvenile” shows their maturity using “STFU” as their name for the post.
Would you be satisfied driving a pedal-powered soap-box to work every day because it gets an unbelievable infinite miles per gallon of gasoline? So much for 100% optimization.
Me thinks you have made a few sacrifices, like maybe more comfortable seats, a radio, maybe even a CD player.
You show me an OS, that is equally stable and has the interface features (Expose, Dashboard, World of Warcraft in a window) and I’ll take a look.
I personally am not interested in killing processes in task manager every 20 minutes and rebooting twice a day on a plain-jane OS.
* Switched to Mac 17 Months ago. Haven’t looked back.
How about AnandTech’s benchmarks of system bottlenecks using Lmbench 2.04? Darwin is slower than the 2.6 linux kernel in every one sometimes by several times. For example, OS X is between 2 and 5 times slower in creating new threads!
I remember that. That was laughable. Did you by chance notice that they ran OS X and Linux on DIFFERENT CPUs (different architectures, even), and compared them?
Yes, I’ve quantified it. To collect highly accurate data, I even wrote my own syscall stubs in assembler to bypass libc overhead (because on L4 the kernel has very little overhead). I also measured IPC (Mach IPC vs. L4 IPC vs. Linux pipe vs. BSD pipe). But I’ve never published the data because Apple is well aware of their performance problems, and I don’t relish in posting negative data.
You really should post it. I want to know
I would be interested to see the results for a rather large allocation on each of the OS’s. Does anyone have the time to do such a test?
I would be interested to see the results for a rather large allocation on each of the OS’s. Does anyone have the time to do such a test?
“Rather large” being allocations from the OS memory manager (kernel mode)?
Buck
Hey! Thanks for this article! Looking forward to more of these! It was a great and a very refreshing read.
Moses:
Which “plain-jane OS” are you referring to? I’ve got a Windows XP box that’s been up for about two months now… and while it’s not exactly 200 days like my Linux fileserver, it’s easily more than 12 hours… and I haven’t had to Task Manager a single app. If this is happening to your computer, you’ve either got a hardware problem or you run software for idiots (super smilies and the like). I guess I’m expecting too much though, how would you know anything about an OS you haven’t used in a year and a half (17 months) anyway?
JV
I remember that. That was laughable. Did you by chance notice that they ran OS X and Linux on DIFFERENT CPUs (different architectures, even), and compared them?
No, they also had the same set of benchmarks on THE SAME POWER PC cpu comparing Linux vs. OSX and the performance degradations are definitely there. If you read the article closely you will also notice that they had contacted both Apple engineers and IBM engineers who’ve pretty much confirmed that OSX really did have some issues.
And those issues aren’t just about optimization for certain usage patterns, they are present in some fundamental routines in the simplest of cases as shown in Anandtech’s microbenchmark comparisons.
Imagine if you have shell script that in turn heavily excercised those problematic system calls. (lots of scripts, lots of forking, which includes subprocess calls, pipes, etc) You can’t really “optimize” that shell script.. and even by this article’s admintion, sometimes there are penalties you can’t avoid.
Alex Chejlyk
Without benchmarks, Macs feel slower than Linux and Windows PC’s. One of the largest problems is the Mac hardware – always behind the curve. Look at the fastest Mac you can buy then take a look at the fastest processors available – Mac’s are like 6 months behind – that makes them slower – benchmarks aside. Couple the older hardware with a cpu/gpu intensive GUI, it just runs more slowly than the competition. To top it off, they charge way too much (imo) for the systems. The hardware is the same as what you can get in a high end pc – just marked up an additional 30%.
Macs are slower….
My 2 cents,
Alex
No, they also had the same set of benchmarks on THE SAME POWER PC cpu comparing Linux vs. OSX and the performance degradations are definitely there.
Read it again. You might also want to have a look at the system configurations. Where exactly does it say it used Linux on PPC, on either of those pages?
This still doesn’t address the threading issues that OS X has.
Those of you who are arguing about Darwin’s syscalls being slower have utterly missed the point. The POINT is that this is a perfect example of knowing just enough to be dangerous, but not enough to actually solve the problem. The slowdown is completely, utterly, unrelated to syscalls or the OS memory management. Period. It’s like saying your code runs slower because OSX has more whitespace in its header files — it’s just not true.
In fact, it’s a perfect example of the kind of conclusion-jumping that leads to premature optimization. If you don’t really understand why your code is not performing well, you will never be able to speed it up.
There is also a subtext here of “well, it performs well on platform X where I originally wrote it, and doesn’t perform well on platform Y, so therefore Y sucks.” No, dumbass. Different systems make different design choices. There’s no way X and Y will have identical performance with all sets of input unless they are identical. If your code doesn’t perform well on platform Y, really what that says to someone in the industry is that you didn’t do a good job writing your code.
JV
A wrong hypothesis does not negate the results. Furthermore, most of the evidence shows that if there’s any optimization that needs to be done, it’s most likely IN OSX itself rather than the user programs. Even without the correct conclusion of why these slowdowns happen, you can easily see where it happens. And it happens at these low level operations, such as system calls, which is as granular as user programs can get while still being portable.
In any case, why do you claim it’s not related to system calls or memory allocation when it has been traced down these points and have been shown to have the most performance differences? (i.e. largest time delta when timing parts of the same program running on different OSes) There has to be something along its code path–and therefore related–to cause the slowdowns, even if it where asynchonous (i.e. codepath triggers some other thread to do something, then sometime later waits for other thread).
Anonymous
well done. very interesting reading.
Hakime
Anandtech did not provide any solid proof of the statement that OS X is slow to create threads. They used Lmbench to do some benchmarks and say that, but LMbench (you can check at the tool web site http://www.bitmover.com/lmbench/ or contact the authors of the tool) does not measure anything about threads, ANYTHING AT ALL.
Write a small program that creates, say 60 threads at once (like in the case of their MySQL test), each one increments a shared variable by 1, run the program, you wont see any difference in performance between OS X and Linux. The statement of Anandtech is simply WRONG, done by people that do not have a clear understanding of how things works. The oberved performance on OS X with MYSQL has another causes that i am investigating, but that’s not about creating threads per say.
please_fix_osx
ok, so these are no problems but “tradeoffs” and “design decisions” and whatnot. would someone please explain the idea behind the “design decision” that realloc() does _nothing_ when the old buffer is larger than the new size? try allocating lots of memory many times, than try shrinking the allocated memory. put this in a loop. mac os x will crash, $foo won’t. WTF?
———————-
void *realloc(void *old_ptr, size_t new_size) {
malloc_zone_t *zone;
size_t old_size = 0;
if (!old_ptr) return malloc_zone_malloc(inline_malloc_default_zone(), new_size);
zone = find_registered_zone(old_ptr, &old_size);
if (zone && (old_size >= new_size)) return old_ptr;
if (!zone) zone = inline_malloc_default_zone();
return malloc_zone_realloc(zone, old_ptr, new_size);
}
———————-
http://cvs.opendarwin.org/cgi-bin/cvsweb.cgi/Libc/gen/malloc.c?rev=1.1.1.1&content-type=text/x-cvsweb-markup&cvsroot=apple
test code:
———————-
#include
#define NUM_ALLOCATIONS 100000
#define ALLOC_SIZE 10485760
#define ALLOC_RESIZE 1492
int main(int argc, char **argv) {
/* exiting will free all this leaked memory */
for (i = 0; i %d[%p]\n”,
orig_size, orig_ptr, new_size, new_ptr);
if (new_ptr == NULL) {
printf(“failure to realloc %d\n”, i);
abort();
}
}
return 0;
}
———————-
http://mail.python.org/pipermail/python-bugs-list/2005-January/027027.html
i know that this is consistent with the man page. but i have no idea how this is supposed to make any sense. please enlighten me. the rdf is kinda weak where i live and i’d rather use osx because it is in fact better than other OSes.
Jacob
kermit, also note that Windows Vista and XGL on Linux should cause similar speed hits to the GUI. Slower, but better GUIs are where the whole industry is going. Apple is just ahead of the curve.
kelly
“Thanks to the help of a variety of developers working at Apple and elsewhere, the large OS X performance gap reported here has been eliminated. This can be accomplished by either algorithmic changes or the use of an alternative memory allocator for OS X. Some performance differences remain between the three operating systems, and this page will be updated shortly to report the new benchmarks. ”
I wonder if you make a claim about how slow vista is microsoft developers will help you?
Shaun
Hakime, didn’t Dominic Giampolo at Apple suggest the slow down was because of MySQL buffering and the way Apple is ultra-safe in flushing buffers in it’s HFS+ filesystem?
I wonder what the results would have been if Anand had used the same filesystem across all it’s tests. I too wish these benchmarkers would use the profiling tools when testing. It’s really quite shocking that the punt the results out yet guess why there’s huge disparities.
daniel Lord
Interesting rabiit hole to go down with all this (including the original ‘benchmark’ that caused it) but in the end, it is actual end-to-end performance that counts. As far as the latest ebncmarks show, OS X on my Intel core Duo Macbook Pro cna run applications I use every day just as fast as any comparable Windows laptop. And it crashes less, doens’t get viruses, looks ar ‘purtie’, has a far-superios UI, and runs UNIX and bash. Wahoo! rBeyond that, while academic, I don’t really give a fig about timing system calls in isolation.
Hello
Try launching new threads and processes from OSX. And then get a coffee, because it’s that slow.
Tachyon
MacOS X _IS_ slow compared to other modern UNIX like OS’s. The main reason being the stupid decision to use a Mach kernel under the main kernel.
Years of bad programming habits from the Pre OS-X days.
Whoo! Your article is spreading more than Sehkon’s, it seems. Which is great, I really love the insightful stuff you write, but I usually never see it linked anywhere… (I’m the original digger)
jtl
Those of you who are arguing about Darwin’s syscalls being slower have utterly missed the point. The POINT is that this is a perfect example of knowing just enough to be dangerous, but not enough to actually solve the problem.
No, the point is that the beginning of this blog entry is factually incorrect. He is trying to disprove Sekhon with bad technical analysis, and teaching the blog readers bad facts, in regards to Darwin’s system calls. Sekhon is absolutely correct in that Darwin system calls are insanely slow, and his technical reason is also correct. Anyone can see the Darwin system call handling here:
http://cvs.opendarwin.org/cgi-bin/cvsweb.cgi/src/xnu/bsd/dev/ppc/systemcalls.c?rev=1.3&content-type=text/x-cvsweb-markup
In contrast is Linux’s system call handler:
http://lxr.linux.no/source/arch/ppc/kernel/entry.S
See DoSyscall.
They have fundamentally different approaches to kernel development.
jtl
of course, the rest of the blog entry is good stuff …
ridiculous_fish
Hi JTL, thanks for reading!
Thanks also for your correction regarding the in-kernel behavior of OS X vis-a-vis Linux. It does look like Linux has less to do than does OS X upon entering the kernel, but it also seems that OS X is more featureful here. A few examples that come to mind:
Mac OS X has a 4/4 user/kernel address space split; Linux (in its default configuration) has a 3/1 split, with the kernel memory mapped into the application memory. This means that copying data into or out of the kernel for system calls is faster in Linux, but that applications have 25% less address space available (and drivers 75% less).
Mac OS X supports both Mach and BSD system calls, enabling features like Mach messaging, which really is quite nifty. (Linux’s IPC is comparatively anemic, as far as I know.) Of course, supporting both of these system calls (through the same trap mechanism, no less) incurs overhead.
Mac OS X supports 64 bit processes on a 32 bit kernel, which adds some overhead in the call process. I don’t believe Linux can do this at all, but please correct me if I’m wrong.
If you will permit me to be a little snarky, quick, which kernel has the “massive overhead:”
The kernel that needs to to copy data across address space boundaries every system call.
The kernel that consumes a gigabyte of address space in every process.
Perhaps both! Perhaps it depends on your point of view.
Incidentally, the OS X file you linked was changed for Tiger. The new one is here. (Sorry about the required login.)
[...] itle=”Ridiculous Fish article “…and statistics” published May 16th, 2006″ href=”http://ridiculousfish.com/blog/archives/2006/05/16/36/trackback/”>an article on a weblog titled Ridiculous Fish that helps debunk allegations that [...]
An additional point is that the syscall entry stub itself has very little do to with the performance of the actual system call. Most system calls do something complicated, and the use of a few instructions more or less at the entry point is dwarfed by the cost of the actual operation.
StatsMonkey
Thank you for investigating Sehkon’s claims. I found his article a few months ago while trying to figure out why R was so slow on my beloved Mac and how I could (perhaps) recompile R to make it faster. All I found were Sehkon’s rants and dubious claims, and no answers to my questions. Personally, I’d rather go get a cup of coffee while doing something computationally intensive on my Mac in R than switch to the (*cough*) PC that sits in the back corner of my desk. I’m looking forward to speeder stats results! Many thanks!
Michael Long
Make you wonder about the developer, doesn’t it. If I were doing a program that allocated and deallocated a ton of fixed-sized blocks I’d spend 5 minutes implementing a pool to manage those requests, and bypass the OS calls entirely.
Of course, then his program would run even faster on all platforms, and what would he have to complain about?
I’d say consuming a gigabyte of address space in every process is not a big concern…because I’m expecting everyone to be using 64-bit machines soon. So in your trade-off, I’d say Linux wins.
I’m not sure that’s an accurate trade-off, though. Doesn’t Linux still copy data in from userspace before doing anything with it? (In each system call; you wouldn’t see it in the generic entry code.) I don’t think xnu invented this practice.
John C. Randolph
Scott,
The practice of copying data to move it across the user/kernel barrier goes back a very long time. IIRC, it’s been that way since the earliest UNIX implementations. I remember a project called IOLite, which was I believe the first copy-free I/O system for BSD UNIX. You should be able to find the paper with Google, if you want a good description of what’s wrong with UNIX’s design in this area.
-jcr
JSS
I have updated my benchmarks. I have significantly improved the efficiency of my code on all platforms, and I now link against dmalloc on OS X. But an OS X performance issue remains.
http://sekhon.berkeley.edu/macosx/
Suggestions are most welcome.
JS.
Nice article. I sorta understood it.
[...] Opteron Benchmarks Shekhon: Linux versus Mac OS X on Intel Dual Core Ridiculous Fish: …and Statistics
Este artí [...]
buy flonase
buy flonase
buy flonase – buy flonase
You learn to write as if to someone else because NEXT YEAR YOU WILL BE
“SOMEONE ELSE.”
You are wise, witty, and wonderful, but you spend too much time reading
this sort of trash.
[...] with the same one used in Sehkon’s Windows code yields equal performance on the Mac.read more | digg story Comments are closed. [...]
Louis Duran
Wow, even poli-sci professors at Berkeley know about computing. When’s the last time a poli-sci professor you know compared malloc to gmalloc?
koyali
hi this is koyali and most of the they want to know the details about jobs now a days this is common to every one so i will provide some information here please visit.
============
koyali
temping work in london