Mystery
June 3rd, 2005
I’m sure you’ve seen it too, ’cause it was on Slashdot and if you’re fishing here, you’re definitely an online junkie. I’m talking about that Anandtech article, of course. The one that tries to compare OS X to Linux and a PowerPC to an x86. Lemme see…this one. No more mysteries, they promise!
None of it’s pleasant, but what’s the worst part? The mySQL results. I know it’s painful – you don’t have to look again. All right. So why was the G5, at best, 2/3 the speed of any of the other machines?
I don’t have an official or authoritative answer. But I think this might have a lot to do with it.
When you commit a transaction to a database, you want to be sure that the data is fully written. If your machine loses power half a second after the transaction completes, you want to know that the data made it to disk. To enable this, Mac OS X provides the F_FULLFSYNC command, which you call with fcntl(). This forces the OS to write all the pending data to the disk drive, and then forces the disk drive to write all the data in its write cache to the platters. Or that is, it tries to – some ATA and Firewire drives lie and don’t actually flush the cache. (The check’s in the mail, really…)
F_FULLFSYNC is pretty slow. But if OS X didn’t do it, you might end up with no data written or partial data written, even out of order writes, if you lose power suddenly.
Well! mySQL performs a F_FULLFSYNC on OS X, and not on Linux; as far as I know Linux doesn’t provide a way to do this.
It’s true that mySQL calls fsync() on both, but fsync() doesn’t force the drive to flush its write cache, so it doesn’t necessarily write out the data. Check out http://dev.mysql.com/doc/mysql/en/news-4-1-9.html and Dominic’s comments at the bottom. Oh, and if you missed it, above, look at this Apple mailing list post.
So OS X takes a performance hit in order to fufill the contract of transactions. Linux is faster, but if you lose your wall juice, your transaction may have not been written, or been partially written, even though it appeared to succeed. And that’s my guess as to the main reason OS X benchmarked slower on mySQL.
Again, this isn’t an official explanation, and I’m not qualified to give one. But given that Anandtech missed this issue entirely, I’m not sure they are either.
What about Anandtech’s theory, here? Could the mySQL benchmark be due to the LMbench results? I must confess, this part left me completely bewildered.
Whew, I’m a bit wore out. I’ll leave you to draw your own conclusions, and I hope you post them in the comments.
The Internet!
π = 3.2828694983
Randy
Interesting theories, however, I think there is a fly in the soup, because the Anandtech SQL test is all reads for the most part,
http://www.anandtech.com/IT/showdoc.aspx?I=2291&p=3
so that almost can’t be it in this case.
However, reads under OS X are surprisingly *much* slower than writes on a G5 w/SATA. Slower in fact than an IDE drive on a PC. That makes perfect sense
in light of what they found, and the fact they claim
“we focused on “read” performance. This means that our benchmarks do not try to write information in the tables, but rather, always fetch and report information from one or more tables. ”
IOW, if they wanted to make the Mac look bad on disk I/O, they picked the perfect way to do it.
I do think you are right about the pthread though, it’s a red herring. Micro-benchmarks that show thread creation being slow doesn’t matter if the app is well written (i.e. thread pooling instead of constantly starting and tearing down threads). They were guessing I suspect, and not very well.
They should have just run a straight raw I/O test to the drive, sequential reads and writes to the OS X file system have surprising characteristics, where the read side is just pitifully slow, hence the problem with their SQL test.
[...] eard about that article and now believe that threads on OS X are too slow, you should read this response from Ridculous Fish. You should also read the first comment (which was the only comment [...]
I don’t think we can rule out F_FULLFSYNC just yet. On my (admittedly slow) iMac G4, I can do 22 one-byte writes followed by an F_FULLFSYNC per second, for a total of 22 bytes/second! (Removing the F_FULLFSYNCs gives me 220 KB/sec). I admit that example is contrived, but the point is that F_FULLFSYNC could be a bottleneck even with a very low data rate, depending on how your transactions come in.
On http://www.anandtech.com/mac/showdoc.aspx?i=2436&p=6 they claim that their max read rate was 600 KB/s, and their max write rate was 23 KB/s. 600 KB/sec reading seems low enough to not be the bottleneck even if OS X is bad at reading from disk. But the 23 KB/s writing might if it’s numerous small transactions followed by F_FULLFSYNCs. And yes, I’m guessing here.
To muddy the waters even further, they claimed (as you showed) that they “always fetch” and “do not try to write,” but then why is there a write rate at all? Were they generalizing, is the writing just incidental (logs?), or have I misinterpretered something?
I’m sure left with a lot of questions!
Dan
What about the poor apache results? They’re entirely reads. apache 1.3 has pre-spawned handlers so fork overhead isn’t a factor.
Mark
From the article:
“When we asked Apple for a reaction, they told us that some database vendors, Sybase and Oracle, have found a way around the threading problems”
So it seems that not only did they (wrongly, it seems) come to the conclusion that thread spawning was the problem, but they managed to get someone from Apple to effectively confirm it?
On another point, something that stood out for me is that the horrible results came from two pieces of software (Apache, MySQL), both of which have a lot of threads waiting on sockets. Which made me wonder, is this a funnel thing?
The fnctl() factor can make things rather slow for SQLite too, particularly if you don’t know what you’re doing. In fact, you can see the impact of this during the initial save of a SQLite-based Core Data app.
and if thread creation is really the bottleneck, mysql has a thread cache that will basically do the equivalent of apache’s pre-forking of processes.
i’m not sure that i believe the every-unix-does-it explanation of mac os x’s weak fsync(). the first sentence of the man page for fsync() on my linux box is: fsync copies all in-core parts of a file to disk, and waits until the device reports that all parts are on stable storage.
i know there are caveats to that with regard to disks that lie about the data having been flushed to disk, but to the best of my knowledge, linux really does behave as described.
i don’t think the culprit has been identified here. personally, i suspect there is something strange going on with mac os x’s non-blocking i/o. i’ve seen a non-blocking read on a unix socket, which returned no data, take 15 seconds.
Jim: no, Linux does not behave that way. fsync on Linux does not make any guarantees beyond those it makes on any other system. Check out Brad Fitzpatrick’s post about his diskchecker.pl .
Peter: it would be nice to have at least either a Preview button or a mention of what you allow in comments in terms of HTML (or whether you use another markup syntax), and preferrably both. Absent of either, I can only try to see if this posts correctly…
Chris
It’s not the calls to fsync that are the problem, it’s the fact the F_FULLFSYNC is used on OS X, but not on Linux (because it doesn’t exist on Linux). At least, that’s the hypothesis.
Dan
Again, F_FULLFSYNC does not explain apache’s terrible performance on OSX.
I can’t explain the Apache benchmark, because I don’t understand it, because all they provide is a single number, without units, and no explanation.
With a concurrency of 5, the G5 scored a 216. 216 what? What are we measuring? How are we measuring it? How was Apache installed? How was Apache tuned? What were we serving? How many requests were made?
Don’t you think it’s odd that the “Concurrency” column has no visible impact on the results?
Apachebench doesn’t output a unified score. It tells you a lot of other information, but there’s no clear path for extracting a single meaningful measurement from it.
If someone can actually explain what those scores mean, I would be grateful.
Here’s a sample Apachebench output, from running on my lil’ G4 iMac.
This is ApacheBench, Version 1.3d apache-1.3
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.0.100 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Finished 500 requests
Server Software: Apache/1.3.33
Server Hostname: 192.168.0.100
Server Port: 80
Document Path: /
Document Length: 1456 bytes
Concurrency Level: 200
Time taken for tests: 5.971 seconds
Complete requests: 500
Failed requests: 0
Broken pipe errors: 0
Total transferred: 945360 bytes
HTML transferred: 735280 bytes
Requests per second: 83.74 [#/sec] (mean)
Time per request: 2388.40 [ms] (mean)
Time per request: 11.94 [ms] (mean, across all concurrent requests)
Transfer rate: 158.33 [Kbytes/sec] received
Connnection Times (ms)
min mean[+/-sd] median max
Connect: 0 109 360.3 33 3048
Processing: 272 1800 704.2 1822 3489
Waiting: 235 1799 703.9 1821 3489
Total: 272 1909 825.8 1992 5417
Percentage of the requests served within a certain time (ms)
50% 1992
66% 2066
75% 2327
80% 2437
90% 2824
95% 3192
98% 3390
99% 5415
100% 5417 (last request)
Chris
Dan: I was addressing jim winstead’s comments. The web server performance issues are certainly a different problem. Ridiculous Fish is right about the thread model, of course: it’s not nearly as convoluted as the article makes it sound; it’s pretty straightforward, in fact. So, assuming the article’s results are valid (and I’d like to see someone who knows their facts redo these tests before accepting them as such) the problem isn’t clear, but I would assume has more to do with coarse-grained network locking.
Wes Felter
The latest version of Linux performs the equivalent of a F_FULLFSYNC on fsync(), but most distros probably haven’t picked it up yet.
jido
Jim: the man page simply for fsync say “waits until the device reports that all parts are on stable storage”. As the links in the article describe, that is not good enough to prevent data loss. For this purpose you should use a drive-specific command (F_FULLSYNC) that is not always available.
Hakime SEDDIK
Here are some tests done by pc magazine of Apache running on Osx and Xserve with WebBench. As you can see, the results are very strong, something completely different to what AnandTech tries to make us believe. That’s very strange that the resutlts are so different. Pc magazine has tested Panther, and as Tiger has got ride of the funnels (Tiger implements fine grained lovcking), the performance may be even better. So what is going on here????
http://www.pcmag.com/article2/0,1759,1630329,00.asp
Jussi
I tried to ask some questions about the apache test in anandtech forums and I got a partial answer. The testing methodology was simple:
ab -n 100000 -c x http://localhost # where x is concurrency
I did not get answer to my later question what the number given as result meant. Nor did he tell about the configuration of apache, which hopefully is about identical, nor what was sent. I did not actually ask but they should have provided the information up front anyways.
Then I did some testing on my powerbook given methodology above (smaller n though). I noticed that getting http://localhost is about five times slower (hits/s) than getting a static html page from there. this is due the fact the index.html on OS X basic configuration is not static but the it dynamically chooses the right language version.
More about the apache test. It really does not seem to be a real life situation, request from the localhost to a single html-page. But it sure is same for everyone (given that the httpd.conf was about the same on both machines)
On my machine httpd are not using threads, there is lots of httpd processes so it was not world’s best example to show OS X’s bad threading. I also tried to use shark to do some profiling but I did not really understand much about that. Quite large portition of CPU was used inside the kernel. Other than that I can’t say.
tester
The problem with ab on Mac OS X is that if you use the -n option with a number bigger than 1000 the executions stalls from time to time … it`s a bug probably specific to the Mac OS X – ab combo. My guess it is a buffer is full/not large enough and so you run into timing issue`s.
Jussi is wright about the language resolving …
Anyway if you run ab with -n 1000 -c 150 on a Mac OS X client Powerbook 1,5 GHz you easily go to +- 1000 req/s …
If I find the time I will test this on a PM G5 2.7 with Mac OS X server, but I guess you easily break the 4000 req/s
Look at:
http://www.apple.com/xserve/performance.html
tester
So maybe I didn’t clarify how I came to this conclusion of ab being unreliable at higher -n numbers like 100000.
First I did use 1000 and it was fast, I repeated this several times and noticed on one incident it seemed to hang but it completed with a low requests per second as a result.
If you wait ab becomes fast again ?! If you use -n 10000 on a clean reboot it runs fast (on my PB that is) the first time, but the second time (if you don’t wait to long it stalls at around 6000 … if you look with top nothing is really happening …
So at first I thought Apache was in trouble with spawning threads as the article claims, but I happen to have other clients on the network, so I fired one up remotely, while ab was struggling to continue on the machine I was doing the test … and lo and behold it runs fast, so Apache was not in trouble … but ab …
If you run ab with -n 100000 it will stall repeatably and you always have a request/s between 50 to 300 … it’s that simple.
Jussi
Very good catch, tester. I tested with so small n that there was no problem. Using bigger n I can see the stalling too. It was brought to my knowledge that the number in the Anandtech table was requests per second. This stalling behaviour seems to explain the result.
This stalling problem is definitely worth investigating, I’m not so sure that it’s ab that is broken, I managed to get the stalling behaviour also with other clients, simple python and rebol scripts which were getting a page from my powerbook with concurrency of 1. I was using my good old P166 router running OpenBSD as the client. I would guess that ab and the others are stalling because they don’t get page(s) they request and stall. My little scripts actually fail eventually stating problems with connecting, maybe they hit some time-out and fail.
I can’t say why this is happening. Maybe ab, python, rebol and OpenBSD are broken or maybe because there’s something wrong with OS X’s networking or Apache.
I have no time, expertise or proper hardware to look further into this but I think Apple definitely should do it, there is something really strange happening. I’m not filing a bug as I am not 100% sure of my networking settings are correct, maybe ridiculous fish could file it?
freshfish
to me, in daily desktop use, it seems like tiger is really su***ng in I/O, compared to panther. don’t know if this has anything to do with spotlight, but it feels much more “kernellish” and mds/mdimport are not racing for cpu (instead, kernel_task takes at least 50%). i never had ditto make my mac (g5/bi/1.8) unusable on panther when transferring files, in tiger, this is the case, even switching applications is a pain. if the same happens with apache and mysql, osx isn’t going to be an option for the server market until this issue is solved. i have high load projects involving mysql/apache/php which i’d love to host on an xserve. but preliminary tests with siege -b have shown very low performance on a G5/bi, even with a fink-worker-apache2 build…
i’ll post specific esults after an in-depth test session (still not giving up…).
Jussi
Sorry for spamming, but I checked the ab results a bit better now and also ab is slow when _connecting_, not getting the actual page. This is very strange, maybe a hidden DoS prevention system? How does one turn it off?
Freshfish, I’ve noticed too on my powebook that copying big files seems to do something strange to the virtual memory and other applications are swapped out. mds and it’s friends are quite slow too…
tester
The stalling of ab is client related, when such a stall occurs, you can use another cpu on the network that can access the web-server that is under test, you can even run ab without any problem an get very good performance from web-server …
I`m not claiming the problem is not Mac OS X related, just that the server is still responsive to other requests from other clients on the network …
By the way, A standard Mac OS X Server install on a dual 2,7 gives around 9 000 connections/s …
I’ll try a to find a linux client to see what the problem is with ab or something else …
If you run ab with -n 1000 or -n 10000 you will see the stall’s as well, but only if you run the test multiple times on a short interval from the same cpu. It seems as if some stack needs to be flushed before it runs fast again, but it is definitely client related.
tester
ab from linux client runs with -n 100000 without any problems and result is;
6050 requests/s
with -n 10000 it gives 5666 requests/s
As already mentioned on Mac OS X with -n 10000 it gives around 9000 requests/s
Jonas Maebe
The stalls may be due an exhaustion of kernel buffers for network connections. I’m not sure which sysctl you should increase to solve that though.
Robert C. Schwab
I was very concerned by the Anandtech article. I have worked for IBM for 27 years and I started work for CDC 64 bit machines in 1970 and on AIX RS6000 for a about 10 and you are correct about forks (new Process) versus Threads in the same addess space.
What you say makes sense which is we call blocking which means we will WAIT for for the I/O to complete. Can I perform a SYNCRONOUS I/O where I don’t wait and evaluate the EVENT at a later time if I prefer to speed things up for other applications. Thanks for the writ-up!
Have they eliminated file system issues for the difference in read performance? HFS+ is not organised like a typical UNIX file system, some operations are faster and others are slower.
It might be worthwhile to compare running off HFS+ and off UFS.
Solutions?
Let’s say the F_FULLFSYNC command (which the OS X Server team is probably referring to as the F***in’ FULLFSYNC command by now) is indeed the culprit. What do you suggest that Apple does going forward?
Do they continue their line of argument, saying that all good *nix OSes do it, therefore we will too? Or do they become more Linux-like, and kill the command?
I, for one, would like to see Apple offer customers a choice, a setting in Mac OS X Server that could turn the F_FULLFSYNC command on or off like a switch. If you care about data integrity when running your MySQL-based apps, turn it on. If you want to smoke your buddy that’s a Linux zealot in a MySQL smackdown, turn it off.
Is this a feasible solution? If not, I would love to hear of one that is.
It gets a little annoying after a while to read a thousand diagnoses, without a single doctor ever issuing a perscription…
Maynard Handley
Without commenting on the rest of the article, people, don’t make snarly comments without knowing what you are talking about — you’re then just as guilty as AnandTech. In particular ON LINUX fork() IS what you use to create threads. More specifically, Linux
(a) uses a substantially more integrated thread/process model than most unixes. Most unixes used to talk about threads as “light-weight processes” but on Linux they really are that. The same kernel data structure describes both, with only a flag or two distinguishing them
(b) going along with this commonality, the central call for creating threads is the clone() call. fork() is, I believe, simply implemented as a call to clone with the appropriate set of flags.
Thus in the Linux world when people talk about fork()/clone() they reasonably refer to the generic task of spawning a new unit of control, whether that’s a thread or a process. It’s not accurate terminology for other unixes but it’s not worth making a big deal over; it simply shows that the author’s primary frame of reference is Linux not other unixes.
Stick to criticizing real problems.
My recent investigations in the MySQL source code shows that the F_FULLFSYNC only is in place in the InnoDB table handlers, not in the MyISAM, which makes this argument quite unimportant. Since MyISAM is still faster than InnoDB most deployed web-applications use MyISAM. It would be interesting to know, what table handler the Andantech people tested…
Dan
It’s beginning to strongly point to kernel locking issues as the performance bottleneck in OSX.
Linux 2.2 and 2.4 had the same kind of performance issues, overall performance increased significantly with fine-grained locking in 2.6.
Maybe apple will finally fix it in 10.5.
stingmerman
The new Anandtech article still contains the AB flaws and tries to use their flawed benchmark to support their MySQL findings, which to me still indicates that they do not know what they are doing. How can we trust their other conclusions? It just doesn’t make sense that it is a thread creation destruction problem, especially since MySQL uses thread pooling? It would be nice if somebody who knew what they were doing narrowed it down:
Is it a configuration issue, is it a 4.1 issue resolved in 4.2? Is it a bug in the testing methodology like the AB bug? I just feel dirty after reading their article, like I have been abused.
First of all, I like to say that this blog has been very helpful is our search for what exactly is wrong with Mac OS X. Thanks!
But I couldn’t help but react to stingmerman:
“The new Anandtech article still contains the AB flaws and tries to use their flawed benchmark to support their MySQL findings, which to me still indicates that they do not know what they are doing”
First of all, read what Dominik Wagner says:
“My recent investigations in the MySQL source code shows that the F_FULLFSYNC only is in place in the InnoDB table handlers, not in the MyISAM, which makes this argument quite unimportant”
We use MyISAM and I clearly indicated that in the article. No Fsync problem.
About Apachebench: “Why exactly does the client stall? Is it really a bug or is it running out of some resources? We didn’t delve deeper, as we are developing a less synthetic, closer to the real world benchmark to test web servers.”
We posted those benchmarks to show that the apachebench problem does not exist in linux and that the G5 does well there. I clearly indicated that you should take the apachebench results with a bit of salt. So attacking the article on that is a bit lame.
Jussi
Hi, Johann.
Some testing I and some others did pointed in the direction that the Apachebench bug is in the client side. I don’t have an opinion if it is in ab itself or OS X. Under OpenBSD ab runs also stall in a similar way, I assume it to be the same problem.
I’m glad to hear that you are developing better ways to test web server performance because the one you are using now is probably the most non-realworld test there can be. Web servers in a production environment are seldom used from localhost only, and requesting one page only. Also in making good tests there should be as little changing variables as possible. When testing a server should differ make sure that the clients, pages served and apache configurations are identical.
Using one separate client for the apache testing would probably have done a great difference. If your better tests are not ready before the next article please at least use one invariant client. It would show if the problem is really in client or server side.
It is certainly interesting!
Trackbacks have been disabled on this post due to spam. Please feel free to leave comments.
Vincent Bernardi
This comment is mostly to discover what the current value of Pi is, although I have much doubts as to the result. I guess I’ll see after having pressed “Submit”
More to the point, has there been any progress on the subject since last year? For example, has any serious scientific paper been published on the comparative performance issues of the Mac OS X and GNU/Linux system calls?
Vincent Bernardi
Shows how much I know. 3.2 is already much more accurate than I would have guessed, seeing how the randomness seems flawed