- Every Post on One Huge Page It's Big, Long, and Meta
-
Labor of Division (Episode VI): Narrowing Division Insight and Benchmarks
September 15th, 2021
Last post presented a narrowing division algorithm, but left some questions dangling around the interpretation of intermediate values. This post answers them. It presents an alternative derivation of the improved correction step, and shows how the true remainder may be computed from the estimated remainder. At the bottom are some benchmarks.
-
Angband in WASM
August 24th, 2021
Announcing a port of Angband running entirely in-browser. Angband is a venerable roguelike dungeon-crawler set in the Tolkien universe.
-
Benchmarking division and libdivide on Apple M1 and Intel AVX512
May 12th, 2021
libdivide is fish’s library for speeding up integer division by runtime constants. It works by replacing divide instructions with cheaper multiplications and shifts.
-
Labor of Division (Episode V): Faster Narrowing Division
April 29th, 2021
Last post explored Algorithm D, and some improvements for the 3 ÷ 2 = 1 digit case. This post presents a narrowing division algorithm, improving upon the widely used “divlu” function from Hacker’s Delight. I am optimistic that these ideas are original.
-
Labor of Division (Episode IV): Algorithm D
April 28th, 2021
Algorithm D is Knuth's celebrated multiword integer division algorithm. This post tries to make Algorithm D approachable, and also has an idea for an improvement, in the last section. A followup post provides code to take advantage of that improvement.
-
My Least Favorite Rust Type
September 20th, 2020
My least favorite Rust type is
std::ops::Range
. -
JavaScript's Tricky Rounding
April 24th, 2018
JavaScript rounds in a tricky way. It tricked all the engines, and even itself.
- Schrödinger? I hardly know her! September 8th, 2016
-
The One Second Dash
August 15th, 2016
The Amazon Dash is a $5 WiFi button that summons a truck to deliver you water or other stuff. Want your Dash to do something else? The popular approach is to sniff its ARP requests. This requires that Dash connect to your network, putting you perilously close to having some DUDE delivered with your IoT mood lighting.
A more immediate problem is immediacy, or lack thereof: the Dash button only connects to your network after being pressed, so there's a ~5 second delay before anything can happen! This makes the ARP Dash hack unsuitable for interactive uses, like doorbells.
- fish shell 2.0 May 17th, 2013
-
Yahoo! Chat - A Eulogy
February 21st, 2013
"Asswipe," replied Yahoo's server. That's when I knew I had it.
- Go Bloviations (Optional Reading) September 15th, 2012
- Can a command line shell be Mac-like? June 6th, 2012
-
The Joy of Hex
December 12th, 2011
Hex Fiend 2.1 is here. Hex Fiend is a fast and clever open source hex editor for Mac OS X.
-
Labor of Division (Episode III): Faster Unsigned Division by Constants
October 19th, 2011
This post is available in a less handy format. There's also a PDF. Comments and discussion are on reddit.
- Angband 3.3 (Guest Blogger: Elrond) October 14th, 2011
- My One Lame Steve Story October 5th, 2011
- One App's Poison September 15th, 2011
- A Quick 'n Dirty Color Sequence August 29th, 2011
-
Labor of Division (Episode I)
February 15th, 2010
Here's how you divide an unsigned int by 13 in C:
unsigned divide(unsigned x) { return x / 13; }
It gets better, I promise. Here's the corresponding x64 assembly that gcc produces:_divide: movl $1321528399, %edx movl %edi, %eax mull %edx shrl $2, %edx movl %edx, %eax ret
- cdecl November 12th, 2009 Quick, what is "char (*(*(* const x[3])())[5])(int)?"
- I Didn't Order That, So Why Is It On My Bill, Episode 2 September 17th, 2009 This is the second episode of I Didn't Order That, So Why Is It On My Bill: C++ features you never use, but pay for anyways. See episode one here.
- I'm Bringing Hexy Back July 31st, 2009
- I Didn't Order That, So Why Is It On My Bill, Episode 1 June 22nd, 2009
-
Roundy
June 1st, 2009
This post is in homage to my old page, a triumph of curvaceousness. May it sleep roundly.
-
Buzz
April 25th, 2007
Buzz is leaving Apple, has already left. Buzz joined Apple two months after I joined, on the same team. He was the first candidate I ever interviewed, although I can't remember what I asked him. Without Buzz's encouragement and inspiration, I'd have never started this blog, and he was first to link to me. So thank you and farewell, Buzz.
-
Angband
April 13th, 2007
Guest blogger: Gollum (Smeagol)
-
Barrier
February 17th, 2007
"That's a lot of cores. And while 80-core floating point monsters like that aren't likely to show up in an iMac any time soon, multicore chips in multiprocessor computers are here today. Single chip machines are so 2004. Programmers better get crackin'. The megahertz free ride is over - and we have work to do."
-
Logos
December 11th, 2006
-
0xF4EE
November 24th, 2006
Hex Fiend 1.1.1 is now available open source under a BSD-style license. Hex Fiend is my fast and clever
freeopen source hex editor for Mac OS X. -
Bridge
September 9th, 2006
A Brief History
-
Hex Fiend 1.1
August 24th, 2006
Spiffy! Hex Fiend version 1.1 is ready. Hex Fiend is my fast and clever free hex editor for Mac OS X. New stuff:
- Horizontal resizing
- Custom fonts
- Overwrite mode
- Hidden files
- Lots more goodies (see release notes)
Wake up.Ssnsnrk.Wake up, he's gone.Zzzz...wha? Oh, someone's here. Allow me to spin up.
........................................If it's not obvious, I'm fish's filesystem, and that's fish's hard drive.I'm a hard drive.We snuck this post in. fish can't know we're here.Don't tell fish. Big secret.He'd be embarassed if he knew.Humiliated. He can't know.See, fish was trying to beat grep. And he was experimenting with all these stupid ideas and complicated algorithms for teensy gains. It was sad, really.Pathetic.fish kept trying so many things. He was thrashing about.Like he was out of water.So we had to help him. It was easy, really - we just had to sneak in one line. One line was all it took.I wrote it!Because I told you what to write.fish only thought about the string searching algorithm.He never even considered us and the work we have to do.I felt slighted. It was rude.See, when I read data from the hard drive, I try to keep it around in memory. That's what the UBC is all about.Unified Buffer Cache.When most people read data, they end up wanting to read it again soon after. So keeping the data around saves time.But not fish.fish was reading these big honking files from start to finish. It was way more than I could remember at once.fish thrashed your cache, dude.So I just stopped trying.We turned off caching with this: fcntl(fd, F_NOCACHE, 1);Just like Apple recommends for that sort of usage pattern.And it helped. Looking for a single character in a 11.5 GB file:Hex Fiend (no caching) Hex Fiend (caching) grep 208 seconds 215 seconds 217 seconds And that's likely the best we can do, thanks to slowpoke over there.Phew. I'm all wore out.There's not much room for improvement left. We're searching 57 MB/second - that's bumping up against the physical transfer limit of our friend, Mr. ATA.I'm totally serial.Depending where we are on his platter. So we've done all we can for searching big files.Don't tell fish.Right. I hopFISH IS COMINGTime to go then. See you later. syncflushsleep........................................ - The Treacherous Optimization May 30th, 2006 Old age and treachery will beat youth and skill every time.
-
...and statistics
May 16th, 2006
The latest "OS X is slow" meme to impinge on the mass psyche of the Internet comes courtesy of one Jasjeet Sekhon, an associate professor of political science at UC Berkeley. The page has hit digg and reddit and been quoted on Slashdot. The article and benchmark is here. Is there any merit to this?
Once again, this discussion is only my meager opinion. I do not speak for Apple, and none of what I have to write represents Apple's official position.
The article is filled with claims such as "OS X is incredibly slow by design," and while the the BSD kernel is "excellent", the XNU kernel is "very inefficient and less stable" compared to Linux or BSD. However, without specifics, these assertions are meaningless; I will ignore them and concentrate on the technical aspects of what's going on.
System calls
Sekhon does give one example of what he means. According to him,
For example, in Linux, the variables for a system call are passed directly using the register file. In OS X, they are packed up in a memory buffer, passed to a variety of places, and the results are then passed back using another memory buffer before the results are written back to the register file.This isn't true, as anyone can verify from Apple's public sources. For example, here is the assembly for the open function (which, of course, performs the open system call):
mov $0x5,%eax nop nop call 0x90110a70 <_sysenter_trap> jae 0x90001f4c <_open +28> call 0x90001f43 <_open +19> pop %edx mov 268455761(%edx),%edx jmp *%edx ret __sysenter_trap: popl %edx movl %esp, %ecx sysenter
I don't have a machine running Linux handy, but I do have a FreeBSD 5.4 machine, and Sekhon seems to hold BSD in high esteem. So let's see how BSD does open:mov $0x5,%eax int $0x80 jb 0xa8c71cc
The OS X version appears a bit longer because the BSD version moves its error handling to the close function. In fact, the above code is, if anything, more efficient in OS X, due to its use of the higher-performing "sysenter" instruction instead of the older "int 0x80" instruction. (Which isn't to say that the total system call is necessarily faster - just the transition from user space to kernel land.) But all that aside, the point is that there is no "packed up into a memory buffer" going on, in either case.ret On to the benchmark
According to Sekhon, OS X performed poorly on his statistical software relative to Windows and Linux, and I was able to reproduce his results on my 2 GHz Core Duo iMac with Windows XP and Mac OS X (I do not have Linux installed, so I did not test it). So yes, it's really happening - but why?
A Shark sample shows that Mac OS X is spending an inordinate amount of time in malloc. After instrumenting Sekhon's code, I see that it is allocating 35 KB buffers, copying data into these buffers, and then immediately freeing them. This is happening a lot - for example, to multiply two matrices, Sekhon's code will allocate a temporary buffer to hold the result, compute the result into it, allocate a new matrix, copy the buffer into that, free the buffer, allocate a third matrix, copy the result into that, destroy the second matrix, and then finally the result gets returned. That's three large allocations per multiplication.
Shark showed that the other major component of the test is the matrix multiplication, which is mostly double precision floating point multiplications and additions, with some loads and stores. Because OS X performs these computations with SSE instructions (though they are not vectorized) and Linux and Windows use the ordinary x87 floating point stack, we might expect to see a performance difference. However, this turned out to not be the case; the SSE and x87 units performed similarly here.
Since the arithmetic component of the test is hardware bound, Sekhon's test is essentially a microbenchmark of malloc() and free() for 35 KB blocks.
malloc
Now, when allocating memory, malloc can either manage the memory blocks on the application heap, or it can go to the kernel's virtual memory system for fresh pages. The application heap is faster because it does not require a round trip to the kernel, but some allocation patterns can cause "holes" in the heap, which waste memory and ultimately hurt performance. If the allocation is performed by the kernel, then the kernel can defragment the pages and avoid wasting memory.
Because most programmers understand that large allocations are expensive, and larger allocations produce more fragmentation, Windows, Linux, and Mac OS X will all switch over from heap-managed allocations to VM-managed allocations at a certain size. That size is determined by the malloc implementation.
Linux uses ptmalloc, which is a thread-safe implemenation based on Doug Lea's allocator (Sekhon's test is single threaded, incidentally). R also uses the Lea allocator on Windows instead of the default Windows malloc. But on Mac OS X, it uses the default allocator.
It just so happens that Mac OS X's default malloc does the "switch" at 15 KB (search for LARGE_THRESHOLD) whereas Lea's allocator does it at 128 KB (search for DEFAULT_MMAP_THRESHOLD). Sekhon's 35 KB allocations fall right in the middle.
So what this means is that on Mac OS X, every 35 KB allocation is causing a round trip to the kernel for fresh pages, whereas on Windows and Linux the allocations are serviced from the application heap, without talking to the kernel at all. Similarly, every free() causes another round trip on Mac OS X, but not on Linux or Windows. None of the defragmentation benefits of using fresh pages come into play because Sekhon frees these blocks immediately after allocating them, which is, shall we say, an unusual allocation pattern.
Like R on Windows, it's a simple matter to compile and link against Lea's malloc instead of the default one on Mac OS X. What happens if we do so?
Mac OS X (default allocator) 24 seconds Mac OS X (Lea allocator) 10 seconds Windows XP 10 seconds These results could be further improved on every platform by avoiding all of the gratuitious allocations and copying, and by using an optimized matrix multiplication routine such as those R provides via ATLAS.
In short
To sum up the particulars of this test:
- Linux, Windows, and Mac OS X service small allocations from the application heap and large ones from the kernel's VM system in recognition of the speed/fragmentation tradeoff.
- Mac OS X's default malloc switches from the first to the second at an earlier point (smaller allocation size) than do the allocators used on Windows and Linux.
- Sekhon's test boils down to a microbenchmark of malloc()ing and then immediately free()ing 35 KB chunks.
- 35 KB is after Mac OS X switches, but before Linux and Windows switch. Thus, Mac OS X will ask the kernel for the memory, while Linux and Windows will not; it is reasonable that OS X could be slower in this circumstance.
- If you use the same allocator on Mac OS X that R uses on Windows, the performance differences all but disappear.
- Most applications are careful to avoid unnecessary large allocations, and will enjoy decreased memory usage and better locality with an allocator that relies more heavily on the VM system (such as on Mac OS X). In that sense, this is a poor benchmark. Sekhon's code could be improved on every platform by allocating only what it needs.
Writing this entry felt like arguing on IRC; please don't make me do it again. In that spirit, the following are ideas that I want potential authors of "shootoffs" to keep in mind:
- Apple provides some truly excellent tools for analyzing the performance of your application. Since they're free, there's no excuse for not using them. You should be able to point very clearly at which operations are slower, and give a convincing explanation of why.
- Apple has made decisions that adversely impact OS X's performance, but there are reasons for those decisions. Sometimes the tradeoff is to improve performance elsewhere, sometimes it's to enable a feature, sometimes it's for reliability, sometimes it's a tragic nod to compatibility. And yes, sometimes it's bugs, and sometimes Apple just hasn't gotten around to optimizing that area yet. Any exhibition of benchmark results should give a discussion of the tradeoffs made to achieve (or cause) that performance.
- If you do provide benchmark results, try to do so without using the phrase "reality distortion field."
-
Hex Fiend
March 28th, 2006
One of my side projects has borne some fruit. Meet Hex Fiend, a new hex editor for Mac OS X. (Hex editors allow you to edit the binary data of a file in hexadecimal or ASCII formats.)
Hex Fiend allows inserting and deleting as well as overwriting data. It supports 100+ GB files with ease. It provides a full undo stack, copy and paste, and other features you've come to expect from a Mac app. And it's very fast, with a surprisingly small memory footprint that doesn't depend on the size of the files you're working with.
Hex Fiend was developed as an experiment in huge files. Specifically,
- How well can the Cocoa NSDocument system be made to work with very large files?
- How well can the Cocoa text system be extended to work with very large files?
- How well does Cocoa get along with 64 bit data in general?
- What are some techniques for representing more data than can fit in memory?
Check it out - it's free, and it's a Universal Binary. If you've got questions or comments about it or how it works, please leave a comment!
(Incidentally, the Hex Fiend main page was made with iWeb!)
Edit: I've discovered/been informed that drag and drop is busted. I will put out an update later tonight to fix this.
Edit 2: Hex Fiend 1.0.1 has been released to fix the drag and drop problems. Please redownload it by clicking on the icon above.
-
Nest
February 5th, 2006
aaron: Howdy George. george: Hey Aaron! How's the C programming coming? aaron: Not so good. It doesn't support all the things I'm used to, from other programming languages. Where's the garbage collection? Where's the type safety? Where's the lexically scoped functions? george: Yeah, ISO standard C is pretty anemic. Wait, lexus-what? aaron: Lexically scoped functions. You know, nested functions - the ability to put a function inside another function. Lots of fancy functional languages can do that, and I wish it were possible in C. george: Oh, right. Well, did you know that GCC supports nested functions as an extension? You can call a nested function inside its parent function, and you can also pass a pointer to the nested function to qsort() and other functions. And the nested function can access the local variables of its parent function. - Array December 23rd, 2005 Our arrays, aren't.
- Float September 26th, 2005
Why now?
My friend over at SciFiHiFi suggests that Microsoft likes trees, but Apple likes hash tables. Well, I think that Microsoft prefers integer arithmetic, while Apple (or at least Cocoa) likes floating point. Here's a bunch of examples I found:- Take color. Microsoft's API (you know, THAT one) makes colors from integer components, but Cocoa makes colors from floats.
- Microsoft does sliders (like the iTunes volume control) with with integers, but NSSlider is floating point.
- Time, too. Microsoft does time in integers, but Apple does it with doubles.
- Spam August 9th, 2005
Well, they found me. I knew it was only a matter of time. It happens to every blog. Comment spam, and a lot of it. But nobody told me it would be like this! I got about 200 messages an hour. I mean, the attention is flattering, but still, my goodness! Let's see what we can do about that.
(If you hate math and all you care about is how to post a comment, go to The Punchline)
π
- objc_msgSend August 1st, 2005
Sending messages can be fun! All you have to do is make a game of it. What kind of game, you ask? Well, for example, we could see how many we can send in four seconds, then try to beat that record. (Apologies to Bart and Principal Skinner.)
Objective-C is a dynamic language. When you send a message to an object, gcc emits a call to objc_msgSend() or one of its specialized, similiar functions. Since every Objective-C message send is funneled through objc_msgSend(), making objc_msgSend() as fast as possible is a priority. Let's see how fast we are and if we could do better.
Profiling objc_msgSend()
I wrote some code to see how many times Objective-C could call a method in four seconds. I set an alarm using the alarm() function. Then I sent an increment message to an object in a loop until I received the SIGALRM signal, at which point I output how many times the method ran and exit. I compiled it with gcc4 on Tiger using -O3. Here is my Objective-C code.
Over an average of three runs, I benchmarked 25681135 messages a second. All right.
As Skinner said, let's try to beat that record! And we'll start by profiling. My favorite profiling tool on OS X is, of course, Shark. Let's run it on our executable. Ok, there we go. Shark shows that 16% of the time is spent in the increment method itself, 34% of the time is spent in dyld_sub_objc_msgSend(), and 48% is spent in objc_msgSend() itself. What is this dyld_sub_objc_msgSend() function and why does it take so much time?
You may already know. objc_msgSend() is dynamically linked from a dynamic library, namely, libobjc. In order to call the actual objc_msgSend(), the code jumps to a stub function, dyld_stub_objc_msgSend(), which is responsible for loading the actual objc_msgSend() method and jumping to it. As you can see, the stub function is relatively expensive. If we could eliminate the need for it, we could see a performance improvement of up to 33%.
Plan of attack
Here's one way to get rid of it. Instead of jumping through the stub function, let's grab a pointer to objc_msgSend() itself and always call objc_msgSend() through that pointer. In fact, it's not so different from inlining the stub.
Easier said than done! How will we do this? Well, we could edit the assembly of this little benchmark by hand, or even screw around with the C code, but that's pretty contrived. It would be great if we could make gcc do this work for us.
Yep, we're gonna hack on gcc. Feel free to download it and do it with me! Or just follow along vicariously. For every source file I mention, I will give the path to it in what you just downloaded, and also a link to it on Apple's opensource site.
Getting the source
Download and extract Apple's latest public version of gcc. As of this writing, it's gcc-4061. Put it on a volume with enough space. Fully built, gcc will take up almost 1.1 GB.
Building the source
All extracted? Great. Open up README.Apple in the gcc-4061 directory. It tells you to run two commands:
mkdir -p build/obj build/dst build/sym gnumake install RC_OS=macos RC_ARCHS=ppc TARGETS=ppc \ SRCROOT=`pwd` OBJROOT=`pwd`/build/obj \ DSTROOT=`pwd`/build/dst SYMROOT=`pwd`/build/sym
Guess what? You run the commands. (Notice those are backticks, not single quotes!) Then go get some coffee or something. This may take a while, but we only have to do this once.
- Compiled Bitmaps June 22nd, 2005 Ancient programmers speak in hushed tones of the legend known as the "compiled bitmap." Ok, no they don't, but there is such a thing. It's a technique that was sometimes used in the DOS days for rapidly drawing an image to the screen. But how can you compile a bitmap? And is there any value remaining in this technique, or should it be left in the dustbin along side TSRs and DOSSHELL? Let's find out!
Blitting
Drawing an image is called "blitting." The usual way we blit is to load a pixel from the source image buffer, store it in the destination, and repeat until we've blitted every pixel.- Mystery June 3rd, 2005
I'm sure you've seen it too, 'cause it was on Slashdot and if you're fishing here, you're definitely an online junkie. I'm talking about that Anandtech article, of course. The one that tries to compare OS X to Linux and a PowerPC to an x86. Lemme see...this one. No more mysteries, they promise!
None of it's pleasant, but what's the worst part? The mySQL results. I know it's painful - you don't have to look again. All right. So why was the G5, at best, 2/3 the speed of any of the other machines?
I don't have an official or authoritative answer. But I think this might have a lot to do with it.
When you commit a transaction to a database, you want to be sure that the data is fully written. If your machine loses power half a second after the transaction completes, you want to know that the data made it to disk. To enable this, Mac OS X provides the F_FULLFSYNC command, which you call with fcntl(). This forces the OS to write all the pending data to the disk drive, and then forces the disk drive to write all the data in its write cache to the platters. Or that is, it tries to - some ATA and Firewire drives lie and don't actually flush the cache. (The check's in the mail, really...)
F_FULLFSYNC is pretty slow. But if OS X didn't do it, you might end up with no data written or partial data written, even out of order writes, if you lose power suddenly.
Well! mySQL performs a F_FULLFSYNC on OS X, and not on Linux; as far as I know Linux doesn't provide a way to do this.
It's true that mySQL calls fsync() on both, but fsync() doesn't force the drive to flush its write cache, so it doesn't necessarily write out the data. Check out http://dev.mysql.com/doc/mysql/en/news-4-1-9.html and Dominic's comments at the bottom. Oh, and if you missed it, above, look at this Apple mailing list post.
So OS X takes a performance hit in order to fufill the contract of transactions. Linux is faster, but if you lose your wall juice, your transaction may have not been written, or been partially written, even though it appeared to succeed. And that's my guess as to the main reason OS X benchmarked slower on mySQL.
Again, this isn't an official explanation, and I'm not qualified to give one. But given that Anandtech missed this issue entirely, I'm not sure they are either.
What about Anandtech's theory, here? Could the mySQL benchmark be due to the LMbench results? I must confess, this part left me completely bewildered.
- They claim that making a new thread is called "forking". No, it's not. Calling fork() is forking, and fork() makes processes, not threads.
- They claim that Mac OS X is slower at making threads by benchmarking fork() and exec(). I don't follow this train of thought at all. Making a new process is substantially different from making a new thread, less so on Linux, but very much so on OS X. And, as you can see from their screenshot, there is one mySQL process with 60 threads; neither fork() nor exec() is being called here.
- They claim that OS X does not use kernel threads to implement user threads. But of course it does - see for yourself.
/* Create the Mach thread for this thread */ PTHREAD_MACH_CALL(thread_create(mach_task_self(), &kernel_thread), kern_res);
- They claim that OS X has to go through "extra layers" and "several threading wrappers" to create a thread. But anyone can see in that source file that a pthread maps pretty directly to a Mach thread, so I'm clueless as to what "extra layers" they're talking about.
- They guess a lot about the important performance factors, but they never actually profile mySQL. Why not?
- Nil May 29th, 2005 What does sending a message to nil do in Objective-C? What value does it return? You probably know that if the method is declared to return an object, you get back nil, but what if the method is declared to return a char? An int? A long long? A float? A struct? And how does it all work?
Let's find out!
objc_msgSend
Here's what I do know. When you send a message to an object, such as [someObject doSomething], the compiler compiles that into a function call to the objc_msgSend() function (or objc_msgSend_stret() if the method is declared to return a struct; more on that below). So what does objc_msgSend() do with nil?
Fortunately, that code is available publicly, from Apple's public source page. We want the objc-msg-ppc.s file (Warning: Apple ID required). .s means assembly, but don't worry, we can handle it.
(I am displaying a few snippets of that code in this post. All such code is Copyright Apple Computer.)
Where's the objc_msgSend() function? Oh, there it is; search for "ENTRY _objc_msgSend".
ENTRY _objc_msgSend ; check whether receiver is nil cmplwi r3,0 ; receiver nil? beq- LMsgSendNilSelf ; if so, call handler or return nil
- Daybreak May 25th, 2005 I'm a developer on Apple's AppKit team; I work to make Cocoa better. With a few exceptions, we are responsible for both AppKit and Foundation. This is my blog.