Fish tank with Puck the Fish Fish tank with Puck the Fish Water on the floor spelling Fish
 

Array

December 23rd, 2005

Our arrays, aren’t.

There, that’s known as a teaser. You see it in television all the time. “Find out which common household plant devours pets at night…but first.” And then you have to sit through and watch the stuff about Brad and Angelina shacking up / Shaq driving his new Hummer / Hummer’s new fragrance before they get to the good stuff, the petnivorous plants. I don’t know what I’d do without TiVo. And here I’ve done the same thing as the networks. Shame on me. I’ll get to it, I promise.

But first

But first, let’s talk about data structures. I’ll try to make this more interesting than Donald Trump’s new reality fragrance SUV. Data structures are really important, we’re lead to believe, so important that entire classes in CS curricula discuss nothing else all year. Gosh! Let’s look at all the data structures we have available in Java.

HashSet TreeSet LinkedHashSet ArrayList
LinkedList PriorityQueue HashMap TreeMap
LinkedHashMap WeakHashMap IdentityHashMap CopyOnWriteArrayList
CopyOnWriteArraySet EnumSet EnumMap ConcurrentLinkedQueue
LinkedBlockingQueue ArrayBlockingQueue PriorityBlockingQueue DelayQueue
SynchronousQueue ConcurrentHashMap

That’s 22, by my count, excluding the legacy classes like Vector.

Here’s the data structures we have in standard C++:

vector deque list slist
bit_vector set map multiset
multimap hash_set hash_map hash_multiset
hash_multimap stack queue priority_queue
bitset

That’s, uh, 17. Here’s what I noticed while writing these up: most of these classes tell you how they work, right in the name. CopyOnWriteArraySet is a set that implements copy-on-write and is backed by an array. ArrayList is a list implemented via an array. hash_multimap is a multi-map implemented via hashing. And so on.

CoreFoundation

I’m going to compare those to CoreFoundation. CoreFoundation, if you’re a bit hazy, is Apple’s framework that "sits below" Carbon and Cocoa, and is one of the few places where these two APIs meet. CoreFoundation has collections, so that other APIs like Quartz that take and return collections don’t need separate Carbon and Cocoa interfaces. But the real reason I’m talking about CoreFoundation is that it’s open source.

Here’s all the CoreFoundation collections, leaving out the immutable variants:

CFMutableDictionary CFMutableBag CFMutableBitVector CFMutableSet
CFMutableArray CFBinaryHeap CFMutableTree

That’s only 7! And these are all the files? Doesn’t Apple know how important data structures are? And most of those names don’t even tell you how they work. CFMutableBag? What’s that? Where’s the, like, CFMutableTreeBag and CFMutableHashBag? But at least some do tell you how they work, like CFBinaryHeap and CFMutableArray. Right?

Right?

The array that wasn’t

Take a look in CFArray, at this strange comment:

	The access time for a value in the array is guaranteed to be at
	worst O(lg N) for any implementation, current and future, but will
	often be O(1) (constant time). Linear search operations similarly
	have a worst case complexity of O(N*lg N), though typically the
	bounds will be tighter, and so on. Insertion or deletion operations
	will typically be linear in the number of values in the array, but
	may be O(N*lg N) clearly in the worst case in some implementations.
	There are no favored positions within the array for performance;
	that is, it is not necessarily faster to access values with low
	indices, or to insert or delete values with high indices, or
	whatever.

It’s like Apple skipped out on some sophomore CS lectures. Everyone knows that arrays don’t have logarithmic lookups – they have constant time lookups. But not these "arrays!" (necessarily!) In fact, you might notice that the guarantees Apple does make are weak enough for CFArray to be a hash table or binary tree. Talk about implementation latitude! Is there any payoff?

Let’s try some operations and see how CFArray compares. Well take an array with 100,000 values in it and time inserting, deleting, and getting items, each from the beginning, end, and random positions, for CFArrays, an STL vector, and a "naïve" C array. Then we’re going to do it again starting with 200,000 elements, then 300,000, etc. up to one million. Whoo, that’s a lot of permutations. I don’t want to do all that. Lovely assistant?

Lov. Asst.: Mmmmm, sir?

Go write that thing I said.

Lov. Asst.: Yes, sir.

Isn’t she great?

Lov. Asst.: Done, sir.

Thank you. I’ve enclosed my source code here; if you want to reproduce my results, run the test.py file inside it. As I said, we’re doing various operations while varying the initial size of the array. All of the results are expressed as a multiple of the result for an array with 100,000 values. That means that if the green line is higher than the red one, it doesn’t mean that the green operation was more expensive than the red one in absolute terms, but that the green one slows down more (as a percentage) than the red one as the array size increases. A line going up means that the operation becomes more expensive with larger arrays, and a line staying flat means its expense does not change.

Graphs

Here’s what I get for my quicky naïve C array:

Ok, so what does this mean again? The green line is at about 50 for a length of 1,000,000, so inserting data at random locations in my C array is about 50 times slower than doing it for a length of 100,000. (Ouch!) But no real surprises here; insertion at the end is constant (flat), insertion at random locations or at the beginning is roughly linear.

Here’s what I get for the STL vector template:

Right out of a textbook. Insertions and deletions take linear time, except for at the end where they take constant time, and lookups are always constant.

Now for CFArray:

Say what? I was so surprised by this, I had to rerun my test, but it seems to be right. Let’s catalog this unusual behavior:

  • Insertion or deletion at the beginning or end all take linear time up to 300,000 elements, at which point they start taking constant time; the operations for large arrays are no more expensive than for arrays of 300,000 elements.
  • Insertion or deletion at random locations starts out taking linear time, but suddenly becomes less expensive and switches to constant time. Insertion in an array of 1,000,000 elements is as expensive as for 100,000 elements, but 200,000 is more expensive than either.
  • Walking forwards or backwards takes constant time (as expected), but walking in a random order appears to take linear time up to about 300,000 elements – something very unusual (usually it’s equally as fast to access any element). This suggests that there is some remembered state or cache when accessing elements in a CFArray. (And CFStorage, used by CFArray, seems to confirm that.)

So it sure looks like CFArray is switching data structure implementations around 30,000 elements. And I believe there’s lots more implementations that I haven’t discovered, for smaller arrays and for immutable arrays.

So what?

I guess I’ve got to come to a conclusion or three. Ok, here we go, in order of increasing broadness:

  1. CFArrays and NSArrays are good at some things that arrays are traditionally bad at. Don’t go rolling your own dequeue just because you plan to insert a lot of stuff in the beginning of an array. Try CFArray or NSArray first, if they’re easier, and optimize if it’s too slow or doesn’t fit.
  2. Apple often (though not always) names a class in terms of what it does, not how it does it. This is the spirit of Objective-C dynamic messaging: a message says what to do, but not how to do it. The way it does it may be different from what you expect. Which leads me to:
  3. Don’t second guess Apple, because Apple has already second guessed YOU. In a good way, of course.

What about PHP?

PHP fakes you out with arrays as well – they’re associative or numeric, and can be used as hash tables. Something strange is going on there too. Now I think from reading this and looking at some of the Zend code that arrays in PHP are straightforward hash tables with a cursor for fast enumeration and don’t do any of the polymorphic trickery that CFArray does. But I’m not sure; if you know more about PHP’s array implementation I’d love it if you posted in the comments!

 

The Internet!

π = 3.28266551101

Jim

I’m new to Objective C (well, just about any C as a practical matter). This is the single most useful blog I’ve come across. Thanks.

Being pretty familiar with PHP arrays from the PHP side, you intrigued me enough for me to look at PHP’s array code in the Zend engine.

It’s a hash table and a linked list at the same time: each element of the table contains a pointer to next and previous elements, making them a *ordered* linked list. There is nothing polymorphic in there, nor would I think it could give much improvements since PHP arrays are never really accessed by index, always by key. It’s more like an ordered dictionary. :-)

[...] OU Filed under: General, Programing, Apple, Macintosh — Jay @ 11:49 am From ridiculous_fish » Blog Archive » Array: “Don’t second guess Apple, because Apple ha [...]

[...] is your blog worth?” Around the web Sharing iTunes iLeech Array Shellcode This entry was posted on Saturday, December 24th, 200 [...]

Maarten Sneep

Hi, I think you left out a zero in the numbers when describing the CF results. Makes it a bit more confusing than it needs to be.

Maarten

ridiculous_fish

Thanks, Maarten! I’ve fixed that.

nicolas

Love your blog !
What about sorting ?
I’d love to see your cunning analysis compare NSArray sorts with objective C sort methods and C arrays with qsort().
(Unless I’m wrong, you’re an apple guy), you should be funded by your group to run a weekly (?) entry here !

Does this really matter? I mean, in the five years I’ve been developing, I’ve never used arrays that have more than 1,000 elements. Yes, I believe there is some software that actually has 100,000+ element data structures, but I think they are few and far between. Are they?

I’ve been working on an app that deals with 100,000+ element arrays lately, so this was a very interesting read. Thank you!

To respond to nicolas’ comment: NSArray’s sorting methods appear comparable to qsort at least for between 20,000 and 40,000 objects. I found that for my purposes (sorting arrays that are usually *mostly* sorted), implementing a category on NSMutableArray which used the standard C library’s mergesort was significantly faster. Given what I’ve learned of NSArray today, I think I have some more benchmarking to do.

My link on this post is to what benchmarks I’ve done.

If you want to read some interesting array-implementation stuff, take a look at the excellent paper on the implementation of Lua 5.0:
http://www.tecgraf.puc-rio.br/~lhf/ftp/doc/jucs05.pdf
Lua’s arrays are basically like PHPs (array plus dictionary) but they’ve put in some very interesting space and speed optimizations; which is important, because Lua uses arrays for everything. (I recommend Lua in general; it’s a very compact, elegant language.)

You know, in response to the question about whether this really matters — it occurs to me, in the course of my own development of Cocoa apps, that there is one particular application where this really matters: the two applications that help you track down memory leaks, ObjectAlloc and MallocDebug.

These two applications store an insane amount of data about all of your objects and where they’re being retained and released and stuff, and this could EASILY eclipse the 100,000 lower limit that has been presented in the graphs above. Actually, I can easily see it eclipsing the 1,000,000 lower limit in terms of arrays and stuff. The number of CFStrings that you use in an application is just so immense that even a few minutes of operation would put you on the radar of the above graphs.

So there are at least SOME applications. :)

– Simone

Of related interest may be Martin Fowler’s page on Duck Interface. To quote the page that quotes Kent Beck’s Smalltalk Best Practice Patterns, “One of the first objects many people write when they come to Smalltalk is Stack. Stack is the basic data structure, fabled in song, story, and hundreds of papers about theoretical programming languages… there is no Stack class in any of the basic images. I’ve seen one written any number of times, but they never seem to last long.”

nicolas

thanks zac for your reply !
I’d like to add one observation.
It may seem obvious but it can’t hurt to repeat it :
The biggest optimization I’ve made on an app that used sorted arrays was : limiting the amount of times I sorted the damn arrays !
I spent some time trying to optimize the sorts or the sorted insertions (binary insert etc). This gained very little cycles…. But calling the sort functions every 100ms instead of every 25ms made a HUGE difference ^_^

Have you thought about doing the same benchmark with NSArray that you did with CFArray? I mean, just to see the overhead that Objective-C adds.

Daniel Berlin

So, what is the memory behavior of CFArray like (ie memory usage)?

The only thing it seems truly better at is deletes, and the only way i can imagine it could remain flat like that for random deletions is to not do the actual deletes, but just mark them to be deleted. This would explain the walk timings as well.

Hmmmm timing for the plain array seems a bit weird. A plain array that backward walks grows – isn’t it just a

avalue = anarray[offset] ?

so in real just an addition of addresses:

*(anarray+offset) ???

with offset just decreasing ? If you mean that walking through the array takes longer I do undrestand that, but how could that get faster with other data structures ?

Enligthen me – Patrick – aka Jolly

[...] it back. You can see a movie of it working here. Because of Ridiculous Fish’s great article on reverse engineering Apple’s NSArray implementation, I also think that [...]

[...] ex of -1 indicates the last element of the array, -2 is the next to last ridiculous_fish Blog Archive ArrayCopyOnWriteArraySet is a set that implements copy-on-write and is ba [...]

Just to inform you, classes:
o hash_set
o hash_map
o hash_multiset
o hash_multimap

are not part of the C++ standard but are CGI extensions.

Finally, i understood… thank you – very good article

dont understand your ‘number’ thing, but …. still your articles are very good!