I Didn't Order That, So Why Is It On My Bill, Episode 2
September 17th, 2009
This is the second episode of I Didn't Order That, So Why Is It On My Bill: C++ features you never use, but pay for anyways. See episode one here.

For those who like their punchlines first: std::string is designed to allow copy-on-write. But once operator[] is called, that string is ruined for COW forevermore. The reason is that it has to guard against a future write at any time, because you may have stashed away the internal pointer that operator[] returns - a feature that you surely know better than to use, but that costs you nevertheless.

The C++ standard string class is std::string. Here's a string containing 10 megabytes of the letter 'x':

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
}
Big string. Now let's make a hundred copies, one by one:
#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
    for (int i = 0; i < 100; i++) {
        string the_copy = the_base;
    }
}
This runs in .02 seconds, which is very fast. Suspiciously fast! This old iMac can't schlep a gigabyte of data in two hundredths of a second. It's using copy-on-write.

So let's write and measure the copy:

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
    for (int i = 0; i < 100; i++) {
        string the_copy = the_base;
        the_copy[0] = 'y';
    }
}
The program now runs in 2.5 seconds. These scant 100 writes resulted in a 100x slowdown. Not surprising: it's exactly what we would expect from a COW string.

the_copy[0] returns a reference to a char. But maybe we aren't going to write into it - perhaps we only want to read. How can these copy-on-write strings know when the write occurs, so they can copy? The answer is that they cannot - they must be pessimistic and assume you will write, even if you never do. We can trick it: don't assign anything, and still see the performance hit:

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
    for (int i = 0; i < 100; i++) {
        string the_copy = the_base;
        the_copy[0];
    }
}
No writes! Just a read! But same performance as before. This isn't copy on write - it's copy on read!

Things get even worse. operator[] returns a reference: basically a sugary pointer. You can read from it or write to it. Or you can be a jerk and just hang onto it:

string original = "hello";
char & ref = original[0];
string clone = original;
ref = 'y';

In case that's not clear, we made a string, stashed away a reference to the first character, copied the string, and then wrote to the string's guts. The write is just a normal store to a memory location, so the poor string is caught completely unaware. If the copy-on-write were naive, then we would have modified both the original and the copy.

Oh man, is that evil. But the string's got to be paranoid and guard against it anyways - in fact, the C++ standard explicitly calls out this case as something that has to work! In other words, when you get a reference to a char inside a string, the string is tainted. It can never participate in copy-on-write again, because you just might have stashed away the internal pointer it handed back, and write into the string as a completely unexpected time.

Of course you know better than to hang onto internal pointers. You probably don't need this feature. But you pay for it anyways:

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
    the_base[0];
    for (int i = 0; i < 100; i++) {
        string the_copy = the_base;
    }
}

This takes the 2.5 seconds. Just one silly read wrecked all of the COW machinery for all the future copies. You're paying for writes that never come!

Could the C++ standard have fixed this? Sure - the C++ standard goes into great detail about when iterators and references into strings are invalidated, and all it would have taken would be to add the copy constructor and operator=() to the list of reference-invalidating functions (section 23.1.5, for those who are following along in the standard). A second option would have been to declare that operator=() invalidates existing references for writing, but not for reading. But the standards committee preferred simplicity, convenience and safety to performance - uncharacteristically, if you'll forgive a smidgen of editorializing.

If you call a reference-invalidating function, you can, in principle, get back in the COW fast lane. Although GNU STL does not take advantage of this in all cases, it does when resizing strings:

#include <string>
using std::string;
int main(void) {
    string the_base(1024 * 1024 * 10, 'x');
    the_base[0];
    the_base.push_back('x');
    for (int i = 0; i < 100; i++) {
        string the_copy = the_base;
    }
}

Allowing the string to modify itself allowed it to know that any previous references were invalidated. We're fast again! It's called a COW, and it's not about to blow it now.

The moral is that you should avoid using operator[], the at() function, iterators, etc. to read from a non-const string - especially a long string - if it may be copied in the future. If you do use these functions, you risk paying for a write that you don't make, and for hanging onto an internal pointer that you released. Unfortunately, there's no good function for just reading from a non-const string. You can do it by adding a crazy cast:

#include <string>
using std::string;
int main(void) {
	string the_base(1024 * 1024 * 10, 'x');
	const_cast<const string &>(the_base)[0];
	for (int i = 0; i < 100; i++) {
		string the_copy = the_base;
	}
}

This hits the fast path too.

Scene. I hope you enjoyed this post. As usual the point is not to rag on C++, but to explore aspects of its design that, for whatever reason, tilt a tradeoff away from performance.