Labor of Division (Episode I)
February 15th, 2010
Here's how you divide an unsigned int by 13 in C:
unsigned divide(unsigned x) { return x / 13; }
It gets better, I promise. Here's the corresponding x64 assembly that gcc produces:
        movl    $1321528399, %edx
        movl    %edi, %eax
        mull    %edx
        shrl    $2, %edx
        movl    %edx, %eax

Instead of division, there's multiplication by a bizarre number: 1321528399. What wickedness is this?

1321528399 is an example of a "magic number:" a number that lets you substitute speedy multiplication for pokey division, as if by magic. In this post, I become a big fat spoilsport and ruin the trick for everyone.

Dividing by 13 is different than multiplying by 1321528399, so there must be more here than meets the eye. Compiling it as PowerPC reveals a bit more:

        lis r0,0x4ec4
        ori r0,r0,60495
        mulhwu r3,r3,r0
        srwi r3,r3,2

mulhwu means "multiply high word unsigned." It multiplies by 1321528399, but takes the high 32 bits of the 64 bit result, not the low 32 bits like we are used to. x86-64 does the same thing (notice the shift right instruction operates on %edx, which contains the high bits, not %eax which is the low bits). After multiplying, it shifts the result right by two. In C:

unsigned divide(unsigned x) {
   return (unsigned)(x*1321528399ULL >> 34);

So dividing by 13 is the same as multiplying by 1321528399 and then dividing by 234. That means that 234 divided by 13 would be 1321528399, right? In fact, it's 1321528398.769 and change. Pretty close, but we're not optimizing horseshoes, so how can we be sure this works all the time?


The answer to the question right above this line

It's provable, and hopefully we can get some intuition about the whole thing. In the following proof, division is exact, not that integer-truncating stuff that C does; to round down I'll use floor. Also, this only covers unsigned division: every number that appears is non-negative. (Signed division is broadly similar, though does differ in some details.) Lastly, we're going to assume that the denominator d is not a power of 2, which is weirdly important. If d were a power of 2, we'd skip the multiplication and just use shifts, so we haven't lost any generality.

We have some fixed positive denominator d and some variable nonegative numerator n, and we want to compute the quotient $\frac n d$ - no, wait, $\lfloor \frac n d \rfloor$, since that's what C does. We'll multiply the top and bottom by 2k, where k is some positive integer power - it represents the precision in a sense, and we'll pick its value later:

$\lfloor \frac n d \rfloor = \lfloor \frac n d \times \frac {2^k} {2^k} \rfloor = \lfloor \frac {2^k} d \times \frac n {2^k} \rfloor$

We're going to call that $\frac {2^k} d$ term mexact, because it's our "magic number" that lets us compute division through multiplication. So we have:
$m_{exact} = \frac {2^k} d $

$ \lfloor \frac n d \rfloor = \lfloor m_{exact} \times \frac n {2^k} \rfloor$

This equality looks promising, because we've hammered our expression into the shape we want; but we haven't really done anything yet. The problem is that mexact is a fraction, which is hard for computers to multiply. (We know it's a fraction because d is not a power of 2). The tricky part is finding an integer approximation of mexact that gives us the same result for any dividend up to the largest possible one (UINT_MAX in our case). Call this approximation m. So we have $m \approx m_{exact} $, where m is an integer.

When it comes to approximation, closer is better, so let's start by just rounding m down: $m = \lfloor m_{exact} \rfloor$. Since mexact cannot be an integer, m must be strictly less than mexact. Does this m work, which is to say, does it behave the same as mexact over the range we care about?

We'll try it when the numerator equals the denominator: n = d. Of course then the quotient is supposed to be 1. This leads to:

$\frac d d = m_{exact} \times \frac d {2^k}$
$\implies \frac d d > m \times \frac d {2^k}$
$\implies 1 > \lfloor m \times \frac d {2^k} \rfloor$

That's bad, because that last expression is supposed to be 1. The approximation $m = \lfloor m_{exact} \rfloor$ is too small, no matter what k is.

Ok, let's round up instead: $m = \lceil m_{exact} \rceil = \lceil \frac {2^k} d \rceil$. Does this value for m work?

Now, $\lceil \frac {2^k} d \rceil$ just means dividing 2k by d and then rounding up. That's the same as rounding up 2k to the next multiple of d first, and then dividing exactly by d. (We know that 2k is not itself a multiple of d because d is not a power of 2!) So we can write:

$m = \lceil \frac {2^k} d \rceil = \frac {2^k + e} d$.

where e is the "thing to add to get to the next multiple of d," which means that 0 < e < d. (Formally, e = d - 2k mod d.) This value e is sort of a measure of how much "damage" the ceil does - that is, how bad the approximation m is.

So let's plug this new m into our division expression, and then apply some algebra:

$\lfloor m \times \frac n {2^k} \rfloor = \lfloor \frac {2^k + e} d \times \frac n {2^k} \rfloor = \lfloor \frac n d + \frac e d \times \frac n {2^k} \rfloor $

This last expression is really cool, because it shows the wholesome result we're after, $\lfloor \frac n d \rfloor$, and also an insidious "error term:" that $\frac e d \times \frac n {2^k}$, which represents how much our ceil is causing us to overestimate. It also shows that we can make the error term smaller by cranking up k: that's the sense in which k is a precision.


K's Choice

We could pick some huge precision k, get a very precise result, and be done. But we defined m to be $\lceil \frac {2^k} d \rceil$, so if k is too big, then m will be too big to fit in a machine register and we can't efficiently multiply by it. So instead let's pick the smallest k that's precise enough, and hope that fits into a register. Since there are several variables involved, our strategy will be to find upper bounds for them, replace the variables with their upper bounds, and simplify the expression until we can solve for k.

So we want the smallest k so that $\lfloor \frac n d \rfloor = \lfloor \frac n d + \frac e d \times \frac n {2^k} \rfloor $. This will be true if the error term $\frac e d \times \frac n {2^k}$ is less than 1, because then the floor operation will wipe it out, right? Oops, not quite, because there's also a fractional contribution from $\frac n d$. We need to be sure that the error term, plus the fractional contribution of $\frac n d$, is less than 1 for that equality to be true.

We'll start by putting an upper bound on the fractional contribution. How big can it be? If you divide any integer by d, the fractional part of the result is no more than $\frac {d - 1} d$. Therefore the fractional contribution of $\frac n d$ is at most $\frac {d - 1} d$, and so if our error term is less than $\frac 1 d$, it will be erased by the floor and our equality will be true.

So k will be big enough when   $\frac e d \times \frac n {2^k} < \frac 1 d$.

Next we'll tackle $\frac e d$. We know from up above that e is at most d-1, so we have $\frac e d < 1$. So we can ignore that factor: $\frac n {2^k} < \frac 1 d \implies \frac e d \times \frac n {2^k} < \frac 1 d$.

The term $\frac n {2^k}$ has a dependence on n, the numerator. How big can the numerator be? Let's say we're dividing 32 bit unsigned integers: n ≤ 232. (The result is easily extended to other widths). We can make $\frac n {2^k}$ less than 1 by picking k = 32. To make it even smaller, less than $\frac 1 d$, we must increase k by $\log_2 d$. That is,

$k > 32 + \log_2 d \implies \frac n {2^k} < \frac 1 d$

Since we want k to be an integer, we have to round this up: $k = 32 + \lceil \log_2 d \rceil$. We know that value for k works.

What does that precision do to m, our magic number? Maybe this k makes m too big to fit into a 32 bit register! We defined $m = \lceil \frac {2^k} d \rceil$; plugging in k we have $m = \lceil \frac {2^{32 + \lceil log_2 d \rceil} } d \rceil $.

This is a gross expression, but we can put an upper bound on it. Note that $d < {2 ^ { \lceil log_2 d \rceil } }$, so we can write the following:

$m <= \lceil \frac {2^{32 + \lceil log_2 d \rceil} } {2 ^ { \lfloor log_2 d \rfloor } } \rceil = \lceil {2^{32 + \lceil log_2 d \rceil - \lfloor log_2 d \rfloor} } \rceil $

$ = \lceil {2 ^ { 32 + 1}} \rceil = 2^{33}$

Nuts! Our approximation m just barely does not fit into a 32 bit word!

Since we rounded some stuff, you might think that we were too lazy and a more careful analysis would produce a tighter upper bound. But in fact our upper bound is exact: the magic number for an N-bit division really may need to be as large as N+1 bits. The good news is, there may be smaller numbers that fit in N bits for specific divisors. More on that later!


We are fruitful and multiply

In any case, we now have our magic number m. To perform the division, we need to compute $ \lfloor \frac {m n} {2^k} \rfloor$ as efficiently as possible. m is a 33 bit number, so how can a 32 bit processor efficiently multiply by that? One way would be to use a bignum library, but that would be too slow. A better approach is to multiply by the low 32 bits, and handle the 33rd bit specially, like this:

$$\begin{align} m n & = (m - 2^{32} + 2^{32})n \\ & = (m - 2^{32})n + 2^{32} n \end{align}$$

This looks very promising: m - 232 is definitely a 32 bit number. Furthermore, we can compute that 232n term very easily: we don't even need to shift, just add n to the high word of the product. Unfortunately, it's not right: a 32 bit number times a 33 bit number is 65 bits, so the addition overflows.

But we can prevent the overflow by distributing one of the 2s in the denominator, through the following trick: $$ \begin{align} \lfloor \frac {n + q} {2^k} \rfloor & = \lfloor \frac {n - q + 2q} {2^k} \rfloor \\ & = \lfloor ( \frac {n - q + 2q} 2 ) / {2^{k-1}} \rfloor \\ & = \lfloor ( \lfloor \frac {n - q} 2 \rfloor + q ) / {2^{k-1}} \rfloor \end{align} $$

Here q is the magic number (minus that 33rd bit) times n, and n is just n, the numerator. (Everything should be multiplied by 232, but that's sort of implicit in the fact that we are working in the register containing the high product, so we can ignore it.)

Can the subtraction n - q underflow? No, because q is just n times the 32 bit magic number, and then divided by 232. The magic number (now bereft of its 33rd bit) is less than 232, so we have q < n, so n - q cannot underflow.

What about that +q? Can that overflow? No, because we can just throw out the floor to get an upper bound of $\frac {n + q} 2$ which is a 32 bit number.

So we have a practical algorithm. Given a dividend n and a fixed divisor d, where 0 < d < 232 and 0 ≤ n < 232:

  1. Precompute $p = \lceil log_2 d \rceil$
  2. Precompute $m = \lceil \frac {2^{32 + p}} d \rceil$. This will be a 33 bit number, so keep only the low 32 bits.
  3. Compute $q = (m \times n) \gg 32$. Most processors can do this with one "high multiply" instruction.
  4. Compute $t = ((n - q) \gg 1) + q$, which is an overflow-safe way of computing (n + q) ≫ 1. This extra addition corrects for dropping the 33rd bit of m.
  5. Perform the remaining shift p: $t = t ≫ (p-1)$
That's it! We know how to efficiently replace division with multiplication. We can see all these steps in action. Here is the assembly that Clang generates for division by 7, along with my commentary:
 movl $613566757, %ecx  613566757 is the low 32 bits of (2**35)/7, rounded up
 movl %edi, %eax 
 mull %ecx              Multiply the dividend by the magic number
 subl %edx, %edi        The dividend minus the high 32 bits of the product in %edx (%eax has the low)
 shrl %edi              Initial shift by 1 to prevent overflow
 addl %edx, %edi        Add in the high 32 bits again, correcting for the 33rd bit of the magic number
 movl %edi, %eax        Move the result into the return register
 shrl $2, %eax          Final shift right of floor(log_2(7))

Improving the algorithm for particular divisors

This algorithm is the best we can do for the general case, but we may be able to improve on this for certain divisors. The key is that e term: the value we add to get to the next multiple of d. We reasoned that e was less than d, so we used $\frac e d < 1$ as an upper bound for e. But it may happen that a multiple of d is only slightly larger than a power of 2. In that case, e will be small, and if we're lucky, it will be small enough to push the entire "error term" under $\frac 1 d$.

For example, let's try it for the divisor 11. With the general algorithm, we would need $k = 32 + \lceil log_2 11 \rceil = 36$. But what if we choose k = 35 instead? 235 + 1 is a multiple of 11, so we have e = 1, and we compute:

$\frac e d \times \frac n {2^k} = \frac 1 d \times \frac n {2^{35}} < \frac 1 d$

So k = 35 forces the error term to be less than $\frac 1 d$, and therefore k = 35 is a good enough approximation. This is good news, because then $m = \lceil \frac {2^{35}} {11} \rceil = 3123612579 < 2^{32}$, so our magic number fits in 32 bits, and we don't need to worry about overflow! We can avoid the subtraction, addition, and extra shifting in the general algorithm. Indeed, clang outputs:

  movl  $-1171354717, %ecx
  movl  %edi, %eax
  mull  %ecx
  movl  %edx, %eax
  shrl  $3, %eax

The code for dividing by 11 is shorter than for dividing by 7, because a multiple of 11 happens to be very close to a power of 2. This shows the way to an improved algorithm: We can simply try successive powers of 2, from 32 up to $32 + \lfloor log_2 d \rfloor$, and see if any of them happen to be close enough to a multiple of d to force the error term below $\frac 1 d$. If so, the corresponding magic number fits in 32 bits and we can generate very efficient code. If not, we can always fall back to the general algorithm.

The power 32 will work if $\frac e d \leq \frac 1 d$, the power 33 will work if $\frac e d \leq \frac 2 d$, the power 34 will work if $\frac e d \leq \frac 4 d$, etc., up to $32 + \lfloor log_2 d \rfloor$, which will work if $\frac e d \leq \frac 1 2$. By "chance", we'd expect this last power to work half the time, the previous power to work 25% of the time, etc. Since we only need one, most divisors actually should have an efficient, 32 bit or fewer magic number. The infinite series $\frac 1 2 + \frac 1 4 + \frac 1 8$ sums to 1, so our chances get better with increasing divisors.

It's interesting to note that if a divisor d has one of these more efficient magic numbers for a power $k < 32 + \lceil log_2 d \rceil$, it also has one for all higher powers. This is easy to see: if 2k + e is a multiple of d, then 2k+1 + 2e is also a multiple of d.

$\frac e d \times \frac n {2^k} \leq \frac 1 d \implies \frac {2e} d \times \frac n {2^{k+1}} \leq \frac 1 d $

This is good news. It means that we only have to check one case, $ \lceil \frac {2^{32 + \lfloor log_2 d \rfloor}} d \rceil$ (a 32 bit value) to see if there is a more efficient magic number, because if any smaller power works, that one works too. If that that power fails, we can go to the 33 bit number, which we know must work. This is useful information in case we are computing magic numbers at runtime.

Still, gcc and LLVM don't settle for just any magic number - both try to find the smallest magic number. Why? There's a certain aesthetic appeal in not using bigger numbers than necessary, and most likely the resulting smaller multiplier and smaller shifts are a bit faster. In fact, in a very few cases, the resulting shift may be zero! For example, the code for diving by 641:

        movl    $6700417, %ecx
        movl    %edi, %eax
        mull    %ecx
        movl    %edx, %eax

No shifts at all! For this to happen, we must have $\frac e d \le \frac 1 d$, which of course means that e = 1, so 641 must evenly divide 232 + 1. Indeed it does.

This inspires a way to find the other "super-efficient" shiftless divisors: compute the factors of 232 + 1. Sadly, the only other factor is 6700417. I have yet to discover an occasion to divide by this factor, but if I do, I'll be ready.


So that's how it works

Did you skip ahead? It's OK. Here's the summary. Every divisor has a magic number, and most have more than one! A magic number for d is nothing more than a precomputed quotient: a power of 2 divided by d and then rounded up. At runtime, we do the same thing, except backwards: multiply by this magic number and then divide by the power of 2, rounding down. The tricky part is finding a power big enough that the "rounding up" part doesn't hurt anything. If we are lucky, a multiple of d will happen to be only slightly larger than a power of 2, so rounding up doesn't change much and our magic number will fit in 32 bits. If we are unlucky, well, we can always fall back to a 33 bit number, which is almost as efficient.

I humbly acknowledge the legendary Guy Steele Henry Warren (must have been a while) and his book Hacker's Delight, which introduced me to this line of proof. (Except his version has, you know, rigor.)