Labor of Division (Episode III): Faster Unsigned Division by Constants

This is a technique fish thought up for improving the performance of unsigned integer division by certain "uncooperative" constants. It does not seem to appear in the literature, nor is it implemented in gcc, llvm, or icc, so fish is optimistic that it is original.

As is well known (and seen in a previous post), compilers optimize unsigned division by constants into multiplication by a "magic number." But not all constants are created equal, and approximately 30% of divisors require magic numbers that are one bit too large, which necessitates special handling (read: are slower). Of these 30%, slightly less than half (46%) are even, which can be handled at a minimum of increased expense (see below); the remaining odd divisors (659 million, including well known celebrity divisors like 7) need a relatively expensive "fixup" after the multiplication. Or so we used to think. This post gives a variation on the usual algorithm that improves performance for these expensive divisors.

This post presents the algorithm, proves it is correct, proves that it applies in every case we care about, and demonstrates that it is faster. It also contains a reference implementation of the full "magic number" algorithm, incorporating this and all known techniques. In other words, it's so darn big that it requires a table of contents.

Background

Unsigned integer division is one of the slowest operations on a modern microprocessor. When the divisor is known at compile time, optimizing compilers do not emit division instructions, but instead either a bit shift (for a power of 2), or a multiplication by a sort of reciprocal (for non-powers of 2). This second case involves the identity:

$\lfloor \frac n d \rfloor = \lfloor \frac n d \times \frac {2^k} {2^k} \rfloor = \lfloor \frac {2^k} d \times \frac n {2^k} \rfloor$

As d is not a power of 2, $\frac {2^k} d$ is always a fraction. It is rounded up to an integer, which is called a "magic number" because multiplying by it performs division, as if by magic. The rounding-up introduces error into the calculation, but we can reduce that error by increasing k. If k is big enough, the error gets wiped out entirely by the floor, and so we always compute the correct result.

The dividend (numerator) is typically an N bit unsigned integer, where N is the size of a hardware register. For most divisors, k can be small enough that a valid magic number can also fit in N bits or fewer. But for many divisors, there is no such magic number. 7, 14, 19, 31, 42...these divisors require an N+1 bit magic number, which introduces inefficiences, as the magic number cannot fit in a hardware register.

Let us call such divisors "uncooperative." The algorithm presented here improves the performance of dividing by these uncooperative divisors by finding a new magic number which is no more than N bits. The existing algorithm that generates an N+1 bit magic number for uncooperative divisors will be referred to as the "round-up algorithm", because it rounds the true magic number up. The version presented here will be called the "round-down algorithm". We will say that an algorithm "fails" or "succeeds" according to whether it produces a magic number of N bits or fewer; we will show that either the round-up or round-down algorithm (or both) must succeed for all divisors.

A Shift In Time Saves Fourteen

For completeness, it is worth mentioning one additional technique for uncooperative divisors that are even. Consider dividing a 32 bit unsigned integer by 14. The smallest valid magic number for 14 is 33 bits, which is inefficient. However, instead of dividing by 14, we can first divide by 2, and then by 7. While 7 is also uncooperative, the divide by 2 ensures the dividend is only a 31 bit number. Therefore the magic number for the subsequent divide-by-7 only needs to be 32 bits, which can be handled efficiently.

This technique effectively optimizes division by even divisors, and is incorporated in the reference code provided later. Now we present a technique applicable for odd divisors.

Motivation (aka What Goes Up Can Also Go Down)

First, an appeal to intuition. A divisor is uncooperative in the round-up algorithm because the rounding-up produces a poor approximation. That is, $\frac {2^k} d$ is just slightly larger than some integer, so the approximation $\lceil \frac {2^k} d \rceil$ is off by nearly one, which is a lot. It stands to reason, then, that we could get a better approximation by floor instead of ceil: $m = \lfloor \frac {2^k} d \rfloor$.

A naïve attempt to apply this immediately runs into trouble. Let d be any non-power-of-2 divisor, and consider trying to divide d by itself by multiplying with this magic number:

$ \lfloor \frac {2^k} d \rfloor < \frac {2^k} d \implies $

$ \lfloor \frac {2^k} d \rfloor \times \frac d {2^k} < \frac {2^k} d \times \frac d {2^k} \implies $

$ \lfloor \lfloor \frac {2^k} d \rfloor \times \frac d {2^k} \rfloor < 1 $

(Could we replace the outer floor by a ceil? The floor is implemented by a right shift, which throws away the bits that are shifted off. We could conjure up a "rounding up" right shift, and that might work, though it would likely be more expensive than the instructions it replaces.)

So rounding down causes us to underestimate the result. What if we tried to counteract that by incrementing the numerator first?

$\lfloor \frac n d \rfloor \ \stackrel{?}{=} \ \lfloor \lfloor \frac {2^k} d \rfloor \times \frac {\color{#FF3030}{n+1}} {2^k} \rfloor $

Proof of Correctness

First we must show that the round-down algorithm actually works. We proceed much like the proof for the round-up algorithm. We have a known constant d and a runtime variable n, both N bit values. We want to find some k that ensures:

where: $$ \small \begin{align} \small & m = \lfloor \frac {2^k} d \rfloor \\ & 0 \le n < 2^{N} \\ & 0 < d < 2^{N} \\ & \text{d not a power of 2} \end{align} $$

$$\begin{align} \lfloor m \times \frac {n+1} {2^k} \rfloor & = \lfloor \frac {2^k - e} d \times \frac {n + 1} {2^k} \rfloor \\ & = \lfloor \frac {n + 1} d \times \frac {2^k - e} {2^k} \rfloor \\ & = \lfloor \frac {n + 1} d \times ( 1 - \frac e {2^k} ) \rfloor \\ & = \lfloor \frac {n+1} d - \frac {n+1} d \times \frac e {2^k} \rfloor \\ & = \lfloor \frac n d + \frac 1 d - \frac e d \times \frac {n+1} {2^k} \rfloor \end{align} $$

We hope that this equals $\lfloor \frac n d \rfloor$. Within the floor, we see the result, plus two terms of opposite signs. We want the combination of those terms to cancel out to something at least zero, but small enough to be wiped out by the floor. Let us compute the fractional contribution of each term, and show that it is at least zero but less than one.

The fractional contribution of the $\frac n d$ term can be as small as zero and as large as $\frac {d-1} d$. Therefore, in order to keep the whole fractional part at least zero but below one, we require:

The term $\frac e d \times \frac {n+1}{2^k}$ is always positive, so the $< \frac 1 d$ is easily satisfied. It remains to show it is at least zero. Rearranging:

$ 0 \le \frac 1 d - \frac e d \times \frac {n+1} {2^k} \implies \frac e d \times \frac {n+1} {2^k} \le \frac 1 d $

This is very similar to the condition required in the round-up algorithm! Let's continue to simplify, using the fact that n < 2^N.

$ \frac e d \times \frac {n+1} {2^k} \le \frac 1 d$

$ e \times \frac {n+1} {2^k} \le 1 $

$ \frac e {2^{k-N}} \le 1 $

$ e \le 2^{k-N} $

This is the condition that guarantees that our magic number m works. In summary, pick some k ≥ N, and compute $ \small e = 2^k \mod{d} $. If the resulting e ≤ 2^k-N, the algorithm is guaranteed to produce the correct result for all N-bit dividends.

Proof of Universality (aka Your Weakness Is My Strength)

When will this condition be met? Recall the hand-waving from before: the round-up algorithm failed because rounding up produced a poor approximation, so we would expect rounding down to produce a good approximation, which would make the round-down algorithm succeed. Optimistically, we'd hope that round-down will succeed any time round-up fails! Indeed that is the case, and we can formally prove it now.

Here e_up refers to the difference produced by rounding 2^k up to a multiple of d, as in the round-up algorithm, while e_down refers to the difference produced by rounding down to a multiple of d as in round-down. An immediate consequence is e_up + e_down = d.

Recall from the round-up algorithm that we try successive values for k, with the smallest k guaranteed to work equal to $ \small N + \lceil log_2 d \rceil $. Unfortunately, this k produces a magic number of N+1 bits, and so too large to fit in a hardware register. Let's consider the k just below it, which produces a magic number of N bits:

Assume that d is uncooperative, i.e. the magic number for this power was not valid in the round-up algorithm. It would have been valid if $ e_{up} < 2^{\lfloor log_2 d \rfloor} $; because it was not valid we must have $ e_{up} \ge 2^{\lfloor log_2 d \rfloor} $. Substituting in:

$$\begin{align} e_{up} & \ge 2^{\lfloor log_2 d \rfloor} \implies \\ d - e_{down} & \ge 2^{\lfloor log_2 d \rfloor} \implies \\ e_{down} & \le d - 2^{\lfloor log_2 d \rfloor} \implies \\ e_{down} & \le 2^{\lceil log_2 d \rceil} - 2^{\lfloor log_2 d \rfloor} \implies \\ e_{down} & \le 2 \times 2^{\lfloor log_2 d \rfloor} - 2^{\lfloor log_2 d \rfloor} \implies \\ e_{down} & \le 2^{\lfloor log_2 d \rfloor} \implies \\ e_{down} & \le 2^{k-N} \end{align} $$

Thus we've satisfied the condition determined in the proof of correctness. This is an important and remarkable result: the round-down algorithm is guaranteed to have an efficient magic number whenever round-up does not. If the implementation of round-down can be shown to be more efficient, the overflow case for the round-up algorithm can be discarded entirely.

Recap

Here's the practical algorithm. Given a dividend n and a fixed divisor d, where 0 ≤ n < 2^N and 0 < d < 2^N, and where the usual round-up algorithm failed to find an N-bit magic number:

Overflow Handling

This algorithm has a wrinkle. Because n is an N-bit number, it may be as large as 2^N - 1, in which event the n+1 term will be an N+1 bit number. If the value is simply incremented in an N-bit register, the dividend will wrap to zero, and the quotient will in turn be zero. Here we present two strategies for efficiently handling the possibility of modulo overflow.

Distributed Multiply Strategy

This is a 2N-bit quantity and so cannot overflow. For efficient implementation, this requires that the low half of the m x n product be available "for free," so that the sum can be performed and any carry transmitted to the high half. Many modern architectures produce both halves with one instruction, such as Intel x86 (the MUL instruction) or ARM (UMULL). It is also available if the register width is twice the bit size of the type, e.g. performing a 32 bit divide on a 64 bit processor.

Saturating Increment Strategy

However, other processors compute the low and high halves separately, such as PowerPC; in this case computing the lower half of the product would be prohibitively expensive, and so a different strategy is needed. A second, surprising approach is to simply elide the increment if n is already at its maximum value, i.e. replace the increment with a "saturating increment" defined by:

$$ \small \text{SaturInc}(x) = \begin{cases} x+1 & \text{ if } x < 2^N-1 \\ x & \text{ if } x = 2^N-1 \end{cases} $$

It is not obvious why this should work: we needed the increment in the first place, so how can we just skip it? We must show that replacing increment with SaturInc will compute the correct result for 2^N - 1. A proof of that is presented below.

Proof of Correctness when using Saturating Increment

Consider the practical algorithm presented above, with the +1 replaced by saturating increment. If $ \small n < 2^N-1 $, then saturating increment is the same as +1, so the proof from before holds. Therefore assume that $ \small n = 2^N-1 $, so that incrementing n would wrap to 0.

By inspection, $ \small \text{SaturInc}(2^N - 1) = \text{SaturInc}(2^N - 2) $. Because the algorithm has no other dependence on n, replacing the +1 with SaturInc effectively causes the algorithm to compute the quotient $ \lfloor \frac {2^N - 2} d \rfloor $ when n = 2^N-1.

Now d either is or is not a factor of 2^N-1. Let's start by assuming it is not a factor. It is easy to prove that, if x and y are positive integers and y is not a factor of x, then $ \lfloor \frac x y \rfloor = \lfloor \frac {x-1} y \rfloor $. Therefore it must be true that $ \lfloor \frac {2^N - 1} d \rfloor = \lfloor \frac {2^N - 2} d \rfloor $, so the algorithm computes the correct quotient.

Now let us consider the case where d is a factor of 2^N-1. We will prove that d is cooperative, i.e. the round-up algorithm produced an efficient N-bit result for d, and therefore the round-down algorithm is never employed. Because d is a factor of 2^N-1, we have $ \small 2^N\ \bmod{d} = 1 $. Consider once again the case of the "last N-bit magic number," i.e.:

Recall that the round-up algorithm computes $ \small e_{up} = d - (2^k\ \bmod{d}) $. This power is acceptable to the round-up algorithm if $ \small e_{up} \leq 2^{k - N} = 2^{\lfloor log_2 d \rfloor} $. Consider:

$$ \begin{align} 2^k\ \bmod{d} & = 2^{N + \lfloor log_2 d \rfloor}\ \bmod{d} \\ & = 2^N \times 2^{\lfloor log_2 d \rfloor}\ \bmod{d} \\ & = 1 \times 2^{\lfloor log_2 d \rfloor}\ \bmod{d} \\ & = 2^{\lfloor log_2 d \rfloor} \end{align} $$

$$ \begin{align} e_{up} & = d - 2^{\lfloor log_2 d \rfloor} \\ e_{up} & < 2^{\lceil log_2 d \rceil} - 2^{\lfloor log_2 d \rfloor} \\ & < 2 \times 2^{\lfloor log_2 d \rfloor} - 2^{\lfloor log_2 d \rfloor} \\ & < 2^{\lfloor log_2 d \rfloor} \end{align} $$

Thus the power k is acceptable to the round-up algorithm, so d is cooperative and the round-down algorithm is never employed. Thus a saturating increment is acceptable for all uncooperative divisors. Q.E.D.

(As an interesting aside, this last proof demonstrates that all factors of 2^N-1 "just barely" have efficient N-bit magic numbers. For example, the divisor 16,711,935 is a factor of 2³²-1, and its magic number, while N bits, requires a shift of 23, which is large; in fact it is the largest possible shift, as the floor of the base 2 log of that divisor. But increase the divisor by just one (16711936) and only a 16 bit shift is necessary.)

In summary, distributing the multiplication or using a saturating increment are both viable strategies for avoiding wrapping in the n+1 expression, ensuring that the algorithm works over the whole range of dividends. Implementations can use whichever technique is most efficient¹.

Practical Implementation

The discussion so far is only of theoretical interest; it becomes of practical interest if the round-down algorithm can be shown to outperform round-up on uncooperative divisors. This is what will be demonstrated below for x86 processors.

x86 processors admit an efficient saturating increment via the two-instruction sequence add 1; sbb 0; (i.e. "add; subtract 0 with borrow"). They also admit an efficient distributed multiply. The author implemented this optimization in the LLVM compiler using both strategies in turn, and then compiled the following C code which simply divides a value by 7, using clang -O3 -S -arch i386 -fomit-frame-pointer (this last flag for brevity):

The round-down algorithms not only avoid the three-instruction overflow handling, but also avoid needing to store the dividend past the multiply (notice the highlighted MOVL instruction in the round-up algorithm). The result is a net saving of two instructions. Also notice that the variants require fewer registers, which suggests there might be even more payoff (i.e. fewer register spills) when the divide is part of a longer code sequence.

(In the distributive variant the compiler has made the dubious choice to emit the same immediate twice instead of placing it in a register. This is especially deleterious in the loop microbenchmark shown below, because loading the immediate into the register could be hoisted out of the loop. To address this, the microbenchmark tests both the assembly as generated by LLVM, and a version tweaked by hand to address this suboptimal codegen.)

As illustrated, both strategies require only two instructions on x86, which is important because the overhead of the round-up algorithm is three to four instructions. Many processor architectures admit a two-instruction saturating increment through the carry flag².

Microbenchmark

To measure the performance, the author compiled a family of functions. Each function accepts an array of unsigned ints, divides them by a particular uncooperative divisor, and returns the sum; for example:

Each function in the family had very similar machine code; a representative sample is:

A simple test harness was constructed and the above functions were benchmarked to estimate the time per divide. The benchmark was compiled with clang on -O3, and run on a 2.93 GHz Core i7 iMac. Test runs were found to differ by less than .1%.

Nanoseconds Per Divide

Divisor

Round Up

Saturating Increment

Distribute (as generated)

Distribute (hand tweaked)

i386
7

1.632

1.484
9.1%

1.488
8.9%

1.433
12.2%

uint32
37

1.631

1.483
9.1%

1.486
8.9%

1.433
12.1%

123

1.633

1.484
9.1%

1.488
8.9%

1.432
12.3%

763

1.632

1.483
9.1%

1.487
8.9%

1.432
12.2%

1247

1.633

1.484
9.1%

1.491
8.7%

1.433
12.2%

9305

1.631

1.484
9.0%

1.491
8.6%

1.439
11.7%

13307

1.632

1.483
9.1%

1.489
8.7%

1.437
11.9%

52513

1.631

1.483
9.1%

1.490
8.7%

1.432
12.2%

60978747

1.631

1.484
9.0%

1.488
8.8%

1.434
12.1%

106956295

1.631

1.484
9.0%

1.489
8.7%

1.433
12.1%

x86_64
7

1.537

1.307
14.9%

1.548
-0.7%

1.362
11.4%

uint32
37

1.538

1.307
15.0%

1.548
-0.7%

1.362
11.4%

123

1.537

1.319
14.2%

1.547
-0.6%

1.361
11.5%

763

1.536

1.306
15.0%

1.547
-0.8%

1.356
11.7%

1247

1.538

1.322
14.1%

1.549
-0.7%

1.358
11.7%

9305

1.543

1.322
14.3%

1.550
-0.5%

1.361
11.8%

13307

1.545

1.322
14.4%

1.550
-0.3%

1.357
12.1%

52513

1.541

1.307
15.2%

1.550
-0.6%

1.361
11.7%

60978747

1.538

1.322
14.0%

1.549
-0.7%

1.358
11.7%

106956295

1.537

1.322
14.0%

1.551
-0.9%

1.360
11.5%

x86_64
7

1.823

1.588
12.9%

1.505
17.4%

n/a

uint64
39

1.821

1.589
12.7%

1.506
17.3%

n/a

123

1.821

1.592
12.6%

1.506
17.3%

n/a

763

1.822

1.592
12.6%

1.505
17.4%

n/a

1249

1.822

1.589
12.8%

1.506
17.4%

n/a

9311

1.822

1.587
12.9%

1.507
17.3%

n/a

11315

1.822

1.588
12.8%

1.506
17.4%

n/a

52513

1.823

1.591
12.7%

1.506
17.4%

n/a

60978749

1.822

1.590
12.7%

1.507
17.3%

n/a

106956297

1.821

1.588
12.8%

1.506
17.3%

n/a

	Divisor	Round Up	Saturating Increment		Distribute (as generated)		Distribute (hand tweaked)
i386	7	1.632	1.484	9.1%		1.488	8.9%	1.433	12.2%
uint32	37	1.631	1.483	9.1%		1.486	8.9%	1.433	12.1%
	123	1.633	1.484	9.1%		1.488	8.9%	1.432	12.3%
	763	1.632	1.483	9.1%		1.487	8.9%	1.432	12.2%
	1247	1.633	1.484	9.1%		1.491	8.7%	1.433	12.2%
	9305	1.631	1.484	9.0%		1.491	8.6%	1.439	11.7%
	13307	1.632	1.483	9.1%		1.489	8.7%	1.437	11.9%
	52513	1.631	1.483	9.1%		1.490	8.7%	1.432	12.2%
	60978747	1.631	1.484	9.0%		1.488	8.8%	1.434	12.1%
	106956295	1.631	1.484	9.0%		1.489	8.7%	1.433	12.1%

x86_64	7	1.537	1.307	14.9%		1.548	-0.7%	1.362	11.4%
uint32	37	1.538	1.307	15.0%		1.548	-0.7%	1.362	11.4%
	123	1.537	1.319	14.2%		1.547	-0.6%	1.361	11.5%
	763	1.536	1.306	15.0%		1.547	-0.8%	1.356	11.7%
	1247	1.538	1.322	14.1%		1.549	-0.7%	1.358	11.7%
	9305	1.543	1.322	14.3%		1.550	-0.5%	1.361	11.8%
	13307	1.545	1.322	14.4%		1.550	-0.3%	1.357	12.1%
	52513	1.541	1.307	15.2%		1.550	-0.6%	1.361	11.7%
	60978747	1.538	1.322	14.0%		1.549	-0.7%	1.358	11.7%
	106956295	1.537	1.322	14.0%		1.551	-0.9%	1.360	11.5%

x86_64	7	1.823	1.588	12.9%		1.505	17.4%	n/a
uint64	39	1.821	1.589	12.7%		1.506	17.3%	n/a
	123	1.821	1.592	12.6%		1.506	17.3%	n/a
	763	1.822	1.592	12.6%		1.505	17.4%	n/a
	1249	1.822	1.589	12.8%		1.506	17.4%	n/a
	9311	1.822	1.587	12.9%		1.507	17.3%	n/a
	11315	1.822	1.588	12.8%		1.506	17.4%	n/a
	52513	1.823	1.591	12.7%		1.506	17.4%	n/a
	60978749	1.822	1.590	12.7%		1.507	17.3%	n/a
	106956297	1.821	1.588	12.8%		1.506	17.3%	n/a

Microbenchmark results for tested division algorithms on a Core i7. The top group is for 32 bit division in a 32 bit binary, while the bottom two groups are 32 bit and 64 bit division (respectively) in a 64 bit binary.

Times are in nanoseconds per divide (lower is better). Percentages are percent improvement from the Round Up algorithm (higher is better).

These results indicate that the round-down algorithms are indeed faster by 9%-17% (excluding the crummy codegen, which should be fixed in the compiler). The benchmark source code is available at http://ridiculousfish.com/files/division_benchmarks.tar.gz.

Extension to Signed Division

A natural question is whether the same optimization could improve signed division; unfortunately it appears that it does not, for two reasons:

Thus it appears that the round-down algorithm could be made to work in signed division, but will underperform the standard round-up algorithm.

Reference Code

The reference implementation for computing the magic number due to Henry Warren ("Hacker's Delight") is rather dense, and it may not be obvious how to incorporate the improvements presented here. To ease adoption, we present a reference implementation written in C that incorporates all known optimizations, including the round-down algorithm.

Conclusion

The following algorithm is an alternative way to do division by "uncooperative" constants, which may outperform the standard algorithm that produces an N+1 bit magic number. Given a dividend n and a fixed divisor d, where 0 ≤ n < 2^N and 0 < d < 2^N, and where the standard algorithm failed to find a N-bit magic number:

On a Core i7 x86 processor, a microbenchmark showed that this variant "round down" algorithm outperformed the standard algorithm in both 32 bit and 64 bit modes by 9% to 17%, and in addition generated shorter code that used fewer registers. Furthermore, the variant algorithm is no more difficult to implement than is the standard algorithm. The author has provided a reference implementation and begun some preliminary work towards integrating this algorithm into LLVM, and hopes other compilers will adopt it.

Footnotes

1. Of course, if n can statically be shown to not equal 2^N-1, then the increment can be performed without concern for modulo overflow. This likely occurs frequently due to the special nature of the value 2^N-1.

2. Many processor architectures admit a straightforward saturating increment by use of the carry flag. PowerPC at first blush appears to be an exception: it has somewhat unusual carry flag semantics, and the obvious approach requires three instructions:

The result is in r3. r2 can be replaced by any temporary register; its value can be discarded.