Last post presented a narrowing division algorithm, but left some questions dangling around the interpretation of intermediate values. This post answers them. It presents an alternative derivation of the improved correction step, and shows how the true remainder may be computed from the estimated remainder. At the bottom are some benchmarks.

Extending credit for the essential observation to Colin Bartlett who spotted that divllu was effectively computing a (potentially negative) remainder, represented as a difference of two non-negative uints. Thanks Colin!

A quick recap: we wish to write a function that divides a 128-bit uint by a 64-bit uint, producing a 64-bit quotient and remainder. We conceive of this as dividing a 4 digit number by a 2 digit number to produce a 2 digit quotient and remainder, where a digit is base 2^{32}. Call that 4÷2 → 2. Grade school long division (see this post) reduces the problem to 3÷2 → 2. That’s still too big for our 64-bit hardware divider, so we estimate the quotient using a 2÷1 → 2 divide, and then “correct” the estimate, achieving 3÷2 → 2. The correction step is the focus; the idea here is we can also extract the remainder from the correction.

### Theory

Working base 2^{32}, we have a 3 digit dividend $ {n_2 n_1 n_0} $, and 2 digit divisor $ {d_1 d_0} $. We wish to divide them: $q = \lfloor \tfrac {n_2 n_1 n_0} {d_1 d_0} \rfloor $. We estimate $ \hat{q} = \lfloor \tfrac {n_2 n_1} {d_1} \rfloor $, which can be computed by hardware. We know that q̂ exceeds q by no more than 2 (having normalized).

Start by computing the corresponding remainder $ \hat{r} = {n_2 n_1} - \hat{q} \times d_1 $. (Hardware often provides this “for free” alongside the quotient.) Note this “remainder of the estimate” does NOT approximate the true remainder.

Now the true remainder will be *negative* if and only if the estimated quotient is too big. For example, if we estimate 13 ÷ 3 ≅ 5, we will compute a remainder of 13 - 3×5 = -2. So if $\hat{q}$ is bigger than the true quotient $q$, the computed remainder $ r = {n_2 n_1 n_0} - \hat{q} \times {d_1 d_0} $ will be negative. Sadly we cannot directly perform this subtraction, because tons of stuff will overflow, but we can avoid overflow by “factoring out” the initial estimate. Apply some algebra:

where $ b $ is our digit base 2^{32}. These two terms correspond to c2 and c1 from last post, only now we can provide a proper interpretation:

- $ \hat{r} \times b + n_0 $ is the remainder neglecting the low divisor digit, i.e. assuming that digit is 0.
- $ \hat{q} \times {d_0} $ is the remainder contribution from the low divisor digit.

As for names, perhaps `remd1`

and `remd0`

, as these two quantities reflect the contribution to the remainder from each divisor digit.

This leads immediately to the conclusion that $ remd1 - remd0 $ yields the true remainder, after correcting it similarly to how we correct the quotient. This reverses the second question and answers in the affirmative: we can compute the true remainder using the estimated remainder. This in turn reveals a potential optimization, of subtracting instead of multiplying. Is it faster?

### Code

Recall the correction step from last post, including remainder computation:

__libdivide multiply__

Now we know how to compute the remainder via subtraction, as long as we “correct” it:

__libdivide subtract__

Even though we have “saved” a multiply, this benchmarks slower! The reason is that the compiler now emits a branch, which is mispredicted for 5% of random inputs (according to `perf`).

We can partially rescue the idea through branchless tricks:

__libdivide subtract branchless__

This is competitive with the “multiply” version, yet harder to follow. Multiplies are cheap.

### Benchmarks

A little microbenchmark: create a sequence of 16k random dividend/divisor pairs, and then measure the time to compute and sum all quotients and remainders. This is repeated 1k times with the same sequence, and the best (fastest) result for each algorithm is kept.

x86-64 has a narrowing divide instruction `divq`

and we throw that into the mix. Times are nanoseconds per divide, lower is better.

CPU | Hackers Delight | libdivide mul | libdivide sub | libdivide sub branchless | hardware divq |
---|---|---|---|---|---|

Intel i7-8700 | 23.5 | 18.6 | 21.2 | 18.2 | 19.7 |

Intel Xeon 8375C | 17.2 | 10.0 | 14.2 | 11.5 | 2.9 |

Apple M1 Mac Mini | 10.4 | 4.7 | 9.6 | 5.4 | N/A |

Sadly no real speedup from this technique.

Amusingly the best software implementation narrowly beats `divq`

on Coffee Lake (and probably earlier). This instruction is much faster on the Ice Lake Xeon; here’s why!

If you wish to run this yourself, the (messy) code is here:

`clang++ -O3 -std=c++11 divlu_benchmark.cpp && ./a.out`

### Conclusions

You really can compute the remainder directly by subtracting the two correction terms! This is a neat insight. Unfortunately it did not produce a consistent speedup, so the reference algorithm (aka “libdivide mul”) won’t change. If you have a narrowing divide, consider upgrading from the version in Hacker’s Delight to the libdivide reference, which is both faster and clearer.

Thanks for reading!