Labor of Division (Episode V): Faster Narrowing Division
April 29th, 2021

Last post explored Algorithm D, and some improvements for the 3 ÷ 2 = 1 digit case. This post presents a narrowing division algorithm, improving upon the widely used “divlu” function from Hacker’s Delight. I am optimistic that these ideas are original.

`divlu` is a narrowing division: it divides a 4 digit number by a 2 digit number, producing a 2 digit quotient and remainder. This is achieved by specializing Algorithm D for 4-digit dividend and 2-digit divisor.

`q̂` is an estimated quotient digit. It is quite close to the quotient digit `q`; it may be exactly the same, or larger by 1 or 2, and that’s all! The tricky parts of `divlu` is “correcting” `q̂` to the true digit, by subtracting 0, 1, or 2.

Here is the correction step from Hacker’s Delight:

Note this is used twice, once for each quotient digit. There are three improvements possible here:

#### Unroll the loop, whose trip count is at most 2

`divlu` loops until the digit is correct, but we know that `q̂` may exceed `q` by no more than 2. So just correct `q̂` at most twice, saving some arithmetic and a branch.

The final Algorithm D is somewhat vague here: it says to merely “repeat the check”, and it is not obvious from the text if you need to repeat it once or infinite times. A conservative reading suggests that you need a loop, but we know that twice is enough.

#### Omit the checks for `q̂ >= b`

If `q̂ >= b`, then the estimated quotient requires more than one digit; that means our estimate is certainly too big, because the true quotient is single-digit.

This check is necessary in Knuth but an optimization at best in `divlu`. To understand why, consider dividing (base 10) something like `100000 / 10001`. Algorithm D will start by dividing a two digit prefix of the dividend by the first digit of the divisor: `10 / 1`, and then correct it with the next divisor digit (0, does nothing). This estimates the first quotient digit as 10 (obviously absurd, 10 is not a digit!). It’s only very much later, as the algorithm traverses the divisor, that it discovers the 1 at the end, and finally decrements the 10 to 9 (the “add back” phase). But in the interim, 10 was too large to store in our digit array, so we might as well decrement it right away, and save ourselves the trouble later.

But in `divlu` there is no “add back” phase needed, because each quotient digit is informed by the entire divisor (a mere two digits). Furthermore, as digits are half a machine word, there is no challenge with storing a two-digit quantity. Thus the check for `q̂ >= b` is redundant with the next check; in rare cases it saves a multiply but is surely a loss.

You may worry that the check is necessary to prevent overflow in the `q1*vn0` calculation, but analysis shows this cannot overflow as `q1 <= b + 1` (see the last section in the previous post).

#### Avoid overflow checks

We may decrement q̂ twice at most. Knuth gives the cryptic suggestion of “repeat this test if r̂ < b” and Hackers Delight implements this faithfully:

Why compare `r̂ < b`? In the next iteration we compute `b ⨯ r̂` and compare that to a (at most) two-digit number. So if `r̂ ≥ b` we can certainly skip the comparison: it is sufficient. But the test is only necessary in a practical sense, to avoid risking overflow in the `b ⨯ r̂` calculation. If we rearrange the computation as in the previous post, we no longer risk overflow, and so can elide this branch.

#### Put them together

Now we have improved the quotient correction by reducing the arithmetic and branching:

Lean and mean.

#### Open Questions

1. What is the proper interpretation of c1 and c2? I can’t name these well, as I lack intuition about what they are.

2. Can we correct `q̂` using the true remainder? We know that `q̂ - 2 ≤ q`. If we compute the remainder for `q̂ - 2`, multiply it by the divisor, and check how many times to add the divisor until it would exceed the dividend - this would also correct `q` while simultaneously calculating the remainder. The main difficulty here is that the dividend is three digits, so there is no obvious efficient way to perform this arithmetic. But if we could, it would save some intermediate calculations.

#### Final code

This presents an optimized 128 ÷ 64 = 64 unsigned division. It divides a 128 bit number by a 64 bit number to produce a 64 bit quotient.

Reference code is provided as part of libdivide. This particular function is released into the public domain where applicable, or the CC0 Creative Commons license at your option.

Fun fact: this function found a codegen bug in LLVM 11! (It had already been fixed in 12.)