Labor of Division (Episode V): Faster Narrowing Division
April 29th, 2021

Last post explored Algorithm D, and some improvements for the 3 ÷ 2 = 1 digit case. This post presents a narrowing division algorithm, improving upon the widely used “divlu” function from Hacker’s Delight. I am optimistic that these ideas are original.

divlu is a narrowing division: it divides a 4 digit number by a 2 digit number, producing a 2 digit quotient and remainder. This is achieved by specializing Algorithm D for 4-digit dividend and 2-digit divisor.

is an estimated quotient digit. It is quite close to the quotient digit q; it may be exactly the same, or larger by 1 or 2, and that’s all! The tricky parts of divlu is “correcting” to the true digit, by subtracting 0, 1, or 2.

Here is the correction step from Hacker’s Delight:

   if (q1 >= b || q1*vn0 > b*rhat + un1) {
     q1 = q1 - 1;
     rhat = rhat + vn1;
     if (rhat < b) goto again1;}

Note this is used twice, once for each quotient digit. There are three improvements possible here:

Unroll the loop, whose trip count is at most 2

divlu loops until the digit is correct, but we know that may exceed q by no more than 2. So just correct at most twice, saving some arithmetic and a branch.

The final Algorithm D is somewhat vague here: it says to merely “repeat the check”, and it is not obvious from the text if you need to repeat it once or infinite times. A conservative reading suggests that you need a loop, but we know that twice is enough.

Omit the checks for q̂ >= b

If q̂ >= b, then the estimated quotient requires more than one digit; that means our estimate is certainly too big, because the true quotient is single-digit.

This check is necessary in Knuth but an optimization at best in divlu. To understand why, consider dividing (base 10) something like 100000 / 10001. Algorithm D will start by dividing a two digit prefix of the dividend by the first digit of the divisor: 10 / 1, and then correct it with the next divisor digit (0, does nothing). This estimates the first quotient digit as 10 (obviously absurd, 10 is not a digit!). It’s only very much later, as the algorithm traverses the divisor, that it discovers the 1 at the end, and finally decrements the 10 to 9 (the “add back” phase). But in the interim, 10 was too large to store in our digit array, so we might as well decrement it right away, and save ourselves the trouble later.

But in divlu there is no “add back” phase needed, because each quotient digit is informed by the entire divisor (a mere two digits). Furthermore, as digits are half a machine word, there is no challenge with storing a two-digit quantity. Thus the check for q̂ >= b is redundant with the next check; in rare cases it saves a multiply but is surely a loss.

You may worry that the check is necessary to prevent overflow in the q1*vn0 calculation, but analysis shows this cannot overflow as q1 <= b + 1 (see the last section in the previous post).

Avoid overflow checks

We may decrement q̂ twice at most. Knuth gives the cryptic suggestion of “repeat this test if r̂ < b” and Hackers Delight implements this faithfully:

if (rhat < b) goto again;

Why compare r̂ < b? In the next iteration we compute b ⨯ r̂ and compare that to a (at most) two-digit number. So if r̂ ≥ b we can certainly skip the comparison: it is sufficient. But the test is only necessary in a practical sense, to avoid risking overflow in the b ⨯ r̂ calculation. If we rearrange the computation as in the previous post, we no longer risk overflow, and so can elide this branch.

Put them together

Now we have improved the quotient correction by reducing the arithmetic and branching:

    c1 = qhat * den0;
    c2 = rhat * b + num1;
    if (c1 > c2)
        qhat -= (c1 - c2 > den) ? 2 : 1;

Lean and mean.

Open Questions

  1. What is the proper interpretation of c1 and c2? I can’t name these well, as I lack intuition about what they are.

  2. Can we correct using the true remainder? We know that q̂ - 2 ≤ q. If we compute the remainder for q̂ - 2, multiply it by the divisor, and check how many times to add the divisor until it would exceed the dividend - this would also correct q while simultaneously calculating the remainder. The main difficulty here is that the dividend is three digits, so there is no obvious efficient way to perform this arithmetic. But if we could, it would save some intermediate calculations.

Final code

This presents an optimized 128 ÷ 64 = 64 unsigned division. It divides a 128 bit number by a 64 bit number to produce a 64 bit quotient.

Reference code is provided as part of libdivide. This particular function is released into the public domain where applicable, or the CC0 Creative Commons license at your option.

Fun fact: this function found a codegen bug in LLVM 11! (It had already been fixed in 12.)

 * Perform a narrowing division: 128 / 64 -> 64, and 64 / 32 -> 32.
 * The dividend's low and high words are given by \p numhi and \p numlo, respectively.
 * The divisor is given by \p den.
 * \return the quotient, and the remainder by reference in \p r, if not null.
 * If the quotient would require more than 64 bits, or if denom is 0, then return the max value
 * for both quotient and remainder.
 * These functions are released into the public domain, where applicable, or the CC0 license.
uint64_t divllu(uint64_t numhi, uint64_t numlo, uint64_t den, uint64_t *r)
    // We work in base 2**32.
    // A uint32 holds a single digit. A uint64 holds two digits.
    // Our numerator is conceptually [num3, num2, num1, num0].
    // Our denominator is [den1, den0].
    const uint64_t b = (1ull << 32);

    // The high and low digits of our computed quotient.
    uint32_t q1;
    uint32_t q0;

    // The normalization shift factor.
    int shift;

    // The high and low digits of our denominator (after normalizing).
    // Also the low 2 digits of our numerator (after normalizing).
    uint32_t den1;
    uint32_t den0;
    uint32_t num1;
    uint32_t num0;

    // A partial remainder.
    uint64_t rem;

    // The estimated quotient, and its corresponding remainder (unrelated to true remainder).
    uint64_t qhat;
    uint64_t rhat;

    // Variables used to correct the estimated quotient.
    uint64_t c1;
    uint64_t c2;

    // Check for overflow and divide by 0.
    if (numhi >= den) {
        if (r != NULL)
            *r = ~0ull;
        return ~0ull;

    // Determine the normalization factor. We multiply den by this, so that its leading digit is at
    // least half b. In binary this means just shifting left by the number of leading zeros, so that
    // there's a 1 in the MSB.
    // We also shift numer by the same amount. This cannot overflow because numhi < den.
    // The expression (-shift & 63) is the same as (64 - shift), except it avoids the UB of shifting
    // by 64. The funny bitwise 'and' ensures that numlo does not get shifted into numhi if shift is 0.
    // clang 11 has an x86 codegen bug here: see LLVM bug 50118. The sequence below avoids it.
    shift = __builtin_clzll(den);
    den <<= shift;
    numhi <<= shift;
    numhi |= (numlo >> (-shift & 63)) & (-(int64_t)shift >> 63);
    numlo <<= shift;

    // Extract the low digits of the numerator and both digits of the denominator.
    num1 = (uint32_t)(numlo >> 32);
    num0 = (uint32_t)(numlo & 0xFFFFFFFFu);
    den1 = (uint32_t)(den >> 32);
    den0 = (uint32_t)(den & 0xFFFFFFFFu);

    // We wish to compute q1 = [n3 n2 n1] / [d1 d0].
    // Estimate q1 as [n3 n2] / [d1], and then correct it.
    // Note while qhat may be 2 digits, q1 is always 1 digit.
    qhat = numhi / den1;
    rhat = numhi % den1;
    c1 = qhat * den0;
    c2 = rhat * b + num1;
    if (c1 > c2)
        qhat -= (c1 - c2 > den) ? 2 : 1;
    q1 = (uint32_t)qhat;

    // Compute the true (partial) remainder.
    rem = numhi * b + num1 - q1 * den;

    // We wish to compute q0 = [rem1 rem0 n0] / [d1 d0].
    // Estimate q0 as [rem1 rem0] / [d1] and correct it.
    qhat = rem / den1;
    rhat = rem % den1;
    c1 = qhat * den0;
    c2 = rhat * b + num0;
    if (c1 > c2)
        qhat -= (c1 - c2 > den) ? 2 : 1;
    q0 = (uint32_t)qhat;

    // Return remainder if requested.
    if (r != NULL)
        *r = (rem * b + num0 - q0 * den) >> shift;
    return ((uint64_t)q1 << 32) | q0;