• Benchmarking division and libdivide on Apple M1 and Intel AVX512 May 12th, 2021

    libdivide is fish’s library for speeding up integer division by runtime constants. It works by replacing divide instructions with cheaper multiplications and shifts.

    libdivide supports SIMD, and recently gained support for AVX-512 and ARM NEON. Let’s benchmark on an Intel Xeon and Apple’s M1.

    Test Setup

    The microbenchmark is a sum-of-quotients function, like:

    uint32_t sumq(const uint32_t *vals, size_t count, uint32_t d) {
        uint32_t sum = 0;
        for (size_t i=0; i < count; i++)
            sum += vals[i] / d; // this division is the bottleneck
        return sum;
    }

    The execution time of this function is dominated by the divide. It may be optimized with libdivide:

    uint32_t sumq(const uint32_t *vals, size_t count, uint32_t d) {
        libdivide::divider<uint32_t> div(d);
        uint32_t sum = 0;
        for (size_t i=0; i < count; i++)
            sum += vals[i] / div; // faster libdivide division
        return sum;
    }

    This is typically 2-10x faster, depending on d and count. It also unlocks vectorization for even more speedup - example.

    The times below measure the above loop with d=7 (a “slow case”) and count=524k. The loop is repeated 30 times, and the best (fastest) result is kept. 1

    Intel Xeon 3.0 GHz (8275CL)

    Times are nanoseconds per divide, and a percentage improvement compared to hardware (i.e. udiv instruction). Lower is better.

    width hardware libdivide scalar libdivide SSE2 libdivide AVX2 libdivide AVX512
    u32 2.298 0.872 (-62%) 0.355 (-84%) 0.219 (-90%) 0.210 (-91%)
    u64 6.998 0.891 (-87%) 1.058 (-85%) 0.574 (-92%) 0.492 (-93%)

    The three vector ISAs are similar except for width: 128, 256, 512 bits.

    AVX512 is twice the width of AVX2 but only marginally faster. Speculatively, AVX512 processes multiplies serially, one 256-bit lane at a time, losing half its parallelism.

    libdivide must use 32x32->64 muls (mul_epu32) for vectorized 64 bit, incurring the 4x penalty 2. We would expect to break even at AVX2, which has 4x parallelism; instead AVX2 is faster than scalar, perhaps because a 32 bit multiply is faster than 64 bit.

    Apple M1 3.2 GHz (Mac Mini)

    Times are nanoseconds per divide; lower is better.

    width hardware libdivide scalar libdivide NEON
    u32 0.624 0.351 (-43%) 0.158 (-75%)
    u64 0.623 0.315 (-49%) 0.555 (-11%)

    The M1 is 10x faster than the Xeon at 64 bit divides. It’s…just wow.

    The hardware division time of 0.624 nanoseconds is a clean multiple of 3.2 GHz, suggesting a throughput of 2 clock cycles per divide. That is remarkable: hardware dividers are typically variable-latency and hard to pipeline, so 30+ clock cycles per divide is common. Does the M1 have more than one integer division unit per core?

    Amusingly, scalar 64 bit is faster than scalar 32 bit. This is because AARCH64 has a dedicated instruction for 64 bit high multiplies (umulh), while 32 bit high multiplies are performed with a 64 bit low multiply (umull), followed by a right shift.

    NEON with u32 allows four 32-bit high-multiplies with only two umull instructions (godbolt). So NEON u32 should be nearly twice as fast as scalar; in fact it is even faster, probably because of amoritized fixed costs (branches, etc).

    NEON with u64 is slower because its multiplier is not wide enough, so it incurs the 4x penalty 2, and its skinny vector only offers 2x parallelism. (Nevertheless it still may be useful as part of a larger sequence, to keep data in vector registers.)

    Conclusions and unanswered questions

    libdivide is still quite effective at speeding up integer division, even with the M1’s fast divider. Vectorization greatly improves 32 bit, and is a mixed bag for 64 bit, but still faster than hardware.

    Some questions:

    1. Why is AVX512 only slightly faster than AVX2? Are the umuls are serialized per-lane, is it downclocking?

    2. How is Apple M1 able to achieve a throughput of .5 divides per cycle? Is the hardware divider pipelined, is there more than one per core?

    Thanks for reading!

    Notes

    The results were collected using libdivide’s benchmark tool. You can run it yourself:

    git clone https://github.com/ridiculousfish/libdivide.git
    mkdir libdivide/build && cd libdivide/build
    cmake .. && make benchmark
    ./benchmark u32 # or u64 or s32 or s64
    1. 7 is a “slow case” for libdivide, because 7’s magic number must be 33 or 65 bits, and so cannot fit in a register. This is as slow as libdivide gets. 

    2. To divide a 64 bit number, libdivide needs the high half of a 64 bit product, a so-called high multiply. Unfortunately SIMD hardware typically has a 53-bit multiplier: just big enough for double-precision floating point, not big enough for libdivide. So libdivide must cobble together a 64-bit high-multiply through four 32x32->64 multiplies. This incurs a factor-of-4 penalty for 64 bit vectorized division on SIMD.  2

  • Labor of Division (Episode V): Faster Narrowing Division April 29th, 2021

    Last post explored Algorithm D, and some improvements for the 3 ÷ 2 = 1 digit case. This post presents a narrowing division algorithm, improving upon the widely used “divlu” function from Hacker’s Delight. I am optimistic that these ideas are original.

    divlu is a narrowing division: it divides a 4 digit number by a 2 digit number, producing a 2 digit quotient and remainder. This is achieved by specializing Algorithm D for 4-digit dividend and 2-digit divisor.

    is an estimated quotient digit. It is quite close to the quotient digit q; it may be exactly the same, or larger by 1 or 2, and that’s all! The tricky parts of divlu is “correcting” to the true digit, by subtracting 0, 1, or 2.

    Here is the correction step from Hacker’s Delight:

    again1:
       if (q1 >= b || q1*vn0 > b*rhat + un1) {
         q1 = q1 - 1;
         rhat = rhat + vn1;
         if (rhat < b) goto again1;}

    Note this is used twice, once for each quotient digit. There are three improvements possible here:

    Unroll the loop, whose trip count is at most 2

    divlu loops until the digit is correct, but we know that may exceed q by no more than 2. So just correct at most twice, saving some arithmetic and a branch.

    The final Algorithm D is somewhat vague here: it says to merely “repeat the check”, and it is not obvious from the text if you need to repeat it once or infinite times. A conservative reading suggests that you need a loop, but we know that twice is enough.

    Omit the checks for q̂ >= b

    If q̂ >= b, then the estimated quotient requires more than one digit; that means our estimate is certainly too big, because the true quotient is single-digit.

    This check is necessary in Knuth but an optimization at best in divlu. To understand why, consider dividing (base 10) something like 100000 / 10001. Algorithm D will start by dividing a two digit prefix of the dividend by the first digit of the divisor: 10 / 1, and then correct it with the next divisor digit (0, does nothing). This estimates the first quotient digit as 10 (obviously absurd, 10 is not a digit!). It’s only very much later, as the algorithm traverses the divisor, that it discovers the 1 at the end, and finally decrements the 10 to 9 (the “add back” phase). But in the interim, 10 was too large to store in our digit array, so we might as well decrement it right away, and save ourselves the trouble later.

    But in divlu there is no “add back” phase needed, because each quotient digit is informed by the entire divisor (a mere two digits). Furthermore, as digits are half a machine word, there is no challenge with storing a two-digit quantity. Thus the check for q̂ >= b is redundant with the next check; in rare cases it saves a multiply but is surely a loss.

    You may worry that the check is necessary to prevent overflow in the q1*vn0 calculation, but analysis shows this cannot overflow as q1 <= b + 1 (see the last section in the previous post).

    Avoid overflow checks

    We may decrement q̂ twice at most. Knuth gives the cryptic suggestion of “repeat this test if r̂ < b” and Hackers Delight implements this faithfully:

    if (rhat < b) goto again;

    Why compare r̂ < b? In the next iteration we compute b ⨯ r̂ and compare that to a (at most) two-digit number. So if r̂ ≥ b we can certainly skip the comparison: it is sufficient. But the test is only necessary in a practical sense, to avoid risking overflow in the b ⨯ r̂ calculation. If we rearrange the computation as in the previous post, we no longer risk overflow, and so can elide this branch.

    Put them together

    Now we have improved the quotient correction by reducing the arithmetic and branching:

        c1 = qhat * den0;
        c2 = rhat * b + num1;
        if (c1 > c2)
            qhat -= (c1 - c2 > den) ? 2 : 1;

    Lean and mean.

    Open Questions

    1. What is the proper interpretation of c1 and c2? I can’t name these well, as I lack intuition about what they are.

    2. Can we correct using the true remainder? We know that q̂ - 2 ≤ q. If we compute the remainder for q̂ - 2, multiply it by the divisor, and check how many times to add the divisor until it would exceed the dividend - this would also correct q while simultaneously calculating the remainder. The main difficulty here is that the dividend is three digits, so there is no obvious efficient way to perform this arithmetic. But if we could, it would save some intermediate calculations.

    Final code

    This presents an optimized 128 ÷ 64 = 64 unsigned division. It divides a 128 bit number by a 64 bit number to produce a 64 bit quotient.

    Reference code is provided as part of libdivide. This particular function is released into the public domain where applicable, or the CC0 Creative Commons license at your option.

    Fun fact: this function found a codegen bug in LLVM 11! (It had already been fixed in 12.)

    /*
     * Perform a narrowing division: 128 / 64 -> 64, and 64 / 32 -> 32.
     * The dividend's low and high words are given by \p numhi and \p numlo, respectively.
     * The divisor is given by \p den.
     * \return the quotient, and the remainder by reference in \p r, if not null.
     * If the quotient would require more than 64 bits, or if denom is 0, then return the max value
     * for both quotient and remainder.
     *
     * These functions are released into the public domain, where applicable, or the CC0 license.
     */
    uint64_t divllu(uint64_t numhi, uint64_t numlo, uint64_t den, uint64_t *r)
    {
        // We work in base 2**32.
        // A uint32 holds a single digit. A uint64 holds two digits.
        // Our numerator is conceptually [num3, num2, num1, num0].
        // Our denominator is [den1, den0].
        const uint64_t b = (1ull << 32);
    
        // The high and low digits of our computed quotient.
        uint32_t q1;
        uint32_t q0;
    
        // The normalization shift factor.
        int shift;
    
        // The high and low digits of our denominator (after normalizing).
        // Also the low 2 digits of our numerator (after normalizing).
        uint32_t den1;
        uint32_t den0;
        uint32_t num1;
        uint32_t num0;
    
        // A partial remainder.
        uint64_t rem;
    
        // The estimated quotient, and its corresponding remainder (unrelated to true remainder).
        uint64_t qhat;
        uint64_t rhat;
    
        // Variables used to correct the estimated quotient.
        uint64_t c1;
        uint64_t c2;
    
        // Check for overflow and divide by 0.
        if (numhi >= den) {
            if (r != NULL)
                *r = ~0ull;
            return ~0ull;
        }
    
        // Determine the normalization factor. We multiply den by this, so that its leading digit is at
        // least half b. In binary this means just shifting left by the number of leading zeros, so that
        // there's a 1 in the MSB.
        // We also shift numer by the same amount. This cannot overflow because numhi < den.
        // The expression (-shift & 63) is the same as (64 - shift), except it avoids the UB of shifting
        // by 64. The funny bitwise 'and' ensures that numlo does not get shifted into numhi if shift is 0.
        // clang 11 has an x86 codegen bug here: see LLVM bug 50118. The sequence below avoids it.
        shift = __builtin_clzll(den);
        den <<= shift;
        numhi <<= shift;
        numhi |= (numlo >> (-shift & 63)) & (-(int64_t)shift >> 63);
        numlo <<= shift;
    
        // Extract the low digits of the numerator and both digits of the denominator.
        num1 = (uint32_t)(numlo >> 32);
        num0 = (uint32_t)(numlo & 0xFFFFFFFFu);
        den1 = (uint32_t)(den >> 32);
        den0 = (uint32_t)(den & 0xFFFFFFFFu);
    
        // We wish to compute q1 = [n3 n2 n1] / [d1 d0].
        // Estimate q1 as [n3 n2] / [d1], and then correct it.
        // Note while qhat may be 2 digits, q1 is always 1 digit.
        qhat = numhi / den1;
        rhat = numhi % den1;
        c1 = qhat * den0;
        c2 = rhat * b + num1;
        if (c1 > c2)
            qhat -= (c1 - c2 > den) ? 2 : 1;
        q1 = (uint32_t)qhat;
    
        // Compute the true (partial) remainder.
        rem = numhi * b + num1 - q1 * den;
    
        // We wish to compute q0 = [rem1 rem0 n0] / [d1 d0].
        // Estimate q0 as [rem1 rem0] / [d1] and correct it.
        qhat = rem / den1;
        rhat = rem % den1;
        c1 = qhat * den0;
        c2 = rhat * b + num0;
        if (c1 > c2)
            qhat -= (c1 - c2 > den) ? 2 : 1;
        q0 = (uint32_t)qhat;
    
        // Return remainder if requested.
        if (r != NULL)
            *r = (rem * b + num0 - q0 * den) >> shift;
        return ((uint64_t)q1 << 32) | q0;
    }
  • Labor of Division (Episode IV): Algorithm D April 28th, 2021

    Algorithm D is Knuth's celebrated multiword integer division algorithm. This post tries to make Algorithm D approachable, and also has an idea for an improvement, in the last section. A followup post provides code to take advantage of that improvement.

    To build intuition, we'll use concrete base-10 values, instead of a general base and digit count like Knuth does. It should be straightforward to generalize to arbitrary bases and digit counts.

    Background

    The problem is dividing one big number by another big number. The numbers are big enough to require multiple "digits", where a digit is a number in some comfortable base. The examples will be base 10, but in practical applications the base is the width of a hardware divider, or sometimes half that. For example, when dividing a 256-bit number by a 128-bit number, we may pick 64-bit digits, for a base of $2^{64}$, which is huge.

    It's hard to beat grade-school long division. In long division, we crank out the quotient digits from left to right by peeling off a prefix of the numerator, dividing that, and then using the remainder in the next step. For example:

    1. Compute 1793 ÷ 25. Start with the first three digits of the dividend, i.e. 179.
    2. 179 ÷ 25 = 7, remainder 4. Our first quotient digit is 7.
    3. Take the remainder and append ("bring down") the next digit 3 to form a new dividend 43.
    4. 43 ÷ 25 = 1, remainder of 18.
    5. Full quotient is 71 with remainder of 18.

    Our dividend was four digits, but through long division we only had to divide a three digit number. This is the general irreducible step in long division: if our divisor is N digits, we must work in N+1 digit chunks from the dividend. So our simplified problem is to compute $ q = \lfloor \tfrac n d \rfloor $, where n has exactly one more digit than d, and q is a single digit. (If q were not a single digit, then we peeled off too much of the dividend; back up one digit and try again.)

    Now we switch to a sub-algorithm for this simpler problem.

    Estimating the quotient digit

    We have an N digit denominator and an N+1 digit numerator; the quotient q is a single digit. How do we find it?

    The big idea is to "zero out" the low digits: keep top two from the numerator, and top one from the denominator. For example, instead of $\lfloor \tfrac {2356} {395} \rfloor$, compute $\lfloor \tfrac {2300} {300} \rfloor$, which is the same as $\lfloor \tfrac {23} {3} \rfloor$, which is easy. We hope that this new "estimate" quotient, q̂ will be close to q, and so help us find q:

    $$\require{enclose} q = \lfloor \tfrac {2356} {395} \rfloor \\ \hat{q} = \lfloor \tfrac {23\enclose{downdiagonalstrike}{56}} {3\enclose{downdiagonalstrike}{95}} \rfloor = \lfloor \tfrac {2300} {300} \rfloor = \lfloor \tfrac {23} {3} \rfloor = 7\\ q \stackrel{?}{\approx} 7 $$

    How close is q̂ to q? We'll try to bound the error: let's make q̂ as far off as possible, in both directions.

    Is q̂ too small?

    We'll try to make q̂ smaller than q. (Spoiler: we can't.)

    If we want q̂ < q, then this zeroing process must make the quotient smaller. Let's arrange for the numerator to be reduced as much as possible (lots of 9s) and the denominator not at all (lots of zeros). For example $ \lfloor \tfrac {2399} {300} \rfloor $ has a good chance of making q̂ too small. Apply some algebra:

    $ \lfloor \tfrac {2399} {300} \rfloor = \lfloor \tfrac {2300 + 99} {300} \rfloor = \lfloor \tfrac {2300} {300} + \tfrac {99} {300} \rfloor = \lfloor \tfrac {23} {3} + \tfrac {99} {300} \rfloor $

    Now what is the fractional contribution of the two terms in the last expression? The first term can contribute at most 2/3, and the second at most 99/300, which is less than 1/3. So their sum is always less than 1, so it is always wiped out by the floor, and we have:

    $ \lfloor \tfrac {2300} {300} \rfloor = \lfloor \tfrac {2399} {300} \rfloor $

    Which is kind of neat. So q̂ is never too small.

    (This is surprising because it's not true with ordinary division: obviously for example $\tfrac {9999} {100} \gt \tfrac {9900} {100} $. It's true in integer division because of the floor.)

    Is q̂ too big?

    Now let's reverse it: make this zeroing process increase the quotient by as much as possible. What if the numerator ends in 0s (so it doesn't shrink) and the denominator end in lots of 9s (so it gets much smaller)? Like:

    $ q = \lfloor \tfrac {10000} {1999} \rfloor = 5\\ \hat{q} = \lfloor \tfrac {10000} {1000} \rfloor = 10 \not\approx 5 $

    q̂ is a terrible estimate. When we zeroed out the denominator, we reduced it by nearly half, because its leading digit was 1. If the leading digit were larger, then zeroing out the rest wouldn't affect its value so much.

    Let's bound the error using that leading digit. We will just call the numerator $ n $, and the denominator has leading digit $ v $ (following Knuth), where $ 0 \lt v \lt 10 $. We hope to maximize the error, so arrange for the numerator to have trailing 0s and the denominator trailing 9s. Examine the difference:

    $ \hat{q} - q = \lfloor \tfrac n {v000} \rfloor - \lfloor \tfrac n {v999} \rfloor $

    If we eliminate the floors, the fractional difference is always less than 1 (perhaps less than 0), so we have:

    $ \hat{q} - q \lt \tfrac n {v000} - \tfrac n {v999} + 1 $

    Apply some algebra to the difference of fractions:

    $$\begin{align} \tfrac n {v000} - \tfrac n {v999} & = \tfrac {n \times {v999} - n \times {v000}} {v000 \times v999} \\[2ex] & = \tfrac { n \times ( v999 - v000 ) } {v000 \times v999} \\[2ex] & = \tfrac { n \times 999 } {v000 \times v999} \\[2ex] & = \tfrac {999} {v000} \times \tfrac n {v999} \\[2ex] & \lt \tfrac {1} {v} \times \tfrac n {v999} \\[2ex] & \lt \tfrac {1} {v} \times 10 \\[2ex] \end{align} $$ The last step used the fact that the quotient q is a single digit. Inserting this, we get the relation:

    $ \hat{q} - q \lt \tfrac n {v000} - \tfrac n {v999} + 1 \lt \tfrac {10} v + 1$

    So when our denominator's leading digit is 1, q̂ may be too big by as much as 11! But if the leading digit were, say, 8, then q̂ is at most 2 too big.

    Normalization

    Now for the trick: we can always arrange for the denominator to have a leading digit of 5 or more, by just multiplying it. If our denominator has a leading digit of, say, 4, just multiply the whole denominator by 2. And then multiply the numerator by 2 also, to keep the quotient the same.

    Let's work an example: $$ q = \lfloor \tfrac {192} {29} \rfloor = 6 \\[2ex] \hat{q} = \lfloor \tfrac {190} {20} \rfloor = 9 $$ q̂ is a bad estimate because the denominator has a low leading digit of 2. So multiply top and bottom by 3 first: $$q = \lfloor \tfrac {192} {29} \rfloor = \lfloor \tfrac {192 \times 3} {29 \times 3} \rfloor = \lfloor \tfrac {576} {87} \rfloor = 6 \\ \hat{q} = \lfloor \tfrac {570} {80} \rfloor = \lfloor \tfrac {57} {8} \rfloor = 7 $$ and our estimate q̂ is closer.

    Increasing the leading digit of the divisor is the "normalization" step. After normalizing, the leading digit of at least 5 implies that $ \hat{q} - q \lt \tfrac {10} 5 + 1 = 3 $. And as q̂ and q are integral, we get the awesome result that $ \hat{q} - q \le 2 $. This normalization step ensures our estimate is off by at most 2!

    Note that normalizing increases the leading digit of the divisor but does not add a new digit. Alas, it may add a new digit for the dividend, if you normalize before entering the long division algorithm (and you typically do). Also note that while normalizing does not affect the quotient, it does affect the remainder, which must be scaled down by the normalization factor.

    Multiply and subtract

    Now that we have an estimated quotient digit, we proceed prosaically as in long division. Multiply the divisor by the quotient digit and subtract to produce a partial remainder. Slap on the next digit of the dividend ("bring down the...") and that is your new dividend.

    If we over-estimated the quotient digit, we may get a negative remainder. For example, when the divisor is "just barely" larger than some factor of the dividend: $$ q = \lfloor \tfrac {8000} {4001} \rfloor = 1 \\ \hat{q} = \lfloor \tfrac {8} {4} \rfloor = 2 \\ r = 8000 - \hat{q} \times {4001} \lt 0 $$

    Whoops, but it's no problem: just decrement our quotient, and add back the divisor, until the remainder is no longer negative. But as we are performing multiword arithmetic, we should try to avoid these expensive extra computations. It would be better to correct the estimated quotient digit immediately.

    Now we turn to techniques to do that.

    Correcting the estimate

    Here we'll restrict ourselves to dividing a 3 digit dividend by a 2 digit divisor, producing a 1 digit quotient. We will see how to rapidly correct the error in q̂.

    In this section, think of a "digit" as half of a machine word. Working with a two digit number is fast, but three or more digits is slow. "Overflow" means "requires more than two digits."

    As proven above, we know that (after normalization) q̂ may exceed q by at most 2. Let's return to our previous example, and then extend it by computing a remainder r̂:

    $$ q = \lfloor \tfrac {576} {87} \rfloor = 6 \\[1ex] \hat{q} = \lfloor \tfrac {570} {80} \rfloor = 7 \\[1ex] \hat{r} = {570} \bmod {80} = 10 \\[1ex] \hat{q} \times 80 + \hat{r} = {570} $$

    r̂ is "the remainder of the estimated quotient". It is NOT an estimate of the actual remainder; it has nothing whatsoever to do with the actual remainder, because we ignored the low digit of the divisor. This is true even if the estimated quotient digit is precise. In this sense "r-hat" is a poor choice of symbol, but we'll keep it for continuity with Knuth.

    Now our quotient digit 7 is too high, but let's pretend we didn't know that and see what happens: $$\begin{array}{r}7\\ 87\enclose{longdiv}{576}\\ -\underline{609}\\ {\color{red}{?}} \end{array}$$

    The actual remainder would be negative. But to discover that, we had to perform an expensive full multiply (7 x 87) which produces 3 digits. We can avoid that full multiply with r̂:

    $$ \hat{q} \times 87 \gt 576 \implies \\ \hat{q} \times 80 + \hat{q} \times 7 \gt 576 \implies \\ \hat{q} \times 80 + \hat{r} + \hat{q} \times 7 \gt 576 + \hat{r} \implies \\ 570 + \hat{q} \times 7 \gt 576 + \hat{r} \implies \\ \hat{q} \times 7 \gt 6 + \hat{r} $$ So take our estimated quotient digit, multiply it by the last digit in the divisor, and compare that to the sum of the the last digit of the dividend and the "remainder of the estimated quotient". If it's larger, subtract one from the estimated quotient and repeat. But note we only have to test this two times.

    This explains the (otherwise baffling) comparison step in Algorithm D.

    One last question: now that we have the correct quotient digit, don't we have to perform that widening multiply anyways to compute the remainder? The answer is no: the remainder is guaranteed to fit in two digits, so we can just use ordinary wrapping arithmetic to find it. For example, the first quotient digit is 6, so the first remainder is $ 576 - 6 \times 87 = 54 $; let's wrap at two digits so mod 100:

    $$\begin{align} r &= (576 - 6 \times 87) \bmod 100 \\ &= 76 - (6 \times 87) \bmod 100 \\ &= 76 - 22 \\ &= 54 \end{align} $$

    Once the quotient digit is correct, we may just compute the remainder directly. Modulo overflow is expected and fine.

    Optimizing the estimate correction

    I am optimistic this idea is original; I have not seen it published. This section will be more rigorous for that reason.

    Let's restate what we know using variables. We work in some digit base $ b $. We have a three digit dividend $ u_2 u_1 u_0 $, a two digit divisor $ v_1 v_0 $, and we know their quotient $ q = \lfloor \frac {u_2 u_1 u_0} {v_1 v_0} \rfloor \lt b$ , i.e. fits in a single digit. We compute an estimate $ \hat{q} = \lfloor \tfrac {u_2 u_1 0} {v_1 0} \rfloor $, and the corresponding remainder $ \hat{r} $. We know that $\hat{q}$ is too big iff $ \hat{q} \times v_0 \gt u_0 + \hat{r} $. If so, decrement $ \hat{q} $, recompute $ \hat{r} $, and repeat the check exactly once: if it still too big decrement $ \hat{q} $ once more.

    Of course when we decrement the quotient, we may easily compute the new remainder by just adding back the divisor. With this idea we get the following two checks:

    1. $ \hat{q} \times v_0 \gt u_0 + \hat{r} \\ $
    2. $ (\hat{q} - 1) \times v_0 \gt u_0 + (\hat{r} + v_10) $

    If neither are true, then q̂ is precise. If only the first is true, q̂ is too big by 1. If both are true, q̂ is too big by 2.

    One non-obvious risk is that the last sum may overflow. This is mitigated by the $ \hat{r} \lt b $ check in both Knuth and Warren. However it is possible to avoid the overflow through some algebra:

    $$ (\hat{q} - 1) \times v_0 \gt u_0 + (\hat{r} + v_10) \implies \\ \hat{q} \times v_0 - v_0 - u_0 \gt \hat{r} + v_10 \implies \\ \hat{q} \times v_0 - u_0 - \hat{r} \gt v_0 + v_10 \implies \\ \hat{q} \times v_0 - (u_0 + \hat{r}) \gt v_1 v_0 \\ $$

    How nice is this last expression! $ v_1 v_0 $ is just the original divisor. And the difference $ \hat{q} \times v_0 - (u_0 + \hat{r}) $ may also be used for the first check: if it is not positive, then the inequality in check 1 was not satisifed.

    What about overflow? The product $ \hat{q} \times v_0 $ cannot overflow, because $ \hat{q} \le q + 2 \implies \hat{q} \le b + 1 $, therefore $ \hat{q} \times v_0 \le v_0 * b + v_0 $ which clearly does not overflow.

    Also the sum $ u_0 + \hat{r} $ does not overflow, because $ \hat{r} \lt v_1 0 $, therefore $ \hat{r} + u_0 \lt v_1 u_0 $.

    This idea more efficiently finds $ q $ in the 3 ÷ 2 case, avoiding the loop and some other checks from Knuth. See the next post for an updated narrowing division algorithm that exploits this idea.

  • My Least Favorite Rust Type September 20th, 2020

    My least favorite Rust type is std::ops::Range.

    Range is a foundational type, with magic syntax but a simple definition:

    struct Range<Idx> {
        /// The lower bound of the range (inclusive).
        pub start: Idx,
        /// The upper bound of the range (exclusive).
        pub end: Idx,
    }

    Idx is typically an integral type, but you can make a Range of anything, which will become important later. Here’s a range of Unit:

    ()..()

    (In case you were wondering, the Unit range does not contain Unit.)

    What’s wrong with Range? This will be a critique on Rust’s own terms.

    Range needs random-feeling borrows

    Any Rust tutorial will cover borrowing, lifetimes, ownership, etc. You think you understand, then this:

    (3..5).contains(&4) // why the &

    You need the borrow because you can’t like, OWN a number, man!

    Heh, no, it’s because, what if you wanted to make a Range<Vec> instead:

    let r = vec![1, 2, 3]..=vec![10, 11, 12];
    let v = r.contains(&vec![9, 9, 9, 9]);

    (try to guess what value v has above)

    Range<Vec> requires the borrow, so the vastly more common Range<usize> etc. forces it as well.

    Range needs random-feeling clones

    We’re not done picking on the borrow checker. Range makes you clone even when there’s no ownership transfer:

    let v = vec![1, 2, 3, 4];
    let r = 1..2;
    &v[r.clone()]; // required to avoid "use of moved value" in next line
    &v[r]

    Range could be Copy, but some Ranges are also iterators and you might copy one by mistake:

    let mut iter = 0..n;
    for i in iter { if i > 2 { break; } } // silent copy of iter
    iter.collect()

    so Ranges which are not used as iterators (many cannot be, like Range<f64> or Range<char>) pay the price, and so does any type which embeds a Range.

    This is abusing the borrow checker as a bad linter. Range undermines the story of lifetimes and ownership, making the borrow checker feel arbitrary.

    Range is unsure when it is valid

    Rust will happily construct a backwards Range, which ends before it starts:

    let x = 5..0;
    x.contains(&4); // false

    Is this backwards range valid? Its len() is 0, it contains() nothing, it yields nothing as an iterator. But if you try to use it to index into a slice, you get a panic! So it looks valid but is primed to explode!

    This is because - again - you can make a Range of anything. If you try to enforce that start <= end, you lose the ability to make a Range of non-comparable things. A Range of dyn Error can’t be invalid, so a Range of int never gets to be.

    A practical problem is writing correct bounds checks. For example, consider the get_unchecked function on slice - it says “an out-of-bounds index is undefined behavior” but never defines what out of bounds means. So how does one even call this function safely?

    Restated: a Range with start 0 and length 0 may be out of bounds. That’s wild.

    Range hides a footgun

    A regex engine (here’s one) often deals in ranges of char, for example /[a-z]/. Should it use Range<char>?

    No! The footgun is /[\u10FFFF]/, which is the largest char. Range<char> cannot represent this value, even though it has the bits to do so.

    Is RangeInclusive a better choice? This uses an additional bool field to mean…stuff, and it needs to be separately checked in several places. This is a silly expensive representation, pushing RangeInclusive<char> to 12 bytes even though it would fit in 8, with bits to spare. Not a good choice for a perf-sensitive regex algorithm.

    A Recipe for Rearranging Range

    The problem is Range is overloaded: it’s too flexible, it wants to be everything.

    • It’s sometimes a set
    • It’s sometimes an iterator
    • It’s sometimes a slice index
    • You can make silly Ranges out of anything

    These goals are in tension, and it meets none of them well.

    Perhaps it’s too late and we must live with this wart, but a recipe for a different approach:

    1. Limit Range to Copy + PartialOrd types. There may be occasional uses for Range<String> or Range<BigNum>, but it’s not worth forcing borrowing for the common case.

    2. Now that Range always knows how to compare its ends, enforce that start <= end at construction time. For Ranges constructed from constants, this will be free. This avoids the “looks valid, actually explodes” problem and will unlock further optimizations:

      • Range::is_empty() could be written naturally instead of a a bizarre way just for NaNs
      • [T]::get() would need to branch once, not twice
      • Range::len() could just be a sub and not require a branch
    3. Give Range an iter() method, like other collections have. Now Range is not an Iterator, just like Vec is not an Iterator, and Range can be easily made Copy. This does mean writing for n in (1..10).iter(), but Rust already requires that for collections, so it’s more consistent.

    4. Now that Range is not an Iterator, RangeInclusive can drop its extra bool. It would simply be a pair of indexes, and could not be empty (that’s what Swift does).

    Thanks for reading!

  • JavaScript's Tricky Rounding April 24th, 2018

    JavaScript rounds in a tricky way. It tricked all the engines, and even itself.

    Math.round() behaves the same as C’s familiar round with one key difference: it rounds halfways (“is biased”) towards positive infinity. Here is its spec in ES 5.1. It suggests an implementation too:

    The value of Math.round(x) is the same as the value of Math.floor(x+0.5)...

    And so every engine did that.

    Unfortunately, the implementation in the spec does not correctly implement the spec.

    One bug is that adding 0.5 can result in precision loss, so that a value less than .5 may round up to 1. Mac user? Try it yourself (Safari 11.0.3):

    > /System/Library/Frameworks/JavaScriptCore.framework/Versions/A/Resources/jsc
    >>> 0.499999999999999944 < .5
    true
    >>> Math.round(0.499999999999999944)
    1 
    One engine attempted to patch it by just checking for .5:
    double JSRound(double x) {
      if (0.5 > x && x <= -0.5) return copysign(0, x);
      return floor(x + 0.5);
    }
    However this fails on the other end: when x is large enough that fractional values can no longer be represented, x + 0.5 rounds up to x + 1, so JSRounding a large integer like Math.pow(2, 52) would actually increment it.

    What's a correct implementation? SpiderMonkey checks on the high end, and exploits the loss of precision on the low end:

    double math_round_impl(double x) {
       if (ExponentComponent(x) >= 52) return x;
       double delta = (x >= 0 ? GetBiggestNumberLessThan(0.5) : 0.5);
       return copysign(floor(x + delta));
    }
    fish's attempt just checks high and low:
    static const double kIntegerThreshold = 1LLU << 52;
    double jsround(double x) {
      double absx = fabs(x);
      if (absx > kIntegerThreshold) {
        // x is already integral
        return x;
      } else if (absx < 0.5) {
        // x may suffer precision loss when adding 0.5
        // round to +/- 0
        return copysign(0, x);
      } else {
        // normal rounding.
        // ensure negative values stay negative.
        return copysign(floor(x + 0.5), x);
      }
    }
    which produces surprisingly pleasant assembly, due to the compiler's fabs() and copysign() intrinsics.

    Update: Steve Canon finds a floor-based approach, which surely must be optimal or close to it:

    double roundjs(double x) {
      double provisional = floor(x);
      double fraction = x - provisional;
      return copysign(provisional + (fraction >= 0.5), x);
    }

    The ES6 spec sheepishly no longer suggests an implementation, it just disavows one:

    Math.round(x) may also differ from the value of Math.floor(x+0.5) because of internal rounding when computing x+0.5...

    JavaScript presumably rounds this odd way to match Java, and so the only engine to get it right out of the gate is Rhino, which simply calls back to Java's Math.round. Amusingly Oracle fell into the same trap with Rhino's successor Nashorn. Round and round we go!

  • More Posts