
Benchmarking division and libdivide on Apple M1 and Intel AVX512 May 12th, 2021
libdivide is fish’s library for speeding up integer division by runtime constants. It works by replacing divide instructions with cheaper multiplications and shifts.
libdivide supports SIMD, and recently gained support for AVX512 and ARM NEON. Let’s benchmark on an Intel Xeon and Apple’s M1.
Test Setup
The microbenchmark is a sumofquotients function, like:
The execution time of this function is dominated by the divide. It may be optimized with libdivide:
This is typically 210x faster, depending on
d
andcount
. It also unlocks vectorization for even more speedup  example.The times below measure the above loop with
d=7
(a “slow case”) andcount=524k
. The loop is repeated 30 times, and the best (fastest) result is kept. ^{1}Intel Xeon 3.0 GHz (8275CL)
Times are nanoseconds per divide, and a percentage improvement compared to hardware (i.e.
udiv
instruction). Lower is better.width hardware libdivide scalar libdivide SSE2 libdivide AVX2 libdivide AVX512 u32 2.298 0.872 (62%) 0.355 (84%) 0.219 (90%) 0.210 (91%) u64 6.998 0.891 (87%) 1.058 (85%) 0.574 (92%) 0.492 (93%) The three vector ISAs are similar except for width: 128, 256, 512 bits.
AVX512 is twice the width of AVX2 but only marginally faster. Speculatively, AVX512 processes multiplies serially, one 256bit lane at a time, losing half its parallelism.
libdivide must use 32x32>64 muls (
mul_epu32
) for vectorized 64 bit, incurring the 4x penalty ^{2}. We would expect to break even at AVX2, which has 4x parallelism; instead AVX2 is faster than scalar, perhaps because a 32 bit multiply is faster than 64 bit.Apple M1 3.2 GHz (Mac Mini)
Times are nanoseconds per divide; lower is better.
width hardware libdivide scalar libdivide NEON u32 0.624 0.351 (43%) 0.158 (75%) u64 0.623 0.315 (49%) 0.555 (11%) The M1 is 10x faster than the Xeon at 64 bit divides. It’s…just wow.
The hardware division time of 0.624 nanoseconds is a clean multiple of 3.2 GHz, suggesting a throughput of 2 clock cycles per divide. That is remarkable: hardware dividers are typically variablelatency and hard to pipeline, so 30+ clock cycles per divide is common. Does the M1 have more than one integer division unit per core?
Amusingly, scalar 64 bit is faster than scalar 32 bit. This is because AARCH64 has a dedicated instruction for 64 bit high multiplies (
umulh
), while 32 bit high multiplies are performed with a 64 bit low multiply (umull
), followed by a right shift.NEON with u32 allows four 32bit highmultiplies with only two
umull
instructions (godbolt). So NEON u32 should be nearly twice as fast as scalar; in fact it is even faster, probably because of amoritized fixed costs (branches, etc).NEON with u64 is slower because its multiplier is not wide enough, so it incurs the 4x penalty ^{2}, and its skinny vector only offers 2x parallelism. (Nevertheless it still may be useful as part of a larger sequence, to keep data in vector registers.)
Conclusions and unanswered questions
libdivide is still quite effective at speeding up integer division, even with the M1’s fast divider. Vectorization greatly improves 32 bit, and is a mixed bag for 64 bit, but still faster than hardware.
Some questions:

Why is AVX512 only slightly faster than AVX2? Are the umuls are serialized perlane, is it downclocking?

How is Apple M1 able to achieve a throughput of .5 divides per cycle? Is the hardware divider pipelined, is there more than one per core?
Thanks for reading!
Notes
The results were collected using libdivide’s benchmark tool. You can run it yourself:

7 is a “slow case” for libdivide, because 7’s magic number must be 33 or 65 bits, and so cannot fit in a register. This is as slow as libdivide gets. ↩

To divide a 64 bit number, libdivide needs the high half of a 64 bit product, a socalled high multiply. Unfortunately SIMD hardware typically has a 53bit multiplier: just big enough for doubleprecision floating point, not big enough for libdivide. So libdivide must cobble together a 64bit highmultiply through four 32x32>64 multiplies. This incurs a factorof4 penalty for 64 bit vectorized division on SIMD. ↩ ↩^{2}


Labor of Division (Episode V): Faster Narrowing Division
April 29th, 2021
Last post explored Algorithm D, and some improvements for the 3 ÷ 2 = 1 digit case. This post presents a narrowing division algorithm, improving upon the widely used “divlu” function from Hacker’s Delight. I am optimistic that these ideas are original.
divlu
is a narrowing division: it divides a 4 digit number by a 2 digit number, producing a 2 digit quotient and remainder. This is achieved by specializing Algorithm D for 4digit dividend and 2digit divisor.q̂
is an estimated quotient digit. It is quite close to the quotient digitq
; it may be exactly the same, or larger by 1 or 2, and that’s all! The tricky parts ofdivlu
is “correcting”q̂
to the true digit, by subtracting 0, 1, or 2.Here is the correction step from Hacker’s Delight:
Note this is used twice, once for each quotient digit. There are three improvements possible here:
Unroll the loop, whose trip count is at most 2
divlu
loops until the digit is correct, but we know thatq̂
may exceedq
by no more than 2. So just correctq̂
at most twice, saving some arithmetic and a branch.The final Algorithm D is somewhat vague here: it says to merely “repeat the check”, and it is not obvious from the text if you need to repeat it once or infinite times. A conservative reading suggests that you need a loop, but we know that twice is enough.
Omit the checks for
q̂ >= b
If
q̂ >= b
, then the estimated quotient requires more than one digit; that means our estimate is certainly too big, because the true quotient is singledigit.This check is necessary in Knuth but an optimization at best in
divlu
. To understand why, consider dividing (base 10) something like100000 / 10001
. Algorithm D will start by dividing a two digit prefix of the dividend by the first digit of the divisor:10 / 1
, and then correct it with the next divisor digit (0, does nothing). This estimates the first quotient digit as 10 (obviously absurd, 10 is not a digit!). It’s only very much later, as the algorithm traverses the divisor, that it discovers the 1 at the end, and finally decrements the 10 to 9 (the “add back” phase). But in the interim, 10 was too large to store in our digit array, so we might as well decrement it right away, and save ourselves the trouble later.But in
divlu
there is no “add back” phase needed, because each quotient digit is informed by the entire divisor (a mere two digits). Furthermore, as digits are half a machine word, there is no challenge with storing a twodigit quantity. Thus the check forq̂ >= b
is redundant with the next check; in rare cases it saves a multiply but is surely a loss.You may worry that the check is necessary to prevent overflow in the
q1*vn0
calculation, but analysis shows this cannot overflow asq1 <= b + 1
(see the last section in the previous post).Avoid overflow checks
We may decrement q̂ twice at most. Knuth gives the cryptic suggestion of “repeat this test if r̂ < b” and Hackers Delight implements this faithfully:
Why compare
r̂ < b
? In the next iteration we computeb ⨯ r̂
and compare that to a (at most) twodigit number. So ifr̂ ≥ b
we can certainly skip the comparison: it is sufficient. But the test is only necessary in a practical sense, to avoid risking overflow in theb ⨯ r̂
calculation. If we rearrange the computation as in the previous post, we no longer risk overflow, and so can elide this branch.Put them together
Now we have improved the quotient correction by reducing the arithmetic and branching:
Lean and mean.
Open Questions

What is the proper interpretation of c1 and c2? I can’t name these well, as I lack intuition about what they are.

Can we correct
q̂
using the true remainder? We know thatq̂  2 ≤ q
. If we compute the remainder forq̂  2
, multiply it by the divisor, and check how many times to add the divisor until it would exceed the dividend  this would also correctq
while simultaneously calculating the remainder. The main difficulty here is that the dividend is three digits, so there is no obvious efficient way to perform this arithmetic. But if we could, it would save some intermediate calculations.
Final code
This presents an optimized 128 ÷ 64 = 64 unsigned division. It divides a 128 bit number by a 64 bit number to produce a 64 bit quotient.
Reference code is provided as part of libdivide. This particular function is released into the public domain where applicable, or the CC0 Creative Commons license at your option.
Fun fact: this function found a codegen bug in LLVM 11! (It had already been fixed in 12.)


Labor of Division (Episode IV): Algorithm D
April 28th, 2021
Algorithm D is Knuth's celebrated multiword integer division algorithm. This post tries to make Algorithm D approachable, and also has an idea for an improvement, in the last section. A followup post provides code to take advantage of that improvement.
To build intuition, we'll use concrete base10 values, instead of a general base and digit count like Knuth does. It should be straightforward to generalize to arbitrary bases and digit counts.
Background
The problem is dividing one big number by another big number. The numbers are big enough to require multiple "digits", where a digit is a number in some comfortable base. The examples will be base 10, but in practical applications the base is the width of a hardware divider, or sometimes half that. For example, when dividing a 256bit number by a 128bit number, we may pick 64bit digits, for a base of $2^{64}$, which is huge.
It's hard to beat gradeschool long division. In long division, we crank out the quotient digits from left to right by peeling off a prefix of the numerator, dividing that, and then using the remainder in the next step. For example:
 Compute 1793 ÷ 25. Start with the first three digits of the dividend, i.e. 179.
 179 ÷ 25 = 7, remainder 4. Our first quotient digit is 7.
 Take the remainder and append ("bring down") the next digit 3 to form a new dividend 43.
 43 ÷ 25 = 1, remainder of 18.
 Full quotient is 71 with remainder of 18.
Our dividend was four digits, but through long division we only had to divide a three digit number. This is the general irreducible step in long division: if our divisor is N digits, we must work in N+1 digit chunks from the dividend. So our simplified problem is to compute $ q = \lfloor \tfrac n d \rfloor $, where n has exactly one more digit than d, and q is a single digit. (If q were not a single digit, then we peeled off too much of the dividend; back up one digit and try again.)
Now we switch to a subalgorithm for this simpler problem.
Estimating the quotient digit
We have an N digit denominator and an N+1 digit numerator; the quotient q is a single digit. How do we find it?The big idea is to "zero out" the low digits: keep top two from the numerator, and top one from the denominator. For example, instead of $\lfloor \tfrac {2356} {395} \rfloor$, compute $\lfloor \tfrac {2300} {300} \rfloor$, which is the same as $\lfloor \tfrac {23} {3} \rfloor$, which is easy. We hope that this new "estimate" quotient, q̂ will be close to q, and so help us find q:
$$\require{enclose} q = \lfloor \tfrac {2356} {395} \rfloor \\ \hat{q} = \lfloor \tfrac {23\enclose{downdiagonalstrike}{56}} {3\enclose{downdiagonalstrike}{95}} \rfloor = \lfloor \tfrac {2300} {300} \rfloor = \lfloor \tfrac {23} {3} \rfloor = 7\\ q \stackrel{?}{\approx} 7 $$
How close is q̂ to q? We'll try to bound the error: let's make q̂ as far off as possible, in both directions.
Is q̂ too small?
We'll try to make q̂ smaller than q. (Spoiler: we can't.)
If we want q̂ < q, then this zeroing process must make the quotient smaller. Let's arrange for the numerator to be reduced as much as possible (lots of 9s) and the denominator not at all (lots of zeros). For example $ \lfloor \tfrac {2399} {300} \rfloor $ has a good chance of making q̂ too small. Apply some algebra:
$ \lfloor \tfrac {2399} {300} \rfloor = \lfloor \tfrac {2300 + 99} {300} \rfloor = \lfloor \tfrac {2300} {300} + \tfrac {99} {300} \rfloor = \lfloor \tfrac {23} {3} + \tfrac {99} {300} \rfloor $
Now what is the fractional contribution of the two terms in the last expression? The first term can contribute at most 2/3, and the second at most 99/300, which is less than 1/3. So their sum is always less than 1, so it is always wiped out by the floor, and we have:
$ \lfloor \tfrac {2300} {300} \rfloor = \lfloor \tfrac {2399} {300} \rfloor $
Which is kind of neat. So q̂ is never too small.
(This is surprising because it's not true with ordinary division: obviously for example $\tfrac {9999} {100} \gt \tfrac {9900} {100} $. It's true in integer division because of the floor.)
Is q̂ too big?
Now let's reverse it: make this zeroing process increase the quotient by as much as possible. What if the numerator ends in 0s (so it doesn't shrink) and the denominator end in lots of 9s (so it gets much smaller)? Like:$ q = \lfloor \tfrac {10000} {1999} \rfloor = 5\\ \hat{q} = \lfloor \tfrac {10000} {1000} \rfloor = 10 \not\approx 5 $
q̂ is a terrible estimate. When we zeroed out the denominator, we reduced it by nearly half, because its leading digit was 1. If the leading digit were larger, then zeroing out the rest wouldn't affect its value so much.
Let's bound the error using that leading digit. We will just call the numerator $ n $, and the denominator has leading digit $ v $ (following Knuth), where $ 0 \lt v \lt 10 $. We hope to maximize the error, so arrange for the numerator to have trailing 0s and the denominator trailing 9s. Examine the difference:
$ \hat{q}  q = \lfloor \tfrac n {v000} \rfloor  \lfloor \tfrac n {v999} \rfloor $
If we eliminate the floors, the fractional difference is always less than 1 (perhaps less than 0), so we have:
$ \hat{q}  q \lt \tfrac n {v000}  \tfrac n {v999} + 1 $
Apply some algebra to the difference of fractions:
$$\begin{align} \tfrac n {v000}  \tfrac n {v999} & = \tfrac {n \times {v999}  n \times {v000}} {v000 \times v999} \\[2ex] & = \tfrac { n \times ( v999  v000 ) } {v000 \times v999} \\[2ex] & = \tfrac { n \times 999 } {v000 \times v999} \\[2ex] & = \tfrac {999} {v000} \times \tfrac n {v999} \\[2ex] & \lt \tfrac {1} {v} \times \tfrac n {v999} \\[2ex] & \lt \tfrac {1} {v} \times 10 \\[2ex] \end{align} $$ The last step used the fact that the quotient q is a single digit. Inserting this, we get the relation:
$ \hat{q}  q \lt \tfrac n {v000}  \tfrac n {v999} + 1 \lt \tfrac {10} v + 1$
So when our denominator's leading digit is 1, q̂ may be too big by as much as 11! But if the leading digit were, say, 8, then q̂ is at most 2 too big.
Normalization
Now for the trick: we can always arrange for the denominator to have a leading digit of 5 or more, by just multiplying it. If our denominator has a leading digit of, say, 4, just multiply the whole denominator by 2. And then multiply the numerator by 2 also, to keep the quotient the same.
Let's work an example: $$ q = \lfloor \tfrac {192} {29} \rfloor = 6 \\[2ex] \hat{q} = \lfloor \tfrac {190} {20} \rfloor = 9 $$ q̂ is a bad estimate because the denominator has a low leading digit of 2. So multiply top and bottom by 3 first: $$q = \lfloor \tfrac {192} {29} \rfloor = \lfloor \tfrac {192 \times 3} {29 \times 3} \rfloor = \lfloor \tfrac {576} {87} \rfloor = 6 \\ \hat{q} = \lfloor \tfrac {570} {80} \rfloor = \lfloor \tfrac {57} {8} \rfloor = 7 $$ and our estimate q̂ is closer.
Increasing the leading digit of the divisor is the "normalization" step. After normalizing, the leading digit of at least 5 implies that $ \hat{q}  q \lt \tfrac {10} 5 + 1 = 3 $. And as q̂ and q are integral, we get the awesome result that $ \hat{q}  q \le 2 $. This normalization step ensures our estimate is off by at most 2!
Note that normalizing increases the leading digit of the divisor but does not add a new digit. Alas, it may add a new digit for the dividend, if you normalize before entering the long division algorithm (and you typically do). Also note that while normalizing does not affect the quotient, it does affect the remainder, which must be scaled down by the normalization factor.
Multiply and subtract
Now that we have an estimated quotient digit, we proceed prosaically as in long division. Multiply the divisor by the quotient digit and subtract to produce a partial remainder. Slap on the next digit of the dividend ("bring down the...") and that is your new dividend.
If we overestimated the quotient digit, we may get a negative remainder. For example, when the divisor is "just barely" larger than some factor of the dividend: $$ q = \lfloor \tfrac {8000} {4001} \rfloor = 1 \\ \hat{q} = \lfloor \tfrac {8} {4} \rfloor = 2 \\ r = 8000  \hat{q} \times {4001} \lt 0 $$
Whoops, but it's no problem: just decrement our quotient, and add back the divisor, until the remainder is no longer negative. But as we are performing multiword arithmetic, we should try to avoid these expensive extra computations. It would be better to correct the estimated quotient digit immediately.
Now we turn to techniques to do that.
Correcting the estimate
Here we'll restrict ourselves to dividing a 3 digit dividend by a 2 digit divisor, producing a 1 digit quotient. We will see how to rapidly correct the error in q̂.
In this section, think of a "digit" as half of a machine word. Working with a two digit number is fast, but three or more digits is slow. "Overflow" means "requires more than two digits."
As proven above, we know that (after normalization) q̂ may exceed q by at most 2. Let's return to our previous example, and then extend it by computing a remainder r̂:
$$ q = \lfloor \tfrac {576} {87} \rfloor = 6 \\[1ex] \hat{q} = \lfloor \tfrac {570} {80} \rfloor = 7 \\[1ex] \hat{r} = {570} \bmod {80} = 10 \\[1ex] \hat{q} \times 80 + \hat{r} = {570} $$
r̂ is "the remainder of the estimated quotient". It is NOT an estimate of the actual remainder; it has nothing whatsoever to do with the actual remainder, because we ignored the low digit of the divisor. This is true even if the estimated quotient digit is precise. In this sense "rhat" is a poor choice of symbol, but we'll keep it for continuity with Knuth.
Now our quotient digit 7 is too high, but let's pretend we didn't know that and see what happens: $$\begin{array}{r}7\\ 87\enclose{longdiv}{576}\\ \underline{609}\\ {\color{red}{?}} \end{array}$$
The actual remainder would be negative. But to discover that, we had to perform an expensive full multiply (7 x 87) which produces 3 digits. We can avoid that full multiply with r̂:
$$ \hat{q} \times 87 \gt 576 \implies \\ \hat{q} \times 80 + \hat{q} \times 7 \gt 576 \implies \\ \hat{q} \times 80 + \hat{r} + \hat{q} \times 7 \gt 576 + \hat{r} \implies \\ 570 + \hat{q} \times 7 \gt 576 + \hat{r} \implies \\ \hat{q} \times 7 \gt 6 + \hat{r} $$ So take our estimated quotient digit, multiply it by the last digit in the divisor, and compare that to the sum of the the last digit of the dividend and the "remainder of the estimated quotient". If it's larger, subtract one from the estimated quotient and repeat. But note we only have to test this two times.
This explains the (otherwise baffling) comparison step in Algorithm D.
One last question: now that we have the correct quotient digit, don't we have to perform that widening multiply anyways to compute the remainder? The answer is no: the remainder is guaranteed to fit in two digits, so we can just use ordinary wrapping arithmetic to find it. For example, the first quotient digit is 6, so the first remainder is $ 576  6 \times 87 = 54 $; let's wrap at two digits so mod 100:
$$\begin{align} r &= (576  6 \times 87) \bmod 100 \\ &= 76  (6 \times 87) \bmod 100 \\ &= 76  22 \\ &= 54 \end{align} $$
Once the quotient digit is correct, we may just compute the remainder directly. Modulo overflow is expected and fine.
Optimizing the estimate correction
I am optimistic this idea is original; I have not seen it published. This section will be more rigorous for that reason.
Let's restate what we know using variables. We work in some digit base $ b $. We have a three digit dividend $ u_2 u_1 u_0 $, a two digit divisor $ v_1 v_0 $, and we know their quotient $ q = \lfloor \frac {u_2 u_1 u_0} {v_1 v_0} \rfloor \lt b$ , i.e. fits in a single digit. We compute an estimate $ \hat{q} = \lfloor \tfrac {u_2 u_1 0} {v_1 0} \rfloor $, and the corresponding remainder $ \hat{r} $. We know that $\hat{q}$ is too big iff $ \hat{q} \times v_0 \gt u_0 + \hat{r} $. If so, decrement $ \hat{q} $, recompute $ \hat{r} $, and repeat the check exactly once: if it still too big decrement $ \hat{q} $ once more.
Of course when we decrement the quotient, we may easily compute the new remainder by just adding back the divisor. With this idea we get the following two checks:
 $ \hat{q} \times v_0 \gt u_0 + \hat{r} \\ $
 $ (\hat{q}  1) \times v_0 \gt u_0 + (\hat{r} + v_10) $
If neither are true, then q̂ is precise. If only the first is true, q̂ is too big by 1. If both are true, q̂ is too big by 2.
One nonobvious risk is that the last sum may overflow. This is mitigated by the $ \hat{r} \lt b $ check in both Knuth and Warren. However it is possible to avoid the overflow through some algebra:
$$ (\hat{q}  1) \times v_0 \gt u_0 + (\hat{r} + v_10) \implies \\ \hat{q} \times v_0  v_0  u_0 \gt \hat{r} + v_10 \implies \\ \hat{q} \times v_0  u_0  \hat{r} \gt v_0 + v_10 \implies \\ \hat{q} \times v_0  (u_0 + \hat{r}) \gt v_1 v_0 \\ $$
How nice is this last expression! $ v_1 v_0 $ is just the original divisor. And the difference $ \hat{q} \times v_0  (u_0 + \hat{r}) $ may also be used for the first check: if it is not positive, then the inequality in check 1 was not satisifed.
What about overflow? The product $ \hat{q} \times v_0 $ cannot overflow, because $ \hat{q} \le q + 2 \implies \hat{q} \le b + 1 $, therefore $ \hat{q} \times v_0 \le v_0 * b + v_0 $ which clearly does not overflow.
Also the sum $ u_0 + \hat{r} $ does not overflow, because $ \hat{r} \lt v_1 0 $, therefore $ \hat{r} + u_0 \lt v_1 u_0 $.
This idea more efficiently finds $ q $ in the 3 ÷ 2 case, avoiding the loop and some other checks from Knuth. See the next post for an updated narrowing division algorithm that exploits this idea.

My Least Favorite Rust Type
September 20th, 2020
My least favorite Rust type is
std::ops::Range
.Range is a foundational type, with magic syntax but a simple definition:
Idx
is typically an integral type, but you can make a Range of anything, which will become important later. Here’s a range of Unit:(In case you were wondering, the Unit range does not contain Unit.)
What’s wrong with Range? This will be a critique on Rust’s own terms.
Range needs randomfeeling borrows
Any Rust tutorial will cover borrowing, lifetimes, ownership, etc. You think you understand, then this:
You need the borrow because you can’t like, OWN a number, man!
Heh, no, it’s because, what if you wanted to make a
Range<Vec>
instead:(try to guess what value
v
has above)Range<Vec>
requires the borrow, so the vastly more commonRange<usize>
etc. forces it as well.Range needs randomfeeling clones
We’re not done picking on the borrow checker. Range makes you clone even when there’s no ownership transfer:
Range
could beCopy
, but some Ranges are also iterators and you might copy one by mistake:so Ranges which are not used as iterators (many cannot be, like
Range<f64>
orRange<char>
) pay the price, and so does any type which embeds aRange
.This is abusing the borrow checker as a bad linter. Range undermines the story of lifetimes and ownership, making the borrow checker feel arbitrary.
Range is unsure when it is valid
Rust will happily construct a backwards Range, which ends before it starts:
Is this backwards range valid? Its
len()
is 0, itcontains()
nothing, it yields nothing as an iterator. But if you try to use it to index into a slice, you get a panic! So it looks valid but is primed to explode!This is because  again  you can make a Range of anything. If you try to enforce that start <= end, you lose the ability to make a Range of noncomparable things. A Range of
dyn Error
can’t be invalid, so a Range of int never gets to be.A practical problem is writing correct bounds checks. For example, consider the
get_unchecked
function on slice  it says “an outofbounds index is undefined behavior” but never defines what out of bounds means. So how does one even call this function safely?Restated: a Range with start 0 and length 0 may be out of bounds. That’s wild.
Range hides a footgun
A regex engine (here’s one) often deals in ranges of char, for example
/[az]/
. Should it useRange<char>
?No! The footgun is
/[\u10FFFF]/
, which is the largest char.Range<char>
cannot represent this value, even though it has the bits to do so.Is
RangeInclusive
a better choice? This uses an additional bool field to mean…stuff, and it needs to be separately checked in several places. This is a silly expensive representation, pushingRangeInclusive<char>
to 12 bytes even though it would fit in 8, with bits to spare. Not a good choice for a perfsensitive regex algorithm.A Recipe for Rearranging Range
The problem is Range is overloaded: it’s too flexible, it wants to be everything.
 It’s sometimes a set
 It’s sometimes an iterator
 It’s sometimes a slice index
 You can make silly Ranges out of anything
These goals are in tension, and it meets none of them well.
Perhaps it’s too late and we must live with this wart, but a recipe for a different approach:

Limit Range to
Copy + PartialOrd
types. There may be occasional uses forRange<String>
orRange<BigNum>
, but it’s not worth forcing borrowing for the common case. 
Now that Range always knows how to compare its ends, enforce that
start <= end
at construction time. For Ranges constructed from constants, this will be free. This avoids the “looks valid, actually explodes” problem and will unlock further optimizations:Range::is_empty()
could be written naturally instead of a a bizarre way just for NaNs[T]::get()
would need to branch once, not twiceRange::len()
could just be a sub and not require a branch

Give Range an
iter()
method, like other collections have. Now Range is not an Iterator, just like Vec is not an Iterator, and Range can be easily made Copy. This does mean writingfor n in (1..10).iter()
, but Rust already requires that for collections, so it’s more consistent. 
Now that Range is not an Iterator,
RangeInclusive
can drop its extra bool. It would simply be a pair of indexes, and could not be empty (that’s what Swift does).
Thanks for reading!

JavaScript's Tricky Rounding
April 24th, 2018
JavaScript rounds in a tricky way. It tricked all the engines, and even itself.
Math.round()
behaves the same as C’s familiarround
with one key difference: it rounds halfways (“is biased”) towards positive infinity. Here is its spec in ES 5.1. It suggests an implementation too:The value of Math.round(x) is the same as the value of Math.floor(x+0.5)...
Unfortunately, the implementation in the spec does not correctly implement the spec.
One bug is that adding 0.5 can result in precision loss, so that a value less than .5 may round up to 1. Mac user? Try it yourself (Safari 11.0.3):
One engine attempted to patch it by just checking for .5: However this fails on the other end: when x is large enough that fractional values can no longer be represented, x + 0.5 rounds up to x + 1, so JSRounding a large integer like Math.pow(2, 52) would actually increment it.What's a correct implementation? SpiderMonkey checks on the high end, and exploits the loss of precision on the low end:
fish's attempt just checks high and low: which produces surprisingly pleasant assembly, due to the compiler's fabs() and copysign() intrinsics.Update: Steve Canon finds a floorbased approach, which surely must be optimal or close to it:
The ES6 spec sheepishly no longer suggests an implementation, it just disavows one:
Math.round(x) may also differ from the value of Math.floor(x+0.5) because of internal rounding when computing x+0.5...
JavaScript presumably rounds this odd way to match Java, and so the only engine to get it right out of the gate is Rhino, which simply calls back to Java's Math.round. Amusingly Oracle fell into the same trap with Rhino's successor Nashorn. Round and round we go!
 More Posts