• Labor of Division (Episode VI): Narrowing Division Insight and Benchmarks September 15th, 2021

Last post presented a narrowing division algorithm, but left some questions dangling around the interpretation of intermediate values. This post answers them. It presents an alternative derivation of the improved correction step, and shows how the true remainder may be computed from the estimated remainder. At the bottom are some benchmarks.

Extending credit for the essential observation to Colin Bartlett who spotted that divllu was effectively computing a (potentially negative) remainder, represented as a difference of two non-negative uints. Thanks Colin!

A quick recap: we wish to write a function that divides a 128-bit uint by a 64-bit uint, producing a 64-bit quotient and remainder. We conceive of this as dividing a 4 digit number by a 2 digit number to produce a 2 digit quotient and remainder, where a digit is base 232. Call that 4÷2 → 2. Grade school long division (see this post) reduces the problem to 3÷2 → 2. That’s still too big for our 64-bit hardware divider, so we estimate the quotient using a 2÷1 → 2 divide, and then “correct” the estimate, achieving 3÷2 → 2. The correction step is the focus; the idea here is we can also extract the remainder from the correction.

### Theory

Working base 232, we have a 3 digit dividend ${n_2 n_1 n_0}$, and 2 digit divisor ${d_1 d_0}$. We wish to divide them: $q = \lfloor \tfrac {n_2 n_1 n_0} {d_1 d_0} \rfloor$. We estimate $\hat{q} = \lfloor \tfrac {n_2 n_1} {d_1} \rfloor$, which can be computed by hardware. We know that q̂ exceeds q by no more than 2 (having normalized).

Start by computing the corresponding remainder $\hat{r} = {n_2 n_1} - \hat{q} \times d_1$. (Hardware often provides this “for free” alongside the quotient.) Note this “remainder of the estimate” does NOT approximate the true remainder.

Now the true remainder will be negative if and only if the estimated quotient is too big. For example, if we estimate 13 ÷ 3 ≅ 5, we will compute a remainder of 13 - 3×5 = -2. So if $\hat{q}$ is bigger than the true quotient $q$, the computed remainder $r = {n_2 n_1 n_0} - \hat{q} \times {d_1 d_0}$ will be negative. Sadly we cannot directly perform this subtraction, because tons of stuff will overflow, but we can avoid overflow by “factoring out” the initial estimate. Apply some algebra:

\begin{align} r &= {n_2 n_1 n_0} - \hat{q} \times {d_1 d_0} \\ &= {n_2 n_1 0} + n_0 - \hat{q} \times {d_1 d_0} \\ &= {n_2 n_1 0} + n_0 - \hat{q} \times {d_1 0} - \hat{q} \times {d_0} \\ &= {n_2 n_1 0} - \hat{q} \times {d_1 0} + n_0 - \hat{q} \times {d_0} \\ &= ({n_2 n_1} - \hat{q} \times d_1) \times b + n_0 - \hat{q} \times {d_0} \\ &= \hat{r} \times b + n_0 - \hat{q} \times {d_0} \end{align}

where $b$ is our digit base 232. These two terms correspond to c2 and c1 from last post, only now we can provide a proper interpretation:

• $\hat{r} \times b + n_0$ is the remainder neglecting the low divisor digit, i.e. assuming that digit is 0.
• $\hat{q} \times {d_0}$ is the remainder contribution from the low divisor digit.

As for names, perhaps remd1 and remd0, as these two quantities reflect the contribution to the remainder from each divisor digit.

This leads immediately to the conclusion that $remd1 - remd0$ yields the true remainder, after correcting it similarly to how we correct the quotient. This reverses the second question and answers in the affirmative: we can compute the true remainder using the estimated remainder. This in turn reveals a potential optimization, of subtracting instead of multiplying. Is it faster?

### Code

Recall the correction step from last post, including remainder computation:

libdivide multiply

Now we know how to compute the remainder via subtraction, as long as we “correct” it:

libdivide subtract

Even though we have “saved” a multiply, this benchmarks slower! The reason is that the compiler now emits a branch, which is mispredicted for 5% of random inputs (according to perf).

We can partially rescue the idea through branchless tricks:

libdivide subtract branchless

This is competitive with the “multiply” version, yet harder to follow. Multiplies are cheap.

### Benchmarks

A little microbenchmark: create a sequence of 16k random dividend/divisor pairs, and then measure the time to compute and sum all quotients and remainders. This is repeated 1k times with the same sequence, and the best (fastest) result for each algorithm is kept.

x86-64 has a narrowing divide instruction divq and we throw that into the mix. Times are nanoseconds per divide, lower is better.

CPU Hackers Delight libdivide mul libdivide sub libdivide sub branchless hardware divq
Intel i7-8700 23.5 18.6 21.2 18.2 19.7
Intel Xeon 8375C 17.2 10.0 14.2 11.5 2.9
Apple M1 Mac Mini 10.4 4.7 9.6 5.4 N/A

Sadly no real speedup from this technique.

Amusingly the best software implementation narrowly beats divq on Coffee Lake (and probably earlier). This instruction is much faster on the Ice Lake Xeon; here’s why!

If you wish to run this yourself, the (messy) code is here:

clang++ -O3 -std=c++11 divlu_benchmark.cpp && ./a.out

### Conclusions

You really can compute the remainder directly by subtracting the two correction terms! This is a neat insight. Unfortunately it did not produce a consistent speedup, so the reference algorithm (aka “libdivide mul”) won’t change. If you have a narrowing divide, consider upgrading from the version in Hacker’s Delight to the libdivide reference, which is both faster and clearer.

• Angband in WASM August 24th, 2021

Announcing a port of Angband running entirely in-browser. Angband is a venerable roguelike dungeon-crawler set in the Tolkien universe.

There’s no sign-up or other nonsense. Some screenshots:

Click on Unleash the Borg and watch it go. Click on Turbo and watch it go faster!

This port compiles Angband to WASM using Emscripten. It uses a new front end not based on SDL or curses, but instead targeting vanilla JS and HTML. The WASM code runs in a web worker, and communicates with the main thread to receive events and issue rendering commands. Text and graphics are drawn via an HTML table element. JS logic is in TypeScript. Savefiles use IndexedDB.

This is the old Angband v3.4.1, specifically to support the automated Angband player, aka the Borg. I hope to port to latest Angband and upstream this as a new front-end.

If you want to get involved, here’s the GitHub repo, it’s easy to get started, really fun, and there’s lots to do!

Onwards to Morgoth!

• Benchmarking division and libdivide on Apple M1 and Intel AVX512 May 12th, 2021

libdivide is fish’s library for speeding up integer division by runtime constants. It works by replacing divide instructions with cheaper multiplications and shifts.

libdivide supports SIMD, and recently gained support for AVX-512 and ARM NEON. Let’s benchmark on an Intel Xeon and Apple’s M1.

### Test Setup

The microbenchmark is a sum-of-quotients function, like:

The execution time of this function is dominated by the divide. It may be optimized with libdivide:

This is typically 2-10x faster, depending on d and count. It also unlocks vectorization for even more speedup - example.

The times below measure the above loop with d=7 (a “slow case”) and count=524k. The loop is repeated 30 times, and the best (fastest) result is kept. 1

### Intel Xeon 3.0 GHz (8275CL)

Times are nanoseconds per divide, and a percentage improvement compared to hardware (i.e. udiv instruction). Lower is better.

width hardware libdivide scalar libdivide SSE2 libdivide AVX2 libdivide AVX512
u32 2.298 0.872 (-62%) 0.355 (-84%) 0.219 (-90%) 0.210 (-91%)
u64 6.998 0.891 (-87%) 1.058 (-85%) 0.574 (-92%) 0.492 (-93%)

The three vector ISAs are similar except for width: 128, 256, 512 bits.

AVX512 is twice the width of AVX2 but only marginally faster. Speculatively, AVX512 processes multiplies serially, one 256-bit lane at a time, losing half its parallelism.

libdivide must use 32x32->64 muls (mul_epu32) for vectorized 64 bit, incurring the 4x penalty 2. We would expect to break even at AVX2, which has 4x parallelism; instead AVX2 is faster than scalar, perhaps because a 32 bit multiply is faster than 64 bit.

### Apple M1 3.2 GHz (Mac Mini)

Times are nanoseconds per divide; lower is better.

width hardware libdivide scalar libdivide NEON
u32 0.624 0.351 (-43%) 0.158 (-75%)
u64 0.623 0.315 (-49%) 0.555 (-11%)

The M1 is 10x faster than the Xeon at 64 bit divides. It’s…just wow.

The hardware division time of 0.624 nanoseconds is a clean multiple of 3.2 GHz, suggesting a throughput of 2 clock cycles per divide. That is remarkable: hardware dividers are typically variable-latency and hard to pipeline, so 30+ clock cycles per divide is common. Does the M1 have more than one integer division unit per core?

Amusingly, scalar 64 bit is faster than scalar 32 bit. This is because AARCH64 has a dedicated instruction for 64 bit high multiplies (umulh), while 32 bit high multiplies are performed with a 64 bit low multiply (umull), followed by a right shift.

NEON with u32 allows four 32-bit high-multiplies with only two umull instructions (godbolt). So NEON u32 should be nearly twice as fast as scalar; in fact it is even faster, probably because of amoritized fixed costs (branches, etc).

NEON with u64 is slower because its multiplier is not wide enough, so it incurs the 4x penalty 2, and its skinny vector only offers 2x parallelism. (Nevertheless it still may be useful as part of a larger sequence, to keep data in vector registers.)

libdivide is still quite effective at speeding up integer division, even with the M1’s fast divider. Vectorization greatly improves 32 bit, and is a mixed bag for 64 bit, but still faster than hardware.

Some questions:

1. Why is AVX512 only slightly faster than AVX2? Are the umuls are serialized per-lane, is it downclocking?

2. How is Apple M1 able to achieve a throughput of .5 divides per cycle? Is the hardware divider pipelined, is there more than one per core?