libdivide is fish’s library for speeding up integer division by runtime constants. It works by replacing divide instructions with cheaper multiplications and shifts.
libdivide supports SIMD, and recently gained support for AVX512 and ARM NEON. Let’s benchmark on an Intel Xeon and Apple’s M1.
Test Setup
The microbenchmark is a sumofquotients function, like:
The execution time of this function is dominated by the divide. It may be optimized with libdivide:
This is typically 210x faster, depending on d
and count
. It also unlocks vectorization for even more speedup  example.
The times below measure the above loop with d=7
(a “slow case”) and count=524k
. The loop is repeated 30 times, and the best (fastest) result is kept. ^{1}
Intel Xeon 3.0 GHz (8275CL)
Times are nanoseconds per divide, and a percentage improvement compared to hardware (i.e. udiv
instruction). Lower is better.
width  hardware  libdivide scalar  libdivide SSE2  libdivide AVX2  libdivide AVX512 

u32  2.298  0.872 (62%)  0.355 (84%)  0.219 (90%)  0.210 (91%) 
u64  6.998  0.891 (87%)  1.058 (85%)  0.574 (92%)  0.492 (93%) 
The three vector ISAs are similar except for width: 128, 256, 512 bits.
AVX512 is twice the width of AVX2 but only marginally faster. Speculatively, AVX512 processes multiplies serially, one 256bit lane at a time, losing half its parallelism.
libdivide must use 32x32>64 muls (mul_epu32
) for vectorized 64 bit, incurring the 4x penalty ^{2}. We would expect to break even at AVX2, which has 4x parallelism; instead AVX2 is faster than scalar, perhaps because a 32 bit multiply is faster than 64 bit.
Apple M1 3.2 GHz (Mac Mini)
Times are nanoseconds per divide; lower is better.
width  hardware  libdivide scalar  libdivide NEON 

u32  0.624  0.351 (43%)  0.158 (75%) 
u64  0.623  0.315 (49%)  0.555 (11%) 
The M1 is 10x faster than the Xeon at 64 bit divides. It’s…just wow.
The hardware division time of 0.624 nanoseconds is a clean multiple of 3.2 GHz, suggesting a throughput of 2 clock cycles per divide. That is remarkable: hardware dividers are typically variablelatency and hard to pipeline, so 30+ clock cycles per divide is common. Does the M1 have more than one integer division unit per core?
Amusingly, scalar 64 bit is faster than scalar 32 bit. This is because AARCH64 has a dedicated instruction for 64 bit high multiplies (umulh
), while 32 bit high multiplies are performed with a 64 bit low multiply (umull
), followed by a right shift.
NEON with u32 allows four 32bit highmultiplies with only two umull
instructions (godbolt). So NEON u32 should be nearly twice as fast as scalar; in fact it is even faster, probably because of amoritized fixed costs (branches, etc).
NEON with u64 is slower because its multiplier is not wide enough, so it incurs the 4x penalty ^{2}, and its skinny vector only offers 2x parallelism. (Nevertheless it still may be useful as part of a larger sequence, to keep data in vector registers.)
Conclusions and unanswered questions
libdivide is still quite effective at speeding up integer division, even with the M1’s fast divider. Vectorization greatly improves 32 bit, and is a mixed bag for 64 bit, but still faster than hardware.
Some questions:

Why is AVX512 only slightly faster than AVX2? Are the umuls are serialized perlane, is it downclocking?

How is Apple M1 able to achieve a throughput of .5 divides per cycle? Is the hardware divider pipelined, is there more than one per core?
Thanks for reading!
Notes
The results were collected using libdivide’s benchmark tool. You can run it yourself:

7 is a “slow case” for libdivide, because 7’s magic number must be 33 or 65 bits, and so cannot fit in a register. This is as slow as libdivide gets. ↩

To divide a 64 bit number, libdivide needs the high half of a 64 bit product, a socalled high multiply. Unfortunately SIMD hardware typically has a 53bit multiplier: just big enough for doubleprecision floating point, not big enough for libdivide. So libdivide must cobble together a 64bit highmultiply through four 32x32>64 multiplies. This incurs a factorof4 penalty for 64 bit vectorized division on SIMD. ↩ ↩^{2}