Benchmarking division and libdivide on Apple M1 and Intel AVX512

Test Setup

The microbenchmark is a sum-of-quotients function, like:

uint32_t sumq(const uint32_t *vals, size_t count, uint32_t d) {
    uint32_t sum = 0;
    for (size_t i=0; i < count; i++)
        sum += vals[i] / d; // this division is the bottleneck
    return sum;
}

The execution time of this function is dominated by the divide. It may be optimized with libdivide:

uint32_t sumq(const uint32_t *vals, size_t count, uint32_t d) {
    libdivide::divider<uint32_t> div(d);
    uint32_t sum = 0;
    for (size_t i=0; i < count; i++)
        sum += vals[i] / div; // faster libdivide division
    return sum;
}

This is typically 2-10x faster, depending on d and count. It also unlocks vectorization for even more speedup - example.

The times below measure the above loop with d=7 (a “slow case”) and count=524k. The loop is repeated 30 times, and the best (fastest) result is kept. ¹

Intel Xeon 3.0 GHz (8275CL)

Times are nanoseconds per divide, and a percentage improvement compared to hardware (i.e. udiv instruction). Lower is better.

width	hardware	libdivide scalar	libdivide SSE2	libdivide AVX2	libdivide AVX512
u32	2.298	0.872 (-62%)	0.355 (-84%)	0.219 (-90%)	0.210 (-91%)
u64	6.998	0.891 (-87%)	1.058 (-85%)	0.574 (-92%)	0.492 (-93%)

The three vector ISAs are similar except for width: 128, 256, 512 bits.

AVX512 is twice the width of AVX2 but only marginally faster. Speculatively, AVX512 processes multiplies serially, one 256-bit lane at a time, losing half its parallelism.

libdivide must use 32x32->64 muls (mul_epu32) for vectorized 64 bit, incurring the 4x penalty ². We would expect to break even at AVX2, which has 4x parallelism; instead AVX2 is faster than scalar, perhaps because a 32 bit multiply is faster than 64 bit.

Apple M1 3.2 GHz (Mac Mini)

Times are nanoseconds per divide; lower is better.

width	hardware	libdivide scalar	libdivide NEON
u32	0.624	0.351 (-43%)	0.158 (-75%)
u64	0.623	0.315 (-49%)	0.555 (-11%)

The M1 is 10x faster than the Xeon at 64 bit divides. It’s…just wow.

The hardware division time of 0.624 nanoseconds is a clean multiple of 3.2 GHz, suggesting a throughput of 2 clock cycles per divide. That is remarkable: hardware dividers are typically variable-latency and hard to pipeline, so 30+ clock cycles per divide is common. Does the M1 have more than one integer division unit per core?

Amusingly, scalar 64 bit is faster than scalar 32 bit. This is because AARCH64 has a dedicated instruction for 64 bit high multiplies (umulh), while 32 bit high multiplies are performed with a 64 bit low multiply (umull), followed by a right shift.

NEON with u32 allows four 32-bit high-multiplies with only two umull instructions (godbolt). So NEON u32 should be nearly twice as fast as scalar; in fact it is even faster, probably because of amoritized fixed costs (branches, etc).

NEON with u64 is slower because its multiplier is not wide enough, so it incurs the 4x penalty ², and its skinny vector only offers 2x parallelism. (Nevertheless it still may be useful as part of a larger sequence, to keep data in vector registers.)

Conclusions and unanswered questions

libdivide is still quite effective at speeding up integer division, even with the M1’s fast divider. Vectorization greatly improves 32 bit, and is a mixed bag for 64 bit, but still faster than hardware.

Some questions:

Why is AVX512 only slightly faster than AVX2? Are the umuls are serialized per-lane, is it downclocking?
How is Apple M1 able to achieve a throughput of .5 divides per cycle? Is the hardware divider pipelined, is there more than one per core?

Thanks for reading!

Notes

The results were collected using libdivide’s benchmark tool. You can run it yourself:

git clone https://github.com/ridiculousfish/libdivide.git
mkdir libdivide/build && cd libdivide/build
cmake .. && make benchmark
./benchmark u32 # or u64 or s32 or s64

7 is a “slow case” for libdivide, because 7’s magic number must be 33 or 65 bits, and so cannot fit in a register. This is as slow as libdivide gets. ↩
To divide a 64 bit number, libdivide needs the high half of a 64 bit product, a so-called high multiply. Unfortunately SIMD hardware typically has a 53-bit multiplier: just big enough for double-precision floating point, not big enough for libdivide. So libdivide must cobble together a 64-bit high-multiply through four 32x32->64 multiplies. This incurs a factor-of-4 penalty for 64 bit vectorized division on SIMD. ↩ ↩²