golang/go

cmd/compile: suboptimal code generation for 32-bit unsigned division by constants with 33-bit magic

Open

Aperta il 2 giu 2026

Vedi su GitHub
 (2 commenti) (0 reazioni) (0 assegnatari)Go (133.883 star) (19.008 fork)batch import
FixPendingNeedsFixcompiler/runtimehelp wanted

Descrizione

Go version

go version go1.26.3 linux/amd64

Output of go env in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/test/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/test/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build2569300233=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/dev/null'
GOMODCACHE='/home/test/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/test/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/test/sdk/go1.26.3'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/test/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/test/sdk/go1.26.3/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.26.3'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

Compiled uint32 division by constants whose magic multiplier requires 33 bits, on a 64-bit target (amd64).

func div7(x uint32) uint32  { return x / 7 }
func div14(x uint32) uint32 { return x / 14 }

What did you see happen?

go tool compile -S generates suboptimal code.

For odd divisors (d=7, magic is 33-bit):

MOVQ  AX, CX
SHLQ  $32, CX
IMUL3Q $613566757, AX, AX
ADDQ  CX, AX
RCRQ  $1, AX
SHRQ  $34, AX

For even divisors with 33-bit underlying magic (d=14=2×7):

SHRQ  $1, AX
MOVL  $2454267027, CX
IMULQ CX, AX
SHRQ  $34, AX

What did you expect to see?

On 64-bit targets with native 64x64->128-bit multiply (MULQ on amd64, UMULH on arm64), a pre-shifted magic constant C = (2^32+m) << (32-s) allows a single multiply to give the quotient directly. Clang already generate this form.

Expected for both d=7 and d=14:

MOVQ  $<pre-shifted constant>, AX
MULQ  CX
(result in DX)

Benchmark (Intel Xeon w9-3495X, 3 divisions/iteration):

d=7,19,107: 3.07 ns/op → 1.92 ns/op (~38% faster) d=14,38,214: 2.04 ns/op → 1.92 ns/op (~6% faster)

Reference:

Guida contributor