golang/go
Vedi su GitHubcmd/compile: suboptimal code generation for 32-bit unsigned division by constants with 33-bit magic
Open
Aperta il 2 giu 2026
FixPendingNeedsFixcompiler/runtimehelp wanted
Descrizione
Go version
go version go1.26.3 linux/amd64
Output of go env in your module/workspace:
AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/test/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/test/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build2569300233=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/dev/null'
GOMODCACHE='/home/test/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/test/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/test/sdk/go1.26.3'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/test/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/test/sdk/go1.26.3/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.26.3'
GOWORK=''
PKG_CONFIG='pkg-config'
What did you do?
Compiled uint32 division by constants whose magic multiplier requires 33 bits, on a 64-bit target (amd64).
func div7(x uint32) uint32 { return x / 7 }
func div14(x uint32) uint32 { return x / 14 }
What did you see happen?
go tool compile -S generates suboptimal code.
For odd divisors (d=7, magic is 33-bit):
MOVQ AX, CX
SHLQ $32, CX
IMUL3Q $613566757, AX, AX
ADDQ CX, AX
RCRQ $1, AX
SHRQ $34, AX
For even divisors with 33-bit underlying magic (d=14=2×7):
SHRQ $1, AX
MOVL $2454267027, CX
IMULQ CX, AX
SHRQ $34, AX
What did you expect to see?
On 64-bit targets with native 64x64->128-bit multiply (MULQ on amd64, UMULH on arm64),
a pre-shifted magic constant C = (2^32+m) << (32-s) allows a single multiply to give the quotient directly.
Clang already generate this form.
Expected for both d=7 and d=14:
MOVQ $<pre-shifted constant>, AX
MULQ CX
(result in DX)
Benchmark (Intel Xeon w9-3495X, 3 divisions/iteration):
d=7,19,107: 3.07 ns/op → 1.92 ns/op (~38% faster) d=14,38,214: 2.04 ns/op → 1.92 ns/op (~6% faster)
Reference: