cmd/compile: riscv instruction bloat in hot loops due to large stack frame offsets
#77541 opened on Feb 11, 2026
Description
Environment
- Go version: go1.25.6
- GOOS: linux
- GOARCH: riscv64
- GORISCV64: rva23u64
Background
I'm benchmarking Go programs across different architectures using golang.org/x/benchmarks (Bent suite). I've observed a consistent performance gap between RISC-V and ARM64, with RISC-V showing ~20-40% slower execution in many CPU-intensive workloads.
After profiling and analyzing the assembly output, I identified a systematic issue: RISC-V generates significantly more instructions in tight loops when accessing local variables with large stack offsets, while ARM64 handles the same code more efficiently.
Example: kanzi-go SBRT Transform
Go source (from github.com/flanglet/kanzi-go/transform/SBRT.go:207):
func (this *SBRT) Inverse(src, dst []byte) (uint, uint, error) { // ... initialization ... p := [256]int{} // Stack offset: SP + 2352 q := [256]int{} // Stack offset: SP + 304
for i := 0; i < count; i++ { // Hot loop
c := r2s[src[i]]
qc := ((i & m1) + (p[c] & m2)) >> s // ← Bottleneck
p[c] = i
q[c] = qc
// ...
}
} RISC-V assembly (loop body for line 207): 0x110906 ADDI $1176, X2, X19 # SP + 1176 → X19 0x11090a ADDI $1176, X19, X19 # X19 + 1176 → X19 (total: SP + 2352) 0x11090e SH3ADD X19, X18, X20 # p_base + (c << 3) 0x110912 MOV (X20), X21 # Load p[c]3 instructions to compute the address, repeated every iteration.
ARM64 assembly (same line): 0x148968 ADD $2360, RSP, R13 # RSP + 2360 → R13 0x14896c MOVD (R13)(R12<<3), R14 # Load p[c]1 instruction for the address calculation.
Root Cause
RISC-V's 12-bit immediate range (-2048 to +2047) forces the compiler to split large offsets into multiple ADDI instructions. Since the SSA backend currently treats these multi-instruction sequences as "cheap to rematerialize", it recomputes them inside loops instead of hoisting the base address calculation out.
Scope
This issue is not limited to kanzi-go. I've observed similar patterns across multiple Bent benchmarks:
Invocation/interpreter/string_manipulation_size_50(github.com/tetratelabs/wazero/internal/integration_test/bench)Invocation/interpreter/random_mat_mul_size_20(github.com/tetratelabs/wazero/internal/integration_test/bench)- Any code with large stack frames and tight loops
The cumulative effect across these workloads contributes significantly to RISC-V's performance gap.
Expected Behavior
The compiler should recognize that:
- Large stack offset calculations (requiring 2+ instructions) are expensive
- These calculations are loop-invariant
- They should be hoisted before the loop and stored in a register
Ideal RISC-V output:
Before loop:
ADDI $1176, X2, X19 ADDI $1176, X19, X19 # X19 = SP + 2352 (p_base)
.Loop: SH3ADD X19, X18, X20 # Use pre-computed p_base MOV (X20), X21 # ... loop body ... JMP .Loop
Question
Is this a known limitation? Are there plans to improve the cost model for RISC-V's limited immediate range, or to enhance LICM (Loop Invariant Code Motion) for architecture-specific multi-instruction sequences?
I'm happy to provide more detailed profiles or assist with testing potential fixes.