golang/go

cmd/compile: riscv instruction bloat in hot loops due to large stack frame offsets

Open

#77541 opened on Feb 11, 2026

View on GitHub
 (9 comments) (0 reactions) (0 assignees)Go (133,883 stars) (19,008 forks)batch import
BugReportNeedsInvestigationPerformancearch-riscvcompiler/runtimehelp wanted

Description

Environment

  • Go version: go1.25.6
  • GOOS: linux
  • GOARCH: riscv64
  • GORISCV64: rva23u64

Background

I'm benchmarking Go programs across different architectures using golang.org/x/benchmarks (Bent suite). I've observed a consistent performance gap between RISC-V and ARM64, with RISC-V showing ~20-40% slower execution in many CPU-intensive workloads.

After profiling and analyzing the assembly output, I identified a systematic issue: RISC-V generates significantly more instructions in tight loops when accessing local variables with large stack offsets, while ARM64 handles the same code more efficiently.

Example: kanzi-go SBRT Transform

Go source (from github.com/flanglet/kanzi-go/transform/SBRT.go:207):

func (this *SBRT) Inverse(src, dst []byte) (uint, uint, error) { // ... initialization ... p := [256]int{} // Stack offset: SP + 2352 q := [256]int{} // Stack offset: SP + 304

for i := 0; i < count; i++ {  // Hot loop
    c := r2s[src[i]]
    qc := ((i & m1) + (p[c] & m2)) >> s  // ← Bottleneck
    p[c] = i
    q[c] = qc
    // ...
}

} RISC-V assembly (loop body for line 207): 0x110906 ADDI $1176, X2, X19 # SP + 1176 → X19 0x11090a ADDI $1176, X19, X19 # X19 + 1176 → X19 (total: SP + 2352) 0x11090e SH3ADD X19, X18, X20 # p_base + (c << 3) 0x110912 MOV (X20), X21 # Load p[c]3 instructions to compute the address, repeated every iteration.

ARM64 assembly (same line): 0x148968 ADD $2360, RSP, R13 # RSP + 2360 → R13 0x14896c MOVD (R13)(R12<<3), R14 # Load p[c]1 instruction for the address calculation.

Root Cause

RISC-V's 12-bit immediate range (-2048 to +2047) forces the compiler to split large offsets into multiple ADDI instructions. Since the SSA backend currently treats these multi-instruction sequences as "cheap to rematerialize", it recomputes them inside loops instead of hoisting the base address calculation out.

Scope

This issue is not limited to kanzi-go. I've observed similar patterns across multiple Bent benchmarks:

  • Invocation/interpreter/string_manipulation_size_50 (github.com/tetratelabs/wazero/internal/integration_test/bench)
  • Invocation/interpreter/random_mat_mul_size_20 (github.com/tetratelabs/wazero/internal/integration_test/bench)
  • Any code with large stack frames and tight loops

The cumulative effect across these workloads contributes significantly to RISC-V's performance gap.

Expected Behavior

The compiler should recognize that:

  1. Large stack offset calculations (requiring 2+ instructions) are expensive
  2. These calculations are loop-invariant
  3. They should be hoisted before the loop and stored in a register

Ideal RISC-V output:

Before loop:

ADDI $1176, X2, X19 ADDI $1176, X19, X19 # X19 = SP + 2352 (p_base)

.Loop: SH3ADD X19, X18, X20 # Use pre-computed p_base MOV (X20), X21 # ... loop body ... JMP .Loop

Question

Is this a known limitation? Are there plans to improve the cost model for RISC-V's limited immediate range, or to enhance LICM (Loop Invariant Code Motion) for architecture-specific multi-instruction sequences?

I'm happy to provide more detailed profiles or assist with testing potential fixes.

Contributor guide