Repository metrics

Stars: (133,883 stars)
PR merge metrics: (No merged PRs in 30d)

Description

Environment

Go version: go1.25.6
GOOS: linux
GOARCH: riscv64
GORISCV64: rva23u64

Background

I'm benchmarking Go programs across different architectures using golang.org/x/benchmarks (Bent suite). I've observed a consistent performance gap between RISC-V and ARM64, with RISC-V showing ~20-40% slower execution in many CPU-intensive workloads.

After profiling and analyzing the assembly output, I identified a systematic issue: RISC-V generates significantly more instructions in tight loops when accessing local variables with large stack offsets, while ARM64 handles the same code more efficiently.

Example: kanzi-go SBRT Transform

Go source (from github.com/flanglet/kanzi-go/transform/SBRT.go:207):

func (this *SBRT) Inverse(src, dst []byte) (uint, uint, error) { // ... initialization ... p := [256]int{} // Stack offset: SP + 2352 q := [256]int{} // Stack offset: SP + 304

for i := 0; i < count; i++ {  // Hot loop
    c := r2s[src[i]]
    qc := ((i & m1) + (p[c] & m2)) >> s  // ← Bottleneck
    p[c] = i
    q[c] = qc
    // ...
}

} RISC-V assembly (loop body for line 207): 0x110906 ADDI $1176, X2, X19 # SP + 1176 → X19 0x11090a ADDI $1176, X19, X19 # X19 + 1176 → X19 (total: SP + 2352) 0x11090e SH3ADD X19, X18, X20 # p_base + (c << 3) 0x110912 MOV (X20), X21 # Load p[c]3 instructions to compute the address, repeated every iteration.

ARM64 assembly (same line): 0x148968 ADD $2360, RSP, R13 # RSP + 2360 → R13 0x14896c MOVD (R13)(R12<<3), R14 # Load p[c]1 instruction for the address calculation.

Root Cause

RISC-V's 12-bit immediate range (-2048 to +2047) forces the compiler to split large offsets into multiple ADDI instructions. Since the SSA backend currently treats these multi-instruction sequences as "cheap to rematerialize", it recomputes them inside loops instead of hoisting the base address calculation out.

Scope

This issue is not limited to kanzi-go. I've observed similar patterns across multiple Bent benchmarks:

Invocation/interpreter/string_manipulation_size_50 (github.com/tetratelabs/wazero/internal/integration_test/bench)
Invocation/interpreter/random_mat_mul_size_20 (github.com/tetratelabs/wazero/internal/integration_test/bench)
Any code with large stack frames and tight loops

The cumulative effect across these workloads contributes significantly to RISC-V's performance gap.

Expected Behavior

The compiler should recognize that:

Large stack offset calculations (requiring 2+ instructions) are expensive
These calculations are loop-invariant
They should be hoisted before the loop and stored in a register

Ideal RISC-V output:

Before loop:

ADDI $1176, X2, X19 ADDI $1176, X19, X19 # X19 = SP + 2352 (p_base)

.Loop: SH3ADD X19, X18, X20 # Use pre-computed p_base MOV (X20), X21 # ... loop body ... JMP .Loop

Question

Is this a known limitation? Are there plans to improve the cost model for RISC-V's limited immediate range, or to enhance LICM (Loop Invariant Code Motion) for architecture-specific multi-instruction sequences?

I'm happy to provide more detailed profiles or assist with testing potential fixes.

Contributor guide

Research direction: Investigate the cost model for RISC V in the Go compiler's SSA backend, specifically how large stack offsets are rematerialized. Propose enabling loop invariant code motion (LICM) for multi instruction address calculations, or adjust the rematerialization cost threshold for RISC V. Analyze existing LICM passes and test with a minimal reproduction in the Go repository.
Tech stack: go
Domain: backend
Issue type: Bug
Difficulty: 3
Estimated time: 3-5 days
Activity status: Active
Clarity: Clear
Prerequisites: Gocompiler basics
Newbie friendliness: 70