llvm/llvm-project

[AVX-512] `splat(u64 & 255) -| u64s_that_fit_in_bytes` should lower to `vpsubusb`

Open

#195462 opened on May 2, 2026

View on GitHub
 (9 comments) (1 reaction) (1 assignee)C++ (26,378 stars) (10,782 forks)batch import
backend:X86good first issuellvm:SelectionDAG

Description

Zig Godbolt LLVM Godbolt

define internal <8 x i64> @example.foo(i64 %0) unnamed_addr align 1 {
Entry:
  %1 = and i64 %0, 255
  %2 = insertelement <1 x i64> poison, i64 %1, i64 0
  %3 = shufflevector <1 x i64> %2, <1 x i64> poison, <8 x i32> zeroinitializer
  %4 = tail call <8 x i64> @llvm.usub.sat.v8i64(<8 x i64> %3, <8 x i64> <i64 0, i64 64, i64 128, i64 192, i64 0, i64 64, i64 128, i64 192>)
  ret <8 x i64> %4
}

declare <8 x i64> @llvm.usub.sat.v8i64(<8 x i64>, <8 x i64>) #1

Currently becomes:

.LCPI1_0:
        .quad   0
        .quad   64
        .quad   128
        .quad   192
example.foo:
        vbroadcasti64x4 zmm1, ymmword ptr [rip + .LCPI1_0]
        movzx   eax, dil
        vpbroadcastq    zmm0, rax
        vpmaxuq zmm0, zmm0, zmm1
        vpsubq  zmm0, zmm0, zmm1
        ret

Should be:

.LCPI0_0:
        .byte   0
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   64
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   128
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   192
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   0
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   64
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   128
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   192
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
example.foo:
        vpbroadcastq    zmm0, rdi
        vpsubusb        zmm0, zmm0, zmmword ptr [rip + .LCPI0_0]
        ret

This is kinda two optimizations for the price of one.

  1. We can lower llvm.usub.sat.v8i64 to llvm.usub.sat.v64i8 because we are operating on u64's that definitely fit within a byte.
  2. We can then fold the u64 & 255 into llvm.usub.sat.v64i8 by placing 255's in the upper 7 bytes of each u64 lane.

Contributor guide