llvm/llvm-project
View on GitHub[AVX-512] `splat(u64 & 255) -| u64s_that_fit_in_bytes` should lower to `vpsubusb`
Open
#195462 opened on May 2, 2026
backend:X86good first issuellvm:SelectionDAG
Description
define internal <8 x i64> @example.foo(i64 %0) unnamed_addr align 1 {
Entry:
%1 = and i64 %0, 255
%2 = insertelement <1 x i64> poison, i64 %1, i64 0
%3 = shufflevector <1 x i64> %2, <1 x i64> poison, <8 x i32> zeroinitializer
%4 = tail call <8 x i64> @llvm.usub.sat.v8i64(<8 x i64> %3, <8 x i64> <i64 0, i64 64, i64 128, i64 192, i64 0, i64 64, i64 128, i64 192>)
ret <8 x i64> %4
}
declare <8 x i64> @llvm.usub.sat.v8i64(<8 x i64>, <8 x i64>) #1
Currently becomes:
.LCPI1_0:
.quad 0
.quad 64
.quad 128
.quad 192
example.foo:
vbroadcasti64x4 zmm1, ymmword ptr [rip + .LCPI1_0]
movzx eax, dil
vpbroadcastq zmm0, rax
vpmaxuq zmm0, zmm0, zmm1
vpsubq zmm0, zmm0, zmm1
ret
Should be:
.LCPI0_0:
.byte 0
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 64
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 128
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 192
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 0
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 64
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 128
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 192
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
.byte 255
example.foo:
vpbroadcastq zmm0, rdi
vpsubusb zmm0, zmm0, zmmword ptr [rip + .LCPI0_0]
ret
This is kinda two optimizations for the price of one.
- We can lower
llvm.usub.sat.v8i64tollvm.usub.sat.v64i8because we are operating on u64's that definitely fit within a byte. - We can then fold the
u64 & 255intollvm.usub.sat.v64i8by placing 255's in the upper 7 bytes of each u64 lane.