[AVX-512] `splat(u64 & 255) -| u64s_that_fit_in_bytes` should lower to `vpsubusb` · llvm/llvm-project#195462

2026-05-02T15:29:31.000Z

[Zig Godbolt](https://zig.godbo.lt/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:16,fontUsePx:'0',j:1,lang:zig,selection:(endColumn:24,endLineNumber:5,positionColumn:9,positionLineNumber:5,selectionStartColumn:24,selectionStartLineNumber:5,startColumn:9,startLineNumber:5),source:'const+std+%3D+@import(%22std%22)%3B%0A%0Aexport+fn+foo(len:+u64)+@Vector(8,+u64)+%7B%0A++++return+@as(@Vector(8,+u64),+@splat(len+%26+0xFF))+-%7C%0A++++++++@Vector(8,+u64)%7B+0,+64,+128,+192,+0,+64,+128,+192+%7D%3B%0A%7D%0A%0Aexport+fn+bar(len:+u64)+@Vector(8,+u64)+%7B%0A++++const+C+%3D+~@as(u64,+0)+%3C%3C+8%3B%0A++++return+@bitCast(@as(@Vector(64,+u8),+@bitCast(@as(@Vector(8,+u64),+@splat(len))))+-%7C%0A++++++++@as(@Vector(64,+u8),+@bitCast(@as(@Vector(8,+u64),+@splat(C))+%7C+@Vector(8,+u64)%7B+0,+64,+128,+192,+0,+64,+128,+192+%7D)))%3B%0A%7D'),l:'5',n:'0',o:'Zig+source+%231',t:'0')),k:51.88636954553198,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:ztrunk,filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:16,fontUsePx:'0',j:2,lang:zig,libs:!(),options:'-O+ReleaseFast+-target+x86_64-linux+-mcpu%3Dznver5+-fomit-frame-pointer',overrides:!(),selection:(endColumn:17,endLineNumber:68,positionColumn:17,positionLineNumber:68,selectionStartColumn:9,selectionStartLineNumber:68,startColumn:9,startLineNumber:68),source:1),l:'5',n:'0',o:'+zig+trunk+(Editor+%231)',t:'0')),header:(),k:48.11363045446803,l:'4',m:50,n:'0',o:'',s:0,t:'0'),(g:!((h:ir,i:('-fno-discard-value-names':'0',compilerName:'zig+trunk',demangle-symbols:'0',editorid:1,filter-attributes:'0',filter-comments:'0',filter-debug-info:'0',filter-declarations:'1',filter-instruction-metadata:'0',filter-library-functions:'1',fontScale:14,fontUsePx:'0',j:2,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),show-optimized:'0',treeid:0,wrap:'1'),l:'5',n:'0',o:'LLVM+IR+Viewer+zig+trunk+(Editor+%231,+Compiler+%232)',t:'0')),header:(),l:'4',m:50,n:'0',o:'',s:0,t:'0')),k:48.11363045446803,l:'3',n:'0',o:'',t:'0')),l:'2',m:100,n:'0',o:'',t:'0')),version:4) [LLVM Godbolt](https://llvm.godbo.lt/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1C1aANxakl9ZATwDKjdAGFUtAK4sGISQHZSjgBk8BkwAOQ8AI0xiEAAmUgAHVAVCOwYXd09vPySU2wEgkPCWKJj4q0wbNKECJmICDI8vX0tMa3yGGrqCQrDI6LjLWvrGrJaFYZ7gvpKB2IBKS1Q3YmR2DgABflQAagBSAGYAER2xPCYFfYOnAA4d1R28ADZJQ%2BwdiGfJRZ2EgmIdhtMKpWAl6AA6bZ7DQAQQ2ETqVxOZwuV1u90eLzeHy%2BPz%2BAKBIJYYMw4IRxGhMMpWBoIUegmizFoaLuDy%2B2MJoIh20%2BL32sQArBp5js3AxmGx0AB9JjodAAs7ABg7Lj7HwAIUp2EExAAniBKTt%2BQLVYdkQx0JjJMaNKQdoKBYbjbEkfSlPU2pg2II0aq2ViDu8kngFAI7V9jVxw3yNE69oKDq6FAg3FQqPRTJUiADDk4/VbsfGBfFfRj2YHfqgQ2GWWWDrFsQAvaJVhipM7Nimwo1F61mna1Og7UQmWv%2B14VjYmczgtwKNwRcEKJgEcGmG5fCC51kFitFg527dlgPvXMR21WnYvaPWrixG43lUAThL58f18vd4fn5fb3mTuITACDHXdT0FV5YT2HwjkpalKgMQCQPLd4pzMFhZ3nRdl1XddNyPcc3kPa4d2QkV4wOLhYNhGlpnpAhGTEJCT0BYEuVJcleT7QVhVFcVWEwaVZXlU5aDwJUVTVTVYW1f59TjQVTWON1ono%2BhvWA3N82QytqwYR8iwvc95OLJMUzTDMsxIUsCL3BSiLzY8JyDKtQz0pD6ybFtgnbUTO2MxN%2BwiQhRAmJinOdAddlzPk2RuN5jL7JTB2ZEdmWi61Yo5ad0LnBclxXNcXjwG4t2uGLHji2yBQPNFyuKwtrmKnZDLuABaKMKp2drwzajqmu6zqBv6vrep6q9pEGkauqmobRsmsbZumhav2W1b5vW4a1s2waP22va5v2paNoOu4WqOw7FsumbrrG3aTvO%2B6rq2x6VuOh73qet7Lruj6bq%2Bv7hr/YyBVdIKCBCjSyoyirC3AyKwvi7sdkAyH0RssCBUdSDoKomEsGQBDMFq6H6snbKMLy7DCskYrSqcOrKuweyGb/fkKI4RZaE4AVeC8DgtFIVBOACAIADUAFkdgASQAJR2UMVjWNmeFIAhNA5xYAGtvC4cFJA0J8NA0LgnwFJ4n18Lgnn0ThJF59XBc4XgFBAW01f5jnSDgWAkDQYk6GichKD9hIA5iEwbGIMVNb4Oh6OIF2IAiB2guYPVOBV1O6l1AB5CJtCzDPeD99Sc4YWh9Q90gsAiNxgCcMRaBd7heCwFhDGAcQq/wQCqkzZuBeBSo3HoovyAZLmq9EiJiGzlwsAd/48BYMfM2ICJkkwI4vQ70SjHVxZ0yYYAFFFvBMAAdxzhJGDH/hBBEMR2CkGRBEUFR1Cr3QowMffsv0PAEQXaQEWKgP4aRm6tRzomVqLBkAJDcGaRsDA1480FmvYgeAsDAIgIsCoVR7AQEcKMLwUZAjTGKKUPQuRUgCBIdQ5ItCGC9EoQMKM%2BCOhdBGK4JoegOHVEmCw/oMR2GTHoaI7oQjZgiLwcsVYz9Obc3tlXIWHAdgRw%2BP8aOIoIC4EIFZciXB5i8HdloeYWsQAChuOCAUhsnxPhuBoSQBwNDFhuAcAUNsOB21ICvLgRtSB8wFqo52rtVYH0URwWIyjglO3CR7cxpA14pHsJIIAA%3D%3D) ```llvm define internal @example.foo(i64 %0) unnamed_addr align 1 { Entry: %1 = and i64 %0, 255 %2 = insertelement poison, i64 %1, i64 0 %3 = shufflevector %2, poison, zeroinitializer %4 = tail call @llvm.usub.sat.v8i64( %3, ) ret %4 } declare @llvm.usub.sat.v8i64( , ) #1 ``` Currently becomes: ```asm .LCPI1_0: .quad 0 .quad 64 .quad 128 .quad 192 example.foo: vbroadcasti64x4 zmm1, ymmword ptr [rip + .LCPI1_0] movzx eax, dil vpbroadcastq zmm0, rax vpmaxuq zmm0, zmm0, zmm1 vpsubq zmm0, zmm0, zmm1 ret ``` Should be: ```asm .LCPI0_0: .byte 0 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 64 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 128 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 192 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 0 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 64 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 128 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 192 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 .byte 255 example.foo: vpbroadcastq zmm0, rdi vpsubusb zmm0, zmm0, zmmword ptr [rip + .LCPI0_0] ret ``` This is kinda two optimizations for the price of one. 1. We can lower `llvm.usub.sat.v8i64` to `llvm.usub.sat.v64i8` because we are operating on u64's that definitely fit within a byte. 2. We can then fold the `u64 & 255` into `llvm.usub.sat.v64i8` by placing 255's in the upper 7 bytes of each u64 lane.

(9 comments) (1 reaction) (1 assignee)C++ (10,782 forks)batch import

backend:X86good first issuellvm:SelectionDAG

Repository metrics

Stars: (26,378 stars)
PR merge metrics: (Avg merge 1d 2h) (1,000 merged PRs in 30d)

Description

Zig Godbolt LLVM Godbolt

define internal <8 x i64> @example.foo(i64 %0) unnamed_addr align 1 {
Entry:
  %1 = and i64 %0, 255
  %2 = insertelement <1 x i64> poison, i64 %1, i64 0
  %3 = shufflevector <1 x i64> %2, <1 x i64> poison, <8 x i32> zeroinitializer
  %4 = tail call <8 x i64> @llvm.usub.sat.v8i64(<8 x i64> %3, <8 x i64> <i64 0, i64 64, i64 128, i64 192, i64 0, i64 64, i64 128, i64 192>)
  ret <8 x i64> %4
}

declare <8 x i64> @llvm.usub.sat.v8i64(<8 x i64>, <8 x i64>) #1

Currently becomes:

.LCPI1_0:
        .quad   0
        .quad   64
        .quad   128
        .quad   192
example.foo:
        vbroadcasti64x4 zmm1, ymmword ptr [rip + .LCPI1_0]
        movzx   eax, dil
        vpbroadcastq    zmm0, rax
        vpmaxuq zmm0, zmm0, zmm1
        vpsubq  zmm0, zmm0, zmm1
        ret

Should be:

.LCPI0_0:
        .byte   0
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   64
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   128
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   192
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   0
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   64
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   128
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   192
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
        .byte   255
example.foo:
        vpbroadcastq    zmm0, rdi
        vpsubusb        zmm0, zmm0, zmmword ptr [rip + .LCPI0_0]
        ret

This is kinda two optimizations for the price of one.

We can lower llvm.usub.sat.v8i64 to llvm.usub.sat.v64i8 because we are operating on u64's that definitely fit within a byte.
We can then fold the u64 & 255 into llvm.usub.sat.v64i8 by placing 255's in the upper 7 bytes of each u64 lane.

Contributor guide

Research direction: Implement two optimizations: 1) Lower `llvm.usub.sat.v8i64` to `llvm.usub.sat.v64i8` when the u64 operands are known to fit within a byte. 2) Fold the `u64 & 255` mask into the saturated subtraction by placing 255 in the upper bytes of each u64 lane. Study the existing lowering of `usub.sat` in LLVM's SelectionDAG or GlobalISel, and the pattern matching for vector operations. The goal is to generate a `vpsubusb` instruction instead of the current sequence.
Tech stack: cpp
Domain: performance
Issue type: Performance
Difficulty: 4
Estimated time: Over 1 week
Activity status: Active
Clarity: Clear
Prerequisites: C++LLVMx86 assembly
Newbie friendliness: 25

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.