[X86] Missed Optimization: Vector 8-bit `rotl(x, 1)` should be lowered as `(x + x) - (x < 0)` · llvm/llvm-project#198059

(8 comments) (0 reactions) (1 assignee)C++ (26,378 stars) (10,782 forks)batch import

backend:X86good first issuemissed-optimization

Description

Due to a lack of support, most 8-bit shifts are implemented using a 16-bit shift + AND:

rotl1_src:
        movdqa  xmm1, xmm0
        paddb   xmm1, xmm0
        psrlw   xmm0, 7
        pand    xmm0, xmmword ptr [rip + .LCPI2_0]
        por     xmm0, xmm1
        ret

The OR and right shift can be replaced with a subtraction by a less-than-zero mask, which acts like a conditional disjoint add by 1. This shortens the dependency chain and avoids the shift, which has worse throughput on some architectures.

rotl1_tgt:
        pxor    xmm1, xmm1
        pcmpgtb xmm1, xmm0
        paddb   xmm0, xmm0
        psubb   xmm0, xmm1
        ret

https://godbolt.org/z/199KoWhs8

Contributor guide

Tech stack: cpp
Domain: backendperformance
Issue type: performance
Difficulty: 5
Estimated time: over 1 week
Activity status: blocked
Clarity: clear
Prerequisites: LLVM internalsX86 assemblyDAG legalizationpattern matching
Newbie friendliness: 15
Research direction: Investigate the X86 target's lowering of 8 bit rotate left in LLVM's X86ISelLowering.cpp or X86InstrInfo.td. The current code uses a shift+and+or sequence, but a more efficient sequence using pcmpgtb and paddb/psubb exists. Look for existing patterns in the X86 DAG combiner or consider adding a new SDNode for rotate. The godbolt link (https://godbolt.org/z/199KoWhs8) shows the target assembly to aim for. Check if there are any ongoing patches or discussions in the comments.