vllm-project/vllm-ascend

[RFC]: support sequence parallelism by pass

Open

#5,712 opened on Jan 8, 2026

View on GitHub
 (1 comment) (2 reactions) (0 assignees)C++ (1,318 forks)github user discovery
RFChelp wanted

Repository metrics

Stars
 (2,180 stars)
PR merge metrics
 (Avg merge 5d 16h) (419 merged PRs in 30d)

Description

Motivation.

Flash Comm V1 (FC1) is a feature that is similiar to sequence parallelism. FC1 is implemented by custom op in vllm-ascend. However, it is not supported for VL models. When extending FC1 to VL models, we meet 2 problems: 1: The VL model lacks an embedding-layer reduce-scatter operation, resulting in redundant all-gather during the first step.

2: In Qwen3-VL, deepstack_input_embeds is added after computation at each layer, but the shape does not match. We must add chunk before layernorm.

Proposed Change.

Implement sequence parallelism by pass:

Feedback Period.

No response

CC List.

@wxsIcey @ApsarasX

Any Other Things.

No response

Contributor guide