vllm-project/vllm-ascend
Auf GitHub ansehen[RFC]: support sequence parallelism by pass
Open
#5.712 geöffnet am 8. Jan. 2026
RFChelp wanted
Repository-Metriken
- Stars
- (2.180 Stars)
- PR-Merge-Metriken
- (Durchschn. Merge 5T 16h) (419 gemergte PRs in 30 T)
Beschreibung
Motivation.
Flash Comm V1 (FC1) is a feature that is similiar to sequence parallelism. FC1 is implemented by custom op in vllm-ascend. However, it is not supported for VL models. When extending FC1 to VL models, we meet 2 problems: 1: The VL model lacks an embedding-layer reduce-scatter operation, resulting in redundant all-gather during the first step.
2: In Qwen3-VL, deepstack_input_embeds is added after computation at each layer, but the shape does not match. We must add chunk before layernorm.
Proposed Change.
Implement sequence parallelism by pass:
- Support VL Dense model https://github.com/vllm-project/vllm-ascend/pull/5632
- Suport INT8 dynamic quant
- Support VL MoE models https://github.com/vllm-project/vllm-ascend/pull/7044
- Support LLM dense and MoE models
- Matmul reducescatter pass and allgather matmul pass
- Support model runner v2
- compitable with FC2 and FC3
- Make FC1 only applicable in eager mode
Feedback Period.
No response
CC List.
@wxsIcey @ApsarasX
Any Other Things.
No response