[RFC]: support sequence parallelism by pass
#5 712 ouverte le 8 janv. 2026
Métriques du dépôt
- Stars
- (2 180 stars)
- Métriques de merge PR
- (Merge moyen 5j 16h) (419 PRs mergées en 30 j)
Description
Motivation.
Flash Comm V1 (FC1) is a feature that is similiar to sequence parallelism. FC1 is implemented by custom op in vllm-ascend. However, it is not supported for VL models. When extending FC1 to VL models, we meet 2 problems: 1: The VL model lacks an embedding-layer reduce-scatter operation, resulting in redundant all-gather during the first step.
2: In Qwen3-VL, deepstack_input_embeds is added after computation at each layer, but the shape does not match. We must add chunk before layernorm.
Proposed Change.
Implement sequence parallelism by pass:
- Support VL Dense model https://github.com/vllm-project/vllm-ascend/pull/5632
- Suport INT8 dynamic quant
- Support VL MoE models https://github.com/vllm-project/vllm-ascend/pull/7044
- Support LLM dense and MoE models
- Matmul reducescatter pass and allgather matmul pass
- Support model runner v2
- compitable with FC2 and FC3
- Make FC1 only applicable in eager mode
Feedback Period.
No response
CC List.
@wxsIcey @ApsarasX
Any Other Things.
No response