RFChelp wanted
倉庫指標
- Star
- (2,180 star)
- PR 合併指標
- (平均合併 5天 16小時) (30 天內合併 419 個 PR)
描述
Motivation.
Flash Comm V1 (FC1) is a feature that is similiar to sequence parallelism. FC1 is implemented by custom op in vllm-ascend. However, it is not supported for VL models. When extending FC1 to VL models, we meet 2 problems: 1: The VL model lacks an embedding-layer reduce-scatter operation, resulting in redundant all-gather during the first step.
2: In Qwen3-VL, deepstack_input_embeds is added after computation at each layer, but the shape does not match. We must add chunk before layernorm.
Proposed Change.
Implement sequence parallelism by pass:
- Support VL Dense model https://github.com/vllm-project/vllm-ascend/pull/5632
- Suport INT8 dynamic quant
- Support VL MoE models https://github.com/vllm-project/vllm-ascend/pull/7044
- Support LLM dense and MoE models
- Matmul reducescatter pass and allgather matmul pass
- Support model runner v2
- compitable with FC2 and FC3
- Make FC1 only applicable in eager mode
Feedback Period.
No response
CC List.
@wxsIcey @ApsarasX
Any Other Things.
No response