[RFC]: support sequence parallelism by pass · vllm-project/vllm-ascend#5712

(1 Kommentar) (2 Reaktionen) (0 zugewiesene Personen)C++ (1.318 Forks)github user discovery

RFChelp wanted

Repository-Metriken

Stars: (2.180 Stars)
PR-Merge-Metriken: (Durchschn. Merge 5T 16h) (419 gemergte PRs in 30 T)

Beschreibung

Motivation.

Flash Comm V1 (FC1) is a feature that is similiar to sequence parallelism. FC1 is implemented by custom op in vllm-ascend. However, it is not supported for VL models. When extending FC1 to VL models, we meet 2 problems: 1: The VL model lacks an embedding-layer reduce-scatter operation, resulting in redundant all-gather during the first step.

2: In Qwen3-VL, deepstack_input_embeds is added after computation at each layer, but the shape does not match. We must add chunk before layernorm.

Proposed Change.

Implement sequence parallelism by pass:

Support VL Dense model https://github.com/vllm-project/vllm-ascend/pull/5632
Suport INT8 dynamic quant
Support VL MoE models https://github.com/vllm-project/vllm-ascend/pull/7044
Support LLM dense and MoE models
Matmul reducescatter pass and allgather matmul pass
Support model runner v2
compitable with FC2 and FC3
Make FC1 only applicable in eager mode

Feedback Period.

No response

CC List.

@wxsIcey @ApsarasX

Any Other Things.