vllm-project/vllm-ascend

[RFC]: support sequence parallelism by pass

Open

#5.712 geöffnet am 8. Jan. 2026

Auf GitHub ansehen
 (1 Kommentar) (2 Reaktionen) (0 zugewiesene Personen)C++ (1.318 Forks)github user discovery
RFChelp wanted

Repository-Metriken

Stars
 (2.180 Stars)
PR-Merge-Metriken
 (Durchschn. Merge 5T 16h) (419 gemergte PRs in 30 T)

Beschreibung

Motivation.

Flash Comm V1 (FC1) is a feature that is similiar to sequence parallelism. FC1 is implemented by custom op in vllm-ascend. However, it is not supported for VL models. When extending FC1 to VL models, we meet 2 problems: 1: The VL model lacks an embedding-layer reduce-scatter operation, resulting in redundant all-gather during the first step.

2: In Qwen3-VL, deepstack_input_embeds is added after computation at each layer, but the shape does not match. We must add chunk before layernorm.

Proposed Change.

Implement sequence parallelism by pass:

Feedback Period.

No response

CC List.

@wxsIcey @ApsarasX

Any Other Things.

No response

Contributor Guide