vllm-project/vllm-ascend

[RFC]: support sequence parallelism by pass

Open

#5,712 建立於 2026年1月8日

在 GitHub 查看
 (1 留言) (2 反應) (0 負責人)C++ (1,318 fork)github user discovery
RFChelp wanted

倉庫指標

Star
 (2,180 star)
PR 合併指標
 (平均合併 5天 16小時) (30 天內合併 419 個 PR)

描述

Motivation.

Flash Comm V1 (FC1) is a feature that is similiar to sequence parallelism. FC1 is implemented by custom op in vllm-ascend. However, it is not supported for VL models. When extending FC1 to VL models, we meet 2 problems: 1: The VL model lacks an embedding-layer reduce-scatter operation, resulting in redundant all-gather during the first step.

2: In Qwen3-VL, deepstack_input_embeds is added after computation at each layer, but the shape does not match. We must add chunk before layernorm.

Proposed Change.

Implement sequence parallelism by pass:

Feedback Period.

No response

CC List.

@wxsIcey @ApsarasX

Any Other Things.

No response

貢獻者指南