vllm-project/vllm-ascend

[RFC]: support sequence parallelism by pass

Open

#5,712 创建于 2026年1月8日

在 GitHub 查看
 (1 评论) (2 反应) (0 负责人)C++ (1,318 fork)github user discovery
RFChelp wanted

仓库指标

Star
 (2,180 star)
PR 合并指标
 (平均合并 5天 16小时) (30 天内合并 419 个 PR)

描述

Motivation.

Flash Comm V1 (FC1) is a feature that is similiar to sequence parallelism. FC1 is implemented by custom op in vllm-ascend. However, it is not supported for VL models. When extending FC1 to VL models, we meet 2 problems: 1: The VL model lacks an embedding-layer reduce-scatter operation, resulting in redundant all-gather during the first step.

2: In Qwen3-VL, deepstack_input_embeds is added after computation at each layer, but the shape does not match. We must add chunk before layernorm.

Proposed Change.

Implement sequence parallelism by pass:

Feedback Period.

No response

CC List.

@wxsIcey @ApsarasX

Any Other Things.

No response

贡献者指南