[Feature]: Tracking Whisper feature requests · vllm-project/vllm#25750

2025-09-26T07:54:28.000Z

### 🚀 The feature, motivation and pitch This issue is for keeping track of the recurrent Whisper asks as well as the linked on-going efforts to support that feature, if any. When a feature request has no linked PR, feel free to claim the work here if you want to help! - [ ] Support different `response_formats` https://platform.openai.com/docs/api-reference/audio/createTranscription - Related issues: https://github.com/vllm-project/vllm/issues/19556, https://github.com/vllm-project/vllm/issues/14818, https://github.com/vllm-project/vllm/issues/24302 - PR(s): https://github.com/vllm-project/vllm/pull/24209 (`verbose_json`, **help needed with other formats**) - [ ] Support timestamp granularities: - Context: Very much related to the above. Unfortunately outputting by `word` requires aligning encoder latents (usually extrapolated from the crossattn layers) with decoder ones. I feel a lot of these whsper-specific techniques bring in added complexity to vLLM. However, I think we're open to exploring in this direction if we can come up with a less invasive solutions. Some references to get started https://github.com/m-bain/whisperX https://github.com/openai/whisper/discussions/684 . - https://github.com/vllm-project/vllm/pull/24209 *partially* addresses this issue by letting whisper predict the segments through timestamp tokens. - [x] Automatic language detection: - Context: one should be able to let the decoder predict part of its "preamble" prompt, **including the language and task token, conditioned on the encoder output**. This is effectively utilizing whisper "built-in automatic language detection" feature. Mind that it would be ideal to _guide_ the output tokens among valid languages. Accuracy to evaluate. - Related issues: https://github.com/vllm-project/vllm/issues/14174 - PR: https://github.com/vllm-project/vllm/pull/34342 - [ ] Follow-up: Optimize performance to avoid encoding the audio twice eg leveraging MM cache and/or submitting the request in a single `.generate` call - [ ] Use MM encoder cache for encoder-decoder models - [ ] @alex-jw-brooks to propose an approach - [x] Beam search: - PR(s): https://github.com/vllm-project/vllm/pull/13758 , this one needs reviving. Feel free to claim work here. - PR(s) : https://github.com/vllm-project/vllm/pull/36153 https://github.com/vllm-project/vllm/pull/36160 - [ ] Feed previous chunk context to improve accuracy - PR(s): https://github.com/vllm-project/vllm/pull/20249 (first tentative, to re-do) - [x] Audio chunking for Offline Whisper - Context: We already have this for online serving https://github.com/vllm-project/vllm/blob/72506c98349d6bcd32b4e33eec7b5513453c1502/vllm/entrypoints/openai/speech_to_text.py#L149, just need to make this logic available to the engine cc @sangbumlikeagod - PR(s): https://github.com/vllm-project/vllm/pull/34628 - [ ] Voice Activity Detection @ekagra-ranjan - Context: Current Whisper/ASR model served with vLLM in production necessitate of a better pre-processing to avoid hallucinations cause by running ASR on background noise chunks. While building a pipeline around vLLM is a popular option, adding a lightweight VAD model should make the out-of-the-box experience of using ASR endpoints in vLLM much better. ### Alternatives _No response_ ### Additional context _No response_ ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

(13 comments) (17 reactions) (0 assignees)Python (16,816 forks)batch import

feature requestgood first issuehelp wantedkeep-openmulti-modality

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Contributor guide

Research direction: Review the list of feature requests and identify one that has no linked PR and seems manageable. Start by reading the linked issues and understanding the context.
Tech stack: python
Domain: aimachine learning
Issue type: Feature
Difficulty: 4
Estimated time: Over 1 week
Activity status: Active
Clarity: Needs investigation
Prerequisites: PythonWhisper
Newbie friendliness: 20

Repository metrics

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Contributor guide

Get fresh easy issues in your inbox.