feature requestgood first issuehelp wantedkeep-openmulti-modality
Description
🚀 The feature, motivation and pitch
This issue is for keeping track of the recurrent Whisper asks as well as the linked on-going efforts to support that feature, if any. When a feature request has no linked PR, feel free to claim the work here if you want to help!
- Support different
response_formatshttps://platform.openai.com/docs/api-reference/audio/createTranscription- Related issues: https://github.com/vllm-project/vllm/issues/19556, https://github.com/vllm-project/vllm/issues/14818, https://github.com/vllm-project/vllm/issues/24302
- PR(s): https://github.com/vllm-project/vllm/pull/24209 (
verbose_json, help needed with other formats)
- Support timestamp granularities:
- Context: Very much related to the above. Unfortunately outputting by
wordrequires aligning encoder latents (usually extrapolated from the crossattn layers) with decoder ones. I feel a lot of these whsper-specific techniques bring in added complexity to vLLM. However, I think we're open to exploring in this direction if we can come up with a less invasive solutions. Some references to get started https://github.com/m-bain/whisperX https://github.com/openai/whisper/discussions/684 . - https://github.com/vllm-project/vllm/pull/24209 partially addresses this issue by letting whisper predict the segments through timestamp tokens.
- Context: Very much related to the above. Unfortunately outputting by
- Automatic language detection:
- Context: one should be able to let the decoder predict part of its "preamble" prompt, including the language and task token, conditioned on the encoder output. This is effectively utilizing whisper "built-in automatic language detection" feature. Mind that it would be ideal to guide the output tokens among valid languages. Accuracy to evaluate.
- Related issues: https://github.com/vllm-project/vllm/issues/14174
- PR: https://github.com/vllm-project/vllm/pull/34342
- Follow-up: Optimize performance to avoid encoding the audio twice eg leveraging MM cache and/or submitting the request in a single
.generatecall
- Use MM encoder cache for encoder-decoder models
- @alex-jw-brooks to propose an approach
- Beam search:
- PR(s): https://github.com/vllm-project/vllm/pull/13758 , this one needs reviving. Feel free to claim work here.
- PR(s) : https://github.com/vllm-project/vllm/pull/36153 https://github.com/vllm-project/vllm/pull/36160
- Feed previous chunk context to improve accuracy
- PR(s): https://github.com/vllm-project/vllm/pull/20249 (first tentative, to re-do)
- Audio chunking for Offline Whisper
- Context: We already have this for online serving https://github.com/vllm-project/vllm/blob/72506c98349d6bcd32b4e33eec7b5513453c1502/vllm/entrypoints/openai/speech_to_text.py#L149, just need to make this logic available to the engine cc @sangbumlikeagod
- PR(s): https://github.com/vllm-project/vllm/pull/34628
- Voice Activity Detection @ekagra-ranjan
- Context: Current Whisper/ASR model served with vLLM in production necessitate of a better pre-processing to avoid hallucinations cause by running ASR on background noise chunks. While building a pipeline around vLLM is a popular option, adding a lightweight VAD model should make the out-of-the-box experience of using ASR endpoints in vLLM much better.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.