vllm-project/vllm

[Bug]: Wrong timestamps if audio > 30s

Open

#32588 opened on Jan 19, 2026

View on GitHub
 (11 comments) (1 reaction) (2 assignees)Python (80,034 stars) (16,816 forks)batch import
buggood first issuehelp wanted

Description

Your current environment

NA

🐛 Describe the bug

Segment level timestamps are incrementally offset by ~0.5s per segment.

When transcribing long audio with vLLM Whisper, segment-level timestamps are increasingly inaccurate. Chunking uses by default a 1s window to find low-energy (quiet) regions for splitting, so the next chunk may start up to 1s earlier than the nominal chunk length. This offset is not compensated for when generating segment timestamps, resulting in an average delay of ~0.5s per segment, which accumulates (e.g., 5s after 10 segments).

Clients can workaround by subtracting 0.5s per segment, but that's obviously not accurate.

Also, the related documentation is not correct: there are no overlapping chunks, but just a window for searching the best split point. Maybe this confusion is the cause of the issue.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide