Description
Your current environment
NA
🐛 Describe the bug
Segment level timestamps are incrementally offset by ~0.5s per segment.
When transcribing long audio with vLLM Whisper, segment-level timestamps are increasingly inaccurate. Chunking uses by default a 1s window to find low-energy (quiet) regions for splitting, so the next chunk may start up to 1s earlier than the nominal chunk length. This offset is not compensated for when generating segment timestamps, resulting in an average delay of ~0.5s per segment, which accumulates (e.g., 5s after 10 segments).
Clients can workaround by subtracting 0.5s per segment, but that's obviously not accurate.
Also, the related documentation is not correct: there are no overlapping chunks, but just a window for searching the best split point. Maybe this confusion is the cause of the issue.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.