[Bug]: Wrong timestamps if audio > 30s · vllm-project/vllm#32588

(11 comments) (1 reaction) (2 assignees)Python (16,816 forks)batch import

buggood first issuehelp wanted

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

Your current environment

🐛 Describe the bug

Segment level timestamps are incrementally offset by ~0.5s per segment.

When transcribing long audio with vLLM Whisper, segment-level timestamps are increasingly inaccurate. Chunking uses by default a 1s window to find low-energy (quiet) regions for splitting, so the next chunk may start up to 1s earlier than the nominal chunk length. This offset is not compensated for when generating segment timestamps, resulting in an average delay of ~0.5s per segment, which accumulates (e.g., 5s after 10 segments).

Clients can workaround by subtracting 0.5s per segment, but that's obviously not accurate.

Also, the related documentation is not correct: there are no overlapping chunks, but just a window for searching the best split point. Maybe this confusion is the cause of the issue.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide

Research direction: Examine the audio chunking logic in vLLM Whisper to understand how timestamps are generated after splitting. The offset accumulates because the split point window is not compensated. Look at where segment timestamps are computed and add correction for the split offset.
Tech stack: python
Domain: backendapi
Issue type: Bug
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Active
Clarity: Mostly clear
Prerequisites: GitPython
Newbie friendliness: 70