[Bug]: FlashInfer Incompatible with Sleep Mode

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

Your current environment

Your output of `python collect_env.py` here

🐛 Describe the bug

Here is a script to reproduce the bug: I use vllm=v0.10.1 and flashinfer-python=v0.5.3.

from vllm import LLM, SamplingParams

if __name__ == "__main__":
    model_pth = "xxx/Qwen3-1.7B"  
    tp_size = 1
    llm = LLM(
        model=model_pth, 
        enable_sleep_mode=True,
        tensor_parallel_size=tp_size,
        gpu_memory_utilization=0.7, 
    )

    llm.sleep(level=1)
    llm.wake_up()

    prompts = [
        "What is AI?", 
        "Where is the Machu Picchu located?", 
        "What is the capital of France?",
        "Who painted the Mona Lisa?",
    ]

    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.9,
        max_tokens=64,
    )

    outputs = llm.generate(prompts, sampling_params)

    for i, out in enumerate(outputs):
        prompt = prompts[i]
        generated = out.outputs[0].text
        print(f"Prompt {i}: {prompt!r}")
        print(f"Generation: {generated}\n")

Root Cause

The bug occurs because the FlashInfer backend’s attn_metadata is stateful. It holds a block_table_arange tensor that is initialized once and then reused across subsequent calls to build:

self.block_table_arange = torch.arange(
    max_num_pages_per_req,
    dtype=torch.int32,
    device=self.device,
)

This block_table_arange tensor is allocated in the mempool with the "kv_cache" tag. It gets discarded after calling llm.sleep, but is not recreated when the engine wakes up, which leads to incorrect values and thus wrong outputs.

Specifically, this will cause bad rollout outputs in VERL using vllm + flashinfer.

Temporary Fix

Here is a patch as a temporary workaround. It’s not an ideal solution, but it works:

from vllm.v1.attention.backends.flashinfer import FlashInferMetadataBuilder
import torch

def patch_flashinfer_build():
    old_build = FlashInferMetadataBuilder.build

    def new_build(*args, **kwargs):
        self = args[0]
        max_num_pages_per_req = self.block_table_arange.numel()
        self.block_table_arange.copy_(
            torch.arange(
                max_num_pages_per_req,
                device=self.block_table_arange.device,
                dtype=self.block_table_arange.dtype,
            )
        )
        return old_build(*args, **kwargs)

    FlashInferMetadataBuilder.build = new_build

patch_flashinfer_build()

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide

Research direction: Inspect the lifecycle of block table arange tensor in FlashInferMetadataBuilder, particularly during sleep and wake operations, to ensure it is reinitialized correctly.
Tech stack: python
Domain: backend
Issue type: Bug
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: PythonPyTorchvllm
Newbie friendliness: 60