[Feature Request] fast inference for LFM (and Mamba models) · unslothai/unsloth#4073

(6 comments) (0 reactions) (0 assignees)Python (5,658 forks)batch import

good first issuehelp wanted

Repository metrics

Stars: (64,271 stars)
PR merge metrics: (Avg merge 3d 15h) (525 merged PRs in 30d)

Description

Bug Description

When using FastLanguageModel.from_pretrained() with fast_inference=True on an LFM2.5 model (LiquidAI/LFM2.5-1.2B-Thinking, architecture Lfm2ForCausalLM), the model loads into vLLM successfully but crashes during state dict extraction.

Error

File "unsloth_zoo/vllm_utils.py", line 1122, in _get_vllm_state_dict
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)
                       ^^^^^^
UnboundLocalError: cannot access local variable 'prefix' where it is not associated with a value

Root Cause

In _get_vllm_state_dict, the layer iteration loop only sets prefix inside if hasattr(layer, "self_attn") and elif hasattr(layer, "cross_attn") branches. The get_state_dict(f"{prefix}.o_proj", ...) call is at the loop body level (outside both branches).

LFM2/Mamba layers use mixer (or similar) instead of self_attn/cross_attn, so neither branch executes and prefix is never assigned.

for kk in range(len(vllm_text_model.layers)):
    layer = vllm_text_model.layers[kk]
    if hasattr(layer, "self_attn"):
        prefix = f"..."  # set here
        # ...
    elif hasattr(layer, "cross_attn"):
        prefix = f"..."  # set here
        # ...
    # Mamba layers fall through — prefix never set
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)  # CRASH

Environment

Unsloth: 2026.2.1
vLLM: 0.15.1
PyTorch: 2.9.1+cu128
CUDA: 12.8
GPU: NVIDIA GeForce RTX 5080 (Blackwell, sm_120a)
Model: LiquidAI/LFM2.5-1.2B-Thinking (Lfm2ForCausalLM)

Steps to Reproduce

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="LiquidAI/LFM2.5-1.2B-Thinking",
    max_seq_length=4096,
    load_in_4bit=False,
    fast_inference=True,
)

Notes

vLLM itself handles LFM2 fine — model loads as Lfm2ForCausalLM, CUDA graphs are captured, KV cache is allocated. The crash is only in Unsloth's _get_vllm_state_dict wrapper.
fast_inference=False works as expected (bypasses vLLM entirely).
There is no FastLfm2Model class in Unsloth — LFM2 falls through to the generic FastModel/FastBaseModel path, which does attempt vLLM initialization.

Suggested Fix

Add handling for Mamba/SSM layers in the loop — either skip them with continue or add an elif hasattr(layer, "mixer") branch that extracts the correct state dict for Mamba layers.

Contributor guide

Research direction: Inspect the get vllm state dict function in unsloth zoo/vllm utils.py around line 1122. Identify the loop where prefix is only set for self attn/cross attn branches. Add support for Mamba/SSM layers (e.g., those using 'mixer' attribute) by adding an elif branch that sets prefix appropriately and extracts the state dict. Look at vLLM's handling of Lfm2ForCausalLM for reference. Test with LFM2.5 model.
Tech stack: pythonpytorch
Domain: backendmachine learning
Issue type: Bug
Difficulty: 3
Estimated time: 1-3 hours
Activity status: Active
Clarity: Clear
Prerequisites: PythonPyTorchvLLM
Newbie friendliness: 40