unslothai/unsloth

[Feature Request] fast inference for LFM (and Mamba models)

Open

#4073 opened on Feb 17, 2026

View on GitHub
 (5 comments) (0 reactions) (0 assignees)Python (64,271 stars) (5,658 forks)batch import
good first issuehelp wanted

Description

Bug Description

When using FastLanguageModel.from_pretrained() with fast_inference=True on an LFM2.5 model (LiquidAI/LFM2.5-1.2B-Thinking, architecture Lfm2ForCausalLM), the model loads into vLLM successfully but crashes during state dict extraction.

Error

File "unsloth_zoo/vllm_utils.py", line 1122, in _get_vllm_state_dict
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)
                       ^^^^^^
UnboundLocalError: cannot access local variable 'prefix' where it is not associated with a value

Root Cause

In _get_vllm_state_dict, the layer iteration loop only sets prefix inside if hasattr(layer, "self_attn") and elif hasattr(layer, "cross_attn") branches. The get_state_dict(f"{prefix}.o_proj", ...) call is at the loop body level (outside both branches).

LFM2/Mamba layers use mixer (or similar) instead of self_attn/cross_attn, so neither branch executes and prefix is never assigned.

for kk in range(len(vllm_text_model.layers)):
    layer = vllm_text_model.layers[kk]
    if hasattr(layer, "self_attn"):
        prefix = f"..."  # set here
        # ...
    elif hasattr(layer, "cross_attn"):
        prefix = f"..."  # set here
        # ...
    # Mamba layers fall through — prefix never set
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)  # CRASH

Environment

  • Unsloth: 2026.2.1
  • vLLM: 0.15.1
  • PyTorch: 2.9.1+cu128
  • CUDA: 12.8
  • GPU: NVIDIA GeForce RTX 5080 (Blackwell, sm_120a)
  • Model: LiquidAI/LFM2.5-1.2B-Thinking (Lfm2ForCausalLM)

Steps to Reproduce

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="LiquidAI/LFM2.5-1.2B-Thinking",
    max_seq_length=4096,
    load_in_4bit=False,
    fast_inference=True,
)

Notes

  • vLLM itself handles LFM2 fine — model loads as Lfm2ForCausalLM, CUDA graphs are captured, KV cache is allocated. The crash is only in Unsloth's _get_vllm_state_dict wrapper.
  • fast_inference=False works as expected (bypasses vLLM entirely).
  • There is no FastLfm2Model class in Unsloth — LFM2 falls through to the generic FastModel/FastBaseModel path, which does attempt vLLM initialization.

Suggested Fix

Add handling for Mamba/SSM layers in the loop — either skip them with continue or add an elif hasattr(layer, "mixer") branch that extracts the correct state dict for Mamba layers.

Contributor guide