unslothai/unsloth

[Feature Request] fast inference for LFM (and Mamba models)

Open

#4.073 aberto em 17 de fev. de 2026

Ver no GitHub
 (6 comments) (0 reactions) (0 assignees)Python (5.658 forks)batch import
good first issuehelp wanted

Métricas do repositório

Stars
 (64.271 stars)
Métricas de merge de PR
 (Mesclagem média 3d 15h) (525 fundiu PRs em 30d)

Description

Bug Description

When using FastLanguageModel.from_pretrained() with fast_inference=True on an LFM2.5 model (LiquidAI/LFM2.5-1.2B-Thinking, architecture Lfm2ForCausalLM), the model loads into vLLM successfully but crashes during state dict extraction.

Error

File "unsloth_zoo/vllm_utils.py", line 1122, in _get_vllm_state_dict
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)
                       ^^^^^^
UnboundLocalError: cannot access local variable 'prefix' where it is not associated with a value

Root Cause

In _get_vllm_state_dict, the layer iteration loop only sets prefix inside if hasattr(layer, "self_attn") and elif hasattr(layer, "cross_attn") branches. The get_state_dict(f"{prefix}.o_proj", ...) call is at the loop body level (outside both branches).

LFM2/Mamba layers use mixer (or similar) instead of self_attn/cross_attn, so neither branch executes and prefix is never assigned.

for kk in range(len(vllm_text_model.layers)):
    layer = vllm_text_model.layers[kk]
    if hasattr(layer, "self_attn"):
        prefix = f"..."  # set here
        # ...
    elif hasattr(layer, "cross_attn"):
        prefix = f"..."  # set here
        # ...
    # Mamba layers fall through — prefix never set
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)  # CRASH

Environment

  • Unsloth: 2026.2.1
  • vLLM: 0.15.1
  • PyTorch: 2.9.1+cu128
  • CUDA: 12.8
  • GPU: NVIDIA GeForce RTX 5080 (Blackwell, sm_120a)
  • Model: LiquidAI/LFM2.5-1.2B-Thinking (Lfm2ForCausalLM)

Steps to Reproduce

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="LiquidAI/LFM2.5-1.2B-Thinking",
    max_seq_length=4096,
    load_in_4bit=False,
    fast_inference=True,
)

Notes

  • vLLM itself handles LFM2 fine — model loads as Lfm2ForCausalLM, CUDA graphs are captured, KV cache is allocated. The crash is only in Unsloth's _get_vllm_state_dict wrapper.
  • fast_inference=False works as expected (bypasses vLLM entirely).
  • There is no FastLfm2Model class in Unsloth — LFM2 falls through to the generic FastModel/FastBaseModel path, which does attempt vLLM initialization.

Suggested Fix

Add handling for Mamba/SSM layers in the loop — either skip them with continue or add an elif hasattr(layer, "mixer") branch that extracts the correct state dict for Mamba layers.

Guia do colaborador