[Feature Request] fast inference for LFM (and Mamba models) · unslothai/unsloth#4073

Métricas do repositório

Stars: (64.271 stars)
Métricas de merge de PR: (Mesclagem média 3d 15h) (525 fundiu PRs em 30d)

Description

Bug Description

When using FastLanguageModel.from_pretrained() with fast_inference=True on an LFM2.5 model (LiquidAI/LFM2.5-1.2B-Thinking, architecture Lfm2ForCausalLM), the model loads into vLLM successfully but crashes during state dict extraction.

Error

File "unsloth_zoo/vllm_utils.py", line 1122, in _get_vllm_state_dict
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)
                       ^^^^^^
UnboundLocalError: cannot access local variable 'prefix' where it is not associated with a value

Root Cause

In _get_vllm_state_dict, the layer iteration loop only sets prefix inside if hasattr(layer, "self_attn") and elif hasattr(layer, "cross_attn") branches. The get_state_dict(f"{prefix}.o_proj", ...) call is at the loop body level (outside both branches).

LFM2/Mamba layers use mixer (or similar) instead of self_attn/cross_attn, so neither branch executes and prefix is never assigned.

for kk in range(len(vllm_text_model.layers)):
    layer = vllm_text_model.layers[kk]
    if hasattr(layer, "self_attn"):
        prefix = f"..."  # set here
        # ...
    elif hasattr(layer, "cross_attn"):
        prefix = f"..."  # set here
        # ...
    # Mamba layers fall through — prefix never set
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)  # CRASH

Environment

Unsloth: 2026.2.1
vLLM: 0.15.1
PyTorch: 2.9.1+cu128
CUDA: 12.8
GPU: NVIDIA GeForce RTX 5080 (Blackwell, sm_120a)
Model: LiquidAI/LFM2.5-1.2B-Thinking (Lfm2ForCausalLM)

Steps to Reproduce

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="LiquidAI/LFM2.5-1.2B-Thinking",
    max_seq_length=4096,
    load_in_4bit=False,
    fast_inference=True,
)

Notes

vLLM itself handles LFM2 fine — model loads as Lfm2ForCausalLM, CUDA graphs are captured, KV cache is allocated. The crash is only in Unsloth's _get_vllm_state_dict wrapper.
fast_inference=False works as expected (bypasses vLLM entirely).
There is no FastLfm2Model class in Unsloth — LFM2 falls through to the generic FastModel/FastBaseModel path, which does attempt vLLM initialization.

Suggested Fix

Add handling for Mamba/SSM layers in the loop — either skip them with continue or add an elif hasattr(layer, "mixer") branch that extracts the correct state dict for Mamba layers.

Guia do colaborador

Direção de pesquisa: Inspecione a função get vllm state dict em unsloth zoo/vllm utils.py por volta da linha 1122. Identifique o loop onde o prefixo é definido apenas para os ramos self attn/cross attn. Adicione suporte para camadas Mamba/SSM (por exemplo, aquelas que usam o atributo 'mixer') adicionando um ramo elif que define o prefixo apropriado e extrai o dicionário de estado. Veja como o vLLM lida com Lfm2ForCausalLM como referência. Teste com o modelo LFM2.5.
Pilha de tecnologia: pythonpytorch
Domain: backendmachine learning
Tipo Issue: Bug
Difficulty: 3
Tempo estimado: 1-3 horas
Status da atividade: Ativo
Clarity: Claro
Prerequisites: PythonPyTorchvLLM
Simpatia para novatos: 40