Misc. bug: llama-finetune won't work even with 17M parameters arch:llama
#18499 opened on Dec 30, 2025
Description
Name and Version
build: 7462 (6ce3d8579) with Clang 21.1.5 for Android aarch64
using termux (Samsung M33 5G) ggml-model-f32.gguf
Samsung M33 5G CPU: 8× ARM Cortex-A53 @ 2.00 GHz (64-bit, ARMv8) RAM: 8 GB Storage: 128 GB GPU: Mali-G52 (integrated) OS: Android (you’re running Termux) ABI: arm64-v8a Other: Supports NEON, DOTPROD, FP16 vector ops
Operating systems
Other? (Please let us know in description), Linux
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Command line
./build/bin/llama-finetune --model /storage/emulated/0/ysf/files/models/ggml-model-f32.gguf --file /storage/emulated/0/ysf/files/models/dataset.txt -c 512 -b 4 -ub 4 --epochs 1
Problem description & steps to reproduce
GGML_ASSERT(!node->view_src || node->op == GGML_OP_CPY || node->op == GGML_OP_VIEW || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_TRANSPOSE) failed
First Bad Commit
It hasn’t worked properly since the very first version of llama-finetune. I believe llama.cpp should include a well-structured section for creating, training, and fine-tuning models. The project is already in a great place, and I’ve always had a big vision for what it could achieve.
Relevant log output
/build/bin/llama-finetune --model /storage/emulated/0/ysf/files/models/ggml-model-f32.gguf --file /storage/emulated/0/ysf/files/models/dataset.txt -c 512 -b 4 -ub 4 --epochs 1 main: force disabling memory mapping because it would result in-read-only pointers to the weights main: force changing k cache type to f32 due to a lack of f16 support for OUT_PROD main: force changing v cache type to f32 due to a lack of f16 support for OUT_PROD build: 7462 (6ce3d8579) with Clang 21.1.5 for Android aarch64 common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on llama_params_fit_impl: no devices with dedicated memory found llama_params_fit: successfully fit params to free device memory llama_params_fit: fitting params to free memory took 0.06 seconds llama_model_loader: loaded meta data with 21 key-value pairs and 12 tensors from /storage/emulated/0/ysf/files/models/ggml-model-f32.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = llama2-from-scratch llama_model_loader: - kv 2: llama.context_length u32 = 512 llama_model_loader: - kv 3: llama.embedding_length u32 = 256 llama_model_loader: - kv 4: llama.block_count u32 = 1 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 1024llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 256 llama_model_loader: - kv 7: llama.attention.head_count u32 = 1 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 1 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 0 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = truellama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - type f32: 12 tensors print_info: file format = GGUF V3 (latest) print_info: file type = all F32 print_info: file size = 66.50 MiB (32.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 2 ('</s>') load: special tokens cache size = 3 load: token to piece cache size = 0.1684 MB print_info: arch = llama print_info: vocab_only = 0 print_info: no_alloc = 0 print_info: n_ctx_train = 512 print_info: n_embd = 256 print_info: n_embd_inp = 256 print_info: n_layer = 1 print_info: n_head = 1 print_info: n_head_kv = 1 print_info: n_rot = 256 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 256 print_info: n_embd_head_v = 256 print_info: n_gqa = 1 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 1024 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 512 print_info: rope_yarn_log_mul= 0.0000 print_info: rope_finetuned = unknown print_info: model type = ?B print_info: model params = 17.43 M print_info: general.name = llama2-from-scratch print_info: vocab type = SPM print_info: n_vocab = 32000 print_info: n_merges = 0 print_info: BOS token = 1 '<s>' print_info: EOS token = 2 '</s>' print_info: UNK token = 0 '<unk>' print_info: LF token = 13 '<0x0A>' print_info: EOG token = 2 '</s>' print_info: max token length = 48 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: CPU model buffer size = 66.50 MiB ....... common_init_result: added </s> logit bias = -inf llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 512 llama_context: n_ctx_seq = 512 llama_context: n_batch = 4 llama_context: n_ubatch = 4 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = false llama_context: freq_base = 10000.0 llama_context: freq_scale = 1 llama_context: CPU output buffer size = 0.12 MiB llama_kv_cache: CPU KV buffer size = 1.00 MiB llama_kv_cache: size = 1.00 MiB ( 512 cells, 1 layers, 1/1 seqs), K (f32): 0.50 MiB, V (f32): 0.50 MiB llama_context: Flash Attention was auto, set to enabled llama_context: CPU compute buffer size = 0.52 MiB llama_context: graph nodes = 40 llama_context: graph splits = 1 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | -optimizer adamw -lr0 1e-05 -wd 0 -lr-min -1 -min-epochs -1 -epochs 1 -period 1 -val 0.05 /data/data/com.termux/files/home/llama.cpp/ggml/src/ggml.c:6871: GGML_ASSERT(!node->view_src || node->op == GGML_OP_CPY || node->op == GGML_OP_VIEW || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_TRANSPOSE) failed 0: 0x7d977104a4 1: 0x7d97710464 ggml_print_backtrace 2: 0x7d977217d4 ggml_abort 3: 0x7d9771cc8c ggml_build_backward_expand 4: 0x7d9772c594 5: 0x7d9772cd24 ggml_opt_alloc 6: 0x7d9d0d9664 _ZN13llama_context14opt_epoch_iterEP16ggml_opt_datasetP15ggml_opt_resultRKNSt6__ndk16vectorIiNS4_9allocatorIiEEEESA_R11llama_batchPFvbP16ggml_opt_contextS1_S3_lllEblll 7: 0x7d9d0d9ab8 _ZN13llama_context9opt_epochEP16ggml_opt_datasetP15ggml_opt_resultS3_lPFvbP16ggml_opt_contextS1_S3_lllES7_ 8: 0x5b2828eda8 9: 0x7d9e361280 __libc_init Aborted ./build/bin/llama-finetune --model /storage/emulated/0/ysf/files/models/ggml-model-f32.gguf --file /storage/emulated/0/ysf/files/models/dataset.txt -c 512 -b 4 -ub 4 --epochs 1