Misc. bug: llama-finetune won't work even with 17M parameters arch:llama · ggml-org/llama.cpp#18499

(11 comments) (0 reactions) (0 assignees)C++ (18,202 forks)batch import

Review Complexity : Highbughelp wanted

Repository metrics

Stars: (110,169 stars)
PR merge metrics: (Avg merge 6d 8h) (389 merged PRs in 30d)

Description

Name and Version

build: 7462 (6ce3d8579) with Clang 21.1.5 for Android aarch64

using termux (Samsung M33 5G) ggml-model-f32.gguf

Samsung M33 5G CPU: 8× ARM Cortex-A53 @ 2.00 GHz (64-bit, ARMv8) RAM: 8 GB Storage: 128 GB GPU: Mali-G52 (integrated) OS: Android (you’re running Termux) ABI: arm64-v8a Other: Supports NEON, DOTPROD, FP16 vector ops

Operating systems

Other? (Please let us know in description), Linux

Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

Command line

./build/bin/llama-finetune --model /storage/emulated/0/ysf/files/models/ggml-model-f32.gguf  --file /storage/emulated/0/ysf/files/models/dataset.txt -c 512 -b 4 -ub 4 --epochs 1

Problem description & steps to reproduce

GGML_ASSERT(!node->view_src || node->op == GGML_OP_CPY || node->op == GGML_OP_VIEW || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_TRANSPOSE) failed

First Bad Commit

It hasn’t worked properly since the very first version of llama-finetune. I believe llama.cpp should include a well-structured section for creating, training, and fine-tuning models. The project is already in a great place, and I’ve always had a big vision for what it could achieve.

Relevant log output

/build/bin/llama-finetune --model /storage/emulated/0/ysf/files/models/ggml-model-f32.gguf  --file /storage/emulated/0/ysf/files/models/dataset.txt -c 512 -b 4 -ub 4 --epochs 1           main: force disabling memory mapping because it would result in-read-only pointers to the weights                                                                                               main: force changing k cache type to f32 due to a lack of f16 support for OUT_PROD              main: force changing v cache type to f32 due to a lack of f16 support for OUT_PROD              build: 7462 (6ce3d8579) with Clang 21.1.5 for Android aarch64                                   common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on               llama_params_fit_impl: no devices with dedicated memory found                                   llama_params_fit: successfully fit params to free device memory                                 llama_params_fit: fitting params to free memory took 0.06 seconds                               llama_model_loader: loaded meta data with 21 key-value pairs and 12 tensors from /storage/emulated/0/ysf/files/models/ggml-model-f32.gguf (version GGUF V3 (latest))                            llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                                                                                               llama_model_loader: - kv   0:                       general.architecture str              = llama                                                                                               llama_model_loader: - kv   1:                               general.name str              = llama2-from-scratch                                                                                 llama_model_loader: - kv   2:                       llama.context_length u32              = 512 llama_model_loader: - kv   3:                     llama.embedding_length u32              = 256 llama_model_loader: - kv   4:                          llama.block_count u32              = 1   llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 1024llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 256 llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 1   llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 1   llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001                                                                                            llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000                                                                                        llama_model_loader: - kv  11:                          general.file_type u32              = 0   llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama                                                                                               llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...                                                            llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [-1000.000000, -1000.000000, -1000.00...                                                            llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...                                                            llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1   llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2   llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0   llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = truellama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false                                                                                               llama_model_loader: - type  f32:   12 tensors                                                   print_info: file format = GGUF V3 (latest)                                                      print_info: file type   = all F32                                                               print_info: file size   = 66.50 MiB (32.00 BPW)                                                 load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect          load: printing all EOG tokens:                                                                  load:   - 2 ('</s>')                                                                            load: special tokens cache size = 3                                                             load: token to piece cache size = 0.1684 MB                                                     print_info: arch             = llama                                                            print_info: vocab_only       = 0                                                                print_info: no_alloc         = 0                                                                print_info: n_ctx_train      = 512                                                              print_info: n_embd           = 256                                                              print_info: n_embd_inp       = 256                                                              print_info: n_layer          = 1                                                                print_info: n_head           = 1                                                                print_info: n_head_kv        = 1                                                                print_info: n_rot            = 256                                                              print_info: n_swa            = 0                                                                print_info: is_swa_any       = 0                                                                print_info: n_embd_head_k    = 256                                                              print_info: n_embd_head_v    = 256                                                              print_info: n_gqa            = 1                                                                print_info: n_embd_k_gqa     = 256                                                              print_info: n_embd_v_gqa     = 256                                                              print_info: f_norm_eps       = 0.0e+00                                                          print_info: f_norm_rms_eps   = 1.0e-06                                                          print_info: f_clamp_kqv      = 0.0e+00                                                          print_info: f_max_alibi_bias = 0.0e+00                                                          print_info: f_logit_scale    = 0.0e+00                                                          print_info: f_attn_scale     = 0.0e+00                                                          print_info: n_ff             = 1024                                                             print_info: n_expert         = 0                                                                print_info: n_expert_used    = 0                                                                print_info: n_expert_groups  = 0                                                                print_info: n_group_used     = 0                                                                print_info: causal attn      = 1                                                                print_info: pooling type     = 0                                                                print_info: rope type        = 0                                                                print_info: rope scaling     = linear                                                           print_info: freq_base_train  = 10000.0                                                          print_info: freq_scale_train = 1                                                                print_info: n_ctx_orig_yarn  = 512                                                              print_info: rope_yarn_log_mul= 0.0000                                                           print_info: rope_finetuned   = unknown                                                          print_info: model type       = ?B                                                               print_info: model params     = 17.43 M                                                          print_info: general.name     = llama2-from-scratch                                              print_info: vocab type       = SPM                                                              print_info: n_vocab          = 32000                                                            print_info: n_merges         = 0                                                                print_info: BOS token        = 1 '<s>'                                                          print_info: EOS token        = 2 '</s>'                                                         print_info: UNK token        = 0 '<unk>'                                                        print_info: LF token         = 13 '<0x0A>'                                                      print_info: EOG token        = 2 '</s>'                                                         print_info: max token length = 48                                                               load_tensors: loading model tensors, this can take a while... (mmap = false)                    load_tensors:          CPU model buffer size =    66.50 MiB                                     .......                                                                                         common_init_result: added </s> logit bias = -inf                                                llama_context: constructing llama_context                                                       llama_context: n_seq_max     = 1                                                                llama_context: n_ctx         = 512                                                              llama_context: n_ctx_seq     = 512                                                              llama_context: n_batch       = 4                                                                llama_context: n_ubatch      = 4                                                                llama_context: causal_attn   = 1                                                                llama_context: flash_attn    = auto                                                             llama_context: kv_unified    = false                                                            llama_context: freq_base     = 10000.0                                                          llama_context: freq_scale    = 1                                                                llama_context:        CPU  output buffer size =     0.12 MiB                                    llama_kv_cache:        CPU KV buffer size =     1.00 MiB                                        llama_kv_cache: size =    1.00 MiB (   512 cells,   1 layers,  1/1 seqs), K (f32):    0.50 MiB, V (f32):    0.50 MiB                                                                            llama_context: Flash Attention was auto, set to enabled                                         llama_context:        CPU compute buffer size =     0.52 MiB                                    llama_context: graph nodes  = 40                                                                llama_context: graph splits = 1                                                                 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)                                                                                                                                                                                      system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |                                     -optimizer adamw -lr0 1e-05 -wd 0 -lr-min -1 -min-epochs -1 -epochs 1 -period 1 -val 0.05       /data/data/com.termux/files/home/llama.cpp/ggml/src/ggml.c:6871: GGML_ASSERT(!node->view_src || node->op == GGML_OP_CPY || node->op == GGML_OP_VIEW || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_TRANSPOSE) failed                                      0: 0x7d977104a4                                                                                 1: 0x7d97710464 ggml_print_backtrace                                                            2: 0x7d977217d4 ggml_abort                                                                      3: 0x7d9771cc8c ggml_build_backward_expand                                                      4: 0x7d9772c594                                                                                 5: 0x7d9772cd24 ggml_opt_alloc                                                                  6: 0x7d9d0d9664 _ZN13llama_context14opt_epoch_iterEP16ggml_opt_datasetP15ggml_opt_resultRKNSt6__ndk16vectorIiNS4_9allocatorIiEEEESA_R11llama_batchPFvbP16ggml_opt_contextS1_S3_lllEblll         7: 0x7d9d0d9ab8 _ZN13llama_context9opt_epochEP16ggml_opt_datasetP15ggml_opt_resultS3_lPFvbP16ggml_opt_contextS1_S3_lllES7_                                                                      8: 0x5b2828eda8                                                                                 9: 0x7d9e361280 __libc_init                                                                     Aborted                    ./build/bin/llama-finetune --model /storage/emulated/0/ysf/files/models/ggml-model-f32.gguf --file /storage/emulated/0/ysf/files/models/dataset.txt -c 512 -b 4 -ub 4 --epochs 1

Contributor guide

Research direction: Investigate the GGML ASSERT failure in the finetune function. The error occurs when a node's view src is set but the operation is not one of the allowed ops (CPY, VIEW, RESHAPE, PERMUTE, TRANSPOSE). Check the computation graph construction in the finetune code to ensure operations are correctly assigned.
Tech stack: cpp
Domain: backendmachine learning
Issue type: Bug
Difficulty: 3
Estimated time: 1-3 hours
Activity status: Active
Clarity: Needs investigation
Prerequisites: C++
Newbie friendliness: 60