-
Notifications
You must be signed in to change notification settings - Fork 162
Open
Description
What happened?
If I increase the context size and have to decrease -ngl so part of layers are offloaded into ram it crashes when receives the first request from client. Works fine if all layers are loaded into vram with rtr (with smaller context sizes) or if offloaded without rtr.
-ngl 99 -rtr → OK
-ngl 16 → OK
-ngl 16 -rtr → Crash
The full log is attached in a log section below, here is the relevant segment:
ggml_compute_forward_dup_q: cache_k_l0 (view) -> cache_k_l0 (view) (copy) is of type f16
D:\LOG\ik_llama.cpp\ggml\src\ggml.c:11754: fatal error
ggml_compute_forward_dup_q: cache_k_l0 (view) -> cache_k_l0 (view) (copy) is of type f16
Name and Version
llama-server.exe --version
version: 4015 (912c98f)
built with MSVC 19.44.35219.0 for
What operating system are you seeing the problem on?
Windows
Relevant log output
d:\ik_llamacpp>llama-server.exe -m D:\llamacpp_models\UD-Q4_K_XL\Kimi-K2-Thinking-UD-Q4_K_XL-00001-of-00014.gguf --port 11434 --host 0.0.0.0 --ctx-size 262144 --temp 1.0 --min-p 0.01 --jinja --numa distribute --threads 96 -ctk q8_0 -ctv q8_0 -amb 512 -mla 3 -ot exps=CPU --parallel 1 --timeout 3600 -cram -1 -gr --chat-template-file d:\ik_llamacpp\Kimi-K2-Thinking_p1.jinja -ts 30,70 -ngl 16 -rtr
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
INFO [ main] build info | tid="3652" timestamp=1763728443 build=4015 commit="912c98f6"
INFO [ main] system info | tid="3652" timestamp=1763728443 n_threads=96 n_threads_batch=-1 total_threads=48 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
CUDA0: using device CUDA0 - 13587 MiB free
CUDA1: using device CUDA1 - 13587 MiB free
llama_model_loader: additional 13 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 57 key-value pairs and 1096 tensors from D:\llamacpp_models\UD-Q4_K_XL\Kimi-K2-Thinking-UD-Q4_K_XL-00001-of-00014.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Kimi-K2-Thinking
llama_model_loader: - kv 3: general.basename str = Kimi-K2-Thinking
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 384x14B
llama_model_loader: - kv 6: general.license str = other
llama_model_loader: - kv 7: general.license.name str = modified-mit
llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 9: deepseek2.block_count u32 = 61
llama_model_loader: - kv 10: deepseek2.context_length u32 = 262144
llama_model_loader: - kv 11: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 12: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 13: deepseek2.attention.head_count u32 = 64
llama_model_loader: - kv 14: deepseek2.attention.head_count_kv u32 = 1
llama_model_loader: - kv 15: deepseek2.rope.freq_base f32 = 50000.000000
llama_model_loader: - kv 16: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 18: deepseek2.expert_group_count u32 = 1
llama_model_loader: - kv 19: deepseek2.expert_group_used_count u32 = 1
llama_model_loader: - kv 20: deepseek2.leading_dense_block_count u32 = 1
llama_model_loader: - kv 21: deepseek2.vocab_size u32 = 163840
llama_model_loader: - kv 22: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 23: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 24: deepseek2.attention.key_length u32 = 576
llama_model_loader: - kv 25: deepseek2.attention.value_length u32 = 512
llama_model_loader: - kv 26: deepseek2.attention.key_length_mla u32 = 192
llama_model_loader: - kv 27: deepseek2.attention.value_length_mla u32 = 128
llama_model_loader: - kv 28: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 29: deepseek2.expert_count u32 = 384
llama_model_loader: - kv 30: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 31: deepseek2.expert_weights_scale f32 = 2.827000
llama_model_loader: - kv 32: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 33: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 34: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 35: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 36: deepseek2.rope.scaling.factor f32 = 64.000000
llama_model_loader: - kv 37: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 38: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 39: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 40: tokenizer.ggml.pre str = kimi-k2
llama_model_loader: - kv 41: tokenizer.ggml.tokens arr[str,163840] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 42: tokenizer.ggml.token_type arr[i32,163840] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 43: tokenizer.ggml.merges arr[str,163328] = ["─á ─á", "─á─á ─á─á", "─á t", "i n",...
llama_model_loader: - kv 44: tokenizer.ggml.bos_token_id u32 = 163584
llama_model_loader: - kv 45: tokenizer.ggml.eos_token_id u32 = 163586
llama_model_loader: - kv 46: tokenizer.ggml.padding_token_id u32 = 163839
llama_model_loader: - kv 47: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- macr...
llama_model_loader: - kv 48: general.quantization_version u32 = 2
llama_model_loader: - kv 49: general.file_type u32 = 15
llama_model_loader: - kv 50: quantize.imatrix.file str = Kimi-K2-Thinking-GGUF/imatrix_unsloth...
llama_model_loader: - kv 51: quantize.imatrix.dataset str = unsloth_calibration_Kimi-K2-Thinking.txt
llama_model_loader: - kv 52: quantize.imatrix.entries_count u32 = 789
llama_model_loader: - kv 53: quantize.imatrix.chunks_count u32 = 50
llama_model_loader: - kv 54: split.no u16 = 0
llama_model_loader: - kv 55: split.tensors.count i32 = 1096
llama_model_loader: - kv 56: split.count u16 = 14
llama_model_loader: - type f32: 365 tensors
llama_model_loader: - type q4_1: 169 tensors
llama_model_loader: - type q8_0: 192 tensors
llama_model_loader: - type q4_K: 289 tensors
llama_model_loader: - type q5_K: 29 tensors
llama_model_loader: - type q6_K: 52 tensors
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
load: printing all EOG tokens:
load: - 163586 ('<|im_end|>')
load: special tokens cache size = 256
load: token to piece cache size = 1.0606 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = deepseek2
llm_load_print_meta: n_ctx_train = 262144
llm_load_print_meta: n_embd = 7168
llm_load_print_meta: n_layer = 61
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 64
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 192
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 12288
llm_load_print_meta: n_embd_v_gqa = 8192
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18432
llm_load_print_meta: n_expert = 384
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = yarn
llm_load_print_meta: freq_base_train = 50000.0
llm_load_print_meta: freq_scale_train = 0.015625
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 671B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 1.026 T
llm_load_print_meta: model size = 601.848 GiB (5.037 BPW)
llm_load_print_meta: repeating layers = 600.336 GiB (5.036 BPW, 1024.059 B parameters)
llm_load_print_meta: general.name = Kimi-K2-Thinking
llm_load_print_meta: n_layer_dense_lead = 1
llm_load_print_meta: n_lora_q = 1536
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 2048
llm_load_print_meta: n_expert_shared = 1
llm_load_print_meta: expert_weights_scale = 2.8
llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.1000
print_info: vocab type = BPE
print_info: n_vocab = 163840
print_info: n_merges = 163328
print_info: BOS token = 163584 '[BOS]'
print_info: EOS token = 163586 '<|im_end|>'
print_info: EOT token = 163586 '<|im_end|>'
print_info: PAD token = 163839 '[PAD]'
print_info: LF token = 198 '─è'
print_info: EOG token = 163586 '<|im_end|>'
print_info: max token length = 512
llm_load_tensors: ggml ctx size = 1.35 MiB
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
...
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/62 layers to GPU
llm_load_tensors: CPU buffer size = 608496.00 MiB
llm_load_tensors: CUDA_Host buffer size = 6196.85 MiB
llm_load_tensors: CUDA0 buffer size = 493.46 MiB
llm_load_tensors: CUDA1 buffer size = 1106.06 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.1.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.2.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.3.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.4.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.5.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.6.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.7.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.8.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.9.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.10.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.11.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.12.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.13.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.14.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.15.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.16.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.17.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.18.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.19.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.20.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.21.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.22.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.23.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.24.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.25.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.26.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.27.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.28.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.29.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.30.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.31.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.32.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.33.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.34.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.35.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.36.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.37.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.38.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.39.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.40.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.41.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.42.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.43.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.44.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA_Host
Computed blk.45.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
Computed blk.51.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
Computed blk.52.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
Computed blk.53.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
Computed blk.54.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
Computed blk.55.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
Computed blk.56.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
Computed blk.57.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
Computed blk.58.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
Computed blk.59.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA1
============ Repacked 462 tensors
llama_new_context_with_model: n_ctx = 262144
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA_Host KV buffer size = 6885.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 765.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 1683.01 MiB
llama_new_context_with_model: KV self size = 9333.00 MiB, c^KV (q8_0): 9333.00 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.63 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11428.00 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 11162.80 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 11074.13 MiB
llama_new_context_with_model: graph nodes = 24144
llama_new_context_with_model: graph splits = 619
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
======================================= HAVE_FANCY_SIMD is defined
INFO [ init] initializing slots | tid="3652" timestamp=1763728930 n_slots=1
INFO [ init] new slot | tid="3652" timestamp=1763728930 id_slot=0 n_ctx_slot=262144
prompt cache is enabled, size limit: no limit
use `--cache-ram 0` to disable the prompt cache
INFO [ main] model loaded | tid="3652" timestamp=1763728930
INFO [ main] chat template | tid="3652" timestamp=1763728930
...
INFO [ main] HTTP server listening | tid="3652" timestamp=1763728930 hostname="0.0.0.0" port="11434" n_threads_http="47"
INFO [ update_slots] all slots are idle | tid="3652" timestamp=1763728930
prompt cache: cache size: 0, n_keep: 0, n_discarded: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="3652" timestamp=1763728977 id_slot=0 id_task=0
INFO [ update_slots] kv cache rm [p0, end) | tid="3652" timestamp=1763728977 id_slot=0 id_task=0 p0=0
ggml_compute_forward_dup_q: cache_k_l0 (view) -> cache_k_l0 (view) (copy) is of type f16
D:\LOG\ik_llama.cpp\ggml\src\ggml.c:11754: fatal error
ggml_compute_forward_dup_q: cache_k_l0 (view) -> cache_k_l0 (view) (copy) is of type f16Metadata
Metadata
Assignees
Labels
No labels