Skip to content

Bug: Inference hangs when kv cache gets full #982

@fernandaspets

Description

@fernandaspets

What happened?

I was running the Aider Polyglot against Kimi K2 Thinking ubergarm smol_iq3_ks. Excellent results by the way!!! Anyways KV cache got full every now and again I had to manually restart ik_llama.cpp otherwise it wouldn't accept a new request. Probably when the Aider test sent a too long prompt full of compile errors it filed the context window and thus the kv cache and then ik_llama seems to have frozen up. I needed to exit our of the inference and reload the model from scratch to get it working again

Name and Version

./build/bin/llama-server --version
version: 4006 (da5de88)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

ERR [            update_slots] failed to decode the batch: KV cache is full - try increasing it via the context size | tid="134899983839232" timestamp=1763505182 i=0 n_batch=1 ret=1
 ERR [              send_error] task error | tid="134899983839232" timestamp=1763505182 id_multi=-1 id_task=166865 error="Input prompt is too big compared to KV size. Please try increasing KV size."
INFO [            update_slots] slot released | tid="134899983839232" timestamp=1763505182 id_slot=0 id_task=166865 n_ctx=50176 n_past=1047 n_system_tokens=0 n_cache_tokens=1047 truncated=false
INFO [            update_slots] all slots are idle | tid="134899983839232" timestamp=1763505182
INFO [      log_server_request] request | tid="134866460991488" timestamp=1763505182 remote_addr="127.0.0.1" remote_port=36114 status=500 method="POST" path="/v1/chat/completions" params={}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions