-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Description
Name and Version
version: 716 (10e9780)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server -m ./model/with/4096context -c 16384 --rope-scaling yarn --rope-scale 4Problem description & steps to reproduce
The llama-server does not allow using the extended context. Server output announces the context length is capped
the slot context (%d) exceeds the training context of the model (%d) - capping\n
This is not aware of RoPE settings or other user configurations which would allow for longer context. Disabling this check in llama-server introduced in cd5e3b5 allowed me to use longer context via the RoPE settings.
This "capping" forces the model to load with 4,096 tokens of context and causes my long-context queries to fail.
Please allow us to override this cap. For users who don't know what they are doing, sure, it's probably helpful, but for advanced users there should be a way to disable it.
First Bad Commit
Relevant log output
the slot context (%d) exceeds the training context of the model (%d) - capping\n