You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(server): clarify that --ctx-size is total context divided among parallel slots
When using `--parallel N`, the `--ctx-size` value is the total context
divided among all slots, not the per-slot context. This is a common source
of confusion.
For example:
- `--ctx-size 4096 --parallel 4` → each slot gets 1024 tokens
- To get 4096 tokens per slot with 4 parallel slots, use `--ctx-size 16384`
Fixes#11681
|`--poll-batch <0\|1>`| use polling to wait for work (default: same as --poll) |
49
-
|`-c, --ctx-size N`| size of the prompt context (default: 4096, 0 = loaded from model)<br/>(env: LLAMA_ARG_CTX_SIZE) |
49
+
|`-c, --ctx-size N`| size of the prompt context (default: 4096, 0 = loaded from model). When using `--parallel N`, this is the **total** context divided among all slots (each slot gets `ctx-size / parallel` tokens). To allocate X tokens per slot with N parallel slots, set `--ctx-size` to `X * N`.<br/>(env: LLAMA_ARG_CTX_SIZE) |
50
50
|`-n, --predict, --n-predict N`| number of tokens to predict (default: -1, -1 = infinity)<br/>(env: LLAMA_ARG_N_PREDICT) |
|`-np, --parallel N`| number of parallel sequences to decode (default: 1)<br/>(env: LLAMA_ARG_N_PARALLEL) |
75
+
|`-np, --parallel N`| number of parallel sequences to decode (default: 1). The total context (`--ctx-size`) is divided equally among these slots.<br/>(env: LLAMA_ARG_N_PARALLEL) |
76
76
|`--mlock`| force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
77
77
|`--no-mmap`| do not memory-map model (slower load but may reduce pageouts if not using mlock)<br/>(env: LLAMA_ARG_NO_MMAP) |
78
78
|`--numa TYPE`| attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
0 commit comments