Skip to content

Commit b43b61d

Browse files
committed
docs(server): clarify that --ctx-size is total context divided among parallel slots
When using `--parallel N`, the `--ctx-size` value is the total context divided among all slots, not the per-slot context. This is a common source of confusion. For example: - `--ctx-size 4096 --parallel 4` → each slot gets 1024 tokens - To get 4096 tokens per slot with 4 parallel slots, use `--ctx-size 16384` Fixes #11681
1 parent bde188d commit b43b61d

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

tools/server/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ The project is under active development, and we are [looking for feedback and co
4646
| `--cpu-strict-batch <0\|1>` | use strict CPU placement (default: same as --cpu-strict) |
4747
| `--prio-batch N` | set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: 0)<br/> |
4848
| `--poll-batch <0\|1>` | use polling to wait for work (default: same as --poll) |
49-
| `-c, --ctx-size N` | size of the prompt context (default: 4096, 0 = loaded from model)<br/>(env: LLAMA_ARG_CTX_SIZE) |
49+
| `-c, --ctx-size N` | size of the prompt context (default: 4096, 0 = loaded from model). When using `--parallel N`, this is the **total** context divided among all slots (each slot gets `ctx-size / parallel` tokens). To allocate X tokens per slot with N parallel slots, set `--ctx-size` to `X * N`.<br/>(env: LLAMA_ARG_CTX_SIZE) |
5050
| `-n, --predict, --n-predict N` | number of tokens to predict (default: -1, -1 = infinity)<br/>(env: LLAMA_ARG_N_PREDICT) |
5151
| `-b, --batch-size N` | logical maximum batch size (default: 2048)<br/>(env: LLAMA_ARG_BATCH) |
5252
| `-ub, --ubatch-size N` | physical maximum batch size (default: 512)<br/>(env: LLAMA_ARG_UBATCH) |
@@ -72,7 +72,7 @@ The project is under active development, and we are [looking for feedback and co
7272
| `-ctk, --cache-type-k TYPE` | KV cache data type for K<br/>allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1<br/>(default: f16)<br/>(env: LLAMA_ARG_CACHE_TYPE_K) |
7373
| `-ctv, --cache-type-v TYPE` | KV cache data type for V<br/>allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1<br/>(default: f16)<br/>(env: LLAMA_ARG_CACHE_TYPE_V) |
7474
| `-dt, --defrag-thold N` | KV cache defragmentation threshold (DEPRECATED)<br/>(env: LLAMA_ARG_DEFRAG_THOLD) |
75-
| `-np, --parallel N` | number of parallel sequences to decode (default: 1)<br/>(env: LLAMA_ARG_N_PARALLEL) |
75+
| `-np, --parallel N` | number of parallel sequences to decode (default: 1). The total context (`--ctx-size`) is divided equally among these slots.<br/>(env: LLAMA_ARG_N_PARALLEL) |
7676
| `--mlock` | force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
7777
| `--no-mmap` | do not memory-map model (slower load but may reduce pageouts if not using mlock)<br/>(env: LLAMA_ARG_NO_MMAP) |
7878
| `--numa TYPE` | attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |

0 commit comments

Comments
 (0)