Skip to content

Conversation

@kitaekatt
Copy link

Summary

When using --parallel N, the --ctx-size value is the total context divided among all slots, not the per-slot context. This is a common source of confusion (see #11681, #5732).

Changes

Added clarification to two flags in tools/server/README.md:

--ctx-size: Added note explaining that when using --parallel N, this is the total context divided among all slots. Each slot gets ctx-size / parallel tokens. To allocate X tokens per slot with N parallel slots, set --ctx-size to X * N.

--parallel: Added note that the total context is divided equally among these slots.

Example

  • --ctx-size 4096 --parallel 4 → each slot gets 1024 tokens
  • To get 4096 tokens per slot with 4 parallel slots, use --ctx-size 16384 --parallel 4

Related Issues

…parallel slots

When using `--parallel N`, the `--ctx-size` value is the total context
divided among all slots, not the per-slot context. This is a common source
of confusion.

For example:
- `--ctx-size 4096 --parallel 4` → each slot gets 1024 tokens
- To get 4096 tokens per slot with 4 parallel slots, use `--ctx-size 16384`

Fixes ggml-org#11681
@ngxson
Copy link
Collaborator

ngxson commented Dec 4, 2025

this documentation is auto-generated, modify its source in arg.cpp instead

@taronaeo
Copy link
Collaborator

taronaeo commented Dec 5, 2025

This is a common source of confusion (see #11681, #5732).

#17671 as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: llama-server --ctx-size is divided by --parallel and cannot be increased?

3 participants