-
Notifications
You must be signed in to change notification settings - Fork 14k
server : add development documentation #17760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ngxson
wants to merge
3
commits into
ggml-org:master
Choose a base branch
from
ngxson:xsn/server_dev_docs
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+170
−126
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,151 @@ | ||
| # llama-server Development Documentation | ||
|
|
||
| This document provides an in-depth technical overview of `llama-server`, intended for maintainers and contributors. | ||
|
|
||
| If you are an end user consuming `llama-server` as a product, please refer to the main [README](./README.md) instead. | ||
|
|
||
| ## Backend | ||
|
|
||
| ### Overview | ||
|
|
||
| The server supports two primary operating modes: | ||
|
|
||
| - **Inference mode**: The default mode for performing inference with a single loaded GGUF model. | ||
| - **Router mode**: Enables management of multiple inference server instances behind a single API endpoint. Requests are automatically routed to the appropriate backend instance based on the requested model. | ||
|
|
||
| The core architecture consists of the following components: | ||
|
|
||
| - `server_context`: Holds the primary inference state, including the main `llama_context` and all active slots. | ||
| - `server_slot`: An abstraction over a single “sequence” in llama.cpp, responsible for managing individual parallel inference requests. | ||
| - `server_routes`: Middleware layer between `server_context` and the HTTP interface; handles JSON parsing/formatting and request routing logic. | ||
| - `server_http_context`: Implements the HTTP server using `cpp-httplib`. | ||
| - `server_queue`: Thread-safe queue used by HTTP workers to submit new tasks to `server_context`. | ||
| - `server_response`: Thread-safe queue used by `server_context` to return results to HTTP workers. | ||
| - `server_response_reader`: Higher-level wrapper around the two queues above for cleaner code. | ||
| - `server_task`: Unit of work pushed into `server_queue`. | ||
| - `server_task_result`: Unit of result pushed into `server_response`. | ||
| - `server_tokens`: Unified representation of token sequences (supports both text and multimodal tokens); used by `server_task` and `server_slot`. | ||
| - `server_prompt_checkpoint`: For recurrent (e.g., RWKV) and SWA models, stores snapshots of KV cache state. Enables reuse when subsequent requests share the same prompt prefix, saving redundant computation. | ||
| - `server_models`: Standalone component for managing multiple backend instances (used in router mode). It is completely independent of `server_context`. | ||
|
|
||
| ```mermaid | ||
| graph TD | ||
| API_User <--> server_http_context | ||
| server_http_context <-- router mode --> server_models | ||
| server_http_context <-- inference mode --> server_routes | ||
| server_routes -- server_task --> server_queue | ||
| subgraph server_context | ||
| server_queue --> server_slot | ||
| server_slot -- server_task_result --> server_response | ||
| server_slot[multiple server_slot] | ||
| end | ||
| server_response --> server_routes | ||
| ``` | ||
|
|
||
| TODO: mention about how batching is handled by `server_slot` | ||
|
|
||
| ### Thread Management | ||
|
|
||
| `server_context` runs on a dedicated single thread. Because it is single-threaded, heavy post-processing (especially after token generation) should be avoided, as it directly impacts multi-sequence throughput. | ||
|
|
||
| Each incoming HTTP request is handled by its own thread managed by the HTTP library. The following operations are performed in HTTP worker threads: | ||
|
|
||
| - JSON request parsing | ||
| - Chat template application | ||
| - Tokenization | ||
| - Conversion of `server_task_result` into final JSON response | ||
| - Error formatting into JSON | ||
| - Tracking of partial/incremental responses (e.g., streaming tool calls or reasoning steps) | ||
|
|
||
| **Best practices to follow:** | ||
|
|
||
| - All JSON formatting and chat template logic must stay in the HTTP layer. | ||
| - Avoid passing raw JSON between the HTTP layer and `server_slot`. Instead, parse everything into native C++ types as early as possible. | ||
|
|
||
| ### Testing | ||
|
|
||
| `llama-server` includes an automated test suite based on `pytest`. | ||
|
|
||
| The framework automatically starts a `llama-server` instance, sends requests, and validates responses. | ||
|
|
||
| For detailed instructions, see the [test documentation](./tests/README.md). | ||
|
|
||
| ### Notable Related PRs | ||
|
|
||
| - Initial server implementation: https://github.com/ggml-org/llama.cpp/pull/1443 | ||
| - Parallel decoding support: https://github.com/ggml-org/llama.cpp/pull/3228 | ||
| - Refactor introducing `server_queue` and `server_response`: https://github.com/ggml-org/llama.cpp/pull/5065 | ||
| - Reranking endpoint: https://github.com/ggml-org/llama.cpp/pull/9510 | ||
| - Multimodal model support (`libmtmd`): https://github.com/ggml-org/llama.cpp/pull/12898 | ||
| - Unified KV cache handling: https://github.com/ggml-org/llama.cpp/pull/16736 | ||
| - Separation of HTTP logic into dedicated files: https://github.com/ggml-org/llama.cpp/pull/17216 | ||
| - Large-scale code base split into smaller files: https://github.com/ggml-org/llama.cpp/pull/17362 | ||
| - Introduction of router mode: https://github.com/ggml-org/llama.cpp/pull/17470 | ||
|
|
||
|
|
||
|
|
||
|
|
||
| ## Web UI | ||
|
|
||
| The project includes a web-based user interface for interacting with `llama-server`. It supports both single-model (`MODEL` mode) and multi-model (`ROUTER` mode) operation. | ||
|
|
||
| The SvelteKit-based Web UI is introduced in this PR: https://github.com/ggml-org/llama.cpp/pull/14839 | ||
|
|
||
| ### Features | ||
|
|
||
| - **Chat interface** with streaming responses | ||
| - **Multi-model support** (ROUTER mode) - switch between models, auto-load on selection | ||
| - **Modality validation** - ensures selected model supports conversation's attachments (images, audio) | ||
| - **Conversation management** - branching, regeneration, editing with history preservation | ||
| - **Attachment support** - images, audio, PDFs (with vision/text fallback) | ||
| - **Configurable parameters** - temperature, top_p, etc. synced with server defaults | ||
| - **Dark/light theme** | ||
|
|
||
| ### Tech Stack | ||
|
|
||
| - **SvelteKit** - frontend framework with Svelte 5 runes for reactive state | ||
| - **TailwindCSS** + **shadcn-svelte** - styling and UI components | ||
| - **Vite** - build tooling | ||
| - **IndexedDB** (Dexie) - local storage for conversations | ||
| - **LocalStorage** - user settings persistence | ||
|
|
||
| ### Architecture | ||
|
|
||
| The WebUI follows a layered architecture: | ||
|
|
||
| ``` | ||
| Routes → Components → Hooks → Stores → Services → Storage/API | ||
| ``` | ||
|
|
||
| - **Stores** - reactive state management (`chatStore`, `conversationsStore`, `modelsStore`, `serverStore`, `settingsStore`) | ||
| - **Services** - stateless API/database communication (`ChatService`, `ModelsService`, `PropsService`, `DatabaseService`) | ||
| - **Hooks** - reusable logic (`useModelChangeValidation`, `useProcessingState`) | ||
|
|
||
| For detailed architecture diagrams, see [`tools/server/webui/docs/`](webui/docs/): | ||
|
|
||
| - `high-level-architecture.mmd` - full architecture with all modules | ||
| - `high-level-architecture-simplified.mmd` - simplified overview | ||
| - `data-flow-simplified-model-mode.mmd` - data flow for single-model mode | ||
| - `data-flow-simplified-router-mode.mmd` - data flow for multi-model mode | ||
| - `flows/*.mmd` - detailed per-domain flows (chat, conversations, models, etc.) | ||
|
|
||
| ### Development | ||
|
|
||
| ```sh | ||
| # make sure you have Node.js installed | ||
| cd tools/server/webui | ||
| npm i | ||
|
|
||
| # run dev server (with hot reload) | ||
| npm run dev | ||
|
|
||
| # run tests | ||
| npm run test | ||
|
|
||
| # build production bundle | ||
| npm run build | ||
| ``` | ||
|
|
||
| After `public/index.html.gz` has been generated, rebuild `llama-server` as described in the [build](#build) section to include the updated UI. | ||
|
|
||
| **Note:** The Vite dev server automatically proxies API requests to `http://localhost:8080`. Make sure `llama-server` is running on that port during development. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this section as it's duplicated with the example under "More examples"