From 711e29ec8865334679cfac06f649bf74be34463e Mon Sep 17 00:00:00 2001 From: Xuan Son Nguyen Date: Thu, 4 Dec 2025 15:30:28 +0100 Subject: [PATCH 1/3] first draft --- tools/server/README-dev.md | 149 +++++++++++++++++++++++++++++++++++++ tools/server/README.md | 69 ----------------- 2 files changed, 149 insertions(+), 69 deletions(-) create mode 100644 tools/server/README-dev.md diff --git a/tools/server/README-dev.md b/tools/server/README-dev.md new file mode 100644 index 00000000000..8cd64f3baec --- /dev/null +++ b/tools/server/README-dev.md @@ -0,0 +1,149 @@ +# llama-server development documentation + +this doc provides an in-depth overview of the llama-server tool, helping maintainers and contributors. + +if you are an user using llama-server as a product, please refer to the [main documentation](./README.md) instead + +## Backend + +### Overview + +Server has 2 modes of operation: +- Inference mode: used for main inference with a model. This requires loading a GGUF model +- Router mode: used for managing multiple instances of the server, each instance is in inference mode. This allow user to use multiple models from the same API endpoint, since they are routed via a router server. + +The project consists of these main components: + +- `server_context`: hold the main inference context, including the main `llama_context` and slots +- `server_slot`: An abstraction layer of "sequence" in libllama, used for managing parallel sequences +- `server_routes`: the intermediate layer between `server_context` and HTTP layer. it contains logic to parse and format JSON for HTTP requests and responses +- `server_http_context`: hold the implementation of the HTTP server layer. currently, we use `cpp-httplib` as the implementation of the HTTP layer +- `server_queue`: A concurrent queue that allow HTTP threads to post new tasks to `server_context` +- `server_response`: A concurrent queue that allow `server_context` to send back response to HTTP threads +- `server_response_reader`: A high-level abstraction of `server_queue` and `server_response`, making the code easier to read and to maintain +- `server_task`: An unit of task, that can be pushed into `server_queue` +- `server_task_result`: An unit of response, that can be pushed into `server_response` +- `server_tokens`: An abstraction of token list, supporting both text and multimodal tokens; it is used by `server_task` and `server_slot` +- `server_prompt_checkpoint`: For recurrence and SWA models, we use this class to store a "snapshot" of the state of the model's memory. This allow re-using them when the following requests has the same prompt prefix, saving some computations. +- `server_models`: Component that allows managing multiple instances of llama-server, allow using multiple models. Please note that this is a standalone component, it independent from `server_context` + +```mermaid +graph TD + API_User <--> server_http_context + server_http_context <-- router mode --> server_models + server_http_context <-- inference mode --> server_routes + server_routes -- server_task --> server_queue + + subgraph server_context + server_queue --> server_slot + server_slot -- server_task_result --> server_response + server_slot[multiple server_slot] + end + + server_response --> server_routes +``` + +TODO: metion about how batching is handled by `server_slot` + +### Thread management + +`server_context` run on its own thread. Because is single-threaded, you should not add too many processing (especially post-token generation logic) to avoid negatively impact multi-sequence performance. + +Each request have its own thread, managed by HTTP layer. These tasks are run inside HTTP thread: +- JSON request parsing +- Applying chat template +- Tokenizing +- Convert `server_task_result` into final JSON response +- Error handling (formatting error into JSON response) +- Partial response tracking (for example, tracking incremental tool calls or reasoning response) + +Some rules practices to follow: +- Any JSON formatting and chat template handling must be done at HTTP level +- Prevent passing JSON back and forth between HTTP layer and `server_slot`. Instead, parse them at HTTP layer into native C++ data types + +### Testing + +llama-server has a testing system based on `pytest` + +In a nutshell, this testing system automatically spawn an instance of `llama-server` and send test requests, then wait and check for the response. + +For more info, please refer to the (test documentation)[./tests/README.md] + +### Related PRs + +- Initial server implementation: https://github.com/ggml-org/llama.cpp/pull/1443 +- Support parallel decoding: https://github.com/ggml-org/llama.cpp/pull/3228 +- Refactor, adding `server_queue` and `server_response`: https://github.com/ggml-org/llama.cpp/pull/5065 +- Reranking support: https://github.com/ggml-org/llama.cpp/pull/9510 +- Multimodel support (`libmtmd`): https://github.com/ggml-org/llama.cpp/pull/12898 +- Unified KV support: https://github.com/ggml-org/llama.cpp/pull/16736 +- Refactor, separate HTTP logic into its own cpp/h interface: https://github.com/ggml-org/llama.cpp/pull/17216 +- Refactor, break the code base into smaller cpp/h files: https://github.com/ggml-org/llama.cpp/pull/17362 +- Adding "router mode" to server: https://github.com/ggml-org/llama.cpp/pull/17470 + + +## Web UI + + +The project includes a web-based user interface for interacting with `llama-server`. It supports both single-model (`MODEL` mode) and multi-model (`ROUTER` mode) operation. + +The SvelteKit-based Web UI is introduced in this PR: https://github.com/ggml-org/llama.cpp/pull/14839 + +### Features + +- **Chat interface** with streaming responses +- **Multi-model support** (ROUTER mode) - switch between models, auto-load on selection +- **Modality validation** - ensures selected model supports conversation's attachments (images, audio) +- **Conversation management** - branching, regeneration, editing with history preservation +- **Attachment support** - images, audio, PDFs (with vision/text fallback) +- **Configurable parameters** - temperature, top_p, etc. synced with server defaults +- **Dark/light theme** + +### Tech Stack + +- **SvelteKit** - frontend framework with Svelte 5 runes for reactive state +- **TailwindCSS** + **shadcn-svelte** - styling and UI components +- **Vite** - build tooling +- **IndexedDB** (Dexie) - local storage for conversations +- **LocalStorage** - user settings persistence + +### Architecture + +The WebUI follows a layered architecture: + +``` +Routes → Components → Hooks → Stores → Services → Storage/API +``` + +- **Stores** - reactive state management (`chatStore`, `conversationsStore`, `modelsStore`, `serverStore`, `settingsStore`) +- **Services** - stateless API/database communication (`ChatService`, `ModelsService`, `PropsService`, `DatabaseService`) +- **Hooks** - reusable logic (`useModelChangeValidation`, `useProcessingState`) + +For detailed architecture diagrams, see [`tools/server/webui/docs/`](webui/docs/): + +- `high-level-architecture.mmd` - full architecture with all modules +- `high-level-architecture-simplified.mmd` - simplified overview +- `data-flow-simplified-model-mode.mmd` - data flow for single-model mode +- `data-flow-simplified-router-mode.mmd` - data flow for multi-model mode +- `flows/*.mmd` - detailed per-domain flows (chat, conversations, models, etc.) + +### Development + +```sh +# make sure you have Node.js installed +cd tools/server/webui +npm i + +# run dev server (with hot reload) +npm run dev + +# run tests +npm run test + +# build production bundle +npm run build +``` + +After `public/index.html.gz` has been generated, rebuild `llama-server` as described in the [build](#build) section to include the updated UI. + +**Note:** The Vite dev server automatically proxies API requests to `http://localhost:8080`. Make sure `llama-server` is running on that port during development. diff --git a/tools/server/README.md b/tools/server/README.md index cb2fbcf8eb7..210468bab3d 100644 --- a/tools/server/README.md +++ b/tools/server/README.md @@ -289,69 +289,6 @@ For more details, please refer to [multimodal documentation](../../docs/multimod cmake --build build --config Release -t llama-server ``` -## Web UI - -The project includes a web-based user interface for interacting with `llama-server`. It supports both single-model (`MODEL` mode) and multi-model (`ROUTER` mode) operation. - -### Features - -- **Chat interface** with streaming responses -- **Multi-model support** (ROUTER mode) - switch between models, auto-load on selection -- **Modality validation** - ensures selected model supports conversation's attachments (images, audio) -- **Conversation management** - branching, regeneration, editing with history preservation -- **Attachment support** - images, audio, PDFs (with vision/text fallback) -- **Configurable parameters** - temperature, top_p, etc. synced with server defaults -- **Dark/light theme** - -### Tech Stack - -- **SvelteKit** - frontend framework with Svelte 5 runes for reactive state -- **TailwindCSS** + **shadcn-svelte** - styling and UI components -- **Vite** - build tooling -- **IndexedDB** (Dexie) - local storage for conversations -- **LocalStorage** - user settings persistence - -### Architecture - -The WebUI follows a layered architecture: - -``` -Routes → Components → Hooks → Stores → Services → Storage/API -``` - -- **Stores** - reactive state management (`chatStore`, `conversationsStore`, `modelsStore`, `serverStore`, `settingsStore`) -- **Services** - stateless API/database communication (`ChatService`, `ModelsService`, `PropsService`, `DatabaseService`) -- **Hooks** - reusable logic (`useModelChangeValidation`, `useProcessingState`) - -For detailed architecture diagrams, see [`tools/server/webui/docs/`](webui/docs/): - -- `high-level-architecture.mmd` - full architecture with all modules -- `high-level-architecture-simplified.mmd` - simplified overview -- `data-flow-simplified-model-mode.mmd` - data flow for single-model mode -- `data-flow-simplified-router-mode.mmd` - data flow for multi-model mode -- `flows/*.mmd` - detailed per-domain flows (chat, conversations, models, etc.) - -### Development - -```sh -# make sure you have Node.js installed -cd tools/server/webui -npm i - -# run dev server (with hot reload) -npm run dev - -# run tests -npm run test - -# build production bundle -npm run build -``` - -After `public/index.html.gz` has been generated, rebuild `llama-server` as described in the [build](#build) section to include the updated UI. - -**Note:** The Vite dev server automatically proxies API requests to `http://localhost:8080`. Make sure `llama-server` is running on that port during development. - ## Quick Start To get started right away, run the following command, making sure to use the correct path for the model you have: @@ -391,12 +328,6 @@ curl --request POST \ --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}' ``` -## Advanced testing - -We implemented a [server test framework](./tests/README.md) using human-readable scenario. - -*Before submitting an issue, please try to reproduce it with this format.* - ## Node JS Test You need to have [Node.js](https://nodejs.org/en) installed. From 28670b7c325f75e558dc378c9a925223574e7724 Mon Sep 17 00:00:00 2001 From: Xuan Son Nguyen Date: Thu, 4 Dec 2025 15:32:02 +0100 Subject: [PATCH 2/3] rewrite --- tools/server/README-dev.md | 98 +++++++++++++++++++------------------- 1 file changed, 50 insertions(+), 48 deletions(-) diff --git a/tools/server/README-dev.md b/tools/server/README-dev.md index 8cd64f3baec..67ebe1aafee 100644 --- a/tools/server/README-dev.md +++ b/tools/server/README-dev.md @@ -1,31 +1,32 @@ -# llama-server development documentation +# llama-server Development Documentation -this doc provides an in-depth overview of the llama-server tool, helping maintainers and contributors. +This document provides an in-depth technical overview of `llama-server`, intended for maintainers and contributors. -if you are an user using llama-server as a product, please refer to the [main documentation](./README.md) instead +If you are an end user consuming `llama-server` as a product, please refer to the main [README](./README.md) instead. ## Backend ### Overview -Server has 2 modes of operation: -- Inference mode: used for main inference with a model. This requires loading a GGUF model -- Router mode: used for managing multiple instances of the server, each instance is in inference mode. This allow user to use multiple models from the same API endpoint, since they are routed via a router server. - -The project consists of these main components: - -- `server_context`: hold the main inference context, including the main `llama_context` and slots -- `server_slot`: An abstraction layer of "sequence" in libllama, used for managing parallel sequences -- `server_routes`: the intermediate layer between `server_context` and HTTP layer. it contains logic to parse and format JSON for HTTP requests and responses -- `server_http_context`: hold the implementation of the HTTP server layer. currently, we use `cpp-httplib` as the implementation of the HTTP layer -- `server_queue`: A concurrent queue that allow HTTP threads to post new tasks to `server_context` -- `server_response`: A concurrent queue that allow `server_context` to send back response to HTTP threads -- `server_response_reader`: A high-level abstraction of `server_queue` and `server_response`, making the code easier to read and to maintain -- `server_task`: An unit of task, that can be pushed into `server_queue` -- `server_task_result`: An unit of response, that can be pushed into `server_response` -- `server_tokens`: An abstraction of token list, supporting both text and multimodal tokens; it is used by `server_task` and `server_slot` -- `server_prompt_checkpoint`: For recurrence and SWA models, we use this class to store a "snapshot" of the state of the model's memory. This allow re-using them when the following requests has the same prompt prefix, saving some computations. -- `server_models`: Component that allows managing multiple instances of llama-server, allow using multiple models. Please note that this is a standalone component, it independent from `server_context` +The server supports two primary operating modes: + +- **Inference mode**: The default mode for performing inference with a single loaded GGUF model. +- **Router mode**: Enables management of multiple inference server instances behind a single API endpoint. Requests are automatically routed to the appropriate backend instance based on the requested model. + +The core architecture consists of the following components: + +- `server_context`: Holds the primary inference state, including the main `llama_context` and all active slots. +- `server_slot`: An abstraction over a single “sequence” in llama.cpp, responsible for managing individual parallel inference requests. +- `server_routes`: Middleware layer between `server_context` and the HTTP interface; handles JSON parsing/formatting and request routing logic. +- `server_http_context`: Implements the HTTP server using `cpp-httplib`. +- `server_queue`: Thread-safe queue used by HTTP workers to submit new tasks to `server_context`. +- `server_response`: Thread-safe queue used by `server_context` to return results to HTTP workers. +- `server_response_reader`: Higher-level wrapper around the two queues above for cleaner code. +- `server_task`: Unit of work pushed into `server_queue`. +- `server_task_result`: Unit of result pushed into `server_response`. +- `server_tokens`: Unified representation of token sequences (supports both text and multimodal tokens); used by `server_task` and `server_slot`. +- `server_prompt_checkpoint`: For recurrent (e.g., RWKV) and SWA models, stores snapshots of KV cache state. Enables reuse when subsequent requests share the same prompt prefix, saving redundant computation. +- `server_models`: Standalone component for managing multiple backend instances (used in router mode). It is completely independent of `server_context`. ```mermaid graph TD @@ -33,58 +34,59 @@ graph TD server_http_context <-- router mode --> server_models server_http_context <-- inference mode --> server_routes server_routes -- server_task --> server_queue - subgraph server_context server_queue --> server_slot server_slot -- server_task_result --> server_response server_slot[multiple server_slot] end - server_response --> server_routes ``` -TODO: metion about how batching is handled by `server_slot` +TODO: mention about how batching is handled by `server_slot` -### Thread management +### Thread Management -`server_context` run on its own thread. Because is single-threaded, you should not add too many processing (especially post-token generation logic) to avoid negatively impact multi-sequence performance. +`server_context` runs on a dedicated single thread. Because it is single-threaded, heavy post-processing (especially after token generation) should be avoided, as it directly impacts multi-sequence throughput. + +Each incoming HTTP request is handled by its own thread managed by the HTTP library. The following operations are performed in HTTP worker threads: -Each request have its own thread, managed by HTTP layer. These tasks are run inside HTTP thread: - JSON request parsing -- Applying chat template -- Tokenizing -- Convert `server_task_result` into final JSON response -- Error handling (formatting error into JSON response) -- Partial response tracking (for example, tracking incremental tool calls or reasoning response) +- Chat template application +- Tokenization +- Conversion of `server_task_result` into final JSON response +- Error formatting into JSON +- Tracking of partial/incremental responses (e.g., streaming tool calls or reasoning steps) + +**Best practices to follow:** -Some rules practices to follow: -- Any JSON formatting and chat template handling must be done at HTTP level -- Prevent passing JSON back and forth between HTTP layer and `server_slot`. Instead, parse them at HTTP layer into native C++ data types +- All JSON formatting and chat template logic must stay in the HTTP layer. +- Avoid passing raw JSON between the HTTP layer and `server_slot`. Instead, parse everything into native C++ types as early as possible. ### Testing -llama-server has a testing system based on `pytest` +`llama-server` includes an automated test suite based on `pytest`. -In a nutshell, this testing system automatically spawn an instance of `llama-server` and send test requests, then wait and check for the response. +The framework automatically starts a `llama-server` instance, sends requests, and validates responses. -For more info, please refer to the (test documentation)[./tests/README.md] +For detailed instructions, see the [test documentation](./tests/README.md). -### Related PRs +### Notable Related PRs - Initial server implementation: https://github.com/ggml-org/llama.cpp/pull/1443 -- Support parallel decoding: https://github.com/ggml-org/llama.cpp/pull/3228 -- Refactor, adding `server_queue` and `server_response`: https://github.com/ggml-org/llama.cpp/pull/5065 -- Reranking support: https://github.com/ggml-org/llama.cpp/pull/9510 -- Multimodel support (`libmtmd`): https://github.com/ggml-org/llama.cpp/pull/12898 -- Unified KV support: https://github.com/ggml-org/llama.cpp/pull/16736 -- Refactor, separate HTTP logic into its own cpp/h interface: https://github.com/ggml-org/llama.cpp/pull/17216 -- Refactor, break the code base into smaller cpp/h files: https://github.com/ggml-org/llama.cpp/pull/17362 -- Adding "router mode" to server: https://github.com/ggml-org/llama.cpp/pull/17470 +- Parallel decoding support: https://github.com/ggml-org/llama.cpp/pull/3228 +- Refactor introducing `server_queue` and `server_response`: https://github.com/ggml-org/llama.cpp/pull/5065 +- Reranking endpoint: https://github.com/ggml-org/llama.cpp/pull/9510 +- Multimodal model support (`libmtmd`): https://github.com/ggml-org/llama.cpp/pull/12898 +- Unified KV cache handling: https://github.com/ggml-org/llama.cpp/pull/16736 +- Separation of HTTP logic into dedicated files: https://github.com/ggml-org/llama.cpp/pull/17216 +- Large-scale code base split into smaller files: https://github.com/ggml-org/llama.cpp/pull/17362 +- Introduction of router mode: https://github.com/ggml-org/llama.cpp/pull/17470 -## Web UI +## Web UI + The project includes a web-based user interface for interacting with `llama-server`. It supports both single-model (`MODEL` mode) and multi-model (`ROUTER` mode) operation. The SvelteKit-based Web UI is introduced in this PR: https://github.com/ggml-org/llama.cpp/pull/14839 From 00d46b46504cb5543895b9bc63005e6abbd5085b Mon Sep 17 00:00:00 2001 From: Xuan Son Nguyen Date: Thu, 4 Dec 2025 15:44:17 +0100 Subject: [PATCH 3/3] update & remove duplicated sections --- tools/server/README.md | 76 +++++++++++------------------------------- 1 file changed, 19 insertions(+), 57 deletions(-) diff --git a/tools/server/README.md b/tools/server/README.md index 210468bab3d..48fc7056b69 100644 --- a/tools/server/README.md +++ b/tools/server/README.md @@ -2,7 +2,7 @@ Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**. -Set of LLM REST APIs and a simple web front end to interact with llama.cpp. +Set of LLM REST APIs and a web UI to interact with llama.cpp. **Features:** * LLM inference of F16 and quantized models on GPU and CPU @@ -19,7 +19,7 @@ Set of LLM REST APIs and a simple web front end to interact with llama.cpp. * Speculative decoding * Easy-to-use web UI -The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggml-org/llama.cpp/issues/4216). +For the ful list of features, please refer to [server's changelog](https://github.com/ggml-org/llama.cpp/issues/9291) ## Usage @@ -317,7 +317,7 @@ docker run -p 8080:8080 -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:se docker run -p 8080:8080 -v /path/to/models:/models --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99 ``` -## Testing with CURL +## Using with CURL Using [curl](https://curl.se/). On Windows, `curl.exe` should be available in the base OS. @@ -328,40 +328,6 @@ curl --request POST \ --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}' ``` -## Node JS Test - -You need to have [Node.js](https://nodejs.org/en) installed. - -```bash -mkdir llama-client -cd llama-client -``` - -Create an index.js file and put this inside: - -```javascript -const prompt = "Building a website can be done in 10 simple steps:" - -async function test() { - let response = await fetch("http://127.0.0.1:8080/completion", { - method: "POST", - body: JSON.stringify({ - prompt, - n_predict: 64, - }) - }) - console.log((await response.json()).content) -} - -test() -``` - -And run it: - -```bash -node index.js -``` - ## API Endpoints ### GET `/health`: Returns health check result @@ -1565,6 +1531,22 @@ Response: } ``` +## API errors + +`llama-server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi + +Example of an error: + +```json +{ + "error": { + "code": 401, + "message": "Invalid API Key", + "type": "authentication_error" + } +} +``` + ## More examples ### Interactive mode @@ -1584,26 +1566,6 @@ Run with bash: bash chat.sh ``` -### OAI-like API - -The HTTP `llama-server` supports an OAI-like API: https://github.com/openai/openai-openapi - -### API errors - -`llama-server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi - -Example of an error: - -```json -{ - "error": { - "code": 401, - "message": "Invalid API Key", - "type": "authentication_error" - } -} -``` - Apart from error types supported by OAI, we also have custom types that are specific to functionalities of llama.cpp: **When /metrics or /slots endpoint is disabled**