Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions tools/server/README-dev.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# llama-server Development Documentation

This document provides an in-depth technical overview of `llama-server`, intended for maintainers and contributors.

If you are an end user consuming `llama-server` as a product, please refer to the main [README](./README.md) instead.

## Backend

### Overview

The server supports two primary operating modes:

- **Inference mode**: The default mode for performing inference with a single loaded GGUF model.
- **Router mode**: Enables management of multiple inference server instances behind a single API endpoint. Requests are automatically routed to the appropriate backend instance based on the requested model.

The core architecture consists of the following components:

- `server_context`: Holds the primary inference state, including the main `llama_context` and all active slots.
- `server_slot`: An abstraction over a single “sequence” in llama.cpp, responsible for managing individual parallel inference requests.
- `server_routes`: Middleware layer between `server_context` and the HTTP interface; handles JSON parsing/formatting and request routing logic.
- `server_http_context`: Implements the HTTP server using `cpp-httplib`.
- `server_queue`: Thread-safe queue used by HTTP workers to submit new tasks to `server_context`.
- `server_response`: Thread-safe queue used by `server_context` to return results to HTTP workers.
- `server_response_reader`: Higher-level wrapper around the two queues above for cleaner code.
- `server_task`: Unit of work pushed into `server_queue`.
- `server_task_result`: Unit of result pushed into `server_response`.
- `server_tokens`: Unified representation of token sequences (supports both text and multimodal tokens); used by `server_task` and `server_slot`.
- `server_prompt_checkpoint`: For recurrent (e.g., RWKV) and SWA models, stores snapshots of KV cache state. Enables reuse when subsequent requests share the same prompt prefix, saving redundant computation.
- `server_models`: Standalone component for managing multiple backend instances (used in router mode). It is completely independent of `server_context`.

```mermaid
graph TD
API_User <--> server_http_context
server_http_context <-- router mode --> server_models
server_http_context <-- inference mode --> server_routes
server_routes -- server_task --> server_queue
subgraph server_context
server_queue --> server_slot
server_slot -- server_task_result --> server_response
server_slot[multiple server_slot]
end
server_response --> server_routes
```

TODO: mention about how batching is handled by `server_slot`

### Thread Management

`server_context` runs on a dedicated single thread. Because it is single-threaded, heavy post-processing (especially after token generation) should be avoided, as it directly impacts multi-sequence throughput.

Each incoming HTTP request is handled by its own thread managed by the HTTP library. The following operations are performed in HTTP worker threads:

- JSON request parsing
- Chat template application
- Tokenization
- Conversion of `server_task_result` into final JSON response
- Error formatting into JSON
- Tracking of partial/incremental responses (e.g., streaming tool calls or reasoning steps)

**Best practices to follow:**

- All JSON formatting and chat template logic must stay in the HTTP layer.
- Avoid passing raw JSON between the HTTP layer and `server_slot`. Instead, parse everything into native C++ types as early as possible.

### Testing

`llama-server` includes an automated test suite based on `pytest`.

The framework automatically starts a `llama-server` instance, sends requests, and validates responses.

For detailed instructions, see the [test documentation](./tests/README.md).

### Notable Related PRs

- Initial server implementation: https://github.com/ggml-org/llama.cpp/pull/1443
- Parallel decoding support: https://github.com/ggml-org/llama.cpp/pull/3228
- Refactor introducing `server_queue` and `server_response`: https://github.com/ggml-org/llama.cpp/pull/5065
- Reranking endpoint: https://github.com/ggml-org/llama.cpp/pull/9510
- Multimodal model support (`libmtmd`): https://github.com/ggml-org/llama.cpp/pull/12898
- Unified KV cache handling: https://github.com/ggml-org/llama.cpp/pull/16736
- Separation of HTTP logic into dedicated files: https://github.com/ggml-org/llama.cpp/pull/17216
- Large-scale code base split into smaller files: https://github.com/ggml-org/llama.cpp/pull/17362
- Introduction of router mode: https://github.com/ggml-org/llama.cpp/pull/17470




## Web UI

The project includes a web-based user interface for interacting with `llama-server`. It supports both single-model (`MODEL` mode) and multi-model (`ROUTER` mode) operation.

The SvelteKit-based Web UI is introduced in this PR: https://github.com/ggml-org/llama.cpp/pull/14839

### Features

- **Chat interface** with streaming responses
- **Multi-model support** (ROUTER mode) - switch between models, auto-load on selection
- **Modality validation** - ensures selected model supports conversation's attachments (images, audio)
- **Conversation management** - branching, regeneration, editing with history preservation
- **Attachment support** - images, audio, PDFs (with vision/text fallback)
- **Configurable parameters** - temperature, top_p, etc. synced with server defaults
- **Dark/light theme**

### Tech Stack

- **SvelteKit** - frontend framework with Svelte 5 runes for reactive state
- **TailwindCSS** + **shadcn-svelte** - styling and UI components
- **Vite** - build tooling
- **IndexedDB** (Dexie) - local storage for conversations
- **LocalStorage** - user settings persistence

### Architecture

The WebUI follows a layered architecture:

```
Routes → Components → Hooks → Stores → Services → Storage/API
```

- **Stores** - reactive state management (`chatStore`, `conversationsStore`, `modelsStore`, `serverStore`, `settingsStore`)
- **Services** - stateless API/database communication (`ChatService`, `ModelsService`, `PropsService`, `DatabaseService`)
- **Hooks** - reusable logic (`useModelChangeValidation`, `useProcessingState`)

For detailed architecture diagrams, see [`tools/server/webui/docs/`](webui/docs/):

- `high-level-architecture.mmd` - full architecture with all modules
- `high-level-architecture-simplified.mmd` - simplified overview
- `data-flow-simplified-model-mode.mmd` - data flow for single-model mode
- `data-flow-simplified-router-mode.mmd` - data flow for multi-model mode
- `flows/*.mmd` - detailed per-domain flows (chat, conversations, models, etc.)

### Development

```sh
# make sure you have Node.js installed
cd tools/server/webui
npm i

# run dev server (with hot reload)
npm run dev

# run tests
npm run test

# build production bundle
npm run build
```

After `public/index.html.gz` has been generated, rebuild `llama-server` as described in the [build](#build) section to include the updated UI.

**Note:** The Vite dev server automatically proxies API requests to `http://localhost:8080`. Make sure `llama-server` is running on that port during development.
145 changes: 19 additions & 126 deletions tools/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**.

Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
Set of LLM REST APIs and a web UI to interact with llama.cpp.

**Features:**
* LLM inference of F16 and quantized models on GPU and CPU
Expand All @@ -19,7 +19,7 @@ Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
* Speculative decoding
* Easy-to-use web UI

The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggml-org/llama.cpp/issues/4216).
For the ful list of features, please refer to [server's changelog](https://github.com/ggml-org/llama.cpp/issues/9291)

## Usage

Expand Down Expand Up @@ -289,69 +289,6 @@ For more details, please refer to [multimodal documentation](../../docs/multimod
cmake --build build --config Release -t llama-server
```

## Web UI

The project includes a web-based user interface for interacting with `llama-server`. It supports both single-model (`MODEL` mode) and multi-model (`ROUTER` mode) operation.

### Features

- **Chat interface** with streaming responses
- **Multi-model support** (ROUTER mode) - switch between models, auto-load on selection
- **Modality validation** - ensures selected model supports conversation's attachments (images, audio)
- **Conversation management** - branching, regeneration, editing with history preservation
- **Attachment support** - images, audio, PDFs (with vision/text fallback)
- **Configurable parameters** - temperature, top_p, etc. synced with server defaults
- **Dark/light theme**

### Tech Stack

- **SvelteKit** - frontend framework with Svelte 5 runes for reactive state
- **TailwindCSS** + **shadcn-svelte** - styling and UI components
- **Vite** - build tooling
- **IndexedDB** (Dexie) - local storage for conversations
- **LocalStorage** - user settings persistence

### Architecture

The WebUI follows a layered architecture:

```
Routes → Components → Hooks → Stores → Services → Storage/API
```

- **Stores** - reactive state management (`chatStore`, `conversationsStore`, `modelsStore`, `serverStore`, `settingsStore`)
- **Services** - stateless API/database communication (`ChatService`, `ModelsService`, `PropsService`, `DatabaseService`)
- **Hooks** - reusable logic (`useModelChangeValidation`, `useProcessingState`)

For detailed architecture diagrams, see [`tools/server/webui/docs/`](webui/docs/):

- `high-level-architecture.mmd` - full architecture with all modules
- `high-level-architecture-simplified.mmd` - simplified overview
- `data-flow-simplified-model-mode.mmd` - data flow for single-model mode
- `data-flow-simplified-router-mode.mmd` - data flow for multi-model mode
- `flows/*.mmd` - detailed per-domain flows (chat, conversations, models, etc.)

### Development

```sh
# make sure you have Node.js installed
cd tools/server/webui
npm i
# run dev server (with hot reload)
npm run dev
# run tests
npm run test
# build production bundle
npm run build
```

After `public/index.html.gz` has been generated, rebuild `llama-server` as described in the [build](#build) section to include the updated UI.

**Note:** The Vite dev server automatically proxies API requests to `http://localhost:8080`. Make sure `llama-server` is running on that port during development.

## Quick Start

To get started right away, run the following command, making sure to use the correct path for the model you have:
Expand Down Expand Up @@ -380,7 +317,7 @@ docker run -p 8080:8080 -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:se
docker run -p 8080:8080 -v /path/to/models:/models --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99
```

## Testing with CURL
## Using with CURL

Using [curl](https://curl.se/). On Windows, `curl.exe` should be available in the base OS.

Expand All @@ -391,46 +328,6 @@ curl --request POST \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
```

## Advanced testing

We implemented a [server test framework](./tests/README.md) using human-readable scenario.

*Before submitting an issue, please try to reproduce it with this format.*

## Node JS Test
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this section as it's duplicated with the example under "More examples"


You need to have [Node.js](https://nodejs.org/en) installed.

```bash
mkdir llama-client
cd llama-client
```

Create an index.js file and put this inside:

```javascript
const prompt = "Building a website can be done in 10 simple steps:"
async function test() {
let response = await fetch("http://127.0.0.1:8080/completion", {
method: "POST",
body: JSON.stringify({
prompt,
n_predict: 64,
})
})
console.log((await response.json()).content)
}
test()
```

And run it:

```bash
node index.js
```

## API Endpoints

### GET `/health`: Returns health check result
Expand Down Expand Up @@ -1634,6 +1531,22 @@ Response:
}
```

## API errors

`llama-server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi

Example of an error:

```json
{
"error": {
"code": 401,
"message": "Invalid API Key",
"type": "authentication_error"
}
}
```

## More examples

### Interactive mode
Expand All @@ -1653,26 +1566,6 @@ Run with bash:
bash chat.sh
```

### OAI-like API

The HTTP `llama-server` supports an OAI-like API: https://github.com/openai/openai-openapi

### API errors

`llama-server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi

Example of an error:

```json
{
"error": {
"code": 401,
"message": "Invalid API Key",
"type": "authentication_error"
}
}
```

Apart from error types supported by OAI, we also have custom types that are specific to functionalities of llama.cpp:

**When /metrics or /slots endpoint is disabled**
Expand Down