server: introduce API for serving / loading / unloading multiple models #17470

ngxson · 2025-11-24T14:31:31Z

This PR introduces the ability to use multiple models, unload/load them on the fly in llama-server

The API was designed to take advantage of OAI-compat /v1/models endpoint, as well as the "model" in body payload for POST requests like /v1/chat/completions. By default, if the model is not yet loaded, it will be loaded automatically on-demand.

This is the first version of the feature, aims to be experimental. Here is the list of capabilities:

API for listing, loading, unloading models
Routing request based on "model" field
Limit maximum number of models to be loaded at the same time
Allow loading models from a local directory
(Advanced) allow specifying custom per-model config via API

Other features like downloading new models, delete cached models, real-time events, etc are planned for next iteration.

Example commands:

# start the server as router (using models in cache)
llama-server

# use GGUFs from a local directory - see directory structure in README.md
llama-server --models-dir ./my_models

# specify default arguments to be passed to models
llama-server -n 128 -ctx 8192 -ngl 4

# allow setting the arguments per-model via API (warning: only used in trusted network)
llama-server --models-allow-extra-args

For the full documentation, please refer to the "Using multiple models" section of the new documentation

Note: waiting for further webui changes from @allozaur

Screen.Recording.2025-11-24.at.15.20.05.mp4

Implementation

The feature was implemented using multi-process approach. The reason for this choice is to be more resilient in case the model crashes.

Most of the implementation is confined inside tools/server/server-models.cpp

There will be one main "router" server whose the job is to create other "child" processes that will actually run the inference.

This system was design and test against these unexpected cases:

Child process suddenly exit due to error (for example, a GGML_ASSERT)
Child process failed to load (for example, the system cannot launch the process)
Router process suddenly exit due to error. In this case, child processes automatically stop themself.

These steps happen when user request the router to launch a model instance:

Check if the model already had a process; if yes, skip
Construct argv and envp to launch the child process; a random HTTP port is selected for each child process
Start the child process
Create a thread to read child's stdout/stderr and forward it to main process, with a prefix [port_number]
Inside child process, it notifies router server its "ready" status, then spawn a thread to monitor its stdin

If the child process exits, router server knows that as soon as stdout/stderr is closed

In reverse, from the router server:

If the router server send a special command via stdin, the child process detects this command, call clean up function and exit gracefully
If router server crashes, the stdin will be closed. This will trigger exit(1) which cause the child process to exit immediately

sequenceDiagram
    router->>child: spawn with args
    child->>child: load model
    child->>router: POST ready status via API
    Note over child,router: Routing HTTP requests
    alt request shutdown
        router->>child: exit command (via stdin)
        child->>child: clean up & exit
    else router dead
        router-->child: stdin close
        child->>child: force exit
    end

Other changes included in the PR:

Added subprocess.h as new vedor
Remove DEFAULT_MODEL_PATH
If -m, --model is not specified, common_params_parse_ex will return an error (except for server)

…' into allozaur/server_model_management_v1_2

…odel used info

…_v1_2' into allozaur/server_model_management_v1_2

…v1_2' into xsn/server_model_management_v1_2

ngxson and others added 30 commits November 19, 2025 21:23

server: add model management and proxy

fc5901a

fix compile error

399f536

does this fix windows?

abc0ca4

fix windows build

54b3545

use subprocess.h, better logging

5423d42

add test

0ef3b61

fix windows

7c6eb17

Merge branch 'master' into xsn/server_model_management_v1_2

919d3f8

feat: Model/Router server architecture WIP

55d33a8

more stable

b9ebdf6

fix unsafe pointer

6610724

also allow terminate loading model

d0ea9e0

add is_active()

5805ca7

refactor: Architecture improvements

8a88576

Merge remote-tracking branch 'ngxson/xsn/server_model_management_v1_2…

c35dee3

…' into allozaur/server_model_management_v1_2

tmp apply upstream fix

2161408

address most problems

5369aaa

address thread safety issue

6929c9f

address review comment

be25bcc

add docs (first version)

cd5c699

address review comment

a2e912c

feat: Improved UX for model information, modality interactions etc

4bf82a1

chore: update webui build output

cc88f6a

Merge remote-tracking branch 'ngxson/xsn/server_model_management_v1_2…

45bf2a4

…' into allozaur/server_model_management_v1_2

refactor: Use only the message data model property for displaying m…

049f40d

…odel used info

chore: update webui build output

c26c340

add --models-dir param

032b9ff

feat: New Model Selection UX WIP

8b1d967

chore: update webui build output

6b7c0a5

feat: Add auto-mic setting

69503aa

allozaur added 14 commits November 28, 2025 19:30

fix: Modality detection improvement for text-based PDF attachments

dd30810

refactor: Cleanup

1adf173

docs: Add info comment

2f97dbf

refactor: Cleanup

c76de5e

re

4d16459

refactor: Cleanup

f50ce7b

refactor: Cleanup

d49d97c

feat: Attachment logic & UI improvements

648d2de

refactor: Constants

27b1522

feat: Improve UI sidebar background color

2464e06

chore: update webui build output

ce9c9af

refactor: Utils imports + move types to app.d.ts

493ef08

test: Fix Storybook mocks

2d556bb

chore: update webui build output

a568e74

ngxson mentioned this pull request Nov 29, 2025

server: move server-context to its own cpp|h #17595

Merged

allozaur and others added 14 commits November 29, 2025 19:40

Merge branch 'master' into allozaur/server_model_management_v1_2

33b9cc4

test: Update Chat Form UI tests

4f39da8

refactor: Tooltip Provider from core layout

949b5fd

refactor: Tests to separate location

ae8a1e8

Merge remote-tracking branch 'origin/allozaur/server_model_management…

6fd720e

…_v1_2' into allozaur/server_model_management_v1_2

Merge branch 'master' into xsn/server_model_management_v1_2

c1dfccd

decouple server_models from server_routes

a82dbbf

test: Move demo test to tests/server

360a5ed

refactor: Remove redundant method

acd3c58

chore: update webui build output

e8b9d74

also route anthropic endpoints

23cb411

Merge remote-tracking branch 'webui/allozaur/server_model_management_…

802e77e

…v1_2' into xsn/server_model_management_v1_2

fix duplicated arg

7b28b5e

fix invalid ptr to shutdown_handler

4a1c05c

ngxson mentioned this pull request Nov 30, 2025

(Staging PR) Model management allozaur/llama.cpp#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: introduce API for serving / loading / unloading multiple models #17470

server: introduce API for serving / loading / unloading multiple models #17470

ngxson commented Nov 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

server: introduce API for serving / loading / unloading multiple models #17470

Are you sure you want to change the base?

server: introduce API for serving / loading / unloading multiple models #17470

Conversation

ngxson commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Nov 24, 2025 •

edited

Loading