Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Nov 24, 2025

Close #16487

This PR introduces the ability to use multiple models, unload/load them on the fly in llama-server

The API was designed to take advantage of OAI-compat /v1/models endpoint, as well as the "model" in body payload for POST requests like /v1/chat/completions. By default, if the model is not yet loaded, it will be loaded automatically on-demand.

This is the first version of the feature, aims to be experimental. Here is the list of capabilities:

  • API for listing, loading, unloading models
  • Routing request based on "model" field
  • Limit maximum number of models to be loaded at the same time
  • Allow loading models from a local directory
  • (Advanced) allow specifying custom per-model config via API

Other features like downloading new models, delete cached models, real-time events, etc are planned for next iteration.

Example commands:

# start the server as router (using models in cache)
llama-server

# use GGUFs from a local directory - see directory structure in README.md
llama-server --models-dir ./my_models

# specify default arguments to be passed to models
llama-server -n 128 -ctx 8192 -ngl 4

# allow setting the arguments per-model via API (warning: only used in trusted network)
llama-server --models-allow-extra-args

For the full documentation, please refer to the "Using multiple models" section of the new documentation

Note: waiting for further webui changes from @allozaur

Screen.Recording.2025-11-24.at.15.20.05.mp4

Implementation

The feature was implemented using multi-process approach. The reason for this choice is to be more resilient in case the model crashes.

Most of the implementation is confined inside tools/server/server-models.cpp

There will be one main "router" server whose the job is to create other "child" processes that will actually run the inference.

This system was design and test against these unexpected cases:

  • Child process suddenly exit due to error (for example, a GGML_ASSERT)
  • Child process failed to load (for example, the system cannot launch the process)
  • Router process suddenly exit due to error. In this case, child processes automatically stop themself.

These steps happen when user request the router to launch a model instance:

  1. Check if the model already had a process; if yes, skip
  2. Construct argv and envp to launch the child process; a random HTTP port is selected for each child process
  3. Start the child process
  4. Create a thread to read child's stdout/stderr and forward it to main process, with a prefix [port_number]
  5. Inside child process, it notifies router server its "ready" status, then spawn a thread to monitor its stdin

If the child process exits, router server knows that as soon as stdout/stderr is closed

In reverse, from the router server:

  • If the router server send a special command via stdin, the child process detects this command, call clean up function and exit gracefully
  • If router server crashes, the stdin will be closed. This will trigger exit(1) which cause the child process to exit immediately
sequenceDiagram
    router->>child: spawn with args
    child->>child: load model
    child->>router: POST ready status via API
    Note over child,router: Routing HTTP requests
    alt request shutdown
        router->>child: exit command (via stdin)
        child->>child: clean up & exit
    else router dead
        router-->child: stdin close
        child->>child: force exit
    end
Loading

Other changes included in the PR:

  • Added subprocess.h as new vedor
  • Remove DEFAULT_MODEL_PATH
  • If -m, --model is not specified, common_params_parse_ex will return an error (except for server)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes script Script related server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: allow load/unload models on server

4 participants