-
Notifications
You must be signed in to change notification settings - Fork 13.9k
server: introduce API for serving / loading / unloading multiple models #17470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ngxson
wants to merge
158
commits into
ggml-org:master
Choose a base branch
from
ngxson:xsn/server_model_management_v1_2
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+9,747
−4,171
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…' into allozaur/server_model_management_v1_2
…' into allozaur/server_model_management_v1_2
…_v1_2' into allozaur/server_model_management_v1_2
…v1_2' into xsn/server_model_management_v1_2
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Close #16487
This PR introduces the ability to use multiple models, unload/load them on the fly in
llama-serverThe API was designed to take advantage of OAI-compat
/v1/modelsendpoint, as well as the"model"in body payload for POST requests like/v1/chat/completions. By default, if the model is not yet loaded, it will be loaded automatically on-demand.This is the first version of the feature, aims to be experimental. Here is the list of capabilities:
"model"fieldOther features like downloading new models, delete cached models, real-time events, etc are planned for next iteration.
Example commands:
For the full documentation, please refer to the "Using multiple models" section of the new documentation
Note: waiting for further webui changes from @allozaur
Screen.Recording.2025-11-24.at.15.20.05.mp4
Implementation
The feature was implemented using multi-process approach. The reason for this choice is to be more resilient in case the model crashes.
Most of the implementation is confined inside
tools/server/server-models.cppThere will be one main "router" server whose the job is to create other "child" processes that will actually run the inference.
This system was design and test against these unexpected cases:
GGML_ASSERT)These steps happen when user request the router to launch a model instance:
[port_number]If the child process exits, router server knows that as soon as stdout/stderr is closed
In reverse, from the router server:
exit(1)which cause the child process to exit immediatelysequenceDiagram router->>child: spawn with args child->>child: load model child->>router: POST ready status via API Note over child,router: Routing HTTP requests alt request shutdown router->>child: exit command (via stdin) child->>child: clean up & exit else router dead router-->child: stdin close child->>child: force exit endOther changes included in the PR:
DEFAULT_MODEL_PATH-m, --modelis not specified,common_params_parse_exwill return an error (except for server)