sampling : add support for backend sampling #17004

danbev · 2025-11-04T17:34:17Z

This is a work in progress to add support for backend (like GPU) sampling.

The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory.

It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers.

Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph.

Backend samplers can be configured by creating sampler chains, where each sampler chain is associated with a specific sequence id:

    struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
    struct llama_sampler * chain = llama_sampler_chain_init(params);
    llama_sampler_chain_add(chain, llama_sampler_backend_init_greedy());
    std::vector<llama_sampler_seq_config> sampler_configs = {
        { 0, chain }
    };

The struct is defined as:

    struct llama_sampler_seq_config {
        llama_seq_id           seq_id;
        struct llama_sampler * sampler;
    };

These sampler configs are then passed as context params:

    llama_context_params cparams = llama_context_default_params();
    cparams.samplers = sampler_configs.data();
    cparams.n_samplers = sampler_configs.size();

When the model graph is built the GPU samplers will be called to enable them to add their operations to the graph:

ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
    std::unique_ptr<llm_graph_context> llm;
    ...

    // add backend sampling layers (if any)
    llm->build_sampling(*this, params);

The llama_sampler_i interface as been extended with 4 new methods in the API, and they are currently all named with a _ggml suffix to indicate that they are for backend sampling:

        void                   (*init_ggml)(struct llama_sampler      * smpl,
                                            ggml_backend_buffer_type_t  buft);

        void                   (*set_input_ggml)( struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf);

        void                   (*apply_ggml)(  struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf,
                                            llama_sampler_ggml_data * ggml_data);

        void                   (*accept_ggml)( struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf,
                                               struct ggml_tensor * selected_token);

The init_ggml function allows backend samplers to create input tensors that they might need. The ggml_backend_buffer_type should be used so that the tensors are created using this backend buffer type, which is the same as the output logits backend. This avoids splits in the computation graph that would require data transfer between different backends.

The set_input_ggml function is called after the computation graph has been scheduled but before it is computed. This allows the backend sampler to set any input for the tensors it created in init_ggml.

The apply_ggml function is where the backend sampler adds its operations to the graphs. When the graph is built, the configured sampler's _apply function is called which allows them to add operations/nodes to the computation graph.

The accept_ggml functions allows backend samplers to update their tensor states if needed.

This enables the sampling to happen fully, or partially on the backend. The samplers could sample a single token in which case that is what will be transferred from the device memory to host memory after llama_decode has been called. The sampled token can then be retrieved using:

    llama_token id = llama_get_backend_sampled_token_ith(test_ctx.ctx, index);

Is it also possible to run a backend sampler that only filters the logits and then only the filtered logits are transferred back to the host and the sampling can proceed on the CPU with the normal (CPU) sampler chain. In this case the CPU samplers are configured as usual but they will now operate on already filtered logits.

Similar to the above handling of logits, it is possible for a GPU samplers to compute the full probability distribution and transfer that to the host. And the CPU samplers can then operate on the those probabilities.

Configuration

Backend sampling is enabled using --backend_sampling, and the sampler chain, either explicitly specified using --samplers or the default, is automatically analyzed to determine which samplers can run on the backend. The system finds the longest contiguous chain of backend supported samplers from the start of the sampler sequence.

For example:

If the chain is top-k -> temperature -> top-p, and both top-k and temperature are backend-supported but top-p is not, then top-k and temperature will run on the backend, while top-p and subsequent samplers run on the CPU.
If all configured samplers are supported, the final distribution sampling will also happen on the backend, transferring only the sampled token IDs back to the host.
If the sampler chain starts with an unsupported sampler, and the sampler is active, all sampling runs on the CPU. Note that this is currently the case with the default sampler so to use backend sampling it is required to specify a sampler chain. See below for an example.

llama-cli

Initial support for llama-cli has been added and can be used as follows:

    $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
        --prompt 'What is the capital of Sweden?' \
        -n 20 \
        -no-cnv \
        --verbose-prompt \
        -ngl 40 \
        --backend-sampling \
        --samplers 'top_k;temperature'

To enable a partial backend sampling (hybrid sampling), for example running top_k and temperature on the backend and typ_p on the CPU the following sampler chain could be specified:

    $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
        --prompt 'What is the capital of Sweden?' \
        -n 20 \
        -no-cnv \
        --verbose-prompt \
        -ngl 40 \
        --backend-sampling \
        --samplers 'top_k;temperature;top_p'

llama-server

GPU sampling can be enabled for llama-server similar to how it was done above for llama-cli

gdb --args ./build-gpu-sampler/bin/llama-server \
      -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
      --backend-sampling \
      --samplers 'top_k;temperature' \
      --temp 0.8 \
      --top-k 40 \
      -ngl 50 \
      -v

It is then possible to specify send GPU request parameters as follows:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "What is the capital of Sweden?","n_predict": 20, "top_k": 40, "backend_sampling": true}'

Building and running the tests

Download a model for testing:

$ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf

Building the test:

$ cmake --build build --target test-backend-sampler -j8

Runing all tests:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R '^test-backen-sampler$' -V

The following individual tests are available:

$ ctest --test-dir build-gpu-sampler/ -N -R test-backend-sampler-
Internal ctest changing into directory: /home/danbev/work/ai/llama.cpp-debug/build-gpu-sampler
Test project /home/danbev/work/ai/llama.cpp-debug/build-gpu-sampler
  Test #36: test-backend-sampler-greedy
  Test #37: test-backend-sampler-temp
  Test #38: test-backend-sampler-top_k
  Test #39: test-backend-sampler-dist
  Test #40: test-backend-sampler-dist-and-cpu
  Test #41: test-backend-sampler-logit-bias
  Test #42: test-backend-sampler-mul_seq
  Test #43: test-backend-sampler-set-sampler

Total Tests: 8

These can be run individually, for example:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R 'test-backend-sampler-temp' -V

TODO

Implemented backend samplers

Remaining backend samplers

The list below are the current CPU sampler that exist. All of these might not be appropriate as GPU samplers. These will be implemented separate follow up PRs.

am17an · 2025-11-05T09:55:06Z

One place this would be useful immediately is the diffusion-cli. I'm happy to test this when it's ready

ggml/src/ggml.c

ORippler

Not sure if I have a strong opinion on this but removing hybrid sampling would reduce the complexity a bit I think (basically if we always set --gpu-dist we only have two states (either full gpu sampling or full cpu sampling, and no in-between).

common/arg.cpp

common/sampling.cpp

include/llama.h

src/llama-backend-sampler.cpp

tools/server/server.cpp

danbev · 2025-11-13T06:34:41Z

Not sure if I have a strong opinion on this but removing hybrid sampling would reduce the complexity a bit I think (basically if we always set --gpu-dist we only have two states (either full gpu sampling or full cpu sampling, and no in-between).

My thoughts are that I think we should keep the hybrid approach even though it does come with some additional complexity like you say. I think there could be use cases where one might want to perform some sampling like temp/logit_bias/top-k sampling on the device, and then only have a smaller set of logits copied to the host memory, and still enable other CPU samplers, including grammars, to be able to process the logits.

This might turn out to be an incorrect assumption and not something anyone wants to use, but it feels safer to have the ability do hybrid sampling to play it safe.

ggerganov · 2025-11-14T07:41:57Z

@danbev Let's rebase on latest master to pick up the recent changes.

ggerganov · 2025-11-25T09:22:47Z

common/arg.cpp

+    add_opt(common_arg(
+        {"--backend-sampling"},
+        "enable backend sampling (default: disabled)",
+        [](common_params & params) {
+            params.sampling.backend_sampling = true;
+        }
+    ).set_sparam());
+    add_opt(common_arg(
+        {"--backend-dist"},
+        "perform final (distribution) sampling on backend (default: disabled)",
+        [](common_params & params) {
+            params.sampling.backend_dist = true;
+            params.sampling.backend_sampling = true;
+        }
+    ).set_sparam());


This separation between "backend sampling" and "backend dist" is not really necessary.

I think a more generic approach that does not require to use a separate --backend-dist argument is like this:

Define a sampler sequence as we normally do on master: A->B->C->D->E. For example, the default chain that we have is:

sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist

Note that we don't treat the dist sampler specially - it's just a regular sampler

If the --backend-sampling parameter is not passed, we do normal CPU-based sampling as usual

If the --backend-sampling parameter is passed, then we iterate over the sampler sequence from the start and find the longest chain of samplers that are supported by the backend. For example, if A, B, C, E are supported by the backend but D is not, then the backend sampling chain would be A->B->C and the remaining CPU chain will be D->E.

By "supported by the backend" we mean that there is a corresponding llama_sampler_backend_ function to create the backend sampler. It does not mean to necessarily have all the ggml operators implemented by the backend.

The struct common_sampler can be extended to maintain references to both the backend and the CPU sampling chains. The logic for the construction of the 2 chains will be implemented within common_sampler_init as described in the previous point. It's allowed for the CPU sampling chain to be empty - this means that all the sampling is done on the GPU using the backend chain.

I've added 9e5e09d to address this. The commit contains some notes about this and I've updated this PRs main comment with a configuration section and examples using llama-cli and llama-server.

This commit removes the `--backend-dist` option and instead uses the configured --samplers chain to determine which samplers run on the backend. Backend sampling is still enabled using With `--backend_sampling`, and the sampler chain, either explictly specified using `--samplers` or the default, is automatically analyzed to determine which samplers can run on the backend. The system finds the longest contiguous chain of backend supported samplers from the start of the sampler sequence. For example: * If the chain is `top-k -> temperature -> top-p`, and both `top-k` and `temperature` are backend-supported but `top-p` is not, then `top-k` and `temperature` will run on the backend, while `top-p` and subsequent samplers run on the CPU. * If all configured samplers are supported, the final distribution sampling will also happen on the backend, transferring only the sampled token IDs back to the host. * If the sampler chain starts with an unsupported sampler (e.g., `penalties`), all sampling runs on the CPU. Note that this is currently the case with the default sampler so to use backend sampling it is required to specify a sampler chain. See below for an example. The following shows how llama-cli can be run with backend sampling: ```console $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ --prompt 'What is the capital of Sweden?' \ -n 20 \ -no-cnv \ --verbose-prompt \ -ngl 40 \ --backend-sampling \ --samplers 'top_k;temperature' ``` In this case the all sampling will happen on the backend since both `top_k` and `temperature` are supported backend samplers. To enable a partial backend sampling (hybrid sampling), for example running `top_k` and `temperature` on the backend and `typ_p` on the CPU the following sampler chain could be specified: ```console $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ --prompt 'What is the capital of Sweden?' \ -n 20 \ -no-cnv \ --verbose-prompt \ -ngl 40 \ --backend-sampling \ --samplers 'top_k;temperature;top_p' ``` If this looks good then I'll follow up with updates the llama-cli and llama-server documentation to reflect these changes.

@ggerganov

Thanks @ggerganov for the suggestion

This commit adds a function to check if a sampler is actually enabled, meaning that it does not have values that disables its effect. This is then used by the backend samplers initialization to avoid considering samplers that are not enabled when determining the split point between them. The motivation for this is that this allows the default sampler chain for `--samplers` to be used and any sampler that is not enabled will not cause the backend samplers to be skipped. For example, before this change if the penalties sampler was included in the samplers list but had default values that disable it, it would cause the backend samplers to be skipped entirely. This commit also contains some refactoring to remove some code duplication.

src/llama-backend-sampler.cpp

ORippler · 2025-11-26T15:02:53Z

src/llama-backend-sampler.cpp

+    // Use ggml_scale_bias (output = (a * s) + b) which in this case becomes:
+    // min_p_bias = (mask * 1e9f) - 1e9f.
+    // So entries in the mask that we want to discard will become -1e9f, and
+    // others will be 0 (meaning that will not effect the logits).
+    const float large_val = 1e9f;
+    struct ggml_tensor * min_p_bias = ggml_scale_bias(ctx, mask, large_val, -large_val);
+    ggml_set_name(min_p_bias, "min_p_bias");
+
+    // Add the min_p bias to the logits.
+    ggml_data->logits = ggml_add(ctx, ggml_data->logits, min_p_bias);
+    ggml_set_name(ggml_data->logits, "min_p_logits");


Why can't we use the get_rows to return only the values where mask == 1?

This commit modifies the temperature sampling check to allow a temperature value of zero. Previously, the check only allowed positive temperature values, which excluded the valid case of zero temperature. The motivation for this is to enable a zero temperature setting which is also currently causing the following test to fail: ```console (venv) $ cd tools/server/tests (venv) $ ./tests.sh unit/test_basic.py::test_load_split_model ```

This commit adds a macro guard around argsort_f32_i32_cuda_cub usage in the top-k fallback path, falling back to bitonic sort when GGML_CUDA_USE_CUB is not defined. The motivation for this is that some environments like AMD HIP do not have CUB available, causing compilation failure. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/19728226426/job/56523606840#step:6:208

This commit adds a comment to llama_context's constructor explaining why backend samplers are initialized early in the process.

This commit removes the backend sampling chain from the common_sampler structure and related functions. The motivation for this change is that the backend samplers are not currently set on the context, and if they are they would cause the a graph reallocation to occur. Instead, the intialization is handled like it currently is by llama_context's constructor.

Some changes were made in 5ea3be2 which were incomplete. In the case of non-CUB, bitonic sort and its limitations of ncols < 1024 have to apply, similar to argsort.cu

This commit updates the backend sampling implementation to support intermixed usage of backend and CPU samplers within the same batch. The initial implementation was developed as an all-or-nothing solution: either perform backend sampling for the entire batch, or perform CPU sampling for the entire batch. The motivation for this change is to support batches with mixed sequences. For example, we may have a backend sampler configured for sequence 0, while sequence 1 in the same batch uses CPU sampling. This was not supported in the initial implementation. This issue manifested in llama-server with the webui: decoding with backend samplers would work initially, but after changing to CPU sampling, a slot (sequence) could still be using a backend sampler. This meant that logits in output_reserve would not be allocated, resulting in an error. The solution in this commit inspects the batch to determine which sampling modes are needed and allocates buffers accordingly. However, there is a known inefficiency: when we have intermixed backend/CPU samplers in the same batch, we currently copy all logits to the host, even for sequences using backend samplers. Added test_backend_cpu_mixed_batch to verify correct behavior with mixed backend/CPU samplers in a single batch, including dynamic sampler switching between decode calls.

Add check that logits is not null which is can happen for embeddings.

Fix llama-save-load-state which currently fails by handling the case when batch.logits is nullptr (like when loading state) by allocating space for all outputs as CPU logits.

As we only support static graphs for the time and we don't know the size of the output of top-p, we have to do value-scaling same as for min-p operator. Further improvements can be applied to the unit-test (i.e. check for equivalence of top_p happening on backend with top_p happening on cpu) and also by constructing candidates and sorting those as opposed to reversing the sort of the logits (this would be arange + get_rows instead of argsort + get_rows)

ORippler · 2025-11-28T14:58:59Z

src/llama-backend-sampler.cpp

+    // top_k is a view of argsort - check if backend supports the underlying argsort operation
+    // by checking the source tensor (which is the argsort result)
+    if (ctx_data->device && top_k->src[0] && !ggml_backend_dev_supports_op(ctx_data->device, top_k->src[0])) {
+        fprintf(stderr, "Warning: backend does not support argsort operation required for top-k sampling\n");
+        fprintf(stderr, "CPU backend will be used instead which defeats the purpose of having backend samplers\n");
+    }
+
+    // TODO: temporary cont until https://github.com/ggml-org/llama.cpp/pull/17365 is merged
+    ggml_data->candidates = ggml_cont(ctx, top_k);


I feel those are outdated as top_k has been implemented now?

github-actions bot added the testing Everything test related label Nov 4, 2025

danbev force-pushed the gpu-sampling branch 2 times, most recently from 71b0e3d to c82b67b Compare November 6, 2025 06:14

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 6, 2025

danbev force-pushed the gpu-sampling branch 2 times, most recently from 56bca5e to 5d18032 Compare November 6, 2025 06:27

DajanaV mentioned this pull request Nov 6, 2025

UPSTREAM PR #17004: sampling : add support for GPU sampling (wip) auroralabs-loci/llama.cpp#102

Open

5 tasks

danbev force-pushed the gpu-sampling branch from 2747aac to c0ac70c Compare November 7, 2025 09:52

github-actions bot added examples server labels Nov 7, 2025

danbev force-pushed the gpu-sampling branch 7 times, most recently from f49a857 to 7c6dc02 Compare November 11, 2025 12:05

slaren reviewed Nov 11, 2025

View reviewed changes

ggml/src/ggml.c Outdated Show resolved Hide resolved

danbev force-pushed the gpu-sampling branch 4 times, most recently from 1168c22 to 9609e7e Compare November 12, 2025 13:10

ORippler reviewed Nov 12, 2025

View reviewed changes

common/arg.cpp Show resolved Hide resolved

common/sampling.cpp Outdated Show resolved Hide resolved

include/llama.h Outdated Show resolved Hide resolved

src/llama-backend-sampler.cpp Outdated Show resolved Hide resolved

tools/server/server.cpp Show resolved Hide resolved

danbev force-pushed the gpu-sampling branch from cf139de to c7dbcfc Compare November 13, 2025 04:07

danbev force-pushed the gpu-sampling branch 2 times, most recently from 0730c19 to b2370c7 Compare November 16, 2025 07:16

danbev added 3 commits November 25, 2025 06:10

Merge remote-tracking branch 'upstream/master' into backend-sampling

2b4c792

examples : add info about hybrid sampling in batched [no ci]

0f17ccd

Merge remote-tracking branch 'upstream/master' into gpu-sampling

53dca56

ggerganov reviewed Nov 25, 2025

View reviewed changes

danbev and others added 4 commits November 25, 2025 14:01

Merge remote-tracking branch 'upstream/master' into backend-sampling

ec047e1

CUDA: Add top-k implementation

f23b306

sampling : add min-p backend sampler

b45d504

github-actions bot added the build Compilation issues label Nov 26, 2025

ORippler and others added 2 commits November 26, 2025 15:30

Use FetchContent over CPM as it's bundled with CMake

4fea191

Thanks @ggerganov for the suggestion

ORippler reviewed Nov 26, 2025

View reviewed changes

danbev and others added 12 commits November 26, 2025 17:44

cuda : fix editorconfig-checker warning

90a3aff

Merge remote-tracking branch 'upstream/master' into backend-sampling

7c2bfb3

sampling : use argmax for min-p sampling

d9d7361

sampling : add comments about backend sampler [no ci]

172208a

This commit adds a comment to llama_context's constructor explaining why backend samplers are initialized early in the process.

Fix top-k comp & behavior for non-CUB path

f9889cf

Some changes were made in 5ea3be2 which were incomplete. In the case of non-CUB, bitonic sort and its limitations of ncols < 1024 have to apply, similar to argsort.cu

squash! sampling : support intermixed backend/cpu samplers

9ad6522

Add check that logits is not null which is can happen for embeddings.

squash! sampling : support intermixed backend/cpu samplers

459b7ae

Fix llama-save-load-state which currently fails by handling the case when batch.logits is nullptr (like when loading state) by allocating space for all outputs as CPU logits.

refactor : simplify and improve memory management

117e207

ggerganov requested a review from JohannesGaessler as a code owner November 28, 2025 14:10

ORippler reviewed Nov 28, 2025

View reviewed changes

danbev and others added 3 commits November 28, 2025 16:12

sampling : use logits directly for min-p filtering

8cac9de

sampling : simplify

2464d1b

llama : simplify

fbc8f49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sampling : add support for backend sampling #17004

sampling : add support for backend sampling #17004

danbev commented Nov 4, 2025 •

edited

Loading

Uh oh!

am17an commented Nov 5, 2025

Uh oh!

Uh oh!

ORippler left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danbev commented Nov 13, 2025

Uh oh!

ggerganov commented Nov 14, 2025

Uh oh!

ggerganov Nov 25, 2025 •

edited

Loading

Uh oh!

danbev Nov 25, 2025

Uh oh!

Uh oh!

ORippler Nov 26, 2025

Uh oh!

ORippler Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sampling : add support for backend sampling #17004

Are you sure you want to change the base?

sampling : add support for backend sampling #17004

Conversation

danbev commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Configuration

llama-cli

llama-server

Building and running the tests

TODO

Implemented backend samplers

Remaining backend samplers

Uh oh!

am17an commented Nov 5, 2025

Uh oh!

Uh oh!

ORippler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danbev commented Nov 13, 2025

Uh oh!

ggerganov commented Nov 14, 2025

Uh oh!

ggerganov Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danbev Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ORippler Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ORippler Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

danbev commented Nov 4, 2025 •

edited

Loading

ggerganov Nov 25, 2025 •

edited

Loading