-
Notifications
You must be signed in to change notification settings - Fork 13.9k
sampling : add support for backend sampling #17004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
danbev
wants to merge
68
commits into
ggml-org:master
Choose a base branch
from
danbev:gpu-sampling
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+3,486
−396
Open
Changes from 1 commit
Commits
Show all changes
68 commits
Select commit
Hold shift + click to select a range
7884b0e
sampling : add support for backend sampling
danbev 9fe9a00
llama-cli : add backend sampler configuration
danbev f1f3e68
server : add backend sampling options/configuration
danbev a3eb847
webui : add backend sampling options
danbev 67d3b8e
ggml : add initial cumsum implementation for CUDA
danbev 71574f9
sampling : enable all backend sampler tests
danbev 4b52e59
graph : do not include llama-model.h
ggerganov 82957a9
sampling : always expose sampled_ids
danbev 311c1a3
sampling : ensure at most one output token per seq
danbev 26be108
CUDA: Optimize argsort for gpu-based token sampling
ORippler 0da7e7d
sampling : remove version from sampler chain
danbev 51fee29
sampling : always populate logits for sampled probs
danbev 7e98ebc
sampling : simplify backend sampling logic decode
danbev d74eb61
squash! sampling : simplify backend sampling logic decode
danbev 38f408c
common : fix regression caused by extra memory allocations during sam…
ggerganov 18ed4d8
squash! sampling : simplify backend sampling logic decode
danbev 0c660e7
Merge remote-tracking branch 'upstream/master' into backend-sampling
danbev ed4345b
squash! common : fix regression caused by extra memory allocations du…
danbev 0d28b16
sampling : introduce sampling_info struct
danbev c162562
sampling : return early if backend sampling is disabled
danbev 61ffe41
sampling : use pinned memory for backend sampling buffers
danbev 9b24393
common, tools : refactor model loading to support backend samplers
danbev 79b8cf2
Merge remote-tracking branch 'upstream/master' into backend-sampling
danbev 65500d0
sampling : add stride variable for clarity
danbev ae23d2d
sampling: clarify candidate ids usage in comments
danbev 9e273f7
sampling : fix copying both sampled tokens and logits/probs from backend
danbev 50d21aa
tests : cleanup test-backend-sampler.cpp
danbev 7816f0b
Merge remote-tracking branch 'upstream/master' into backend-sampling
danbev d88ba18
common : remove build-info.cpp from commit [no ci]
danbev 4a90583
sampling : cleanup and clarify output_reserve
danbev 8eb9b47
sampling : remove redundant checks for stride and size [no ci]
danbev 25f3380
sampling : add debug log when backend sampler selects token
danbev d0bea21
examples : update batched to use backend sampling
danbev e2d4f08
llama-cli : fix dangling reference to sampler config
ggerganov b26c706
common : initialize backend samplers
ggerganov 883a870
samplers : add missing cont
ggerganov a02adf4
sampling : add assertions for contiguous tensors in async copy functions
danbev 2b4c792
Merge remote-tracking branch 'upstream/master' into backend-sampling
danbev 0f17ccd
examples : add info about hybrid sampling in batched [no ci]
danbev 53dca56
Merge remote-tracking branch 'upstream/master' into gpu-sampling
danbev 9e5e09d
sampling : remove backend-dist option (wip)
danbev ec047e1
Merge remote-tracking branch 'upstream/master' into backend-sampling
danbev f23b306
CUDA: Add top-k implementation
ORippler b45d504
sampling : add min-p backend sampler
danbev 4fea191
Use `FetchContent` over CPM as it's bundled with CMake
ORippler 0f7805f
common : add get_active_samplers function to check enabled samplers
danbev 90a3aff
cuda : fix editorconfig-checker warning
danbev 7c2bfb3
Merge remote-tracking branch 'upstream/master' into backend-sampling
danbev d9d7361
sampling : use argmax for min-p sampling
danbev 51107a0
sampling : fix temperature check to allow zero temperature
danbev 5ea3be2
cuda : fix top-k compilation when CUB is unavailable
danbev 172208a
sampling : add comments about backend sampler [no ci]
danbev e9d0709
sampling : remove backend sampling chain from common_sampler
danbev f9889cf
Fix top-k comp & behavior for non-CUB path
ORippler 74be332
sampling : support intermixed backend/cpu samplers
danbev 9ad6522
squash! sampling : support intermixed backend/cpu samplers
danbev 459b7ae
squash! sampling : support intermixed backend/cpu samplers
danbev 117e207
refactor : simplify and improve memory management
ggerganov 333da80
Add initial version for top-p sampling
ORippler 8cac9de
sampling : use logits directly for min-p filtering
danbev 2464d1b
sampling : simplify
ggerganov fbc8f49
llama : simplify
ggerganov 9028ebf
llama : cleanup + naming
ggerganov d8d98bb
Merge branch 'master' into HEAD
ggerganov ff7b0bf
llama : call backend_init once
ggerganov 467746e
Merge branch 'master' into HEAD
ggerganov 1760bd6
llama : reserve graphs with samplers
ggerganov c187003
llama : naming
ggerganov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -488,3 +488,118 @@ struct llama_sampler * llama_sampler_backend_init_logit_bias(int32_t n_vocab, | |
|
|
||
| return sampler; | ||
| } | ||
|
|
||
| struct llama_sampler_backend_min_p_ctx { | ||
| float p; | ||
|
|
||
| // Only required for checking operation support and can be removed later. | ||
| ggml_backend_dev_t device; | ||
| }; | ||
|
|
||
| static void llama_sampler_backend_min_p_init_ggml( | ||
| struct llama_sampler * smpl, | ||
| ggml_backend_buffer_type_t buft) { | ||
| auto * sctx = (llama_sampler_backend_min_p_ctx *) smpl->ctx; | ||
| sctx->device = ggml_backend_buft_get_device(buft); | ||
| } | ||
|
|
||
| static void llama_sampler_backend_min_p_apply_ggml( | ||
| struct llama_sampler * smpl, | ||
| struct ggml_context * ctx, | ||
| struct ggml_cgraph * gf, | ||
| struct llama_sampler_ggml_data * ggml_data) { | ||
| GGML_UNUSED(gf); | ||
|
|
||
| auto * sctx = (llama_sampler_backend_min_p_ctx *) smpl->ctx; | ||
|
|
||
| struct ggml_tensor * softmax = ggml_soft_max(ctx, ggml_data->logits); | ||
| ggml_set_name(softmax, "softmax"); | ||
|
|
||
| // Get the sorted indices of the softmax probabilities in descending order. | ||
| struct ggml_tensor * sorted_idx = ggml_argsort(ctx, softmax, GGML_SORT_ORDER_DESC); | ||
| ggml_set_name(sorted_idx, "sorted_idx"); | ||
|
|
||
| // Reshape into a row vector. | ||
| struct ggml_tensor * softmax_rows = ggml_reshape_2d(ctx, softmax, 1, softmax->ne[0]); | ||
| ggml_set_name(softmax_rows, "softmax_rows"); | ||
|
|
||
| // Get the sorted probabilities using the sorted indices so that we can get | ||
| // the max probability value, which will be the first entry in sorted_probs. | ||
| struct ggml_tensor * sorted_probs = ggml_get_rows(ctx, softmax_rows, sorted_idx); | ||
| ggml_set_name(sorted_probs, "sorted_probs"); | ||
|
|
||
| // Get the max probability value from sorted_probs. | ||
| struct ggml_tensor * p_max = ggml_view_1d(ctx, sorted_probs, 1, 0); | ||
| ggml_set_name(p_max, "p_max"); | ||
|
|
||
| // Calculate the threshold value. | ||
| struct ggml_tensor * threshold = ggml_scale(ctx, p_max, sctx->p); | ||
| ggml_set_name(threshold, "min_p_threshold"); | ||
|
|
||
| // Broadcast the threshold to match the shape of softmax. | ||
| struct ggml_tensor * threshold_b = ggml_repeat(ctx, threshold, softmax); | ||
| ggml_set_name(threshold_b, "min_p_threshold_b"); | ||
|
|
||
| // Subtract the threshold from softmax probabilities. | ||
| struct ggml_tensor * sub = ggml_sub(ctx, softmax, threshold_b); | ||
|
|
||
| // Create a mask where probabilities below the threshold are 0 (discard), | ||
| // and others are 1 (keep). | ||
| struct ggml_tensor * mask = ggml_step(ctx, sub); | ||
| ggml_set_name(mask, "min_p_mask"); | ||
|
|
||
| // Use ggml_scale_bias (output = (a * s) + b) which in this case becomes: | ||
| // min_p_bias = (mask * 1e9f) - 1e9f. | ||
| // So entries in the mask that we want to discard will become -1e9f, and | ||
| // others will be 0 (meaning that will not effect the logits). | ||
| const float large_val = 1e9f; | ||
| struct ggml_tensor * min_p_bias = ggml_scale_bias(ctx, mask, large_val, -large_val); | ||
| ggml_set_name(min_p_bias, "min_p_bias"); | ||
|
|
||
| // Add the min_p bias to the logits. | ||
| ggml_data->logits = ggml_add(ctx, ggml_data->logits, min_p_bias); | ||
| ggml_set_name(ggml_data->logits, "min_p_logits"); | ||
|
||
|
|
||
| ggml_build_forward_expand(gf, ggml_data->logits); | ||
| } | ||
|
|
||
| static const char * llama_sampler_backend_min_p_name(const struct llama_sampler *) { | ||
| return "backend-min-p"; | ||
| } | ||
|
|
||
| static void llama_sampler_backend_min_p_free(struct llama_sampler * smpl) { | ||
| auto * sctx = (llama_sampler_backend_min_p_ctx *) smpl->ctx; | ||
| delete sctx; | ||
| } | ||
|
|
||
| static struct llama_sampler * llama_sampler_backend_min_p_clone(const struct llama_sampler * smpl) { | ||
| auto * sctx = (llama_sampler_backend_min_p_ctx *) smpl->ctx; | ||
| return llama_sampler_backend_init_min_p(sctx->p); | ||
| } | ||
|
|
||
| struct llama_sampler * llama_sampler_backend_init_min_p(float p) { | ||
| static const llama_sampler_i iface = { | ||
| /*.name =*/ llama_sampler_backend_min_p_name, | ||
| /*.accept =*/ nullptr, | ||
| /*.apply =*/ nullptr, | ||
| /*.reset =*/ nullptr, | ||
| /*.clone =*/ llama_sampler_backend_min_p_clone, | ||
| /*.free =*/ llama_sampler_backend_min_p_free, | ||
| /*.apply_ggml =*/ llama_sampler_backend_min_p_apply_ggml, | ||
| /*.accept_ggml =*/ nullptr, | ||
| /*.set_input_ggml =*/ nullptr, | ||
| /*.init_ggml =*/ llama_sampler_backend_min_p_init_ggml, | ||
| }; | ||
|
|
||
| auto * sctx = new llama_sampler_backend_min_p_ctx { | ||
| /*.p =*/ p, | ||
| /*.device =*/ nullptr, | ||
| }; | ||
|
|
||
| auto * sampler = new llama_sampler { | ||
| /*.iface =*/ &iface, | ||
| /*.ctx =*/ sctx, | ||
| }; | ||
|
|
||
| return sampler; | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.