Add PagedAttention support (experimental, CUDA only) #17579

ericcurtin · 2025-11-28T19:36:52Z

Implement PagedAttention algorithm for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics.

The implementation is experimental and disabled by default. Enable with the --pagedattention flag

Implement PagedAttention algorithm from for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics. The implementation is experimental and disabled by default. Enable with the --pagedattention flag Signed-off-by: Eric Curtin <eric.curtin@docker.com>

ngxson · 2025-11-28T21:31:47Z

ggml/src/ggml-cuda/paged-attention-v1.cu

+            const int token_idx = block_idx * BLOCK_SIZE + i;
+            if (token_idx >= seq_len) break;
+
+            // TODO: Vectorized K loading and Q·K computation


some TODOs look quite sus, I'm wondering if the code is AI-generated and/or this function actually works

beside, probably give some credits to the original kernel: https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cuh

I mark it experimental for good reason 🙂

I think it's important to explicitly state if you're using AI to generate this PR or not. the numerous TODOs though out the PR does make it look sus. there will be a human who spend real time and efforts reviewing this PR afterall.

I mark it experimental for good reason 🙂

I think this PR should be marked as a draft, until it is no longer experimental

ericcurtin requested review from CISC and ggerganov as code owners November 28, 2025 19:36

ericcurtin force-pushed the add-pagedattention branch 3 times, most recently from 2a33486 to 14ad291 Compare November 28, 2025 19:58

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025

ericcurtin force-pushed the add-pagedattention branch from 14ad291 to 06254d1 Compare November 28, 2025 20:17

loci-dev mentioned this pull request Nov 28, 2025

UPSTREAM PR #17579: Add PagedAttention support (experimental, CUDA only) auroralabs-loci/llama.cpp#352

Open

ericcurtin force-pushed the add-pagedattention branch from 06254d1 to 1745418 Compare November 28, 2025 20:37

ngxson reviewed Nov 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PagedAttention support (experimental, CUDA only) #17579

Add PagedAttention support (experimental, CUDA only) #17579

ericcurtin commented Nov 28, 2025 •

edited

Loading

Uh oh!

ngxson Nov 28, 2025

Uh oh!

ericcurtin Nov 28, 2025

Uh oh!

ngxson Nov 28, 2025 •

edited

Loading

Uh oh!

ddh0 Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add PagedAttention support (experimental, CUDA only) #17579

Are you sure you want to change the base?

Add PagedAttention support (experimental, CUDA only) #17579

Conversation

ericcurtin commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ericcurtin Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddh0 Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericcurtin commented Nov 28, 2025 •

edited

Loading

ngxson Nov 28, 2025 •

edited

Loading