Feature/recovery Added dense regeneration mechanism #8

AlexCuadron · 2025-07-20T23:35:23Z

During long decoding, errors compound and explode. A quick and easy way to fix this is by regenerating the embeddings at a set interval, which maintains the correctness of the KV cache while leveraging the parallelism capabilities inherent during prefill.

The way this was implemented is directly inside the hugginface class, as it needs to be active for every model that undergoes sparse attention, as error compounding is inevitable (as sparse attention is an approximation mechanism). Currently, it is also tied to how models are deployed inside HF, which is why creating a separate module for it wouldn't make much sense.

By default, the regeneration mechanism is not enabled. Assuming that most of the research would be conducted using short generations, and this KVcache regeneration (if unnoticed) could invalidate experiments.

Tool: Cursor

dynamic GPU allocation interrupt handling Tool: Curosr

1. generation_kargs, request_kwargs passing 2. dumping configs 3. refactoring utils

Tool: Human

Tool: Cursor

ToolL: Cursor

Tool: Cursor

1. kv group handling 2. Adaptive sampling ignore base-sample

AlexCuadron · 2025-07-20T23:36:12Z

I setted the base branch to benchmark so that it's easier to evaluate the changes

- Fixed line length violations (E501) by splitting long lines - Removed trailing whitespace (W291, W293) throughout codebase - Fixed visual indentation issues (E129) - Added missing newline at end of file (W292) - Updated test regex patterns to match actual error messages - Improved code readability with better line breaks - Fixed function parameter formatting Remaining C901 complexity warnings are acceptable for now.

apd10 · 2025-07-21T06:17:22Z

sparse_attention_hub/model_hub/attention_adapter.py

@@ -0,0 +1,177 @@
+"""Adapter classes to bridge sparse attention implementations with HuggingFace attention interface."""
+


What is the purpose of this file?

apd10 · 2025-07-21T06:18:58Z

sparse_attention_hub/adapters/huggingface.py

        model_kwargs: Optional[Dict[str, Any]] = None,
        tokenizer_kwargs: Optional[Dict[str, Any]] = None,
        device: Optional[str] = None,
+        recovery_enabled: bool = False,


move to recovery_args

apd10 · 2025-07-21T06:23:26Z

sparse_attention_hub/adapters/huggingface.py

        )
+
+        # Remove the generated tokens from the cache to restore original context state
+        current_cache_length = context_outputs.past_key_values.get_seq_length()


Is this trying to fix the context for multiple questions?

I intentionally made the choice of not reusing context (see the main loop over questions --- model is called again on context for each question -- ) -- for decent sized contexts -- the choice is due to two reasons

running flash attention is actually quite fast

we will also have to do the same cleaning for sparse_attention_meta_data. for which we will need to know that the structure of that is . I want sparse_attention_meta_data to be as general as possible so that maskers can add anything to it as they please. And this class should be agnostic of what happens in sparse_attention_meta_data.

apd10 · 2025-07-21T06:27:25Z

sparse_attention_hub/adapters/huggingface.py

+
        return answer
+
+    def _reset_kv_cache_sliding_window(self, cache: Any, reset_interval: int) -> None:


what is the purpose of this function ? I dont see this being used anywhere.

I think this code is repeated in regenerate_embeddings function

apd10 · 2025-07-21T06:30:37Z

I think there is dead code in the pull -- look at the comments
Also,

move recovery logic to a separate class in huggingface_recovery.py -- keep huggingface.py lean
add tests for new functions introduced recovery

apd10 added 30 commits July 16, 2025 22:53

Add: Benchmark (LongBench as an example)

297b084

Tool: Cursor

Add Cursor log

0bbc23c

Add a mock benchmark for end-to-end tests

02f5488

Tool: Cursor

Add Cursor log

4294c5c

Add Benchmark Evaluation Matrix

722a0df

dynamic GPU allocation interrupt handling Tool: Curosr

Fixes and refactorin in benchmark

cb3f040

1. generation_kargs, request_kwargs passing 2. dumping configs 3. refactoring utils

Add Cursor log

20d12b1

Fix CUDA setting in workers

55019df

Tool: Human

Fix escaping \n, \r for csv dumping

b70f864

Minor refactoring and sample benchmark script

3ac5cec

Add all implemnetations of benchmarks (KVPRESS + AIME)

86dd8e0

Tool: Cursor

Add Cursor log

091626d

Add default required files to raw_results.csv

b434595

Add missing benchmark files

cc67b27

Add missing files

9dfef97

Fix signatures for process_request (tests + base class)

dbfbe7e

Tool: Cursor

Add Cursor log

2a6ff67

Add OracleTopPMasker

d4ea216

Tool: Cursor

Add Cursor log

ed37ed1

Add RandomSamplingMasker

35d7f5a

Tool: Cursor

Add Cursor log

88bb04b

Add missing files

a7b5511

Add MagicPig implementation

ec24894

ToolL: Cursor

Add Cursor log

77806bf

Add HashAttention utility to convert weights from USA

7f137e8

Fix apply_mask should happen in sparse mode

57ab13d

Add AdaptiveSamplingMasker (HashAttention-v2)

9c2cc7e

Tool: Cursor

Add plans

398c7ab

Add Cursor log

b564f59

Fixes OracleTopk and AdaptiveSampling

c382b05

1. kv group handling 2. Adaptive sampling ignore base-sample

apd10 and others added 4 commits July 19, 2025 20:38

Fix linting errors

9496b29

Add demo for adaptive sampling

e6ce6fb

Add missing packages to toml

255e485

added recovery mechanism, copied from HAT code

941209b

AlexCuadron requested a review from apd10 July 20, 2025 23:35

AlexCuadron changed the base branch from main to feature/benchmark July 20, 2025 23:35

AlexCuadron added 3 commits July 20, 2025 23:51

Remove docs/modelhub_hf_registration.md

0978463

Remove examples/modelhub_hf_registration_example.py

b07ec5f

apd10 reviewed Jul 21, 2025

View reviewed changes

apd10 force-pushed the feature/benchmark branch from 8b60751 to a87d271 Compare July 25, 2025 10:21

AlexCuadron changed the base branch from feature/benchmark to main July 27, 2025 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/recovery Added dense regeneration mechanism #8

Feature/recovery Added dense regeneration mechanism #8

Uh oh!

AlexCuadron commented Jul 20, 2025

Uh oh!

AlexCuadron commented Jul 20, 2025

Uh oh!

apd10 Jul 21, 2025

Uh oh!

apd10 Jul 21, 2025

Uh oh!

apd10 Jul 21, 2025 •

edited

Loading

Uh oh!

apd10 Jul 21, 2025

Uh oh!

apd10 Jul 21, 2025

Uh oh!

apd10 commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,177 @@
		"""Adapter classes to bridge sparse attention implementations with HuggingFace attention interface."""


		return answer

		def _reset_kv_cache_sliding_window(self, cache: Any, reset_interval: int) -> None:

Feature/recovery Added dense regeneration mechanism #8

Are you sure you want to change the base?

Feature/recovery Added dense regeneration mechanism #8

Uh oh!

Conversation

AlexCuadron commented Jul 20, 2025

Uh oh!

AlexCuadron commented Jul 20, 2025

Uh oh!

apd10 Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

apd10 Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

apd10 Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apd10 Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

apd10 Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

apd10 commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

apd10 Jul 21, 2025 •

edited

Loading