Skip to content

Conversation

@AlexCuadron
Copy link
Collaborator

During long decoding, errors compound and explode. A quick and easy way to fix this is by regenerating the embeddings at a set interval, which maintains the correctness of the KV cache while leveraging the parallelism capabilities inherent during prefill.

The way this was implemented is directly inside the hugginface class, as it needs to be active for every model that undergoes sparse attention, as error compounding is inevitable (as sparse attention is an approximation mechanism). Currently, it is also tied to how models are deployed inside HF, which is why creating a separate module for it wouldn't make much sense.

By default, the regeneration mechanism is not enabled. Assuming that most of the research would be conducted using short generations, and this KVcache regeneration (if unnoticed) could invalidate experiments.

apd10 added 30 commits July 16, 2025 22:53
	dynamic GPU allocation
	interrupt handling
	Tool: Curosr
	1. generation_kargs, request_kwargs passing
	2. dumping configs
	3. refactoring utils
	Tool: Cursor
	1. kv group handling
	2. Adaptive sampling ignore base-sample
@AlexCuadron AlexCuadron requested a review from apd10 July 20, 2025 23:35
@AlexCuadron AlexCuadron changed the base branch from main to feature/benchmark July 20, 2025 23:35
@AlexCuadron
Copy link
Collaborator Author

I setted the base branch to benchmark so that it's easier to evaluate the changes

- Fixed line length violations (E501) by splitting long lines
- Removed trailing whitespace (W291, W293) throughout codebase
- Fixed visual indentation issues (E129)
- Added missing newline at end of file (W292)
- Updated test regex patterns to match actual error messages
- Improved code readability with better line breaks
- Fixed function parameter formatting

Remaining C901 complexity warnings are acceptable for now.
@@ -0,0 +1,177 @@
"""Adapter classes to bridge sparse attention implementations with HuggingFace attention interface."""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this file?

model_kwargs: Optional[Dict[str, Any]] = None,
tokenizer_kwargs: Optional[Dict[str, Any]] = None,
device: Optional[str] = None,
recovery_enabled: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to recovery_args

)

# Remove the generated tokens from the cache to restore original context state
current_cache_length = context_outputs.past_key_values.get_seq_length()
Copy link
Collaborator

@apd10 apd10 Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trying to fix the context for multiple questions?

I intentionally made the choice of not reusing context (see the main loop over questions --- model is called again on context for each question -- ) -- for decent sized contexts -- the choice is due to two reasons

  1. running flash attention is actually quite fast
  2. we will also have to do the same cleaning for sparse_attention_meta_data. for which we will need to know that the structure of that is . I want sparse_attention_meta_data to be as general as possible so that maskers can add anything to it as they please. And this class should be agnostic of what happens in sparse_attention_meta_data.


return answer

def _reset_kv_cache_sliding_window(self, cache: Any, reset_interval: int) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this function ? I dont see this being used anywhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is repeated in regenerate_embeddings function

@apd10
Copy link
Collaborator

apd10 commented Jul 21, 2025

I think there is dead code in the pull -- look at the comments
Also,

  1. move recovery logic to a separate class in huggingface_recovery.py -- keep huggingface.py lean
  2. add tests for new functions introduced recovery

@apd10 apd10 force-pushed the feature/benchmark branch from 8b60751 to a87d271 Compare July 25, 2025 10:21
@AlexCuadron AlexCuadron changed the base branch from feature/benchmark to main July 27, 2025 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants