-
Notifications
You must be signed in to change notification settings - Fork 5
Feature/recovery Added dense regeneration mechanism #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Tool: Cursor
Tool: Cursor
dynamic GPU allocation interrupt handling Tool: Curosr
1. generation_kargs, request_kwargs passing 2. dumping configs 3. refactoring utils
Tool: Human
Tool: Cursor
Tool: Cursor
ToolL: Cursor
Tool: Cursor
1. kv group handling 2. Adaptive sampling ignore base-sample
|
I setted the base branch to benchmark so that it's easier to evaluate the changes |
- Fixed line length violations (E501) by splitting long lines - Removed trailing whitespace (W291, W293) throughout codebase - Fixed visual indentation issues (E129) - Added missing newline at end of file (W292) - Updated test regex patterns to match actual error messages - Improved code readability with better line breaks - Fixed function parameter formatting Remaining C901 complexity warnings are acceptable for now.
| @@ -0,0 +1,177 @@ | |||
| """Adapter classes to bridge sparse attention implementations with HuggingFace attention interface.""" | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this file?
| model_kwargs: Optional[Dict[str, Any]] = None, | ||
| tokenizer_kwargs: Optional[Dict[str, Any]] = None, | ||
| device: Optional[str] = None, | ||
| recovery_enabled: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to recovery_args
| ) | ||
|
|
||
| # Remove the generated tokens from the cache to restore original context state | ||
| current_cache_length = context_outputs.past_key_values.get_seq_length() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to fix the context for multiple questions?
I intentionally made the choice of not reusing context (see the main loop over questions --- model is called again on context for each question -- ) -- for decent sized contexts -- the choice is due to two reasons
- running flash attention is actually quite fast
- we will also have to do the same cleaning for sparse_attention_meta_data. for which we will need to know that the structure of that is . I want sparse_attention_meta_data to be as general as possible so that maskers can add anything to it as they please. And this class should be agnostic of what happens in sparse_attention_meta_data.
|
|
||
| return answer | ||
|
|
||
| def _reset_kv_cache_sliding_window(self, cache: Any, reset_interval: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this function ? I dont see this being used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is repeated in regenerate_embeddings function
|
I think there is dead code in the pull -- look at the comments
|
During long decoding, errors compound and explode. A quick and easy way to fix this is by regenerating the embeddings at a set interval, which maintains the correctness of the KV cache while leveraging the parallelism capabilities inherent during prefill.
The way this was implemented is directly inside the hugginface class, as it needs to be active for every model that undergoes sparse attention, as error compounding is inevitable (as sparse attention is an approximation mechanism). Currently, it is also tied to how models are deployed inside HF, which is why creating a separate module for it wouldn't make much sense.
By default, the regeneration mechanism is not enabled. Assuming that most of the research would be conducted using short generations, and this KVcache regeneration (if unnoticed) could invalidate experiments.