-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[multimodal][test] Reduce memory utilization for test_siglip to avoid OOM #29504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses an Out-Of-Memory (OOM) issue in test_siglip on the PyTorch nightly CI by reducing the gpu_memory_utilization to 0.7. This is a direct and effective fix for the problem. The change is well-contained and correctly applies the parameter to the vllm_runner. Overall, this is a good, practical solution to improve CI stability.
|
cc @zou3519 |
c91d75f to
ecee3fb
Compare
|
QQ: Are these tests uring nightly pytorch? What GPUs are being used? |
|
I think we should try to figure out why they are failing there but not in vLLM CI rather than directly changing the vLLM tests. |
@DarkLight1337 I think we're using linux.g6.4xlarge.experimental.nvidia.gpu in pytorch CI, which seems to be L4 machines. |
|
We use L4s by default in vLLM CI |
|
@DarkLight1337 Thanks for the response. From what I can see vllm and pytorch are using the same type of machine, so seems hardware is the not the main reason here. As @huydhn pointed out to me in another discussion, memory usage seems to creeping up slowly when moving to newer version of PyTorch, and pytorch CI is always running on torch trunk (instead of the stable version we use in vllm). As prior example, we do have to tune down the gpu util for some test when we updated the torch version for vllm (e.g. 2.9): #24994 Agree that this doesn't seem sustainable, but given the limited bandwidth we spend on debugging these issues, I'm not sure what could be other better ways to resolve this issue and restore pytorch trunk CI to green with the latest vllm version. Happy to learn if there's a better way here. Thanks. |
|
Alright then, that's understandable. Let's wait for pooling test to be green on main first though (#29578). Then we can unblock this test and see if the test can still pass on L4s |
… OOM. Summary: Context: we want to update vllm pin on pytorch nightly CI but ran into the following issue: We observed a bunch of cuda OOM on pytorch CI: - https://github.com/pytorch/pytorch/actions/runs/19645318969/job/56265977994?pr=166494#step:27:6473 Reducing the utilization requirement of these models for decreased chance of ooming on pytorch nightly CI Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>
b144fa9 to
f59b406
Compare
Summary:
Context: we want to update vllm pin on pytorch nightly CI but ran into the following issue:
We observed a bunch of cuda OOM on pytorch CI:
Reducing the utilization requirement of these models for decreased chance of ooming on pytorch nightly CI
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.