Skip to content

Conversation

@zhxchen17
Copy link
Contributor

@zhxchen17 zhxchen17 commented Nov 26, 2025

Summary:
Context: we want to update vllm pin on pytorch nightly CI but ran into the following issue:

We observed a bunch of cuda OOM on pytorch CI:

Reducing the utilization requirement of these models for decreased chance of ooming on pytorch nightly CI

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@zhxchen17 zhxchen17 requested a review from noooop as a code owner November 26, 2025 16:17
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an Out-Of-Memory (OOM) issue in test_siglip on the PyTorch nightly CI by reducing the gpu_memory_utilization to 0.7. This is a direct and effective fix for the problem. The change is well-contained and correctly applies the parameter to the vllm_runner. Overall, this is a good, practical solution to improve CI stability.

@zhxchen17
Copy link
Contributor Author

cc @zou3519

@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Nov 26, 2025
@zhxchen17 zhxchen17 force-pushed the zhxchen17/torch_ci_oom_2 branch from c91d75f to ecee3fb Compare November 26, 2025 18:53
@DarkLight1337
Copy link
Member

QQ: Are these tests uring nightly pytorch? What GPUs are being used?

@DarkLight1337
Copy link
Member

I think we should try to figure out why they are failing there but not in vLLM CI rather than directly changing the vLLM tests.

@zhxchen17
Copy link
Contributor Author

zhxchen17 commented Nov 27, 2025

QQ: Are these tests uring nightly pytorch? What GPUs are being used?

@DarkLight1337 I think we're using linux.g6.4xlarge.experimental.nvidia.gpu in pytorch CI, which seems to be L4 machines.

@DarkLight1337
Copy link
Member

We use L4s by default in vLLM CI

@zhxchen17
Copy link
Contributor Author

@DarkLight1337 Thanks for the response. From what I can see vllm and pytorch are using the same type of machine, so seems hardware is the not the main reason here.

As @huydhn pointed out to me in another discussion, memory usage seems to creeping up slowly when moving to newer version of PyTorch, and pytorch CI is always running on torch trunk (instead of the stable version we use in vllm).

As prior example, we do have to tune down the gpu util for some test when we updated the torch version for vllm (e.g. 2.9): #24994

Agree that this doesn't seem sustainable, but given the limited bandwidth we spend on debugging these issues, I'm not sure what could be other better ways to resolve this issue and restore pytorch trunk CI to green with the latest vllm version. Happy to learn if there's a better way here. Thanks.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Nov 27, 2025

Alright then, that's understandable.

Let's wait for pooling test to be green on main first though (#29578). Then we can unblock this test and see if the test can still pass on L4s

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2025
@zhxchen17 zhxchen17 changed the title [multimodal][test] Reduce memory utilization for test_siglip to avoid… [multimodal][test] Reduce memory utilization for test_siglip to avoid OOM Nov 29, 2025
… OOM.

Summary:
Context: we want to update vllm pin on pytorch nightly CI but ran into the following issue:

We observed a bunch of cuda OOM on pytorch CI:
- https://github.com/pytorch/pytorch/actions/runs/19645318969/job/56265977994?pr=166494#step:27:6473

Reducing the utilization requirement of these models for decreased chance of ooming on pytorch nightly CI

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
@zhxchen17 zhxchen17 force-pushed the zhxchen17/torch_ci_oom_2 branch from b144fa9 to f59b406 Compare November 29, 2025 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants