[multimodal][test] Reduce memory utilization for test_siglip to avoid OOM #29504

zhxchen17 · 2025-11-26T16:17:13Z

Summary:
Context: we want to update vllm pin on pytorch nightly CI but ran into the following issue:

We observed a bunch of cuda OOM on pytorch CI:

https://github.com/pytorch/pytorch/actions/runs/19645318969/job/56265977994?pr=166494#step:27:6473

Reducing the utilization requirement of these models for decreased chance of ooming on pytorch nightly CI

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

chatgpt-codex-connector · 2025-11-26T16:17:24Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request addresses an Out-Of-Memory (OOM) issue in test_siglip on the PyTorch nightly CI by reducing the gpu_memory_utilization to 0.7. This is a direct and effective fix for the problem. The change is well-contained and correctly applies the parameter to the vllm_runner. Overall, this is a good, practical solution to improve CI stability.

zhxchen17 · 2025-11-26T16:17:46Z

cc @zou3519

DarkLight1337 · 2025-11-27T15:07:36Z

QQ: Are these tests uring nightly pytorch? What GPUs are being used?

DarkLight1337 · 2025-11-27T15:07:57Z

I think we should try to figure out why they are failing there but not in vLLM CI rather than directly changing the vLLM tests.

zhxchen17 · 2025-11-27T16:50:46Z

QQ: Are these tests uring nightly pytorch? What GPUs are being used?

@DarkLight1337 I think we're using linux.g6.4xlarge.experimental.nvidia.gpu in pytorch CI, which seems to be L4 machines.

DarkLight1337 · 2025-11-27T17:03:55Z

We use L4s by default in vLLM CI

zhxchen17 · 2025-11-27T17:21:25Z

@DarkLight1337 Thanks for the response. From what I can see vllm and pytorch are using the same type of machine, so seems hardware is the not the main reason here.

As @huydhn pointed out to me in another discussion, memory usage seems to creeping up slowly when moving to newer version of PyTorch, and pytorch CI is always running on torch trunk (instead of the stable version we use in vllm).

As prior example, we do have to tune down the gpu util for some test when we updated the torch version for vllm (e.g. 2.9): #24994

Agree that this doesn't seem sustainable, but given the limited bandwidth we spend on debugging these issues, I'm not sure what could be other better ways to resolve this issue and restore pytorch trunk CI to green with the latest vllm version. Happy to learn if there's a better way here. Thanks.

DarkLight1337 · 2025-11-27T17:25:50Z

Alright then, that's understandable.

Let's wait for pooling test to be green on main first though (#29578). Then we can unblock this test and see if the test can still pass on L4s

… OOM. Summary: Context: we want to update vllm pin on pytorch nightly CI but ran into the following issue: We observed a bunch of cuda OOM on pytorch CI: - https://github.com/pytorch/pytorch/actions/runs/19645318969/job/56265977994?pr=166494#step:27:6473 Reducing the utilization requirement of these models for decreased chance of ooming on pytorch nightly CI Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>

zhxchen17 requested a review from noooop as a code owner November 26, 2025 16:17

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

mergify bot added the multi-modality Related to multi-modality (#4194) label Nov 26, 2025

zhxchen17 force-pushed the zhxchen17/torch_ci_oom_2 branch from c91d75f to ecee3fb Compare November 26, 2025 18:53

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2025

zhxchen17 changed the title ~~[multimodal][test] Reduce memory utilization for test_siglip to avoid…~~ [multimodal][test] Reduce memory utilization for test_siglip to avoid OOM Nov 29, 2025

zhxchen17 force-pushed the zhxchen17/torch_ci_oom_2 branch from b144fa9 to f59b406 Compare November 29, 2025 21:18

Merge branch 'main' into zhxchen17/torch_ci_oom_2

613ccd0

DarkLight1337 enabled auto-merge (squash) November 30, 2025 04:11

zhxchen17 mentioned this pull request Nov 30, 2025

Manually update vllm CI pin. pytorch/pytorch#166494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[multimodal][test] Reduce memory utilization for test_siglip to avoid OOM #29504

[multimodal][test] Reduce memory utilization for test_siglip to avoid OOM #29504

zhxchen17 commented Nov 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

zhxchen17 commented Nov 26, 2025

Uh oh!

DarkLight1337 commented Nov 27, 2025

Uh oh!

DarkLight1337 commented Nov 27, 2025

Uh oh!

zhxchen17 commented Nov 27, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented Nov 27, 2025

Uh oh!

zhxchen17 commented Nov 27, 2025

Uh oh!

DarkLight1337 commented Nov 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[multimodal][test] Reduce memory utilization for test_siglip to avoid OOM #29504

Are you sure you want to change the base?

[multimodal][test] Reduce memory utilization for test_siglip to avoid OOM #29504

Conversation

zhxchen17 commented Nov 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Nov 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

zhxchen17 commented Nov 26, 2025

Uh oh!

DarkLight1337 commented Nov 27, 2025

Uh oh!

DarkLight1337 commented Nov 27, 2025

Uh oh!

zhxchen17 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Nov 27, 2025

Uh oh!

zhxchen17 commented Nov 27, 2025

Uh oh!

DarkLight1337 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhxchen17 commented Nov 26, 2025 •

edited by github-actions bot

Loading

zhxchen17 commented Nov 27, 2025 •

edited

Loading

DarkLight1337 commented Nov 27, 2025 •

edited

Loading