fix(pd): adapting code for hardware compatibility #5047

HydrogenSulfate · 2025-11-11T12:55:54Z

This pull request updates PaddlePaddle dependencies to nightly builds and refactors device handling throughout the codebase to support more flexible device selection (including XPU), improves compatibility, and adds better flag management for tests. The most significant changes are grouped below.

Dependency Updates:

Updated PaddlePaddle and PaddlePaddle-GPU installation in .github/workflows/test_python.yml and .github/workflows/test_cuda.yml to use nightly builds (3.3.0.dev20251204) from new package URLs, improving access to the latest features and fixes. [1] [2]

Device Handling Refactor:

Replaced usage of paddle.device.cuda.device_count() and related CUDA-specific APIs with more general paddle.device.device_count() and paddle.device.empty_cache() in multiple locations, enabling support for devices beyond CUDA (e.g., XPU). [1] [2] [3] [4] [5]
Updated logic for setting and retrieving device information in deepmd/pd/utils/env.py to use paddle.device.get_device(), ensuring correct device assignment for both CPU and GPU/XPU scenarios.

Device Compatibility Improvements:

Enhanced get_generator in deepmd/pd/utils/utils.py to support XPU devices and added a warning for unsupported device types, improving compatibility and error messaging.

Test Flag Management:

Added explicit management of the FLAGS_use_stride_kernel Paddle flag in source/tests/pd/test_multitask.py to ensure proper test isolation and restore flag values after tests. [1] [2]
Set FLAGS_use_stride_compute_kernel environment variable to 0 in workflow files to control kernel usage during tests. [1] [2]

Distributed Training Check:

Improved NCCL initialization check in distributed training setup to handle cases where NCCL is not compiled, preventing assertion errors.

Copilot

Pull Request Overview

This pull request updates the PaddlePaddle backend to use generic device management APIs instead of CUDA-specific ones, improving compatibility with different hardware backends like XPU. The changes replace paddle.device.cuda.* calls with paddle.device.* equivalents and enhance device selection logic.

Replaced CUDA-specific API calls with generic device APIs across multiple files
Enhanced get_generator function to support XPU devices and handle unsupported device types gracefully
Modified device selection in env.py to use paddle.device.get_device() instead of explicit GPU-LOCAL_RANK construction

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
source/tests/pd/conftest.py	Updated cache clearing to use generic `paddle.device.empty_cache()` instead of CUDA-specific version
deepmd/pd/utils/utils.py	Added XPU generator support and warning handling for unsupported devices; added `warnings` import
deepmd/pd/utils/env.py	Changed device selection to use `paddle.device.get_device()` and `paddle.device.device_count()` for multi-backend support
deepmd/pd/utils/auto_batch_size.py	Updated device availability check and cache clearing to generic APIs
deepmd/pd/entrypoints/main.py	Modified NCCL check to support non-NCCL backends and updated device count API

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

deepmd/pd/utils/env.py

deepmd/pd/utils/utils.py

coderabbitai · 2025-11-11T13:01:14Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Replaced CUDA-specific Paddle calls with device-agnostic APIs, added XPU generator support, relaxed the NCCL assertion when Paddle isn't built with NCCL, conditionalized device cache clearing in test teardown, and updated CI workflows to use nightly/pre-release Paddle wheels and a new env flag.

Changes

Cohort / File(s)	Change Summary
Device API abstraction `deepmd/pd/entrypoints/main.py`, `deepmd/pd/utils/auto_batch_size.py`, `deepmd/pd/utils/env.py`, `source/tests/pd/conftest.py`	Replaced CUDA-only calls with device-agnostic APIs (`paddle.device.cuda.device_count()` → `paddle.device.device_count()`, `paddle.device.cuda.empty_cache()` → `paddle.device.empty_cache()`); `env.py` now selects device via `paddle.device.get_device()`/`paddle.device.set_device(DEVICE)` when devices are present.
Multi-GPU / NCCL check `deepmd/pd/entrypoints/main.py`	Relaxed NCCL assertion: only require `nccl_version > 0` when Paddle reports it was compiled with NCCL; skip the check when Paddle is not compiled with NCCL.
Device support & RNG `deepmd/pd/utils/utils.py`	Added XPU handling (`xpu` / `xpu:*`) for generator creation; imported `warnings`; unsupported devices now emit a `UserWarning` and return `None` instead of raising.
Tests teardown update `source/tests/pd/conftest.py`	Teardown clears device cache only when active device is not CPU (checks `paddle.device.get_device()`), using `paddle.device.empty_cache()`.
CI workflow updates `.github/workflows/test_cuda.yml`, `.github/workflows/test_python.yml`	Switched CI to nightly/pre-release Paddle wheel URLs and added `FLAGS_use_stride_compute_kernel: 0` to workflow envs.
Tests: flags handling `source/tests/pd/test_multitask.py`	Tests capture and restore `FLAGS_use_stride_kernel`, temporarily disable it during tests, and set `learning_rate.start_lr` in some setups.

Sequence Diagram(s)

sequenceDiagram
  participant Runner as App / Test runner
  participant Env as env.py
  participant Main as entrypoints/main.py
  participant Utils as utils.get_generator
  participant Paddle as Paddle API

  Runner->>Env: init
  Env->>Paddle: paddle.device.device_count()
  alt device count > 0
    Paddle-->>Env: n>0
    Env->>Paddle: paddle.device.get_device()
    Paddle-->>Env: DEVICE (e.g., "gpu:0"/"xpu:0")
    Env->>Paddle: paddle.device.set_device(DEVICE)
  else no devices
    Paddle-->>Env: 0
    Env->>Paddle: paddle.device.set_device("cpu")
  end

  Runner->>Main: init multi-GPU
  Main->>Paddle: compiled_with_nccl?
  alt compiled_with_nccl == true
    Main->>Paddle: get_nccl_version()
    Paddle-->>Main: nccl_version
    alt nccl_version > 0
      Main-->>Runner: proceed
    else
      Main-->>Runner: assert/fail
    end
  else
    Main-->>Runner: skip NCCL check
  end

  Runner->>Utils: request RNG for DEVICE
  Utils->>Paddle: inspect DEVICE
  alt DEVICE startswith "xpu"
    Utils->>Paddle: create XPU generator
    Paddle-->>Utils: generator
  else DEVICE startswith "gpu" or DEVICE == "cpu"
    Utils->>Paddle: create GPU/CPU generator
    Paddle-->>Utils: generator
  else unsupported
    Utils-->>Runner: warn + return None
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Files needing extra attention:
- deepmd/pd/entrypoints/main.py — verify compiled_with_nccl detection and get_nccl_version() semantics.
- deepmd/pd/utils/utils.py — XPU generator APIs and device-string handling.
- deepmd/pd/utils/env.py — device selection logic and interaction with local-rank conventions.
- CI workflow files — validate wheel URLs and the effect of FLAGS_use_stride_compute_kernel.

Possibly related PRs

pd: update paddlepaddle version to release/3.0 #4694 — updates PaddlePaddle package/version in CI workflows (modifies the same workflow files).
[pd] Use rc whl instead of dev whl for python cpu test #4656 — adjusts CI Paddle wheel/version selection in .github/workflows/test_python.yml.
pd: skip certain UT and fix paddle ver in in test_cuda.yml #4439 — modifies CI workflow Paddle installation/version and related PD tests.

Suggested reviewers

njzjz
wanghan-iapcm
iProzd

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main objective of the pull request: adapting code for hardware compatibility by replacing CUDA-specific APIs with generic device APIs.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f25e16 and fa7bb22.

📒 Files selected for processing (5)

deepmd/pd/entrypoints/main.py (2 hunks)
deepmd/pd/utils/auto_batch_size.py (2 hunks)
deepmd/pd/utils/env.py (1 hunks)
deepmd/pd/utils/utils.py (2 hunks)
source/tests/pd/conftest.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/pd/utils/auto_batch_size.py
source/tests/pd/conftest.py
deepmd/pd/entrypoints/main.py
deepmd/pd/utils/env.py
deepmd/pd/utils/utils.py

🪛 Ruff (0.14.4)

deepmd/pd/utils/utils.py

357-357: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (31)

GitHub Check: Agent
GitHub Check: CodeQL analysis (python)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Analyze (c-cpp)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Analyze (python)

🔇 Additional comments (7)

source/tests/pd/conftest.py (1)

9-9: LGTM! Device-agnostic API correctly applied.

The change from paddle.device.cuda.empty_cache() to paddle.device.empty_cache() aligns with the PR objective to support multiple hardware backends.

deepmd/pd/entrypoints/main.py (2)

98-98: LGTM! Relaxed NCCL assertion for multi-backend support.

The assertion now allows multi-GPU initialization when PaddlePaddle is not compiled with NCCL, which is appropriate for non-CUDA backends like XPU that may use different communication libraries.

217-217: LGTM! Device-agnostic API correctly applied.

The change from paddle.device.cuda.device_count() to paddle.device.device_count() enables support for multiple hardware backends. Note that the method name get_ngpus() now returns the count of any available compute devices (GPU, XPU, etc.), which is appropriate for the broader hardware support.

deepmd/pd/utils/auto_batch_size.py (2)

31-39: LGTM! Device-agnostic API correctly applied.

The change from paddle.device.cuda.device_count() to paddle.device.device_count() enables support for multiple hardware backends. Note that the method name is_gpu_available() now effectively checks for any compute device availability (GPU, XPU, etc.), not just GPU, which aligns with the broader hardware support goal.

54-54: LGTM! Device-agnostic API correctly applied.

The change from paddle.device.cuda.empty_cache() to paddle.device.empty_cache() enables proper memory cleanup across different hardware backends.

deepmd/pd/utils/utils.py (1)

349-354: LGTM! XPU device support correctly implemented.

The XPU device branches follow the same pattern as GPU device handling and correctly initialize XPU generators for both single-device ("xpu") and indexed-device ("xpu:N") configurations.

deepmd/pd/utils/env.py (1)

32-37: The review comment is based on an incorrect assumption about the DEVICE environment variable's purpose.

The DEVICE environment variable is designed as a CPU-override flag, not a general device selector. This is confirmed by:

Identical implementation pattern across frameworks: Both the PyTorch (deepmd/pt/utils/env.py line 40-43) and PaddlePaddle implementations use the same logic—only checking if DEVICE == "cpu", never using its value to select non-CPU devices like "xpu:0".

Architectural design: For GPU/accelerator selection in distributed training, both frameworks use LOCAL_RANK (PyTorch) or rely on get_device() with device assignment logic (PaddlePaddle). The DEVICE variable is separate and specifically for the CPU-forcing use case.

Consistency: The comment "Make sure DDP uses correct device if applicable" and the pattern across both backends indicate this is intentional—DEVICE forces CPU when needed, while the framework handles proper GPU selection via other mechanisms.

The current code correctly respects the DEVICE environment variable for its intended purpose: forcing CPU execution when DEVICE == "cpu" is set. The suggested fix to support arbitrary device strings like "xpu:0" would require XPU-specific handling logic, which is a feature addition unrelated to the stated PR goal of respecting the DEVICE environment variable.

Likely an incorrect or invalid review comment.

deepmd/pd/utils/utils.py

…o avoid error

codecov · 2025-12-08T08:38:38Z

Codecov Report

❌ Patch coverage is 23.07692% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.28%. Comparing base (0ad7cbf) to head (6bb94db).

Files with missing lines	Patch %	Lines
deepmd/pd/utils/utils.py	14.28%	6 Missing ⚠️
deepmd/pd/entrypoints/main.py	0.00%	2 Missing ⚠️
deepmd/pd/utils/auto_batch_size.py	50.00%	1 Missing ⚠️
deepmd/pd/utils/env.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            devel    #5047      +/-   ##
==========================================
- Coverage   84.28%   84.28%   -0.01%     
==========================================
  Files         709      709              
  Lines       70561    70567       +6     
  Branches     3618     3618              
==========================================
  Hits        59474    59474              
- Misses       9922     9927       +5     
- Partials     1165     1166       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/workflows/test_python.yml (1)

35-35: Clarify CI stability strategy for nightly PaddlePaddle wheels.

Switching to nightly/pre-release PaddlePaddle wheels (3.3.0.dev20251204) can introduce instability and make test failures harder to attribute. Consider:

Whether this is a temporary measure for testing device-agnostic compatibility or a long-term strategy

Adding fallback/retry logic if the specific nightly build becomes unavailable

Documenting the rationale and expected timeline for returning to stable releases

The --no-index --find-links pattern is strict; if the nightly wheel URL becomes stale or the specific version is missing, CI will fail. Consider wrapping this in a retry mechanism or providing a fallback stable version.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb6ccf4 and acab160.

📒 Files selected for processing (2)

.github/workflows/test_cuda.yml (2 hunks)
.github/workflows/test_python.yml (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/test_cuda.yml

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-24T22:42:23.128Z

Learnt from: CR
Repo: deepmodeling/deepmd-kit PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T22:42:23.128Z
Learning: Install backend dependencies (TensorFlow, PyTorch, JAX, or Paddle) before building C++ components, and set TENSORFLOW_ROOT and PYTORCH_ROOT environment variables

Applied to files:

.github/workflows/test_python.yml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Analyze (c-cpp)
GitHub Check: Analyze (python)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build wheels for cp311-macosx_arm64

🔇 Additional comments (1)

.github/workflows/test_python.yml (1)

60-70: Inconsistent environment variable scope across pytest invocations.

FLAGS_use_stride_compute_kernel: 0 is set for the main pytest run (line 63) but not for the subsequent "Test TF2 eager mode" step (line 65). Verify whether this omission is intentional or if the flag should also be applied there.

For consistency and clarity, consider either:

Applying the flag to both pytest invocations if both need stride kernel handling

Documenting why the TF2 test intentionally omits this flag

HydrogenSulfate · 2025-12-08T12:37:38Z

hello @njzjz, this PR is ready for review

HydrogenSulfate added 8 commits November 10, 2025 11:47

adapt code for xpu

1429649

update get_generator for xpu

eb34156

unify device API

22fa65f

refine get_generator

a9cf4d6

refine DEVICE setting

78f39f4

fix nccl check

23f7cc2

fix xpu

4d16274

fix typo

fa7bb22

Copilot AI review requested due to automatic review settings November 11, 2025 12:55

github-actions bot added the Python label Nov 11, 2025

Copilot started reviewing on behalf of HydrogenSulfate November 11, 2025 12:56 View session

Copilot finished reviewing on behalf of HydrogenSulfate November 11, 2025 12:59

Copilot AI reviewed Nov 11, 2025

View reviewed changes

deepmd/pd/utils/env.py Show resolved Hide resolved

deepmd/pd/utils/utils.py Show resolved Hide resolved

coderabbitai bot reviewed Nov 11, 2025

View reviewed changes

deepmd/pd/utils/utils.py Show resolved Hide resolved

update paddle version in ci

03561c0

HydrogenSulfate marked this pull request as draft November 12, 2025 08:37

HydrogenSulfate closed this Nov 12, 2025

HydrogenSulfate added 2 commits December 5, 2025 11:56

fix unitest config

fde2735

Merge branch 'devel' into xpu

24edd0d

HydrogenSulfate reopened this Dec 5, 2025

HydrogenSulfate added 6 commits December 5, 2025 12:50

fix test_python.yml

c60faa5

reduce lr to avoid inf grad

2722824

disable FLAGS_use_stride_compute_kernel and FLAGS_use_stride_kernel t…

629ab9d

…o avoid error

Merge branch 'devel' into xpu

5e3fa3a

add stacklevel for get_generator warning

614995c

update UT flag strategy

ce2583b

HydrogenSulfate marked this pull request as ready for review December 8, 2025 08:48

change to stable paddle version

5854a06

change stable version to a fix develop version

fb6ccf4

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

fix url

6bb94db

HydrogenSulfate force-pushed the xpu branch from acab160 to 6bb94db Compare December 8, 2025 11:33

njzjz approved these changes Dec 8, 2025

View reviewed changes

njzjz added this pull request to the merge queue Dec 8, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pd): adapting code for hardware compatibility #5047

fix(pd): adapting code for hardware compatibility #5047

HydrogenSulfate commented Nov 11, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Nov 11, 2025 •

edited

Loading

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

HydrogenSulfate commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(pd): adapting code for hardware compatibility #5047

Are you sure you want to change the base?

fix(pd): adapting code for hardware compatibility #5047

Conversation

HydrogenSulfate commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

HydrogenSulfate commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HydrogenSulfate commented Nov 11, 2025 •

edited

Loading

coderabbitai bot commented Nov 11, 2025 •

edited

Loading

codecov bot commented Dec 8, 2025 •

edited

Loading