-
Notifications
You must be signed in to change notification settings - Fork 580
fix(pd): adapting code for hardware compatibility #5047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request updates the PaddlePaddle backend to use generic device management APIs instead of CUDA-specific ones, improving compatibility with different hardware backends like XPU. The changes replace paddle.device.cuda.* calls with paddle.device.* equivalents and enhance device selection logic.
- Replaced CUDA-specific API calls with generic device APIs across multiple files
- Enhanced
get_generatorfunction to support XPU devices and handle unsupported device types gracefully - Modified device selection in
env.pyto usepaddle.device.get_device()instead of explicit GPU-LOCAL_RANK construction
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| source/tests/pd/conftest.py | Updated cache clearing to use generic paddle.device.empty_cache() instead of CUDA-specific version |
| deepmd/pd/utils/utils.py | Added XPU generator support and warning handling for unsupported devices; added warnings import |
| deepmd/pd/utils/env.py | Changed device selection to use paddle.device.get_device() and paddle.device.device_count() for multi-backend support |
| deepmd/pd/utils/auto_batch_size.py | Updated device availability check and cache clearing to generic APIs |
| deepmd/pd/entrypoints/main.py | Modified NCCL check to support non-NCCL backends and updated device count API |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughReplaced CUDA-specific Paddle calls with device-agnostic APIs, added XPU generator support, relaxed the NCCL assertion when Paddle isn't built with NCCL, conditionalized device cache clearing in test teardown, and updated CI workflows to use nightly/pre-release Paddle wheels and a new env flag. Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as App / Test runner
participant Env as env.py
participant Main as entrypoints/main.py
participant Utils as utils.get_generator
participant Paddle as Paddle API
Runner->>Env: init
Env->>Paddle: paddle.device.device_count()
alt device count > 0
Paddle-->>Env: n>0
Env->>Paddle: paddle.device.get_device()
Paddle-->>Env: DEVICE (e.g., "gpu:0"/"xpu:0")
Env->>Paddle: paddle.device.set_device(DEVICE)
else no devices
Paddle-->>Env: 0
Env->>Paddle: paddle.device.set_device("cpu")
end
Runner->>Main: init multi-GPU
Main->>Paddle: compiled_with_nccl?
alt compiled_with_nccl == true
Main->>Paddle: get_nccl_version()
Paddle-->>Main: nccl_version
alt nccl_version > 0
Main-->>Runner: proceed
else
Main-->>Runner: assert/fail
end
else
Main-->>Runner: skip NCCL check
end
Runner->>Utils: request RNG for DEVICE
Utils->>Paddle: inspect DEVICE
alt DEVICE startswith "xpu"
Utils->>Paddle: create XPU generator
Paddle-->>Utils: generator
else DEVICE startswith "gpu" or DEVICE == "cpu"
Utils->>Paddle: create GPU/CPU generator
Paddle-->>Utils: generator
else unsupported
Utils-->>Runner: warn + return None
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
deepmd/pd/entrypoints/main.py(2 hunks)deepmd/pd/utils/auto_batch_size.py(2 hunks)deepmd/pd/utils/env.py(1 hunks)deepmd/pd/utils/utils.py(2 hunks)source/tests/pd/conftest.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/pd/utils/auto_batch_size.pysource/tests/pd/conftest.pydeepmd/pd/entrypoints/main.pydeepmd/pd/utils/env.pydeepmd/pd/utils/utils.py
🪛 Ruff (0.14.4)
deepmd/pd/utils/utils.py
357-357: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (31)
- GitHub Check: Agent
- GitHub Check: CodeQL analysis (python)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Analyze (python)
🔇 Additional comments (7)
source/tests/pd/conftest.py (1)
9-9: LGTM! Device-agnostic API correctly applied.The change from
paddle.device.cuda.empty_cache()topaddle.device.empty_cache()aligns with the PR objective to support multiple hardware backends.deepmd/pd/entrypoints/main.py (2)
98-98: LGTM! Relaxed NCCL assertion for multi-backend support.The assertion now allows multi-GPU initialization when PaddlePaddle is not compiled with NCCL, which is appropriate for non-CUDA backends like XPU that may use different communication libraries.
217-217: LGTM! Device-agnostic API correctly applied.The change from
paddle.device.cuda.device_count()topaddle.device.device_count()enables support for multiple hardware backends. Note that the method nameget_ngpus()now returns the count of any available compute devices (GPU, XPU, etc.), which is appropriate for the broader hardware support.deepmd/pd/utils/auto_batch_size.py (2)
31-39: LGTM! Device-agnostic API correctly applied.The change from
paddle.device.cuda.device_count()topaddle.device.device_count()enables support for multiple hardware backends. Note that the method nameis_gpu_available()now effectively checks for any compute device availability (GPU, XPU, etc.), not just GPU, which aligns with the broader hardware support goal.
54-54: LGTM! Device-agnostic API correctly applied.The change from
paddle.device.cuda.empty_cache()topaddle.device.empty_cache()enables proper memory cleanup across different hardware backends.deepmd/pd/utils/utils.py (1)
349-354: LGTM! XPU device support correctly implemented.The XPU device branches follow the same pattern as GPU device handling and correctly initialize XPU generators for both single-device (
"xpu") and indexed-device ("xpu:N") configurations.deepmd/pd/utils/env.py (1)
32-37: The review comment is based on an incorrect assumption about the DEVICE environment variable's purpose.The DEVICE environment variable is designed as a CPU-override flag, not a general device selector. This is confirmed by:
Identical implementation pattern across frameworks: Both the PyTorch (
deepmd/pt/utils/env.pyline 40-43) and PaddlePaddle implementations use the same logic—only checking ifDEVICE == "cpu", never using its value to select non-CPU devices like"xpu:0".Architectural design: For GPU/accelerator selection in distributed training, both frameworks use
LOCAL_RANK(PyTorch) or rely onget_device()with device assignment logic (PaddlePaddle). TheDEVICEvariable is separate and specifically for the CPU-forcing use case.Consistency: The comment "Make sure DDP uses correct device if applicable" and the pattern across both backends indicate this is intentional—
DEVICEforces CPU when needed, while the framework handles proper GPU selection via other mechanisms.The current code correctly respects the
DEVICEenvironment variable for its intended purpose: forcing CPU execution whenDEVICE == "cpu"is set. The suggested fix to support arbitrary device strings like"xpu:0"would require XPU-specific handling logic, which is a feature addition unrelated to the stated PR goal of respecting theDEVICEenvironment variable.Likely an incorrect or invalid review comment.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #5047 +/- ##
==========================================
- Coverage 84.28% 84.28% -0.01%
==========================================
Files 709 709
Lines 70561 70567 +6
Branches 3618 3618
==========================================
Hits 59474 59474
- Misses 9922 9927 +5
- Partials 1165 1166 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
.github/workflows/test_python.yml (1)
35-35: Clarify CI stability strategy for nightly PaddlePaddle wheels.Switching to nightly/pre-release PaddlePaddle wheels (3.3.0.dev20251204) can introduce instability and make test failures harder to attribute. Consider:
- Whether this is a temporary measure for testing device-agnostic compatibility or a long-term strategy
- Adding fallback/retry logic if the specific nightly build becomes unavailable
- Documenting the rationale and expected timeline for returning to stable releases
The
--no-index --find-linkspattern is strict; if the nightly wheel URL becomes stale or the specific version is missing, CI will fail. Consider wrapping this in a retry mechanism or providing a fallback stable version.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/test_cuda.yml(2 hunks).github/workflows/test_python.yml(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/test_cuda.yml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-24T22:42:23.128Z
Learnt from: CR
Repo: deepmodeling/deepmd-kit PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T22:42:23.128Z
Learning: Install backend dependencies (TensorFlow, PyTorch, JAX, or Paddle) before building C++ components, and set TENSORFLOW_ROOT and PYTORCH_ROOT environment variables
Applied to files:
.github/workflows/test_python.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build wheels for cp311-macosx_arm64
🔇 Additional comments (1)
.github/workflows/test_python.yml (1)
60-70: Inconsistent environment variable scope across pytest invocations.
FLAGS_use_stride_compute_kernel: 0is set for the main pytest run (line 63) but not for the subsequent "Test TF2 eager mode" step (line 65). Verify whether this omission is intentional or if the flag should also be applied there.For consistency and clarity, consider either:
- Applying the flag to both pytest invocations if both need stride kernel handling
- Documenting why the TF2 test intentionally omits this flag
|
hello @njzjz, this PR is ready for review |
This pull request updates PaddlePaddle dependencies to nightly builds and refactors device handling throughout the codebase to support more flexible device selection (including XPU), improves compatibility, and adds better flag management for tests. The most significant changes are grouped below.
Dependency Updates:
.github/workflows/test_python.ymland.github/workflows/test_cuda.ymlto use nightly builds (3.3.0.dev20251204) from new package URLs, improving access to the latest features and fixes. [1] [2]Device Handling Refactor:
paddle.device.cuda.device_count()and related CUDA-specific APIs with more generalpaddle.device.device_count()andpaddle.device.empty_cache()in multiple locations, enabling support for devices beyond CUDA (e.g., XPU). [1] [2] [3] [4] [5]deepmd/pd/utils/env.pyto usepaddle.device.get_device(), ensuring correct device assignment for both CPU and GPU/XPU scenarios.Device Compatibility Improvements:
get_generatorindeepmd/pd/utils/utils.pyto support XPU devices and added a warning for unsupported device types, improving compatibility and error messaging.Test Flag Management:
FLAGS_use_stride_kernelPaddle flag insource/tests/pd/test_multitask.pyto ensure proper test isolation and restore flag values after tests. [1] [2]FLAGS_use_stride_compute_kernelenvironment variable to0in workflow files to control kernel usage during tests. [1] [2]Distributed Training Check: