-
Notifications
You must be signed in to change notification settings - Fork 146
Fix CUDA GPU detection crash when running with device=cpu on machines without GPU #6759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add runtime check for PARAM.inp.device before calling GPU detection functions - Prevent cudaErrcheck from exiting when no GPU is available but device is set to "cpu" - Apply same logic as non-MPI version to MPI version in output_device_info Co-authored-by: dzzz2001 <153698752+dzzz2001@users.noreply.github.com>
|
LGTM |
Flying-dragon-boxing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@copilot 帮我进一步重构:### 核心思路
第 1 步:在
|
- Add probe_gpu_availability() function that safely checks GPU without exiting - Refactor get_device_flag() to support device=auto with clear user feedback - Implement MPI broadcast for consistent device selection across all ranks - Add safety check in get_device_kpar() to prevent GPU calls when unavailable - Add informative messages for device selection (INFO/WARNING) Co-authored-by: dzzz2001 <153698752+dzzz2001@users.noreply.github.com>
- Remove redundant probe_gpu_availability check that could cause MPI inconsistency - Function is only called when device=gpu is already confirmed - Add comment clarifying the function should only be called in GPU mode Co-authored-by: dzzz2001 <153698752+dzzz2001@users.noreply.github.com>
I've implemented the refactoring as requested. Here's what was done: Commits:
Key Changes:
The implementation follows your suggested approach with proper MPI synchronization and clear user messaging. |
mohanchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Completed Implementation
This PR addresses the original bug and implements the requested
device=autorefactoring.Changes Summary
1. Original Bug Fix (commit 996199b)
device=cpuon GPU-less machinesoutput_device_info()to skip GPU detection unlessdevice=gpu2. Device Auto-Selection Feature (commits f4653cb, e12b1cd)
probe_gpu_availability()- safe GPU detection without program exitget_device_flag()- supports cpu/gpu/auto modes with MPI broadcastget_device_kpar()- simplified to avoid MPI synchronization issuesBehavior Matrix
Technical Details
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.