Fix CUDA GPU detection crash when running with device=cpu on machines without GPU #6759

Copilot · 2025-12-04T07:01:13Z

Fix CUDA GPU detection when device is set to CPU (original issue)
Implement safe GPU probe function without cudaErrcheck
Refactor get_device_flag to support device=auto with MPI broadcast
Add clear user messaging for device selection
Update get_device_kpar - simplified to avoid MPI issues
Address MPI synchronization concerns from code review
Final code review - only minor style nitpicks remain
Security scan - no vulnerabilities detected

Completed Implementation

This PR addresses the original bug and implements the requested device=auto refactoring.

Changes Summary

1. Original Bug Fix (commit 996199b)

Fixed crash when running CUDA-compiled ABACUS with device=cpu on GPU-less machines
Added runtime guards in output_device_info() to skip GPU detection unless device=gpu

2. Device Auto-Selection Feature (commits f4653cb, e12b1cd)

New: probe_gpu_availability() - safe GPU detection without program exit
Enhanced: get_device_flag() - supports cpu/gpu/auto modes with MPI broadcast
Improved: Clear INFO/WARNING messages for device selection
Maintained: get_device_kpar() - simplified to avoid MPI synchronization issues

Behavior Matrix

device	GPU Present	Result
cpu	Any	CPU (INFO)
gpu	Yes	GPU (INFO)
gpu	No	Exit with error
auto	Yes	GPU (INFO)
auto	No	CPU (WARNING)

Technical Details

MPI-safe: Rank 0 makes decision, broadcasts to all ranks
No false GPU probes: Functions only check GPU when appropriate
Backward compatible: Existing cpu/gpu behavior unchanged
User-friendly: Clear messages explain device selection

Original prompt

This section details on the original issue you should resolve

<issue_title>ABACUS compiled with CUDA doesn't run when there is no GPU</issue_title>
<issue_description>### Describe the bug

When running ABACUS compiled with CUDA with no GPU, ABACUS shows this message and exits.
Unexpected Device Error /home/mhchen_pkuhpc/mhchen_coe/lustre3/8_Hongxu/tmp/abacus-develop/source/source_base/module_device/output_device.cpp:72: cudaErrorNoDevice, no CUDA-capable device is detected

Expected behavior

Should run normally

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

Verify the issue is not a duplicate.

Describe the bug.

Steps to reproduce.

Expected behavior.

Error message.

Environment details.

Additional context.

Assign a priority level (low, medium, high, urgent).

Assign the issue to a team member.

Label the issue with relevant tags.

Identify possible related issues.

Create a unit test or automated test to reproduce the bug (if applicable).

Fix the bug.

Test the fix.

Update documentation (if necessary).

Close the issue and inform the reporter (if applicable).</issue_description>

<agent_instructions>The bug is caused by get_device_num("gpu") inside output_device_info being executed even in the CPU-only version. Please help fix it.</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@Flying-dragon-boxing > Generally, `no CUDA-capable device is detected` this error is running on a host that does not have a graphics card, not a compilation problem.
I've got some clue.
When device is set to cpu, ABACUS should not detect GPU device or execute source/source_base/module_device/output_device.cpp:72. </comment_new>

Fixes ABACUS compiled with CUDA doesn't run when there is no GPU #6562

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

- Add runtime check for PARAM.inp.device before calling GPU detection functions - Prevent cudaErrcheck from exiting when no GPU is available but device is set to "cpu" - Apply same logic as non-MPI version to MPI version in output_device_info Co-authored-by: dzzz2001 <153698752+dzzz2001@users.noreply.github.com>

dzzz2001 · 2025-12-04T07:16:48Z

LGTM

Flying-dragon-boxing

LGTM

dzzz2001 · 2025-12-04T12:59:49Z

@copilot 帮我进一步重构：### 核心思路
我们将改造 device.cpp 中的 get_device_flag 函数，使其成为设备选择的决策中心。

创建安全的 GPU 探测函数：为了防止在无 GPU 环境下调用 cudaGetDeviceCount 导致程序崩溃，我们需要一个不调用 cudaErrcheck 的安全探测函数。
在 get_device_flag 中实现 auto 逻辑：
- 该函数将处理 cpu, gpu, auto 三种输入。
- 在 MPI 环境中，只有 Rank 0 负责执行 GPU 探测和决策。
- Rank 0 将最终的决策（用 GPU 还是 CPU）广播给所有其他进程。
- 所有进程根据广播结果，统一返回 "gpu" 或 "cpu"。
- Rank 0 负责打印清晰的信息或警告。
确保其他函数安全：get_device_kpar 等函数在调用 GPU 相关 API 时，也需要确保是在确认 GPU 可用之后。

第 1 步：在 `device.h` 中添加新函数声明

首先，我们需要声明一个安全的 GPU 探测函数。
修改 device.h
在 namespace base_device::information 中添加新函数的声明：

// device.h
namespace base_device
{
namespace information
{
// ... existing declarations ...
/**
 * @brief Safely probes for GPU availability without exiting on error.
 * @return True if at least one GPU is found and usable, false otherwise.
 */
bool probe_gpu_availability();
// ... existing declarations ...
} // end of namespace information
} // end of namespace base_device

第 2 步：在 `device.cpp` 中实现新函数并重构 `get_device_flag`

这是本次修改的核心。
修改 device.cpp

实现 probe_gpu_availability 函数：
将其放在 get_device_flag 函数之前。

// device.cpp
namespace base_device
{
namespace information
{
bool probe_gpu_availability() {
#if defined(__CUDA)
    int device_count = 0;
    // Directly call cudaGetDeviceCount without cudaErrcheck to prevent program exit
    cudaError_t error_id = cudaGetDeviceCount(&device_count);
    if (error_id == cudaSuccess && device_count > 0) {
        return true;
    }
    return false;
#elif defined(__ROCM)
    int device_count = 0;
    hipError_t error_id = hipGetDeviceCount(&device_count);
    if (error_id == hipSuccess && device_count > 0) {
        return true;
    }
    return false;
#else
    // If not compiled with GPU support, GPU is not available
    return false;
#endif
}
// ... get_device_flag function will be refactored below ...

重构 get_device_flag 函数：
这是实现 auto 逻辑的地方。我们将用下面这个更健壮的版本替换掉原有的 get_device_flag 函数。

// device.cpp
std::string get_device_flag(const std::string &device,
                            const std::string &basis_type) {
    // 1. Validate input string
    if (device != "cpu" && device != "gpu" && device != "auto") {
        ModuleBase::WARNING_QUIT("device", "Parameter \"device\" can only be set to \"cpu\", \"gpu\", or \"auto\"!");
    }
    int decision = 0; // 0 for CPU, 1 for GPU
#ifdef __MPI
    int world_rank = 0;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    if (world_rank == 0) {
        // Rank 0 makes the decision
        if (device == "gpu") {
            if (probe_gpu_availability()) {
                decision = 1;
                std::cout << " INFO: 'device=gpu' specified. GPU will be used." << std::endl;
            } else {
                ModuleBase::WARNING_QUIT("device", "Device is set to 'gpu', but no available GPU was found. Please check your hardware/drivers or set 'device=cpu'.");
            }
        } else if (device == "auto") {
            if (probe_gpu_availability()) {
                decision = 1;
                std::cout << " INFO: 'device=auto' specified. GPU detected and will be used." << std::endl;
            } else {
                decision = 0;
                std::cout << " WARNING: 'device=auto' specified, but no GPU was found. Falling back to CPU." << std::endl;
                std::cout << "          To suppress this warning, please explicitly set 'device=cpu' in your input." << std::endl;
            }
        } else { // device == "cpu"
            decision = 0;
            std::cout << " INFO: 'device=cpu' specified. CPU will be used." << std::endl;
        }
    }
    // Rank 0 broadcasts the final decision to all other ranks
    MPI_Bcast(&decision, 1, MPI_INT, 0, MPI_COMM_WORLD);
#else
    // Non-MPI case: single process makes the decision
    if (device == "gpu") {
        if (probe_gpu_availability()) {
            decision = 1;
            std::cout << " INFO: 'device=gpu' specified. GPU will be used." << std::endl;
        } else {
            ModuleBase::WARNING_QUIT("device", "Device is set to 'gpu', but no available GPU was found. Please check your hardware/drivers or set 'device=cpu'.");
        }
    } else if (device == "auto") {
        if (probe_gpu_availability()) {
            decision = 1;
            std::cout << " INFO: 'device=auto' specified. GPU detected and will be used." << std::endl;
        } else {
            decision = 0;
            std::cout << " WARNING: 'device=auto' specified, but no GPU was found. Falling back to CPU." << std::endl;
            std::cout << "          To suppress this warning, please explicitly set 'device=cpu' in your input." << std::endl;
        }
    } else { // device == "cpu"
        decision = 0;
        std::cout << " INFO: 'device=cpu' specified. CPU will be used." << std::endl;
    }
#endif
    // 2. Final check for incompatible basis type
    if (decision == 1 && basis_type == "lcao_in_pw") {
        ModuleBase::WARNING_QUIT("device", "The GPU currently does not support the basis type \"lcao_in_pw\"!");
    }
    // 3. Return the final decision
    return (decision == 1) ? "gpu" : "cpu";
}
// ... rest of the file ...

第 3 步：优化 `get_device_kpar` 以增强安全性

get_device_kpar 函数也调用了 cudaGetDeviceCount。虽然 get_device_flag 已经做了决策，但为了极致的健壮性，我们可以在这里也加上保护，防止在未来的代码变动中出现意外调用。
修改 device.cpp 中的 get_device_kpar

// device.cpp
int get_device_kpar(const int& kpar, const int& bndpar)
{
#if __MPI && (__CUDA || __ROCM)
    // This function should only be called after the device has been decided.
    // We check the global device setting to ensure we only bind GPUs if GPU mode is active.
    // Assuming GlobalV::DEVICE holds the final decision from get_device_flag.
    extern std::string DEVICE; // Or use the correct global variable path, e.g., GlobalV::DEVICE
    if (DEVICE == "gpu") 
    {
        int temp_nproc = 0;
        int new_kpar = kpar;
        MPI_Comm_size(MPI_COMM_WORLD, &temp_nproc);
        if (temp_nproc != kpar * bndpar)
        {
            new_kpar = temp_nproc / bndpar;
            ModuleBase::WARNING("Input_conv", "kpar is not compatible with the number of processors, auto set kpar value.");
        }

        int node_rank = base_device::information::get_node_rank();
        int device_num = -1;
    #if defined(__CUDA)
        // It's now safe to call cudaGetDeviceCount because we are in GPU mode
        // and probe_gpu_availability has already succeeded.
        cudaErrcheck(cudaGetDeviceCount(&device_num));
        cudaSetDevice(node_rank % device_num);
    #elif defined(__ROCM)
        hipErrcheck(hipGetDeviceCount(&device_num));
        hipSetDevice(node_rank % device_num);
    #endif
        return new_kpar;
    }
#endif
    return kpar;
}

注意：这里我假设 `GlobalV::DEVICE` (或类似的全局变量) 在调用此函数前已经被 `get_device_flag` 的返回值正确设置。如果 `DEVICE` 变量不可访问，一个更安全但稍显冗余的方法是再次调用 `probe_gpu_availability()`，但这通常是不必要的。

第 4 步：`output_device.cpp` 无需修改

您之前对 output_device_info() 的修改已经非常完美。

// output_device.cpp
void output_device_info(std::ostream &output)
{
    // ...
    // Get local hardware info
    int local_gpu_count = 0;
    #if defined(__CUDA) || defined(__ROCM)
    // This guard is now crucial and works perfectly with the new logic.
    // PARAM.inp.device will be the final choice ("cpu" or "gpu").
    if (PARAM.inp.device == "gpu" && local_rank == 0) {
        local_gpu_count = get_device_num("gpu");
    }
    #endif
    // ...
}

由于 get_device_flag 现在会返回一个确定的 "cpu" 或 "gpu"，并且这个值会用来设置 PARAM.inp.device，所以 output_device_info 中的保护条件 if(PARAM.inp.device == "gpu") 能完美工作，确保只有在真正使用 GPU 时才会调用 get_device_num("gpu") 和 get_device_name("gpu")，从而避免了崩溃。

总结

通过以上修改，device=auto 功能得以实现：

用户体验：用户只需设置 device=auto，程序即可智能选择设备。
信息透明：程序会清晰地告知用户最终选择了哪个设备，以及如果发生回退，原因是什么。
MPI 安全：通过 Rank 0 决策和全局广播，保证了所有 MPI 进程的设备选择完全一致。
健壮性：通过 probe_gpu_availability 函数，彻底解决了在无 GPU 环境下的崩溃问题。
这个方案将 auto 功能的实现优雅地集成到了现有的设备管理框架中，代码改动集中且逻辑清晰。

- Add probe_gpu_availability() function that safely checks GPU without exiting - Refactor get_device_flag() to support device=auto with clear user feedback - Implement MPI broadcast for consistent device selection across all ranks - Add safety check in get_device_kpar() to prevent GPU calls when unavailable - Add informative messages for device selection (INFO/WARNING) Co-authored-by: dzzz2001 <153698752+dzzz2001@users.noreply.github.com>

- Remove redundant probe_gpu_availability check that could cause MPI inconsistency - Function is only called when device=gpu is already confirmed - Add comment clarifying the function should only be called in GPU mode Co-authored-by: dzzz2001 <153698752+dzzz2001@users.noreply.github.com>

Copilot · 2025-12-04T13:06:44Z

@copilot 帮我进一步重构：### 核心思路
我们将改造 device.cpp 中的 get_device_flag 函数，使其成为设备选择的决策中心。

创建安全的 GPU 探测函数��...

I've implemented the refactoring as requested. Here's what was done:

Commits:

f4653cb: Initial implementation of device=auto with safe GPU probe and MPI broadcast
e12b1cd: Fixed MPI synchronization issue in get_device_kpar

Key Changes:

✅ Added probe_gpu_availability() - safely probes GPU without cudaErrcheck
✅ Refactored get_device_flag() - now supports cpu/gpu/auto with MPI broadcast
✅ Enhanced user feedback - INFO/WARNING messages for device selection
✅ MPI-safe implementation - Rank 0 decides, broadcasts to all ranks
✅ Simplified get_device_kpar() - removed redundant probe to avoid MPI issues

The implementation follows your suggested approach with proper MPI synchronization and clear user messaging.

mohanchen

LGTM

Initial plan

6dea7c7

Copilot AI assigned Copilot and dzzz2001 Dec 4, 2025

Copilot started work on behalf of dzzz2001 December 4, 2025 07:01 View session

Copilot AI changed the title ~~[WIP] Fix ABACUS running issue when compiled with CUDA and no GPU~~ Fix CUDA GPU detection crash when running with device=cpu on machines without GPU Dec 4, 2025

Copilot AI requested a review from dzzz2001 December 4, 2025 07:08

Copilot finished work on behalf of dzzz2001 December 4, 2025 07:08

dzzz2001 approved these changes Dec 4, 2025

View reviewed changes

dzzz2001 marked this pull request as ready for review December 4, 2025 07:37

Flying-dragon-boxing self-requested a review December 4, 2025 07:51

Flying-dragon-boxing approved these changes Dec 4, 2025

View reviewed changes

Copilot started work on behalf of dzzz2001 December 4, 2025 13:00 View session

Copilot AI and others added 2 commits December 4, 2025 13:03

Copilot AI requested a review from dzzz2001 December 4, 2025 13:08

Copilot finished work on behalf of dzzz2001 December 4, 2025 13:08

dzzz2001 and others added 2 commits December 4, 2025 22:14

注释掉 get_device_flag 中所有 std::cout 输出，避免冗余信息打印

8e8331c

Merge branch 'develop' into copilot/fix-cuda-no-gpu-error

b457076

dzzz2001 approved these changes Dec 5, 2025

View reviewed changes

mohanchen reviewed Dec 6, 2025

View reviewed changes

mohanchen approved these changes Dec 6, 2025

View reviewed changes

mohanchen merged commit 778d420 into develop Dec 6, 2025
25 checks passed

dzzz2001 deleted the copilot/fix-cuda-no-gpu-error branch December 7, 2025 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix CUDA GPU detection crash when running with device=cpu on machines without GPU #6759

Fix CUDA GPU detection crash when running with device=cpu on machines without GPU #6759

Uh oh!

Copilot AI commented Dec 4, 2025 •

edited

Loading

Uh oh!

dzzz2001 commented Dec 4, 2025

Uh oh!

Flying-dragon-boxing left a comment

Uh oh!

dzzz2001 commented Dec 4, 2025

Uh oh!

Copilot AI commented Dec 4, 2025

Uh oh!

mohanchen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix CUDA GPU detection crash when running with device=cpu on machines without GPU #6759

Fix CUDA GPU detection crash when running with device=cpu on machines without GPU #6759

Uh oh!

Conversation

Copilot AI commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Completed Implementation

Changes Summary

Behavior Matrix

Technical Details

Expected behavior

To Reproduce

Environment

Additional Context

Task list for Issue attackers (only for developers)

Comments on the Issue (you are @copilot in this section)

Uh oh!

dzzz2001 commented Dec 4, 2025

Uh oh!

Flying-dragon-boxing left a comment

Choose a reason for hiding this comment

Uh oh!

dzzz2001 commented Dec 4, 2025

第 1 步：在 device.h 中添加新函数声明

第 2 步：在 device.cpp 中实现新函数并重构 get_device_flag

第 3 步：优化 get_device_kpar 以增强安全性

注意：这里我假设 GlobalV::DEVICE (或类似的全局变量) 在调用此函数前已经被 get_device_flag 的返回值正确设置。如果 DEVICE 变量不可访问，一个更安全但稍显冗余的方法是再次调用 probe_gpu_availability()，但这通常是不必要的。

第 4 步：output_device.cpp 无需修改

总结

Uh oh!

Copilot AI commented Dec 4, 2025

Uh oh!

mohanchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Dec 4, 2025 •

edited

Loading

第 1 步：在 `device.h` 中添加新函数声明

第 2 步：在 `device.cpp` 中实现新函数并重构 `get_device_flag`

第 3 步：优化 `get_device_kpar` 以增强安全性

注意：这里我假设 `GlobalV::DEVICE` (或类似的全局变量) 在调用此函数前已经被 `get_device_flag` 的返回值正确设置。如果 `DEVICE` 变量不可访问，一个更安全但稍显冗余的方法是再次调用 `probe_gpu_availability()`，但这通常是不必要的。

第 4 步：`output_device.cpp` 无需修改