Skip to content

Conversation

Copy link

Copilot AI commented Dec 4, 2025

  • Fix CUDA GPU detection when device is set to CPU (original issue)
  • Implement safe GPU probe function without cudaErrcheck
  • Refactor get_device_flag to support device=auto with MPI broadcast
  • Add clear user messaging for device selection
  • Update get_device_kpar - simplified to avoid MPI issues
  • Address MPI synchronization concerns from code review
  • Final code review - only minor style nitpicks remain
  • Security scan - no vulnerabilities detected

Completed Implementation

This PR addresses the original bug and implements the requested device=auto refactoring.

Changes Summary

1. Original Bug Fix (commit 996199b)

  • Fixed crash when running CUDA-compiled ABACUS with device=cpu on GPU-less machines
  • Added runtime guards in output_device_info() to skip GPU detection unless device=gpu

2. Device Auto-Selection Feature (commits f4653cb, e12b1cd)

  • New: probe_gpu_availability() - safe GPU detection without program exit
  • Enhanced: get_device_flag() - supports cpu/gpu/auto modes with MPI broadcast
  • Improved: Clear INFO/WARNING messages for device selection
  • Maintained: get_device_kpar() - simplified to avoid MPI synchronization issues

Behavior Matrix

device GPU Present Result
cpu Any CPU (INFO)
gpu Yes GPU (INFO)
gpu No Exit with error
auto Yes GPU (INFO)
auto No CPU (WARNING)

Technical Details

  • MPI-safe: Rank 0 makes decision, broadcasts to all ranks
  • No false GPU probes: Functions only check GPU when appropriate
  • Backward compatible: Existing cpu/gpu behavior unchanged
  • User-friendly: Clear messages explain device selection
Original prompt

This section details on the original issue you should resolve

<issue_title>ABACUS compiled with CUDA doesn't run when there is no GPU</issue_title>
<issue_description>### Describe the bug

When running ABACUS compiled with CUDA with no GPU, ABACUS shows this message and exits.
Unexpected Device Error /home/mhchen_pkuhpc/mhchen_coe/lustre3/8_Hongxu/tmp/abacus-develop/source/source_base/module_device/output_device.cpp:72: cudaErrorNoDevice, no CUDA-capable device is detected

Expected behavior

Should run normally

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).</issue_description>

<agent_instructions>The bug is caused by get_device_num("gpu") inside output_device_info being executed even in the CPU-only version. Please help fix it.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@Flying-dragon-boxing > Generally, `no CUDA-capable device is detected` this error is running on a host that does not have a graphics card, not a compilation problem.

I've got some clue.
When device is set to cpu, ABACUS should not detect GPU device or execute source/source_base/module_device/output_device.cpp:72. </comment_new>


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

- Add runtime check for PARAM.inp.device before calling GPU detection functions
- Prevent cudaErrcheck from exiting when no GPU is available but device is set to "cpu"
- Apply same logic as non-MPI version to MPI version in output_device_info

Co-authored-by: dzzz2001 <153698752+dzzz2001@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix ABACUS running issue when compiled with CUDA and no GPU Fix CUDA GPU detection crash when running with device=cpu on machines without GPU Dec 4, 2025
Copilot AI requested a review from dzzz2001 December 4, 2025 07:08
@dzzz2001
Copy link
Collaborator

dzzz2001 commented Dec 4, 2025

LGTM

@dzzz2001 dzzz2001 marked this pull request as ready for review December 4, 2025 07:37
Copy link
Collaborator

@Flying-dragon-boxing Flying-dragon-boxing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dzzz2001
Copy link
Collaborator

dzzz2001 commented Dec 4, 2025

@copilot 帮我进一步重构:### 核心思路
我们将改造 device.cpp 中的 get_device_flag 函数,使其成为设备选择的决策中心。

  1. 创建安全的 GPU 探测函数:为了防止在无 GPU 环境下调用 cudaGetDeviceCount 导致程序崩溃,我们需要一个不调用 cudaErrcheck 的安全探测函数。
  2. get_device_flag 中实现 auto 逻辑
    • 该函数将处理 cpu, gpu, auto 三种输入。
    • 在 MPI 环境中,只有 Rank 0 负责执行 GPU 探测和决策。
    • Rank 0 将最终的决策(用 GPU 还是 CPU)广播给所有其他进程。
    • 所有进程根据广播结果,统一返回 "gpu""cpu"
    • Rank 0 负责打印清晰的信息或警告。
  3. 确保其他函数安全get_device_kpar 等函数在调用 GPU 相关 API 时,也需要确保是在确认 GPU 可用之后。

第 1 步:在 device.h 中添加新函数声明

首先,我们需要声明一个安全的 GPU 探测函数。
修改 device.h
namespace base_device::information 中添加新函数的声明:

// device.h
namespace base_device
{
namespace information
{
// ... existing declarations ...
/**
 * @brief Safely probes for GPU availability without exiting on error.
 * @return True if at least one GPU is found and usable, false otherwise.
 */
bool probe_gpu_availability();
// ... existing declarations ...
} // end of namespace information
} // end of namespace base_device

第 2 步:在 device.cpp 中实现新函数并重构 get_device_flag

这是本次修改的核心。
修改 device.cpp

  1. 实现 probe_gpu_availability 函数
    将其放在 get_device_flag 函数之前。
    // device.cpp
    namespace base_device
    {
    namespace information
    {
    bool probe_gpu_availability() {
    #if defined(__CUDA)
        int device_count = 0;
        // Directly call cudaGetDeviceCount without cudaErrcheck to prevent program exit
        cudaError_t error_id = cudaGetDeviceCount(&device_count);
        if (error_id == cudaSuccess && device_count > 0) {
            return true;
        }
        return false;
    #elif defined(__ROCM)
        int device_count = 0;
        hipError_t error_id = hipGetDeviceCount(&device_count);
        if (error_id == hipSuccess && device_count > 0) {
            return true;
        }
        return false;
    #else
        // If not compiled with GPU support, GPU is not available
        return false;
    #endif
    }
    // ... get_device_flag function will be refactored below ...
  2. 重构 get_device_flag 函数
    这是实现 auto 逻辑的地方。我们将用下面这个更健壮的版本替换掉原有的 get_device_flag 函数。
    // device.cpp
    std::string get_device_flag(const std::string &device,
                                const std::string &basis_type) {
        // 1. Validate input string
        if (device != "cpu" && device != "gpu" && device != "auto") {
            ModuleBase::WARNING_QUIT("device", "Parameter \"device\" can only be set to \"cpu\", \"gpu\", or \"auto\"!");
        }
        int decision = 0; // 0 for CPU, 1 for GPU
    #ifdef __MPI
        int world_rank = 0;
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
        if (world_rank == 0) {
            // Rank 0 makes the decision
            if (device == "gpu") {
                if (probe_gpu_availability()) {
                    decision = 1;
                    std::cout << " INFO: 'device=gpu' specified. GPU will be used." << std::endl;
                } else {
                    ModuleBase::WARNING_QUIT("device", "Device is set to 'gpu', but no available GPU was found. Please check your hardware/drivers or set 'device=cpu'.");
                }
            } else if (device == "auto") {
                if (probe_gpu_availability()) {
                    decision = 1;
                    std::cout << " INFO: 'device=auto' specified. GPU detected and will be used." << std::endl;
                } else {
                    decision = 0;
                    std::cout << " WARNING: 'device=auto' specified, but no GPU was found. Falling back to CPU." << std::endl;
                    std::cout << "          To suppress this warning, please explicitly set 'device=cpu' in your input." << std::endl;
                }
            } else { // device == "cpu"
                decision = 0;
                std::cout << " INFO: 'device=cpu' specified. CPU will be used." << std::endl;
            }
        }
        // Rank 0 broadcasts the final decision to all other ranks
        MPI_Bcast(&decision, 1, MPI_INT, 0, MPI_COMM_WORLD);
    #else
        // Non-MPI case: single process makes the decision
        if (device == "gpu") {
            if (probe_gpu_availability()) {
                decision = 1;
                std::cout << " INFO: 'device=gpu' specified. GPU will be used." << std::endl;
            } else {
                ModuleBase::WARNING_QUIT("device", "Device is set to 'gpu', but no available GPU was found. Please check your hardware/drivers or set 'device=cpu'.");
            }
        } else if (device == "auto") {
            if (probe_gpu_availability()) {
                decision = 1;
                std::cout << " INFO: 'device=auto' specified. GPU detected and will be used." << std::endl;
            } else {
                decision = 0;
                std::cout << " WARNING: 'device=auto' specified, but no GPU was found. Falling back to CPU." << std::endl;
                std::cout << "          To suppress this warning, please explicitly set 'device=cpu' in your input." << std::endl;
            }
        } else { // device == "cpu"
            decision = 0;
            std::cout << " INFO: 'device=cpu' specified. CPU will be used." << std::endl;
        }
    #endif
        // 2. Final check for incompatible basis type
        if (decision == 1 && basis_type == "lcao_in_pw") {
            ModuleBase::WARNING_QUIT("device", "The GPU currently does not support the basis type \"lcao_in_pw\"!");
        }
        // 3. Return the final decision
        return (decision == 1) ? "gpu" : "cpu";
    }
    // ... rest of the file ...

第 3 步:优化 get_device_kpar 以增强安全性

get_device_kpar 函数也调用了 cudaGetDeviceCount。虽然 get_device_flag 已经做了决策,但为了极致的健壮性,我们可以在这里也加上保护,防止在未来的代码变动中出现意外调用。
修改 device.cpp 中的 get_device_kpar

// device.cpp
int get_device_kpar(const int& kpar, const int& bndpar)
{
#if __MPI && (__CUDA || __ROCM)
    // This function should only be called after the device has been decided.
    // We check the global device setting to ensure we only bind GPUs if GPU mode is active.
    // Assuming GlobalV::DEVICE holds the final decision from get_device_flag.
    extern std::string DEVICE; // Or use the correct global variable path, e.g., GlobalV::DEVICE
    if (DEVICE == "gpu") 
    {
        int temp_nproc = 0;
        int new_kpar = kpar;
        MPI_Comm_size(MPI_COMM_WORLD, &temp_nproc);
        if (temp_nproc != kpar * bndpar)
        {
            new_kpar = temp_nproc / bndpar;
            ModuleBase::WARNING("Input_conv", "kpar is not compatible with the number of processors, auto set kpar value.");
        }

        int node_rank = base_device::information::get_node_rank();
        int device_num = -1;
    #if defined(__CUDA)
        // It's now safe to call cudaGetDeviceCount because we are in GPU mode
        // and probe_gpu_availability has already succeeded.
        cudaErrcheck(cudaGetDeviceCount(&device_num));
        cudaSetDevice(node_rank % device_num);
    #elif defined(__ROCM)
        hipErrcheck(hipGetDeviceCount(&device_num));
        hipSetDevice(node_rank % device_num);
    #endif
        return new_kpar;
    }
#endif
    return kpar;
}

注意:这里我假设 GlobalV::DEVICE (或类似的全局变量) 在调用此函数前已经被 get_device_flag 的返回值正确设置。如果 DEVICE 变量不可访问,一个更安全但稍显冗余的方法是再次调用 probe_gpu_availability(),但这通常是不必要的。

第 4 步:output_device.cpp 无需修改

您之前对 output_device_info() 的修改已经非常完美。

// output_device.cpp
void output_device_info(std::ostream &output)
{
    // ...
    // Get local hardware info
    int local_gpu_count = 0;
    #if defined(__CUDA) || defined(__ROCM)
    // This guard is now crucial and works perfectly with the new logic.
    // PARAM.inp.device will be the final choice ("cpu" or "gpu").
    if (PARAM.inp.device == "gpu" && local_rank == 0) {
        local_gpu_count = get_device_num("gpu");
    }
    #endif
    // ...
}

由于 get_device_flag 现在会返回一个确定的 "cpu""gpu",并且这个值会用来设置 PARAM.inp.device,所以 output_device_info 中的保护条件 if(PARAM.inp.device == "gpu") 能完美工作,确保只有在真正使用 GPU 时才会调用 get_device_num("gpu")get_device_name("gpu"),从而避免了崩溃。

总结

通过以上修改,device=auto 功能得以实现:

  • 用户体验:用户只需设置 device=auto,程序即可智能选择设备。
  • 信息透明:程序会清晰地告知用户最终选择了哪个设备,以及如果发生回退,原因是什么。
  • MPI 安全:通过 Rank 0 决策和全局广播,保证了所有 MPI 进程的设备选择完全一致。
  • 健壮性:通过 probe_gpu_availability 函数,彻底解决了在无 GPU 环境下的崩溃问题。
    这个方案将 auto 功能的实现优雅地集成到了现有的设备管理框架中,代码改动集中且逻辑清晰。

Copilot AI and others added 2 commits December 4, 2025 13:03
- Add probe_gpu_availability() function that safely checks GPU without exiting
- Refactor get_device_flag() to support device=auto with clear user feedback
- Implement MPI broadcast for consistent device selection across all ranks
- Add safety check in get_device_kpar() to prevent GPU calls when unavailable
- Add informative messages for device selection (INFO/WARNING)

Co-authored-by: dzzz2001 <153698752+dzzz2001@users.noreply.github.com>
- Remove redundant probe_gpu_availability check that could cause MPI inconsistency
- Function is only called when device=gpu is already confirmed
- Add comment clarifying the function should only be called in GPU mode

Co-authored-by: dzzz2001 <153698752+dzzz2001@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 4, 2025

@copilot 帮我进一步重构:### 核心思路
我们将改造 device.cpp 中的 get_device_flag 函数,使其成为设备选择的决策中心。

  1. 创建安全的 GPU 探测函数��...

I've implemented the refactoring as requested. Here's what was done:

Commits:

  • f4653cb: Initial implementation of device=auto with safe GPU probe and MPI broadcast
  • e12b1cd: Fixed MPI synchronization issue in get_device_kpar

Key Changes:

  1. ✅ Added probe_gpu_availability() - safely probes GPU without cudaErrcheck
  2. ✅ Refactored get_device_flag() - now supports cpu/gpu/auto with MPI broadcast
  3. ✅ Enhanced user feedback - INFO/WARNING messages for device selection
  4. ✅ MPI-safe implementation - Rank 0 decides, broadcasts to all ranks
  5. ✅ Simplified get_device_kpar() - removed redundant probe to avoid MPI issues

The implementation follows your suggested approach with proper MPI synchronization and clear user messaging.

Copilot AI requested a review from dzzz2001 December 4, 2025 13:08
Copy link
Collaborator

@mohanchen mohanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mohanchen mohanchen merged commit 778d420 into develop Dec 6, 2025
25 checks passed
@dzzz2001 dzzz2001 deleted the copilot/fix-cuda-no-gpu-error branch December 7, 2025 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ABACUS compiled with CUDA doesn't run when there is no GPU

4 participants