Skip to content

[Feature]: Introduce GPU Direct Storage(GDS) Feature #276

@qyh111

Description

@qyh111

🚀 The feature, motivation and pitch

GPU Direct Storage (GDS) Feature Documentation

Overview

This feature introduces GPU Direct Storage (GDS) support to enable direct data transfer between GPU memory and storage, bypassing CPU memory as an intermediate buffer. This significantly reduces memory bandwidth bottlenecks and improves KV cache loading/offloading performance.

Motivation

Currently, the KV cache transfer follows this path:

Device Memory <-> Host Memory <-> Storage

With GDS support, we can achieve:

Device Memory <-> Storage (direct)

This eliminates the need for:

  • Host memory staging buffers
  • CPU-GPU memory copies (H2D/D2H)
  • Additional memory bandwidth consumption
  • Two-stage transfer operations

Technical Design

Basic GDS call flow

  1. cuFileDriverOpen()
    Initialize the GDS runtime; must be called first, can be called repeatedly.
  2. open(path, O_DIRECT | …)
    Open the target file with POSIX; ensure it goes through the Direct-I/O path.
  3. cuFileHandleRegister(&fh, &desc)
    Wrap the regular fd into a cuFile handle then register to GDS.
    Need to cache it to improve performance
  4. cuFileBufRegister(d_buf, size, 0)
    Register GPU buffer with GDS for zero-copy. We will not use this api

    If the cuFileBufRegister has not been previously called on the buffer pointer, cuFileRead/cuFileWrite will use internal registered buffers when required. per_buffer_cache_size_kb = 1MB, max_device_cache_size_kb = 128MB

  5. cuFileRead(fh, d_buf, size, offset, 0)
    or cuFileWrite(fh, d_buf, size, offset, 0)
    Transfer data between file and GPU memory with no CPU copies.
  6. cuFileBufDeregister(d_buf)
    Unregister the buffer;
    cuFileHandleDeregister(fh) to release the file-handle resource.
  7. close(fd)
    Close the underlying file descriptor.
  8. cuFileDriverClose()
    Shut down GDS and release global resources.

Architecture Overview

The design follows a clean separation of concerns with polymorphic queue types:

  1. Configuration Layer: NFSStore::Config adds transferUseDirect flag
  2. Queue Abstraction: New ITsfTaskQueue interface for polymorphic queue types
  3. Implementation Layer:
    • TsfTaskQueue: Traditional path (Device <-> Host <-> Storage)
    • DirectTsfTaskQueue: GDS path (Device <-> Storage)
  4. Device Layer: CUDA device implements S2D/D2S methods using cuFile APIs

Key Components

1. Device Interface Extension

// ucm/store/device/idevice.h
class IDevice {
    virtual Status Setup(bool transferUseDirect) = 0;


   // GDS Sync
    virtual Status S2D(const std::string& path, size_t offset, size_t length, 
                            std::byte* devicePtr) { 
        return Status::Unsupported(); 
    }
    virtual Status D2S(const std::string& path, size_t offset, size_t length, 
                            std::byte* devicePtr) { 
        return Status::Unsupported(); 
    }

    // GDS Async
    virtual Status S2DAsync(const std::string& path, size_t offset, size_t length, 
                            std::byte* devicePtr, std::function<void(bool)> callback) { 
        return Status::Unsupported(); 
    }
    virtual Status D2SAsync(const std::string& path, size_t offset, size_t length, 
                            const std::byte* devicePtr, std::function<void(bool)> callback) { 
        return Status::Unsupported(); 
    }
};

// ucm/store/device/ibuffered_device.h
class IBufferedDevice : public IDevice {
    Status Setup(bool transferUseDirect) override
    {
        if (transferUseDirect) {
            return Status::OK();
        }
    }
}

2. CUDA Implementation

// ucm/store/device/cuda/cuda_device.cc
class CudaDevice : public IBufferedDevice {
    // cuFileDriverOpen()
    void InitGdsOnce();
    
    //  Sync Iml
    Status S2D(...) override;  // Sync Storage to Device
    Status D2S(...) override;  // Sync Device to Storage
};

3. Queue Polymorphism

// ucm/store/nfsstore/cc/domain/tsf_task/itsf_task_queue.h
class ITsfTaskQueue {
    virtual Status Setup(...) = 0;
    virtual void Push(std::list<TsfTask>& tasks) = 0;
};

class TsfTaskQueue : public ITsfTaskQueue { /* Traditional path */ };
class DirectTsfTaskQueue : public ITsfTaskQueue { /* GDS path */ };

4. Factory Pattern in Manager

// TsfTaskManager creates appropriate queue type based on configuration
Status TsfTaskManager::Setup(..., const bool transferUseDirect) {
    for (size_t i = 0; i < streamNumber; ++i) {
        std::unique_ptr<ITsfTaskQueue> queue;
        if (transferUseDirect) {
            queue = std::make_unique<DirectTsfTaskQueue>();
        } else {
            queue = std::make_unique<TsfTaskQueue>();
        }
        // Setup and store in unified _queues vector
    }
}

Implementation Details

GDS Requirements

DirectTsfTaskQueue

class DirectTsfTaskQueue : public ITsfTaskQueue {
public:
    Status Setup(...) override;
    void Push(...) override;

private:
    // Thread processing load/dump
    void DirectOper(TsfTask& task);
    // call CudaDevice.S2DAsync(...)
    Status S2D(const TsfTask& task);
   // call CudaDevice.D2SAsync(...)
    Status D2S(const TsfTask& task);
    void Done(const TsfTask& task, bool success);
...

FileHandleCache

FileHandleCache is a lightweight class for caching and reusing cuFileHandle objects, designed for use with NVIDIA GPUDirect Storage (GDS).
Its main goals are:

  • Avoid redundant cuFileHandleRegister() calls to reduce overhead
  • Automatically manage reference counts (refCount)
  • Evict handles with refCount == 0 when the cache exceeds its capacity

Core Api

CUfileHandle_t acquire(const std::string& path);
void release(const std::string& path);
void cleanup();
acquire(path)
  • If the handle already exists in the cache, increments its reference count and returns it.
  • Otherwise:
    1. Opens the file using open() with the O_DIRECT flag.
    2. Registers it with cuFileHandleRegister().
    3. Inserts it into the cache with refCount = 1.
    4. Calls cleanupIfNeeded() to evict unused handles if the cache exceeds maxSize.
release(path)
  • Decrements the reference count of the specified file handle.
  • When the reference count reaches zero, the handle becomes eligible for cleanup.
cleanup()
  • Manually removes and deregisters all handles with refCount == 0.
  • Typically invoked before program shutdown or when releasing GPU resources.

Usage

  • Use auto fileHandleCache = Singleton<FileHandleCache>::Instance();

CudaDevice

Setup

Status Setup() override
{
    ...
    if (transferUseDirect) {
        InitGdsOnce();
    }
    ...
}
...
Status CudaDevice::InitGdsOnce()
{
    auto ret = cuFileDriverOpen();
    if (ret == CU_FILE_SUCCESS) {
        UC_INFO("GDS driver initialized successfully");
    } else {
        UC_WARN("Failed to initialize GDS driver, ret={}", ret);
    }
    return ret
}

CUfileError_t cuFileDriverClose();This happens implicitly upon process exit.

Sync Implementation Details(Implement first)

  • ssize_t cuFileRead(CUfileHandle_tfh, void *bufPtr_base, size_t size, off_t file_offset, off_t bufPtr_offset);
  • Key parameters:
    • CUFileHandle_t fh: Registered file handle
    • void *bufPtr_base: Registered device memory pointer
    • size_t *size: Pointer to size
    • off_t *file_offset_p: Pointer to file offset (evaluated at execution time)
    • off_t bufPtr_offset: Should set to 0 if read from beginning
  • Return:
    • Bytes read
    • -1 on IO error, so errno is set to indicate filesystem errors
    • All other errors return a negative integer value of the CUfileOpError enum value.
  • Resource cleanup(deregister and close fd)
  • Learn more on (https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html)

Async Implementation Details

  • CUfileError_t cuFileReadAsync(CUFileHandle_t fh, void *bufPtr_base, size_t *size_p, off_t *file_offset_p, off_t *bufPtr_offset_p, int *bytes_read_p, CUstream stream);
  • Key parameters:
    • CUFileHandle_t fh: Registered file handle
    • void *bufPtr_base: Registered Device memory pointer
    • size_t *size_p: Pointer to size
    • off_t *file_offset_p: Pointer to file offset
    • int *bytes_read_p: Host-allocated memory for returned byte count
    • CUstream stream: CUDA stream for operation ordering
  • Return:
    • CU_FILE_SUCCESS represents success
  • Use cudaStreamAddCallback for monitoring async completion and task completion notification
  • Automatic resource cleanup (file handles, buffer registration, streams) in callbacks

Alignment Requirements

GDS requires strict 4KB alignment for optimal performance:

  • File offset: Must be multiple of 4KB
  • Transfer length: Must be multiple of 4KB
  • Device pointer: Must be 4KB aligned in device memory

When alignment requirements are not met:

  • The API works correctly for unaligned offsets and any data size, although the performance might not match the performance of aligned reads

Error Handling & Fallback

  • The cuFile API automatically probes for GDS support at runtime; if the capability is missing, it silently falls back to compatibility mode.
  • Alignment validation with automatic fallback for misaligned requests
  • cufile.log for troubleshooting
  • Even though the cuFile API performs automatic capability detection, registration can still fail; when it does, the job is aborted immediately rather than attempting any further fallback.

Build Configuration

CMake Configuration

# ucm/store/device/cuda/CMakeLists.txt
...
set_target_properties(Cuda::cudart PROPERTIES
    INTERFACE_INCLUDE_DIRECTORIES "${CUDA_ROOT}/include"
    IMPORTED_LOCATION "${CUDA_ROOT}/lib64/libcudart.so"
    IMPORTED_LOCATION "${CUDA_ROOT}/lib64/libcufile.so"
)

Summary

This GDS feature provides a high-performance alternative to traditional CPU-mediated storage transfers, with automatic fallback mechanisms and comprehensive error handling. The polymorphic design ensures clean separation of concerns while maintaining full backward compatibilitypis

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions