Skip to content

Conversation

@ascend-direct-dev
Copy link
Contributor

Description

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ascend-direct-dev, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for 'fabric memory' within the Ascend direct transfer mechanism. By checking an environment variable, the system can dynamically switch to a more optimized memory management strategy that leverages physical memory allocation and mapping, enhancing performance for Ascend-based data transfers. This change also ensures the underlying ADXL engine is properly configured to utilize this new memory type.

Highlights

  • Fabric Memory Activation: Introduces a mechanism to enable 'fabric memory' for Ascend transfers by checking the ASCEND_ENABLE_USE_FABRIC_MEM environment variable during transfer engine initialization.
  • Advanced Memory Management: Implements a new memory allocation and deallocation path using aclrtMallocPhysical, aclrtReserveMemAddress, aclrtMapMem, aclrtUnmapMem, and aclrtReleaseMemAddress when fabric memory is enabled, providing more direct control over memory resources.
  • ADXL Engine Configuration: Configures the underlying ADXL engine to utilize fabric memory by setting the EnableUseFabricMem option and disabling the buffer pool when the feature is active.
  • Global Configuration Flag: Adds a new ascend_use_fabric_mem boolean flag to the GlobalConfig struct, allowing the application to globally track and control the state of this feature.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Ascend fabric memory, which is a significant performance enhancement. However, the current implementation has several critical issues that need to be addressed. There is a major memory leak in the fabric memory deallocation logic, as well as resource leaks in the error handling paths of the allocation function. The environment variable check for enabling the feature is not robust and could lead to unexpected behavior. Additionally, the removal of a resource cleanup safeguard in the transport destructor could introduce other leaks. I've provided specific comments and suggestions to resolve these problems.

I am having trouble creating individual review comments. Click here to see my feedback.

mooncake-store/src/utils.cpp (130)

critical

This TODO comment points to a critical memory leak. The physical memory allocated by aclrtMallocPhysical is never freed because its handle is lost when allocate_buffer_allocator_memory returns. To fix this, you need to implement a mechanism to track the aclrtDrvMemHandle associated with each allocated virtual address pointer. A global map (std::unordered_map<void*, aclrtDrvMemHandle>) protected by a mutex would be a suitable approach. The handle should be stored in this map upon successful allocation in allocate_buffer_allocator_memory and then retrieved and used here to call aclrtFreePhysical.

mooncake-store/src/client_service.cpp (267-269)

high

The current implementation enables fabric memory if the ASCEND_ENABLE_USE_FABRIC_MEM environment variable is set to any value. This can be misleading, as setting it to 0 or false would still enable the feature. It's better to check for a specific value like 1 to avoid ambiguity, which is a common practice in this codebase.

        if (ascend_use_fabric_mem && std::string(ascend_use_fabric_mem) == "1") {
            globalConfig().ascend_use_fabric_mem = true;
        }

mooncake-store/src/utils.cpp (99-107)

high

There are resource leaks in these error handling paths. If aclrtReserveMemAddress fails, the physical memory allocated by aclrtMallocPhysical is leaked. Similarly, if aclrtMapMem fails, both the physical memory and the reserved virtual address space are leaked. You should add cleanup calls in the error handling blocks to prevent these leaks.

            if (ret != ACL_ERROR_NONE) {
                LOG(ERROR) << "Failed to reserve memory: " << ret;
                aclrtFreePhysical(handle);
                return nullptr;
            }
            ret = aclrtMapMem(va, total_size, 0, handle, 0);
            if (ret != ACL_ERROR_NONE) {
                LOG(ERROR) << "Failed to map memory: " << ret;
                aclrtReleaseMemAddress(va);
                aclrtFreePhysical(handle);
                return nullptr;
            }

mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp (71-81)

high

Removing this memory deregistration loop from the destructor is risky. This loop served as a safeguard to clean up any memory regions that were registered but not explicitly unregistered. Unless adxl_->Finalize() is guaranteed to handle this cleanup for all registered memory handles, this change could introduce a resource leak. This would affect both fabric and non-fabric memory paths. Please confirm that adxl_->Finalize() makes this loop redundant, or consider restoring this cleanup logic.

@ascend-direct-dev ascend-direct-dev force-pushed the dev_store branch 3 times, most recently from f5780b4 to ee0c3f0 Compare December 9, 2025 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant