ascend enable fabric mem #1170

ascend-direct-dev · 2025-12-05T06:08:06Z

Description

Type of Change

Types
- Bug fix
- New feature
  - Transfer Engine
  - Mooncake Store
  - Mooncake EP
  - Integration
  - P2P Store
  - Python Wheel
- Breaking change
- CI/CD
- Documentation update
- Other

How Has This Been Tested?

Checklist

I have performed a self-review of my own code.
I have updated the documentation.
I have added tests to prove my changes are effective.

gemini-code-assist · 2025-12-05T06:08:25Z

Summary of Changes

Hello @ascend-direct-dev, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for 'fabric memory' within the Ascend direct transfer mechanism. By checking an environment variable, the system can dynamically switch to a more optimized memory management strategy that leverages physical memory allocation and mapping, enhancing performance for Ascend-based data transfers. This change also ensures the underlying ADXL engine is properly configured to utilize this new memory type.

Highlights

Fabric Memory Activation: Introduces a mechanism to enable 'fabric memory' for Ascend transfers by checking the ASCEND_ENABLE_USE_FABRIC_MEM environment variable during transfer engine initialization.
Advanced Memory Management: Implements a new memory allocation and deallocation path using aclrtMallocPhysical, aclrtReserveMemAddress, aclrtMapMem, aclrtUnmapMem, and aclrtReleaseMemAddress when fabric memory is enabled, providing more direct control over memory resources.
ADXL Engine Configuration: Configures the underlying ADXL engine to utilize fabric memory by setting the EnableUseFabricMem option and disabling the buffer pool when the feature is active.
Global Configuration Flag: Adds a new ascend_use_fabric_mem boolean flag to the GlobalConfig struct, allowing the application to globally track and control the state of this feature.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for Ascend fabric memory, which is a significant performance enhancement. However, the current implementation has several critical issues that need to be addressed. There is a major memory leak in the fabric memory deallocation logic, as well as resource leaks in the error handling paths of the allocation function. The environment variable check for enabling the feature is not robust and could lead to unexpected behavior. Additionally, the removal of a resource cleanup safeguard in the transport destructor could introduce other leaks. I've provided specific comments and suggestions to resolve these problems.

I am having trouble creating individual review comments. Click here to see my feedback.

mooncake-store/src/utils.cpp (130)

This TODO comment points to a critical memory leak. The physical memory allocated by aclrtMallocPhysical is never freed because its handle is lost when allocate_buffer_allocator_memory returns. To fix this, you need to implement a mechanism to track the aclrtDrvMemHandle associated with each allocated virtual address pointer. A global map (std::unordered_map<void*, aclrtDrvMemHandle>) protected by a mutex would be a suitable approach. The handle should be stored in this map upon successful allocation in allocate_buffer_allocator_memory and then retrieved and used here to call aclrtFreePhysical.

mooncake-store/src/client_service.cpp (267-269)

The current implementation enables fabric memory if the ASCEND_ENABLE_USE_FABRIC_MEM environment variable is set to any value. This can be misleading, as setting it to 0 or false would still enable the feature. It's better to check for a specific value like 1 to avoid ambiguity, which is a common practice in this codebase.

        if (ascend_use_fabric_mem && std::string(ascend_use_fabric_mem) == "1") {
            globalConfig().ascend_use_fabric_mem = true;
        }

mooncake-store/src/utils.cpp (99-107)

There are resource leaks in these error handling paths. If aclrtReserveMemAddress fails, the physical memory allocated by aclrtMallocPhysical is leaked. Similarly, if aclrtMapMem fails, both the physical memory and the reserved virtual address space are leaked. You should add cleanup calls in the error handling blocks to prevent these leaks.

            if (ret != ACL_ERROR_NONE) {
                LOG(ERROR) << "Failed to reserve memory: " << ret;
                aclrtFreePhysical(handle);
                return nullptr;
            }
            ret = aclrtMapMem(va, total_size, 0, handle, 0);
            if (ret != ACL_ERROR_NONE) {
                LOG(ERROR) << "Failed to map memory: " << ret;
                aclrtReleaseMemAddress(va);
                aclrtFreePhysical(handle);
                return nullptr;
            }

mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp (71-81)

Removing this memory deregistration loop from the destructor is risky. This loop served as a safeguard to clean up any memory regions that were registered but not explicitly unregistered. Unless adxl_->Finalize() is guaranteed to handle this cleanup for all registered memory handles, this change could introduce a resource leak. This would affect both fabric and non-fabric memory paths. Please confirm that adxl_->Finalize() makes this loop redundant, or consider restoring this cleanup logic.

ascend-direct-dev requested review from XucSh, alogfans, chestnut-Q, doujiang24, stmatengss and ykwd as code owners December 5, 2025 06:08

ascend-direct-dev marked this pull request as draft December 5, 2025 06:08

github-actions bot added the run-ci label Dec 5, 2025

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

ascend-direct-dev force-pushed the dev_store branch 3 times, most recently from f5780b4 to ee0c3f0 Compare December 9, 2025 08:01

ascend enable fabric mem

1df91c1

ascend-direct-dev force-pushed the dev_store branch from ee0c3f0 to 1df91c1 Compare December 9, 2025 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ascend enable fabric mem #1170

ascend enable fabric mem #1170

Uh oh!

ascend-direct-dev commented Dec 5, 2025

Uh oh!

gemini-code-assist bot commented Dec 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ascend enable fabric mem #1170

Are you sure you want to change the base?

ascend enable fabric mem #1170

Uh oh!

Conversation

ascend-direct-dev commented Dec 5, 2025

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist bot commented Dec 5, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

mooncake-store/src/utils.cpp (130)

mooncake-store/src/client_service.cpp (267-269)

mooncake-store/src/utils.cpp (99-107)

mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp (71-81)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant