-
Notifications
You must be signed in to change notification settings - Fork 459
ascend enable fabric mem #1170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ascend enable fabric mem #1170
Conversation
Summary of ChangesHello @ascend-direct-dev, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for 'fabric memory' within the Ascend direct transfer mechanism. By checking an environment variable, the system can dynamically switch to a more optimized memory management strategy that leverages physical memory allocation and mapping, enhancing performance for Ascend-based data transfers. This change also ensures the underlying ADXL engine is properly configured to utilize this new memory type. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Ascend fabric memory, which is a significant performance enhancement. However, the current implementation has several critical issues that need to be addressed. There is a major memory leak in the fabric memory deallocation logic, as well as resource leaks in the error handling paths of the allocation function. The environment variable check for enabling the feature is not robust and could lead to unexpected behavior. Additionally, the removal of a resource cleanup safeguard in the transport destructor could introduce other leaks. I've provided specific comments and suggestions to resolve these problems.
I am having trouble creating individual review comments. Click here to see my feedback.
mooncake-store/src/utils.cpp (130)
This TODO comment points to a critical memory leak. The physical memory allocated by aclrtMallocPhysical is never freed because its handle is lost when allocate_buffer_allocator_memory returns. To fix this, you need to implement a mechanism to track the aclrtDrvMemHandle associated with each allocated virtual address pointer. A global map (std::unordered_map<void*, aclrtDrvMemHandle>) protected by a mutex would be a suitable approach. The handle should be stored in this map upon successful allocation in allocate_buffer_allocator_memory and then retrieved and used here to call aclrtFreePhysical.
mooncake-store/src/client_service.cpp (267-269)
The current implementation enables fabric memory if the ASCEND_ENABLE_USE_FABRIC_MEM environment variable is set to any value. This can be misleading, as setting it to 0 or false would still enable the feature. It's better to check for a specific value like 1 to avoid ambiguity, which is a common practice in this codebase.
if (ascend_use_fabric_mem && std::string(ascend_use_fabric_mem) == "1") {
globalConfig().ascend_use_fabric_mem = true;
}
mooncake-store/src/utils.cpp (99-107)
There are resource leaks in these error handling paths. If aclrtReserveMemAddress fails, the physical memory allocated by aclrtMallocPhysical is leaked. Similarly, if aclrtMapMem fails, both the physical memory and the reserved virtual address space are leaked. You should add cleanup calls in the error handling blocks to prevent these leaks.
if (ret != ACL_ERROR_NONE) {
LOG(ERROR) << "Failed to reserve memory: " << ret;
aclrtFreePhysical(handle);
return nullptr;
}
ret = aclrtMapMem(va, total_size, 0, handle, 0);
if (ret != ACL_ERROR_NONE) {
LOG(ERROR) << "Failed to map memory: " << ret;
aclrtReleaseMemAddress(va);
aclrtFreePhysical(handle);
return nullptr;
}
mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp (71-81)
Removing this memory deregistration loop from the destructor is risky. This loop served as a safeguard to clean up any memory regions that were registered but not explicitly unregistered. Unless adxl_->Finalize() is guaranteed to handle this cleanup for all registered memory handles, this change could introduce a resource leak. This would affect both fabric and non-fabric memory paths. Please confirm that adxl_->Finalize() makes this loop redundant, or consider restoring this cleanup logic.
f5780b4 to
ee0c3f0
Compare
ee0c3f0 to
1df91c1
Compare
Description
Type of Change
How Has This Been Tested?
Checklist