revert pinned hostmem for pod comm#292
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adjusts pod-communication (fabric-handle) memory allocation to better support exporting extended (HOST_NUMA) memory across pods, and avoids a failure mode where NUMA page validation caused one rank to error while other ranks hung in collectives.
Changes:
- Add a
GetMemLocation()helper to correctly selectDEVICEvsHOST_NUMAlocations for pod-comm VMM allocations and access descriptors. - Remove
CheckPages()/move_pages()validation for fabric-exportable host allocations (keep zeroing), with an explanatory comment. - Extend CUDA-compat macro aliases/undefs to include
hipMemLocationandhipMemLocationTypeHostNuma.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (IsGpuMemType(memDevice.memType)) { | ||
| exportErr = hipSetDevice(memDevice.memIndex); | ||
| } |
There was a problem hiding this comment.
Why is this hipSetDevice even necessary? The runtime should already have the ability to determine where the memory handle is. It also seems strange that you would not need it for CPU memory types but GPU memory types.
Based on the documentation, cuMemExportToSharableHandle doesn't require the current device context either, and won't ever return CUDA_ERROR_INVALID_CONTEXT
Motivation
This is to add back the option of extended GPU memory in cross pod transfer.
Technical Details
Removed
CheckPages()insideAllocateMemory()for pinned host memory. This would cause a silent fail on the node where it's allocated, and other nodes will hang at broadcast insideExchangeMemory.It's also not tested yet on either Nvidia or AMD platforms.
Test Plan
Test Result
Submission Checklist