revert pinned hostmem for pod comm by AtlantaPepsi · Pull Request #292 · ROCm/TransferBench

AtlantaPepsi · 2026-05-08T20:50:15Z

Motivation

This is to add back the option of extended GPU memory in cross pod transfer.

Technical Details

Removed CheckPages() inside AllocateMemory() for pinned host memory. This would cause a silent fail on the node where it's allocated, and other nodes will hang at broadcast inside ExchangeMemory.

It's also not tested yet on either Nvidia or AMD platforms.

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR adjusts pod-communication (fabric-handle) memory allocation to better support exporting extended (HOST_NUMA) memory across pods, and avoids a failure mode where NUMA page validation caused one rank to error while other ranks hung in collectives.

Changes:

Add a GetMemLocation() helper to correctly select DEVICE vs HOST_NUMA locations for pod-comm VMM allocations and access descriptors.
Remove CheckPages()/move_pages() validation for fabric-exportable host allocations (keep zeroing), with an explanatory comment.
Extend CUDA-compat macro aliases/undefs to include hipMemLocation and hipMemLocationTypeHostNuma.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

gilbertlee-amd · 2026-05-16T04:48:17Z

+        if (IsGpuMemType(memDevice.memType)) {
+          exportErr = hipSetDevice(memDevice.memIndex);
+        }


Why is this hipSetDevice even necessary? The runtime should already have the ability to determine where the memory handle is. It also seems strange that you would not need it for CPU memory types but GPU memory types.

Based on the documentation, cuMemExportToSharableHandle doesn't require the current device context either, and won't ever return CUDA_ERROR_INVALID_CONTEXT

Copilot AI review requested due to automatic review settings May 8, 2026 20:50

AtlantaPepsi requested a review from a team as a code owner May 8, 2026 20:50

Copilot started reviewing on behalf of AtlantaPepsi May 8, 2026 20:51 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp

AtlantaPepsi added 2 commits May 14, 2026 18:43

revert pinned hostmem for pod comm

19b0ae9

minor fix

634e46e

Copilot AI review requested due to automatic review settings May 14, 2026 23:43

AtlantaPepsi force-pushed the EGM branch from 9b3c1d7 to 634e46e Compare May 14, 2026 23:43

Copilot started reviewing on behalf of AtlantaPepsi May 14, 2026 23:45 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

gilbertlee-amd requested changes May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revert pinned hostmem for pod comm#292

revert pinned hostmem for pod comm#292
AtlantaPepsi wants to merge 2 commits into
ROCm:candidatefrom
AtlantaPepsi:EGM

AtlantaPepsi commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

gilbertlee-amd May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AtlantaPepsi commented May 8, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

gilbertlee-amd May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants