Skip to content

Fix memory allocation bug to use the correct hipDevice#308

Open
nileshnegi wants to merge 3 commits into
candidatefrom
users/nileshnegi/fix/mem-alloc-fabric
Open

Fix memory allocation bug to use the correct hipDevice#308
nileshnegi wants to merge 3 commits into
candidatefrom
users/nileshnegi/fix/mem-alloc-fabric

Conversation

@nileshnegi
Copy link
Copy Markdown
Collaborator

Motivation

Not using hipSetDevice before allocating memory can use an unintended deviceIdx when executing fabric-handle-based transfers

Technical Details

Test Plan

Test Result

Submission Checklist

Not using hipSetDevice before allocating memory can use unintended
deviceIdx when executing fabric-handle based transfers
Copilot AI review requested due to automatic review settings May 19, 2026 23:06
@nileshnegi nileshnegi requested a review from a team as a code owner May 19, 2026 23:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses incorrect device selection during memory allocation for fabric-handle-based transfers by ensuring the appropriate HIP device (or NUMA policy) is selected before allocation occurs.

Changes:

  • Set NUMA preference before CPU allocations and call hipSetDevice() before GPU allocations inside AllocateMemory().
  • Reset NUMA preference after shared (fabric-handle) CPU memory initialization/checking.
  • Remove now-redundant per-branch numa_set_preferred / hipSetDevice calls later in the function.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/header/TransferBench.hpp
nileshnegi and others added 2 commits May 19, 2026 22:34
- Reset numa_set_preferred(-1) before ERR_FATAL early return in the
  non-POD_COMM_ENABLED path; without this the NUMA policy stays dirty
  for subsequent CPU allocations in the same process
- Use memDevice.memIndex directly in the top-level hipSetDevice call
  instead of deviceIdx, which is NUMA-remapped for CPU types only;
  documents that the MEM_CPU_CLOSEST remapping does not apply to GPU
- Remove now-redundant hipSetDevice inside the POD_COMM GPU memHandle
  branch; device was already set at the top of AllocateMemory
- Guard CollectTopology GPU agent probe loop with hipSetDevice(i) so
  each AllocateMemory call targets the correct device

Co-authored-by: Claude <claude@anthropic.com>
Copilot AI review requested due to automatic review settings May 20, 2026 06:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment on lines +1498 to +1505
if (IsCpuMemType(memType)) {
// Set NUMA policy prior to call to hipHostMalloc
numa_set_preferred(deviceIdx);
} else if (IsGpuMemType(memType)) {
// Switch to the appropriate GPU
// IMP: if the remapping above changes, remember to modify this!
ERR_CHECK(hipSetDevice(deviceIdx));
}
Comment on lines +1498 to +1501
if (IsCpuMemType(memType)) {
// Set NUMA policy prior to call to hipHostMalloc
numa_set_preferred(deviceIdx);
} else if (IsGpuMemType(memType)) {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants