Add graph-capturable one-shot all-reduce by mawad-amd · Pull Request #531 · ROCm/iris

mawad-amd · 2026-05-05T20:04:44Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Graph-capturable single-kernel all-reduce with barrier-compute-barrier semantics for vLLM integration. 1D layout, unmasked fast path, capped at 16 CTAs with BLOCK_SIZE=2048 for small-message optimization. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The per-block barrier variant (formerly one_shot_vllm) is the canonical one_shot implementation going forward. The old one_shot that required host-side zero+barrier is preserved as one_shot_legacy. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

- Add iris/ccl/gluon/all_reduce.py: one-shot all-reduce using gluon atomics with 2-fence barrier (vs Triton's ~15). 26x faster than CustomAllreduce on 8x MI300X decode (19.8us vs 525us per call). - Graph-capture safe by design: - Eager mode: zero barrier flags + end barrier (SINGLE_BARRIER=False) - Graph capture: skip zero + elide end barrier (SINGLE_BARRIER=True), flags accumulate across replays via relative barrier protocol - Reject in-place aliasing (output is input) with ValueError to prevent XGMI read/write races across ranks - Route one_shot_gluon variant and use_gluon=True through gluon backend in ccl/all_reduce.py, add one_shot_gluon to config validation Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

When no config is passed, default to one_shot_gluon with use_gluon=True. Gluon one-shot is 26x faster than CustomAllreduce for decode-phase tensors on MI300X (19.8us vs 525us per call). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

all_reduce_preamble defaulted to Config() (two_shot/Triton) while all_reduce defaulted to Config(one_shot_gluon/gluon). The preamble created a Triton workspace without start_flags, then gluon launch tried to allocate a new workspace during graph capture, causing NoneType errors from failed heap allocation. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

mawad-amd and others added 6 commits May 4, 2026 22:25

Apply Ruff auto-fixes

7816775

Add graph capture test for one_shot_vllm, rename shmem to ctx

4919e67

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Apply Ruff auto-fixes

dffa3b8

Remove benchmark script from branch

6309087

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

github-actions Bot added in-progress We are working on it iris Iris project issue labels May 5, 2026

mawad-amd and others added 5 commits May 5, 2026 13:13

Cache get_num_xcc with lru_cache to avoid repeated FFI calls

e1ccc33

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Apply Ruff auto-fixes

a7810a3

Default all-reduce to gluon one-shot

ee3ec96

When no config is passed, default to one_shot_gluon with use_gluon=True. Gluon one-shot is 26x faster than CustomAllreduce for decode-phase tensors on MI300X (19.8us vs 525us per call). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add graph-capturable one-shot all-reduce#531

Add graph-capturable one-shot all-reduce#531
mawad-amd wants to merge 11 commits into
mainfrom
muhaawad/one-shot-vllm

mawad-amd commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mawad-amd commented May 5, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant