Skip to content

Add graph-capturable one-shot all-reduce#531

Draft
mawad-amd wants to merge 11 commits into
mainfrom
muhaawad/one-shot-vllm
Draft

Add graph-capturable one-shot all-reduce#531
mawad-amd wants to merge 11 commits into
mainfrom
muhaawad/one-shot-vllm

Conversation

@mawad-amd
Copy link
Copy Markdown
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

mawad-amd and others added 6 commits May 4, 2026 22:25
Graph-capturable single-kernel all-reduce with barrier-compute-barrier
semantics for vLLM integration. 1D layout, unmasked fast path, capped
at 16 CTAs with BLOCK_SIZE=2048 for small-message optimization.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The per-block barrier variant (formerly one_shot_vllm) is the canonical
one_shot implementation going forward. The old one_shot that required
host-side zero+barrier is preserved as one_shot_legacy.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@github-actions github-actions Bot added in-progress We are working on it iris Iris project issue labels May 5, 2026
mawad-amd and others added 5 commits May 5, 2026 13:13
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Add iris/ccl/gluon/all_reduce.py: one-shot all-reduce using gluon
  atomics with 2-fence barrier (vs Triton's ~15). 26x faster than
  CustomAllreduce on 8x MI300X decode (19.8us vs 525us per call).

- Graph-capture safe by design:
  - Eager mode: zero barrier flags + end barrier (SINGLE_BARRIER=False)
  - Graph capture: skip zero + elide end barrier (SINGLE_BARRIER=True),
    flags accumulate across replays via relative barrier protocol

- Reject in-place aliasing (output is input) with ValueError to prevent
  XGMI read/write races across ranks

- Route one_shot_gluon variant and use_gluon=True through gluon backend
  in ccl/all_reduce.py, add one_shot_gluon to config validation

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
When no config is passed, default to one_shot_gluon with use_gluon=True.
Gluon one-shot is 26x faster than CustomAllreduce for decode-phase
tensors on MI300X (19.8us vs 525us per call).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
all_reduce_preamble defaulted to Config() (two_shot/Triton) while
all_reduce defaulted to Config(one_shot_gluon/gluon). The preamble
created a Triton workspace without start_flags, then gluon launch
tried to allocate a new workspace during graph capture, causing
NoneType errors from failed heap allocation.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant