Add graph-capturable one-shot all-reduce#531
Draft
mawad-amd wants to merge 11 commits into
Draft
Conversation
Graph-capturable single-kernel all-reduce with barrier-compute-barrier semantics for vLLM integration. 1D layout, unmasked fast path, capped at 16 CTAs with BLOCK_SIZE=2048 for small-message optimization. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The per-block barrier variant (formerly one_shot_vllm) is the canonical one_shot implementation going forward. The old one_shot that required host-side zero+barrier is preserved as one_shot_legacy. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Add iris/ccl/gluon/all_reduce.py: one-shot all-reduce using gluon
atomics with 2-fence barrier (vs Triton's ~15). 26x faster than
CustomAllreduce on 8x MI300X decode (19.8us vs 525us per call).
- Graph-capture safe by design:
- Eager mode: zero barrier flags + end barrier (SINGLE_BARRIER=False)
- Graph capture: skip zero + elide end barrier (SINGLE_BARRIER=True),
flags accumulate across replays via relative barrier protocol
- Reject in-place aliasing (output is input) with ValueError to prevent
XGMI read/write races across ranks
- Route one_shot_gluon variant and use_gluon=True through gluon backend
in ccl/all_reduce.py, add one_shot_gluon to config validation
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
When no config is passed, default to one_shot_gluon with use_gluon=True. Gluon one-shot is 26x faster than CustomAllreduce for decode-phase tensors on MI300X (19.8us vs 525us per call). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
all_reduce_preamble defaulted to Config() (two_shot/Triton) while all_reduce defaulted to Config(one_shot_gluon/gluon). The preamble created a Triton workspace without start_flags, then gluon launch tried to allocate a new workspace during graph capture, causing NoneType errors from failed heap allocation. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist