Optimize fp8 block scaling Allgather for FSDP2 by vthumbe1503 · Pull Request #2789 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-03-23T00:45:10Z

Description

Eliminate Columnwise allgather for fp8_model_init with fsdp2. For weights when FP8 blockscaling is used, we typically use 2d. And in such a case, columnwise data and scale inv is just the transpose of the rowwise data and scale inverse. And so allgathering the rowwise data/scales are enough

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

vthumbe1503 · 2026-03-23T00:47:45Z

/te-ci L1 pytorch

greptile-apps · 2026-03-23T00:50:04Z

Greptile Summary

This PR introduces a communication-halving optimization for FP8 block-scaled weights under PyTorch FSDP2: instead of all-gathering both rowwise and columnwise tensors, only rowwise data and scales are all-gathered, and the columnwise view is derived locally via fp8_transpose (a cheap on-device kernel) in fsdp_post_all_gather. This is valid because 2D block scaling (128×128 tiles) makes the columnwise tensor a mathematical transpose of the rowwise tensor, so all_gather(shards) → transpose gives the same result as transpose(shards) → all_gather. The PR also applies a consistent defensive improvement across all three tensor types (Float8BlockwiseQTensor, Float8Tensor, MXFP8Tensor): caching fsdp_state._fsdp_param_group into a local variable and raising a descriptive RuntimeError when it is None, instead of letting an AttributeError propagate.

Key changes:

float8_blockwise_tensor.py: fsdp_pre_all_gather now reads _training_state to determine whether the call is for a forward or backward pass (when reshard_after_forward=True), encodes rowwise_usage/columnwise_usage in metadata, and always ships only (rowwise_data, rowwise_scale_inv). fsdp_post_all_gather unconditionally extracts these two tensors, then calls _create_columnwise() only when columnwise_usage=True, followed by update_usage() to release whichever form is not needed.
float8_tensor.py / mxfp8_tensor.py: Guard added for _fsdp_param_group is None; param_group._training_state access deduped through the local variable.
No test coverage was added for the new Float8BlockwiseQTensor FSDP2 path (checklist unchecked).

Confidence Score: 4/5

PR is safe to merge; optimization is mathematically correct and prior review concerns have been addressed.
The core optimization — all-gathering only rowwise tensors and deriving columnwise via local fp8_transpose — is mathematically sound for 2D block scaling. The forward/backward pass detection via TrainingState.PRE_BACKWARD mirrors the existing pattern in float8_tensor.py and mxfp8_tensor.py. The buffer-reuse path in _create_columnwise (passing out=self._columnwise_data) correctly avoids per-iteration GPU allocations on subsequent iterations. Prior concerns from the review thread (param_group None-guard, stale columnwise fields, allocation reuse) have all been addressed. The only remaining items are two minor P2 style notes (redundant _create_columnwise call path and a trailing semicolon in a comment) plus the absence of a dedicated test for the new blockwise FSDP2 optimization.
transformer_engine/pytorch/tensor/float8_blockwise_tensor.py — contains the main optimization; fsdp_pre_all_gather and fsdp_post_all_gather logic should be validated with an integration test covering both reshard_after_forward=True and reshard_after_forward=False.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/tensor/float8_blockwise_tensor.py	Core optimization: eliminates columnwise all-gather for 2D block-scaled weights by all-gathering only rowwise data and deriving columnwise locally via fp8_transpose in post_all_gather. Logic is mathematically sound (transpose distributes over all-gather for dim-0 sharding). Minor: explicit _create_columnwise() call before update_usage is slightly redundant with update_usage's own fallback; trailing semicolon in a comment.
transformer_engine/pytorch/tensor/float8_tensor.py	Defensive improvement only: caches fsdp_state._fsdp_param_group in a local variable and adds a None-guard with a descriptive RuntimeError, matching the new pattern in float8_blockwise_tensor.py. No functional logic change.
transformer_engine/pytorch/tensor/mxfp8_tensor.py	Same defensive improvement as float8_tensor.py: None-guard added for _fsdp_param_group with a clear RuntimeError. No algorithmic changes.