[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803
Open
pstjohn wants to merge 2 commits intoNVIDIA:mainfrom
Open
[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803pstjohn wants to merge 2 commits intoNVIDIA:mainfrom
pstjohn wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Add tests that demonstrate two known memory issues with FSDP2 + FP8: - Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass accumulate across layers instead of being freed between layers, defeating FSDP2's memory efficiency. Detected by comparing per-layer forward memory increments against a bf16 baseline using layer hooks. - Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during backward persist until the next forward pass instead of being freed after backward completes. Detected by comparing the backward memory delta (post_bwd - post_fwd) against a bf16 baseline. New tests: - test_bf16_no_excess_forward_memory: control, validates per-layer measurement - test_bf16_no_excess_backward_memory: control, validates backward delta comparison - test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681 - test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717 All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Greptile SummaryThis PR adds a new Key changes:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant PT as pytest (outer)
participant TR as torchrun
participant PF as pytest (inner)
participant TF as run_fsdp2_mem_leak.py
PT->>TR: subprocess.run(torchrun -m pytest run_fsdp2_mem_leak.py)
TR->>PF: launch 2 processes
PF->>TF: collect tests (recipe_name × quantized_model_init fixtures)
Note over TF: Control tests (PASS)
TF->>TF: test_bf16_no_excess_forward_memory()
TF->>TF: build + shard bf16 model
TF->>TF: warmup (WARMUP_STEPS=2)
TF->>TF: attach _LayerMemoryTracker hooks
TF->>TF: forward → record post-layer memory
TF->>TF: assert per-layer increments are uniform
TF->>TF: test_bf16_no_excess_backward_memory()
TF->>TF: measure (post_bwd - post_fwd) for model_a
TF->>TF: measure (post_bwd - post_fwd) for model_b
TF->>TF: assert |delta_b - delta_a| ≤ 256 KiB
Note over TF: FP8 tests (XFAIL — known bugs)
TF->>TF: test_fp8_temp_accumulation_across_layers(recipe, quant_init)
TF->>TF: bf16 baseline: avg per-layer forward increment
TF->>TF: FP8 model: avg per-layer forward increment
TF->>TF: assert (fp8_avg - bf16_avg) ≤ 50 KiB [xfail: ~680 KiB excess]
TF->>TF: test_transpose_cache_retained_after_backward(recipe, quant_init)
TF->>TF: bf16 baseline: post_bwd - post_fwd delta
TF->>TF: FP8 model: post_bwd - post_fwd delta
TF->>TF: assert (fp8_delta - bf16_delta) ≤ 256 KiB [xfail: ~3 MiB excess]
PF-->>TR: exit 0 (xfails counted as pass) or 5 (all skipped)
TR-->>PT: returncode in {0, 5}
PT->>PT: assert returncode in (0, 5)
Reviews (2): Last reviewed commit: "Address review comments: fix standalone ..." | Re-trigger Greptile |
… constant - Fix standalone runner to not pass recipe/quantized_model_init args to bf16 control tests (which take no arguments) - Fix stale comment referencing 4-layer model (now 8 layers) - Remove unused MEASURED_STEPS constant Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Issue #2681: FP8 weight copy accumulation during forward
FP8 weight copies created by
te.autocast()accumulate across layers (~0.68 MiB/layer excess over bf16 baseline). Detected for all 5 recipes withno_quant_init.Issue #2717: Transpose cache retained after backward
_create_transposetensors persist after backward until the next forward frees them (~3 MiB excess over bf16). Detected forDelayedScalingandFloat8CurrentScalingwithquant_init.New tests (in
run_fsdp2_mem_leak.py)test_bf16_no_excess_forward_memorytest_bf16_no_excess_backward_memorytest_fp8_temp_accumulation_across_layerstest_transpose_cache_retained_after_backwardAll FP8 tests parametrized over 5 recipes × {no_quant_init, quant_init}.
Test plan
pytest tests/pytorch/distributed/test_torch_fsdp2.py— all 4 outer tests pass (including existing model and fused_adam tests)🤖 Generated with Claude Code