[PyT][Test] Add xfailing FSDP2 memory leak detection tests by pstjohn · Pull Request #2803 · NVIDIA/TransformerEngine

pstjohn · 2026-03-25T23:48:14Z

Summary

Add xfailing tests that demonstrate two known FSDP2 + FP8 memory issues (te.autocast() and fully_shard doesn't de-allocate quantized weights until the backwards pass #2681, _create_transpose tensor accumulating during FSDP2 with quantized_model_init #2717)
Tests use layer-level forward hooks and backward memory deltas to detect FP8 temporary tensor accumulation, comparing against a bf16 baseline
Includes bf16 control tests that validate the measurement methodology passes cleanly

Issue #2681: FP8 weight copy accumulation during forward

FP8 weight copies created by te.autocast() accumulate across layers (~0.68 MiB/layer excess over bf16 baseline). Detected for all 5 recipes with no_quant_init.

Issue #2717: Transpose cache retained after backward

_create_transpose tensors persist after backward until the next forward frees them (~3 MiB excess over bf16). Detected for DelayedScaling and Float8CurrentScaling with quant_init.

New tests (in `run_fsdp2_mem_leak.py`)

Test	Type	What it checks
`test_bf16_no_excess_forward_memory`	control (PASS)	bf16 per-layer increments are uniform
`test_bf16_no_excess_backward_memory`	control (PASS)	bf16 vs bf16 backward delta shows zero excess
`test_fp8_temp_accumulation_across_layers`	xfail	FP8 per-layer forward increment exceeds bf16
`test_transpose_cache_retained_after_backward`	xfail	FP8 backward delta exceeds bf16 baseline

All FP8 tests parametrized over 5 recipes × {no_quant_init, quant_init}.

Test plan

pytest tests/pytorch/distributed/test_torch_fsdp2.py — all 4 outer tests pass (including existing model and fused_adam tests)
bf16 control tests PASS
FP8 accumulation tests XFAIL for affected configurations
Pre-commit hooks pass

🤖 Generated with Claude Code

Add tests that demonstrate two known memory issues with FSDP2 + FP8: - Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass accumulate across layers instead of being freed between layers, defeating FSDP2's memory efficiency. Detected by comparing per-layer forward memory increments against a bf16 baseline using layer hooks. - Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during backward persist until the next forward pass instead of being freed after backward completes. Detected by comparing the backward memory delta (post_bwd - post_fwd) against a bf16 baseline. New tests: - test_bf16_no_excess_forward_memory: control, validates per-layer measurement - test_bf16_no_excess_backward_memory: control, validates backward delta comparison - test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681 - test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717 All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-03-25T23:50:14Z

Greptile Summary

This PR adds a new run_fsdp2_mem_leak.py test module and a corresponding outer pytest wrapper (test_fsdp2_mem_leak_tests) that document two known FSDP2 + FP8 memory issues (#2681, #2717) as xfail tests, alongside two bf16 control tests that validate the measurement methodology.

Key changes:

run_fsdp2_mem_leak.py (new): Four tests using forward hooks (_LayerMemoryTracker) and backward memory deltas to detect FP8 temporary tensor accumulation. Two control tests (bf16_no_excess_forward_memory, bf16_no_excess_backward_memory) are expected to pass; two FP8 tests are marked xfail(strict=False) matching the open issues.
test_torch_fsdp2.py: Adds test_fsdp2_mem_leak_tests following the exact same subprocess.run(torchrun -m pytest ...) pattern as the existing test_fsdp2_fused_adam_tests, capping at 2 GPUs.
The module docstring's "Available --test values" list omits bf16_no_excess_backward_memory, which is runnable standalone.
_maybe_skip calls pytest.skip() in a path reachable from the standalone __main__ runner; for NVFP4 + quantized_model_init, this surfaces as an unhandled _pytest.outcomes.Skipped traceback instead of a clean exit.

Confidence Score: 5/5

Safe to merge — test-only addition with no production code changes and all prior review concerns addressed.
All three issues raised in the prior review round (standalone runner TypeError for control tests, stale "4-layer" comment, unused MEASURED_STEPS constant) have been resolved in the current revision. The remaining findings are non-blocking P2 items: a one-line docstring omission and an edge-case unclean exit in standalone mode for an unsupported recipe combination. The test methodology is sound, the xfail annotations correctly document known open issues, and the new outer test wrapper follows the established pattern exactly.
No files require special attention.

Important Files Changed

Filename	Overview
tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py	New file adding four FSDP2 memory-leak detection tests (two bf16 control tests that pass, two FP8 xfail tests documenting known issues #2681 and #2717). Logic is sound; minor issues: module docstring omits `bf16_no_excess_backward_memory` from the available test list, and `_maybe_skip` raises `pytest.skip.Exception` uncaught in standalone mode for NVFP4 + quant_init.
tests/pytorch/distributed/test_torch_fsdp2.py	Adds `test_fsdp2_mem_leak_tests` outer wrapper following the exact same pattern as the existing `test_fsdp2_fused_adam_tests`. Correctly caps at 2 GPUs, passes `--local-ranks-filter=0`, and accepts exit codes 0 or 5.

Sequence Diagram

sequenceDiagram
    participant PT as pytest (outer)
    participant TR as torchrun
    participant PF as pytest (inner)
    participant TF as run_fsdp2_mem_leak.py

    PT->>TR: subprocess.run(torchrun -m pytest run_fsdp2_mem_leak.py)
    TR->>PF: launch 2 processes
    PF->>TF: collect tests (recipe_name × quantized_model_init fixtures)

    Note over TF: Control tests (PASS)
    TF->>TF: test_bf16_no_excess_forward_memory()
    TF->>TF: build + shard bf16 model
    TF->>TF: warmup (WARMUP_STEPS=2)
    TF->>TF: attach _LayerMemoryTracker hooks
    TF->>TF: forward → record post-layer memory
    TF->>TF: assert per-layer increments are uniform

    TF->>TF: test_bf16_no_excess_backward_memory()
    TF->>TF: measure (post_bwd - post_fwd) for model_a
    TF->>TF: measure (post_bwd - post_fwd) for model_b
    TF->>TF: assert |delta_b - delta_a| ≤ 256 KiB

    Note over TF: FP8 tests (XFAIL — known bugs)
    TF->>TF: test_fp8_temp_accumulation_across_layers(recipe, quant_init)
    TF->>TF: bf16 baseline: avg per-layer forward increment
    TF->>TF: FP8 model: avg per-layer forward increment
    TF->>TF: assert (fp8_avg - bf16_avg) ≤ 50 KiB  [xfail: ~680 KiB excess]

    TF->>TF: test_transpose_cache_retained_after_backward(recipe, quant_init)
    TF->>TF: bf16 baseline: post_bwd - post_fwd delta
    TF->>TF: FP8 model: post_bwd - post_fwd delta
    TF->>TF: assert (fp8_delta - bf16_delta) ≤ 256 KiB  [xfail: ~3 MiB excess]

    PF-->>TR: exit 0 (xfails counted as pass) or 5 (all skipped)
    TR-->>PT: returncode in {0, 5}
    PT->>PT: assert returncode in (0, 5)

_{Reviews (2): Last reviewed commit: "Address review comments: fix standalone ..." | Re-trigger Greptile}

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py

… constant - Fix standalone runner to not pass recipe/quantized_model_init args to bf16 control tests (which take no arguments) - Fix stale comment referencing 4-layer model (now 8 layers) - Remove unused MEASURED_STEPS constant Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py Outdated Show resolved Hide resolved

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py Outdated Show resolved Hide resolved

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py Outdated Show resolved Hide resolved

pstjohn mentioned this pull request Mar 26, 2026

[PyT] Fix FSDP2 memory leaks for FP8 weight workspaces and transpose caches #2805

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803
pstjohn wants to merge 2 commits intoNVIDIA:mainfrom
pstjohn:pstjohn/fsdp2-mem-leak-tests

pstjohn commented Mar 25, 2026

Uh oh!

greptile-apps bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pstjohn commented Mar 25, 2026

Summary

Issue #2681: FP8 weight copy accumulation during forward

Issue #2717: Transpose cache retained after backward

New tests (in run_fsdp2_mem_leak.py)

Test plan

Uh oh!

greptile-apps bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New tests (in `run_fsdp2_mem_leak.py`)

greptile-apps bot commented Mar 25, 2026 •

edited

Loading