Skip to content

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803

Open
pstjohn wants to merge 2 commits intoNVIDIA:mainfrom
pstjohn:pstjohn/fsdp2-mem-leak-tests
Open

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803
pstjohn wants to merge 2 commits intoNVIDIA:mainfrom
pstjohn:pstjohn/fsdp2-mem-leak-tests

Conversation

@pstjohn
Copy link
Contributor

@pstjohn pstjohn commented Mar 25, 2026

Summary

Issue #2681: FP8 weight copy accumulation during forward

FP8 weight copies created by te.autocast() accumulate across layers (~0.68 MiB/layer excess over bf16 baseline). Detected for all 5 recipes with no_quant_init.

Issue #2717: Transpose cache retained after backward

_create_transpose tensors persist after backward until the next forward frees them (~3 MiB excess over bf16). Detected for DelayedScaling and Float8CurrentScaling with quant_init.

New tests (in run_fsdp2_mem_leak.py)

Test Type What it checks
test_bf16_no_excess_forward_memory control (PASS) bf16 per-layer increments are uniform
test_bf16_no_excess_backward_memory control (PASS) bf16 vs bf16 backward delta shows zero excess
test_fp8_temp_accumulation_across_layers xfail FP8 per-layer forward increment exceeds bf16
test_transpose_cache_retained_after_backward xfail FP8 backward delta exceeds bf16 baseline

All FP8 tests parametrized over 5 recipes × {no_quant_init, quant_init}.

Test plan

  • pytest tests/pytorch/distributed/test_torch_fsdp2.py — all 4 outer tests pass (including existing model and fused_adam tests)
  • bf16 control tests PASS
  • FP8 accumulation tests XFAIL for affected configurations
  • Pre-commit hooks pass

🤖 Generated with Claude Code

Add tests that demonstrate two known memory issues with FSDP2 + FP8:

- Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass
  accumulate across layers instead of being freed between layers, defeating
  FSDP2's memory efficiency. Detected by comparing per-layer forward memory
  increments against a bf16 baseline using layer hooks.

- Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during
  backward persist until the next forward pass instead of being freed after
  backward completes. Detected by comparing the backward memory delta
  (post_bwd - post_fwd) against a bf16 baseline.

New tests:
- test_bf16_no_excess_forward_memory: control, validates per-layer measurement
- test_bf16_no_excess_backward_memory: control, validates backward delta comparison
- test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681
- test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717

All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR adds a new run_fsdp2_mem_leak.py test module and a corresponding outer pytest wrapper (test_fsdp2_mem_leak_tests) that document two known FSDP2 + FP8 memory issues (#2681, #2717) as xfail tests, alongside two bf16 control tests that validate the measurement methodology.

Key changes:

  • run_fsdp2_mem_leak.py (new): Four tests using forward hooks (_LayerMemoryTracker) and backward memory deltas to detect FP8 temporary tensor accumulation. Two control tests (bf16_no_excess_forward_memory, bf16_no_excess_backward_memory) are expected to pass; two FP8 tests are marked xfail(strict=False) matching the open issues.
  • test_torch_fsdp2.py: Adds test_fsdp2_mem_leak_tests following the exact same subprocess.run(torchrun -m pytest ...) pattern as the existing test_fsdp2_fused_adam_tests, capping at 2 GPUs.
  • The module docstring's "Available --test values" list omits bf16_no_excess_backward_memory, which is runnable standalone.
  • _maybe_skip calls pytest.skip() in a path reachable from the standalone __main__ runner; for NVFP4 + quantized_model_init, this surfaces as an unhandled _pytest.outcomes.Skipped traceback instead of a clean exit.

Confidence Score: 5/5

  • Safe to merge — test-only addition with no production code changes and all prior review concerns addressed.
  • All three issues raised in the prior review round (standalone runner TypeError for control tests, stale "4-layer" comment, unused MEASURED_STEPS constant) have been resolved in the current revision. The remaining findings are non-blocking P2 items: a one-line docstring omission and an edge-case unclean exit in standalone mode for an unsupported recipe combination. The test methodology is sound, the xfail annotations correctly document known open issues, and the new outer test wrapper follows the established pattern exactly.
  • No files require special attention.

Important Files Changed

Filename Overview
tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py New file adding four FSDP2 memory-leak detection tests (two bf16 control tests that pass, two FP8 xfail tests documenting known issues #2681 and #2717). Logic is sound; minor issues: module docstring omits bf16_no_excess_backward_memory from the available test list, and _maybe_skip raises pytest.skip.Exception uncaught in standalone mode for NVFP4 + quant_init.
tests/pytorch/distributed/test_torch_fsdp2.py Adds test_fsdp2_mem_leak_tests outer wrapper following the exact same pattern as the existing test_fsdp2_fused_adam_tests. Correctly caps at 2 GPUs, passes --local-ranks-filter=0, and accepts exit codes 0 or 5.

Sequence Diagram

sequenceDiagram
    participant PT as pytest (outer)
    participant TR as torchrun
    participant PF as pytest (inner)
    participant TF as run_fsdp2_mem_leak.py

    PT->>TR: subprocess.run(torchrun -m pytest run_fsdp2_mem_leak.py)
    TR->>PF: launch 2 processes
    PF->>TF: collect tests (recipe_name × quantized_model_init fixtures)

    Note over TF: Control tests (PASS)
    TF->>TF: test_bf16_no_excess_forward_memory()
    TF->>TF: build + shard bf16 model
    TF->>TF: warmup (WARMUP_STEPS=2)
    TF->>TF: attach _LayerMemoryTracker hooks
    TF->>TF: forward → record post-layer memory
    TF->>TF: assert per-layer increments are uniform

    TF->>TF: test_bf16_no_excess_backward_memory()
    TF->>TF: measure (post_bwd - post_fwd) for model_a
    TF->>TF: measure (post_bwd - post_fwd) for model_b
    TF->>TF: assert |delta_b - delta_a| ≤ 256 KiB

    Note over TF: FP8 tests (XFAIL — known bugs)
    TF->>TF: test_fp8_temp_accumulation_across_layers(recipe, quant_init)
    TF->>TF: bf16 baseline: avg per-layer forward increment
    TF->>TF: FP8 model: avg per-layer forward increment
    TF->>TF: assert (fp8_avg - bf16_avg) ≤ 50 KiB  [xfail: ~680 KiB excess]

    TF->>TF: test_transpose_cache_retained_after_backward(recipe, quant_init)
    TF->>TF: bf16 baseline: post_bwd - post_fwd delta
    TF->>TF: FP8 model: post_bwd - post_fwd delta
    TF->>TF: assert (fp8_delta - bf16_delta) ≤ 256 KiB  [xfail: ~3 MiB excess]

    PF-->>TR: exit 0 (xfails counted as pass) or 5 (all skipped)
    TR-->>PT: returncode in {0, 5}
    PT->>PT: assert returncode in (0, 5)
Loading

Reviews (2): Last reviewed commit: "Address review comments: fix standalone ..." | Re-trigger Greptile

… constant

- Fix standalone runner to not pass recipe/quantized_model_init args to
  bf16 control tests (which take no arguments)
- Fix stale comment referencing 4-layer model (now 8 layers)
- Remove unused MEASURED_STEPS constant

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant