[DRAFT] add example script for depth importance estimation by chochowski · Pull Request #1016 · NVIDIA/Model-Optimizer

chochowski · 2026-03-10T15:17:21Z

What does this PR do?

Type of change: ? new-feature

The script goes through the GPTModel transformer patching each block as no-op one-by-one estimating how much removing this block affects the final output representation.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added example script demonstrating depth-importance estimation for GPT-like model pruning with configurable ranking metrics (MSE-based) and aggregation strategies (mean/median) to identify and selectively optimize layers based on importance scores.

copy-pr-bot · 2026-03-10T15:17:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-10T15:21:35Z

📝 Walkthrough

Walkthrough

Introduces examples/pruning/depth_ranking.py, a new script that implements depth-pruning and layer importance estimation for GPT-like Megatron models using PyTorch hooks, selective layer patching, and distributed data-parallel utilities.

Changes

Cohort / File(s)	Summary
Depth Pruning Workflow `examples/pruning/depth_ranking.py`	New 554-line module implementing importance estimation via hooks (LastHiddenImportanceHook), no-op forward patches for MLP/attention/Mamba/Transformer/GPT blocks, MSE-based scoring, distributed gathering across DP ranks, and pruning orchestration (estimate_depth_importance, collect_scores). Includes utilities for rank detection, CUDA tensor handling, and CLI argument extension.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Main Process
    participant Model as GPT Model
    participant Hooks as Importance Hooks
    participant Scoring as Scoring Module
    participant DP as DP Ranks

    Main->>Model: setup_gates(model)
    activate Model
    Model->>Hooks: Attach LastHiddenImportanceHook to layers
    deactivate Model
    
    Main->>Model: Reference forward pass
    activate Model
    Model->>Hooks: hook_fn captures activations
    Hooks->>Hooks: load_reference() stores baseline
    deactivate Model
    
    Main->>Model: Patch layer (noop_*_forward_patch)
    activate Model
    Model->>Hooks: Inference forward pass
    Hooks->>Hooks: Compute MSE vs reference
    deactivate Model
    
    Main->>Scoring: collect_scores(model)
    activate Scoring
    Scoring->>Hooks: Extract per-layer statistics
    Scoring->>DP: gather_across_dp(rankings)
    activate DP
    DP-->>Scoring: Aggregated rankings
    deactivate DP
    Scoring-->>Main: Importance scores
    deactivate Scoring
    
    Main->>Main: Dump scores, generate drop list

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding an example script for depth importance estimation, which is reflected in the new file examples/pruning/depth_ranking.py.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch depth_importance_example

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/pruning/depth_ranking.py`:
- Around line 303-320: In collect_scores(), avoid hardcoding a length-10 CUDA
tensor and metric assumptions: use the provided use_metric argument when reading
stats, infer device and per-sample vector lengths dynamically, and replace
torch.zeros((10,)).cuda() with a padding strategy (e.g., pad_sequence or
creating zeros using the same device and shape as stat) so torch.stack(res)
won't fail for different DP/batch sizes; ensure aggregation (median/mean)
computes over the correct dim and that drop_group/drop_blocks are applied to the
sorted_indices from the selected metric; apply the same fixes in the related
block around the other occurrence (lines ~510-514) so both places respect
variable sample counts, device, and the caller's metric/drop settings.
- Line 1: Add the standard NVIDIA Apache 2.0 license header to the top of
depth_ranking.py (replace the solitary copyright line), using the full Apache
2.0 block required by the repository policy so the file begins with the complete
license header comment instead of just the copyright notice.
- Around line 16-17: The example imports test-only utilities
(get_mcore_gpt_model, set_seed and the inference helpers) from _test_utils which
couples the example to the test tree; replace those imports with supported
runtime equivalents or relocate the helper implementations into the examples
package. Concretely, remove references to
_test_utils.torch.megatron.models.get_mcore_gpt_model and
_test_utils.torch.misc.set_seed and either import model-building and seed
utilities from the public API (or a stable examples.helpers module), and copy
any inference helper functions used on lines ~93-96 into
examples/pruning/helpers.py and import them from there so the example no longer
depends on tests.
- Around line 78-80: Wrap the transformer_engine imports in a try/except and set
a flag (e.g., HAS_TE) so importing the example won't fail when
transformer_engine is not installed, and import/alias torch.nn.LayerNorm as the
fallback norm; then update setup_gates() to detect and handle both TE norm
classes (RMSNorm/LayerNorm from transformer_engine when HAS_TE) and standard
PyTorch norms (torch.nn.LayerNorm and any other supported norm types) by
creating the same logits_gate_list entries for either case; additionally, add a
defensive check before using model.logits_gate_list[0] to raise a clear error or
construct a default gate when logits_gate_list is empty, and optionally ensure
main() pins transformer_impl to the TE backend when intended (or document that
transformer_engine must be installed).
- Around line 353-384: The layer-to-rank mapping is wrong because
layer_id_in_this_rank() assumes equal split via num_layers_per_pp and offset;
update it to use Megatron's true pipeline offsets (e.g., call the utility used
in the distill plugin like TransformerLayer._get_layer_offset() or compute
start/end using config.num_layers_in_first_pipeline_stage and
config.num_layers_in_last_pipeline_stage together with
get_pipeline_model_parallel_rank()/get_pipeline_model_parallel_world_size()) so
that global layer IDs map to local indices correctly; adjust
layer_id_in_this_rank() to return the local index or -1 based on that computed
start/end range, and ensure patch_model/unpatch_model still use the returned
local index consistently.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f0060447-cc9c-4f31-b6c7-4a5eebc9255f

📥 Commits

Reviewing files that changed from the base of the PR and between cbab377 and 7a29122.

📒 Files selected for processing (1)

examples/pruning/depth_ranking.py

coderabbitai · 2026-03-10T15:33:12Z