Skip to content

[DRAFT] add example script for depth importance estimation#1016

Open
chochowski wants to merge 1 commit intomainfrom
depth_importance_example
Open

[DRAFT] add example script for depth importance estimation#1016
chochowski wants to merge 1 commit intomainfrom
depth_importance_example

Conversation

@chochowski
Copy link

@chochowski chochowski commented Mar 10, 2026

What does this PR do?

Type of change: ? new-feature

The script goes through the GPTModel transformer patching each block as no-op one-by-one estimating how much removing this block affects the final output representation.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features
    • Added example script demonstrating depth-importance estimation for GPT-like model pruning with configurable ranking metrics (MSE-based) and aggregation strategies (mean/median) to identify and selectively optimize layers based on importance scores.

@chochowski chochowski requested a review from a team as a code owner March 10, 2026 15:17
@chochowski chochowski requested a review from jenchen13 March 10, 2026 15:17
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

Introduces examples/pruning/depth_ranking.py, a new script that implements depth-pruning and layer importance estimation for GPT-like Megatron models using PyTorch hooks, selective layer patching, and distributed data-parallel utilities.

Changes

Cohort / File(s) Summary
Depth Pruning Workflow
examples/pruning/depth_ranking.py
New 554-line module implementing importance estimation via hooks (LastHiddenImportanceHook), no-op forward patches for MLP/attention/Mamba/Transformer/GPT blocks, MSE-based scoring, distributed gathering across DP ranks, and pruning orchestration (estimate_depth_importance, collect_scores). Includes utilities for rank detection, CUDA tensor handling, and CLI argument extension.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Main Process
    participant Model as GPT Model
    participant Hooks as Importance Hooks
    participant Scoring as Scoring Module
    participant DP as DP Ranks

    Main->>Model: setup_gates(model)
    activate Model
    Model->>Hooks: Attach LastHiddenImportanceHook to layers
    deactivate Model
    
    Main->>Model: Reference forward pass
    activate Model
    Model->>Hooks: hook_fn captures activations
    Hooks->>Hooks: load_reference() stores baseline
    deactivate Model
    
    Main->>Model: Patch layer (noop_*_forward_patch)
    activate Model
    Model->>Hooks: Inference forward pass
    Hooks->>Hooks: Compute MSE vs reference
    deactivate Model
    
    Main->>Scoring: collect_scores(model)
    activate Scoring
    Scoring->>Hooks: Extract per-layer statistics
    Scoring->>DP: gather_across_dp(rankings)
    activate DP
    DP-->>Scoring: Aggregated rankings
    deactivate DP
    Scoring-->>Main: Importance scores
    deactivate Scoring
    
    Main->>Main: Dump scores, generate drop list
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding an example script for depth importance estimation, which is reflected in the new file examples/pruning/depth_ranking.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch depth_importance_example

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/pruning/depth_ranking.py`:
- Around line 303-320: In collect_scores(), avoid hardcoding a length-10 CUDA
tensor and metric assumptions: use the provided use_metric argument when reading
stats, infer device and per-sample vector lengths dynamically, and replace
torch.zeros((10,)).cuda() with a padding strategy (e.g., pad_sequence or
creating zeros using the same device and shape as stat) so torch.stack(res)
won't fail for different DP/batch sizes; ensure aggregation (median/mean)
computes over the correct dim and that drop_group/drop_blocks are applied to the
sorted_indices from the selected metric; apply the same fixes in the related
block around the other occurrence (lines ~510-514) so both places respect
variable sample counts, device, and the caller's metric/drop settings.
- Line 1: Add the standard NVIDIA Apache 2.0 license header to the top of
depth_ranking.py (replace the solitary copyright line), using the full Apache
2.0 block required by the repository policy so the file begins with the complete
license header comment instead of just the copyright notice.
- Around line 16-17: The example imports test-only utilities
(get_mcore_gpt_model, set_seed and the inference helpers) from _test_utils which
couples the example to the test tree; replace those imports with supported
runtime equivalents or relocate the helper implementations into the examples
package. Concretely, remove references to
_test_utils.torch.megatron.models.get_mcore_gpt_model and
_test_utils.torch.misc.set_seed and either import model-building and seed
utilities from the public API (or a stable examples.helpers module), and copy
any inference helper functions used on lines ~93-96 into
examples/pruning/helpers.py and import them from there so the example no longer
depends on tests.
- Around line 78-80: Wrap the transformer_engine imports in a try/except and set
a flag (e.g., HAS_TE) so importing the example won't fail when
transformer_engine is not installed, and import/alias torch.nn.LayerNorm as the
fallback norm; then update setup_gates() to detect and handle both TE norm
classes (RMSNorm/LayerNorm from transformer_engine when HAS_TE) and standard
PyTorch norms (torch.nn.LayerNorm and any other supported norm types) by
creating the same logits_gate_list entries for either case; additionally, add a
defensive check before using model.logits_gate_list[0] to raise a clear error or
construct a default gate when logits_gate_list is empty, and optionally ensure
main() pins transformer_impl to the TE backend when intended (or document that
transformer_engine must be installed).
- Around line 353-384: The layer-to-rank mapping is wrong because
layer_id_in_this_rank() assumes equal split via num_layers_per_pp and offset;
update it to use Megatron's true pipeline offsets (e.g., call the utility used
in the distill plugin like TransformerLayer._get_layer_offset() or compute
start/end using config.num_layers_in_first_pipeline_stage and
config.num_layers_in_last_pipeline_stage together with
get_pipeline_model_parallel_rank()/get_pipeline_model_parallel_world_size()) so
that global layer IDs map to local indices correctly; adjust
layer_id_in_this_rank() to return the local index or -1 based on that computed
start/end range, and ensure patch_model/unpatch_model still use the returned
local index consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f0060447-cc9c-4f31-b6c7-4a5eebc9255f

📥 Commits

Reviewing files that changed from the base of the PR and between cbab377 and 7a29122.

📒 Files selected for processing (1)
  • examples/pruning/depth_ranking.py

@@ -0,0 +1,554 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add the repository Apache 2.0 header.

This new example file only has a copyright notice, so it is missing the required Apache 2.0 license block for new Python files. As per coding guidelines, "NVIDIA Apache 2.0 license header required on all new Python/C++/CUDA files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/depth_ranking.py` at line 1, Add the standard NVIDIA Apache
2.0 license header to the top of depth_ranking.py (replace the solitary
copyright line), using the full Apache 2.0 block required by the repository
policy so the file begins with the complete license header comment instead of
just the copyright notice.

Comment on lines +16 to +17
from _test_utils.torch.megatron.models import get_mcore_gpt_model
from _test_utils.torch.misc import set_seed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the example off the test-only import path.

get_mcore_gpt_model(), set_seed(), and the inference helpers are pulled from _test_utils, which lives under tests/_test_utils/.... That couples this example to the test tree instead of supported runtime code, so it will not run in a normal install or when the example is copied into user workflows.

Also applies to: 93-96

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/depth_ranking.py` around lines 16 - 17, The example imports
test-only utilities (get_mcore_gpt_model, set_seed and the inference helpers)
from _test_utils which couples the example to the test tree; replace those
imports with supported runtime equivalents or relocate the helper
implementations into the examples package. Concretely, remove references to
_test_utils.torch.megatron.models.get_mcore_gpt_model and
_test_utils.torch.misc.set_seed and either import model-building and seed
utilities from the public API (or a stable examples.helpers module), and copy
any inference helper functions used on lines ~93-96 into
examples/pruning/helpers.py and import them from there so the example no longer
depends on tests.

Comment on lines +78 to +80
from transformer_engine.pytorch.module.rmsnorm import RMSNorm
from transformer_engine.pytorch.module.layernorm import LayerNorm
import debugpy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd "depth_ranking.py" examples/

Repository: NVIDIA/Model-Optimizer

Length of output: 99


🏁 Script executed:

cat -n examples/pruning/depth_ranking.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4033


🏁 Script executed:

cat -n examples/pruning/depth_ranking.py | sed -n '70,90p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1119


🏁 Script executed:

cat -n examples/pruning/depth_ranking.py | sed -n '250,290p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1656


🏁 Script executed:

cat -n examples/pruning/depth_ranking.py | sed -n '350,375p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1171


🏁 Script executed:

cat -n examples/pruning/depth_ranking.py | sed -n '520,550p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1063


🏁 Script executed:

cat pyproject.toml | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 2872


🏁 Script executed:

rg "transformer.engine|transformer_engine" pyproject.toml setup.py setup.cfg

Repository: NVIDIA/Model-Optimizer

Length of output: 115


🏁 Script executed:

cat -n examples/pruning/depth_ranking.py | sed -n '340,365p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1112


🏁 Script executed:

cat setup.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4823


🏁 Script executed:

rg "transformer_engine|transformer-engine" --type py -i

Repository: NVIDIA/Model-Optimizer

Length of output: 4319


🏁 Script executed:

cat -n examples/pruning/depth_ranking.py | sed -n '1,50p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2057


Guard conditional imports of transformer_engine and support non-TE norm implementations.

Lines 78-79 import RMSNorm and LayerNorm from transformer_engine unconditionally at module level, but transformer_engine is not a required dependency. This will fail immediately if the package is not installed. Additionally, setup_gates() (lines 271-278) only matches TE norm implementations—if a model uses standard PyTorch LayerNorm or other norm types, logits_gate_list remains empty, causing an IndexError at line 361 when accessing model.logits_gate_list[0].

The main() function (lines 532-544) does not pin transformer_impl, so there's no guarantee TE norms are used. Either make transformer_engine an explicit requirement, or wrap the imports in a try-except block and add support for non-TE norm implementations in setup_gates().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/depth_ranking.py` around lines 78 - 80, Wrap the
transformer_engine imports in a try/except and set a flag (e.g., HAS_TE) so
importing the example won't fail when transformer_engine is not installed, and
import/alias torch.nn.LayerNorm as the fallback norm; then update setup_gates()
to detect and handle both TE norm classes (RMSNorm/LayerNorm from
transformer_engine when HAS_TE) and standard PyTorch norms (torch.nn.LayerNorm
and any other supported norm types) by creating the same logits_gate_list
entries for either case; additionally, add a defensive check before using
model.logits_gate_list[0] to raise a clear error or construct a default gate
when logits_gate_list is empty, and optionally ensure main() pins
transformer_impl to the TE backend when intended (or document that
transformer_engine must be installed).

Comment on lines +303 to +320
def collect_scores(model, use_metric: str="mse", aggregation: str="mean", drop_blocks: List[int]=[], drop_group: int=1):
stats=model.logits_gate_list[0].activations_stats
print(f'{stats=}')
res=[]
for i in range(len(stats[use_metric])):
stat = stats[use_metric][i]
res.append(stat) if stat.numel()>0 else res.append(torch.zeros((10,)).cuda())

res=torch.stack(res).float()
print(f'{res.median(dim=1)[0].sort()=}')
print(f'{res.mean(dim=1)[0].sort()=}')
already_dropped = len(drop_blocks)
if aggregation == 'median':
sorted_indices = res.median(dim=1)[0].sort()[1]
else:
sorted_indices = res.mean(dim=1).sort()[1]

drop = sorted_indices[already_dropped:already_dropped+drop_group]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scoring is hardwired to one metric and a fixed sample count.

collect_scores() substitutes empty entries with torch.zeros((10,)).cuda(), and the final call site still uses collect_scores(model) plus a hardcoded scores["mse_drop"]. So the ranking silently ignores the caller's metric/drop settings, and torch.stack(res) will break once the gathered tensor length is anything other than 10, such as DP>1 or a different number of evaluation samples.

Also applies to: 510-514

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/depth_ranking.py` around lines 303 - 320, In
collect_scores(), avoid hardcoding a length-10 CUDA tensor and metric
assumptions: use the provided use_metric argument when reading stats, infer
device and per-sample vector lengths dynamically, and replace
torch.zeros((10,)).cuda() with a padding strategy (e.g., pad_sequence or
creating zeros using the same device and shape as stat) so torch.stack(res)
won't fail for different DP/batch sizes; ensure aggregation (median/mean)
computes over the correct dim and that drop_group/drop_blocks are applied to the
sorted_indices from the selected metric; apply the same fixes in the related
block around the other occurrence (lines ~510-514) so both places respect
variable sample counts, device, and the caller's metric/drop settings.

Comment on lines +353 to +384
pp_size = get_pipeline_model_parallel_world_size()
pp_rank = get_pipeline_model_parallel_rank()
num_layers = config.num_layers
num_layers_per_pp = num_layers // pp_size
offset = pp_rank*num_layers_per_pp

setup_gates(model)
# set lm head in the last hidden hook
model.logits_gate_list[0].set_lm_head(model.output_layer)

# Prepare model
def patch_model(layer_id, block='transformer'):
if layer_id == -1:
return None
patch_register = model.decoder.layers[layer_id].forward
model.decoder.layers[layer_id].forward = noop_gpt_block_forward_patch
print_rank_0(f'Patched gpt block {layer_id} to noop_gpt_block_forward')

return patch_register

def unpatch_model(layer_id, patch_register, block='transformer'):
if layer_id == -1:
return None
print_rank_0(f'Unpatching gpt block {layer_id} ')
model.decoder.layers[layer_id].forward = patch_register


def layer_id_in_this_rank(layer_id):
if layer_id >= offset and layer_id < offset+num_layers_per_pp:
return layer_id-offset
else:
return -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file
find . -type f -name "depth_ranking.py" | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 101


🏁 Script executed:

# Check the file size
wc -l examples/pruning/depth_ranking.py 2>/dev/null || echo "File not found in expected location"

Repository: NVIDIA/Model-Optimizer

Length of output: 103


🏁 Script executed:

# Search for depth_ranking.py
fd "depth_ranking.py"

Repository: NVIDIA/Model-Optimizer

Length of output: 99


🏁 Script executed:

# Read the relevant section and surrounding context
sed -n '340,390p' examples/pruning/depth_ranking.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1838


🏁 Script executed:

# Search for stage-to-layer mapping methods in the codebase
rg "stage.*layer|layer.*stage|get_.*layers" --type py | grep -i "stage\|partition" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 2441


🏁 Script executed:

# Look for imports and Megatron-related utilities
head -50 examples/pruning/depth_ranking.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1707


🏁 Script executed:

# Search for how to properly get layer ranges per stage
rg "first_pipeline_stage|last_pipeline_stage|layer.*range|start_layer|end_layer" --type py -A 3 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 3670


🏁 Script executed:

# Check if the GPTModel or config object has methods for layer ranges
rg "get.*layer.*range|layer_range|stage.*layer|first.*layer.*last" --type py | grep -i "def\|class" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look at how the megatron_bridge prune_minitron handles this
sed -n '1,100p' examples/megatron_bridge/prune_minitron.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4340


🏁 Script executed:

# Look at the model creation to understand what properties are available
rg "num_layers_in_first_pipeline_stage|num_layers_in_last_pipeline_stage" --type py -B 5 -A 5 | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 5569


🏁 Script executed:

# Check the get_mcore_gpt_model function to see what model attributes are available
sed -n '1,150p' tests/_test_utils/torch/megatron/models.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5223


🏁 Script executed:

# Search for how to get layer ranges in Megatron or if model has attributes for this
rg "layer.*offset|start.*layer|end.*layer" --type py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 2109


🏁 Script executed:

# Check the get_mcore_gpt_model function signature and implementation to see where layer ranges are stored
sed -n '100,300p' tests/_test_utils/torch/megatron/models.py

Repository: NVIDIA/Model-Optimizer

Length of output: 7349


🏁 Script executed:

# Search for how the model uses or stores layer range information
rg "decoder\.layers|first_pipeline_stage|last_pipeline_stage" --type py -B 3 -A 3 | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 7807


🏁 Script executed:

# Look at the mcore_minitron plugin to see if it has similar layer range logic
sed -n '1,100p' modelopt/torch/prune/plugins/mcore_minitron.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3542


🏁 Script executed:

# Search for how to get actual layer boundaries from config or model
rg "first_pipeline_stage.*layers|last_pipeline_stage.*layers" --type py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 2509


🏁 Script executed:

# Look for methods in TransformerConfig that provide layer ranges
rg "class TransformerConfig" --type py -A 50 | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check how depth_ranking.py uses the layers and if there's a way to get actual layer ranges
sed -n '300,450p' examples/pruning/depth_ranking.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5828


🏁 Script executed:

# Search for Megatron documentation or methods that calculate layer ranges based on first/last stage config
rg "compute.*layer|calculate.*layer|get_layer.*range" --type py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 341


🏁 Script executed:

# Look at what properties the config has related to layers
python3 << 'EOF'
import sys
# Try to find and understand how Megatron computes layer ranges
# Based on the config properties found: num_layers_in_first_pipeline_stage, num_layers_in_last_pipeline_stage
# Look for Megatron's approach

# Search string patterns that might hint at how layers are distributed
patterns = [
    "num_layers_in_first_pipeline_stage",
    "num_layers_in_last_pipeline_stage",
    "pipeline_rank",
    "pp_rank"
]

# We know from tests the layer distribution can be:
# [1, 3] for 2 stages with 4 total layers
# [3, 2, 2, 1] for 4 stages with 8 total layers  
# [4, 1, 1, 1, 1, 1, 1, 6] for 8 stages

print("Layer distribution examples from test:")
print("2 stages, 4 total: [1, 3]")
print("4 stages, 8 total: [3, 2, 2, 1]")
print("8 stages, 8 total: [4, 1, 1, 1, 1, 1, 1, 6]")
print()
print("This shows that Megatron may not evenly distribute layers.")
print("A proper fix would need to compute actual layer ranges per stage.")
EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 332


🏁 Script executed:

# Search for Megatron utilities or comments that explain layer distribution
rg "pipeline.*rank|pp.*rank" examples/pruning/depth_ranking.py -B 5 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 2317


🏁 Script executed:

# Search for how layers are actually computed/distributed in the model initialization
rg "num_layers.*first|num_layers.*last|layer.*offset|offset.*layer" --type py -B 3 -A 3 | grep -E "(def|class|return|=)" | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 4403


🏁 Script executed:

# Check if the model itself has properties that tell us the layer boundaries
rg "decoder\..*start|decoder\..*end|first_layer|last_layer" --type py | grep -v "test\|\.0\|\.1" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 2450


🏁 Script executed:

# Look for any utility function or method that computes layer ranges based on pp_rank
rg "def.*layer.*range|def.*offset|def.*start.*end" --type py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1206


🏁 Script executed:

# Search for _get_layer_offset or similar methods that handle layer offset computation
rg "_get_layer_offset|get_layer_offset" --type py -B 5 -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 1499


🏁 Script executed:

# Look for how the distill plugin handles this layer offset
sed -n '1,100p' modelopt/torch/distill/plugins/megatron.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3995


🏁 Script executed:

# Search for how Megatron core handles layer distribution across pipeline stages
rg "virtual_pipeline|pipeline.*stage" --type py modelopt/ | grep -i "layer" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 284


🏁 Script executed:

# Find the actual _get_layer_offset implementation
rg "_get_layer_offset" --type py -A 15

Repository: NVIDIA/Model-Optimizer

Length of output: 1350


🏁 Script executed:

# Search for TransformerLayer class definition to find _get_layer_offset
rg "class TransformerLayer" --type py -A 50 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look for how Megatron computes actual layer ranges in pipeline parallelism
web_search
Megatron pipeline parallel layer distribution first_pipeline_stage last_pipeline_stage

Repository: NVIDIA/Model-Optimizer

Length of output: 163


🌐 Web query:

Megatron-Core pipeline parallel layer distribution first_pipeline_stage last_pipeline_stage

💡 Result:

In Megatron-Core, “first/last pipeline stage” refers to the pipeline-parallel rank at the ends of the model graph:

  • First pipeline stage: the rank that owns the “front” of the model (typically embeddings / pre_process). In the single-module case this is generally PP rank = 0 (and parallel_state.is_pipeline_first_stage() is used to gate that logic). [1]
  • Last pipeline stage: the rank that owns the “tail” of the model (typically final norm / logits / loss / post_process). In the single-module case this is generally PP rank = PP_size − 1 (gated by parallel_state.is_pipeline_last_stage()). [4]

How transformer layers are distributed across PP stages

  1. Even split (default)
    If you don’t request a custom/uneven layout, Megatron-Core assumes an equal split across pipeline stages (and, for interleaved/virtual pipeline, it further splits into PP_size * VPP_size “stages”). Example shown in the docs: with PP=4, VPP=2, num_layers=16 ⇒ 8 stages of 2 layers each, interleaved across GPUs. [1]

  2. Uneven split with “first/last stage layer counts”
    Megatron-Core exposes config fields:

  • num_layers_in_first_pipeline_stage
  • num_layers_in_last_pipeline_stage
    If set, the remaining layers are evenly distributed over the middle PP ranks, and Megatron-Core requires the remainder to be divisible by the number of remaining stages; also, this mode is not compatible with interleaved (virtual) pipeline. [2][3]

(These correspond to CLI options in Megatron-LM like --decoder-first-pipeline-num-layers / --decoder-last-pipeline-num-layers.) [3]

  1. Fully custom per-stage layout
    For non-uniform/handcrafted partitioning beyond “first/last heavier”, use pipeline_model_parallel_layout, which is parsed into a PipelineParallelLayerLayout and can specify exactly which layers live on each PP (and VPP) stage. [2][5]

Sources: [1] https://docs.nvidia.com/megatron-core/developer-guide/latest/apidocs/core/core.parallel_state.html, [2] https://docs.nvidia.com/megatron-core/developer-guide/latest/apidocs/core/core.transformer.transformer_config.html, [3] NVIDIA/Megatron-LM#1303, [4] https://docs.nvidia.com/megatron-core/developer-guide/0.16.0/apidocs/core/core.pipeline_parallel.multimodule_communicator.html, [5] https://docs.nvidia.com/megatron-core/developer-guide/0.15.0/apidocs/core/core.transformer.pipeline_parallel_layer_layout.html


🏁 Script executed:

# Search for how the distill plugin imports or uses _get_layer_offset
rg "from megatron|import.*TransformerLayer" modelopt/torch/distill/plugins/megatron.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 432


🏁 Script executed:

# Look for any documentation or comments about layer offset computation
rg "layer.*offset|offset.*layer" examples/pruning/depth_ranking.py -B 5 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 831


🏁 Script executed:

# Look for how to properly compute layer ranges using Megatron utilities
rg "def.*layer.*offset|_get_layer_offset" --type py -B 3 -A 10 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1252


🏁 Script executed:

# Check if config object has properties or methods to get layer ranges
python3 << 'EOF'
# Based on web search results, let's understand the layer distribution logic:
# 1. Even distribution (default): num_layers // pp_size per stage
# 2. Uneven distribution: first stage, last stage, middle stages get remainder
# 
# The code needs to:
# - Check if num_layers_in_first_pipeline_stage or num_layers_in_last_pipeline_stage are set
# - If set, compute layer offset/range accordingly
# - If not set, use even distribution

print("Layer distribution logic needed:")
print("1. If first/last stage layer counts set in config:")
print("   - First stage gets num_layers_in_first_pipeline_stage")
print("   - Last stage gets num_layers_in_last_pipeline_stage")
print("   - Middle stages get (num_layers - first - last) / (pp_size - 2)")
print("")
print("2. If not set, use even distribution:")
print("   - Each stage gets num_layers // pp_size")
print("")
print("The current code only handles case 2")
EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 448


🏁 Script executed:

# Search for how other parts of the codebase handle this layer offset calculation
rg "first_pipeline_stage|last_pipeline_stage" --type py -B 2 -A 2 | grep -E "(if|def|offset|layer)" | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 3238


🏁 Script executed:

# Check if config object exposes these properties directly
sed -n '350,380p' examples/pruning/depth_ranking.py | grep "config\."

Repository: NVIDIA/Model-Optimizer

Length of output: 100


🏁 Script executed:

# Look for how to access the first/last stage layer counts from config
rg "config\.num_layers_in_first|config\.num_layers_in_last" --type py | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 284


🏁 Script executed:

# Understand when estimate_depth_importance is called and if uneven PP is expected
rg "estimate_depth_importance" --type py -B 5 -A 2 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 972


🏁 Script executed:

# Check if there are examples or documentation about how depth_ranking.py is meant to be used
find . -name "*depth*" -type f | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 101


🏁 Script executed:

# Look for comments in the file about assumptions
sed -n '1,100p' examples/pruning/depth_ranking.py | grep -i "assumption\|note\|assume\|parallel\|pipeline"

Repository: NVIDIA/Model-Optimizer

Length of output: 1006


Uneven pipeline parallelism breaks layer offset calculation.

The current code assumes evenly distributed layers across pipeline stages using num_layers // pp_size, but Megatron-Core supports uneven distribution where the first and last pipeline stages can have different layer counts (configured via num_layers_in_first_pipeline_stage and num_layers_in_last_pipeline_stage). This causes layer_id_in_this_rank() to patch or skip incorrect layers.

Use Megatron's layer offset utilities (like TransformerLayer._get_layer_offset() used in the distill plugin) or compute layer ranges directly from the config's first/last stage properties to correctly map global layer IDs to local ones.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/depth_ranking.py` around lines 353 - 384, The layer-to-rank
mapping is wrong because layer_id_in_this_rank() assumes equal split via
num_layers_per_pp and offset; update it to use Megatron's true pipeline offsets
(e.g., call the utility used in the distill plugin like
TransformerLayer._get_layer_offset() or compute start/end using
config.num_layers_in_first_pipeline_stage and
config.num_layers_in_last_pipeline_stage together with
get_pipeline_model_parallel_rank()/get_pipeline_model_parallel_world_size()) so
that global layer IDs map to local indices correctly; adjust
layer_id_in_this_rank() to return the local index or -1 based on that computed
start/end range, and ensure patch_model/unpatch_model still use the returned
local index consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant