[DRAFT] add example script for depth importance estimation#1016
[DRAFT] add example script for depth importance estimation#1016chochowski wants to merge 1 commit intomainfrom
Conversation
📝 WalkthroughWalkthroughIntroduces Changes
Sequence Diagram(s)sequenceDiagram
participant Main as Main Process
participant Model as GPT Model
participant Hooks as Importance Hooks
participant Scoring as Scoring Module
participant DP as DP Ranks
Main->>Model: setup_gates(model)
activate Model
Model->>Hooks: Attach LastHiddenImportanceHook to layers
deactivate Model
Main->>Model: Reference forward pass
activate Model
Model->>Hooks: hook_fn captures activations
Hooks->>Hooks: load_reference() stores baseline
deactivate Model
Main->>Model: Patch layer (noop_*_forward_patch)
activate Model
Model->>Hooks: Inference forward pass
Hooks->>Hooks: Compute MSE vs reference
deactivate Model
Main->>Scoring: collect_scores(model)
activate Scoring
Scoring->>Hooks: Extract per-layer statistics
Scoring->>DP: gather_across_dp(rankings)
activate DP
DP-->>Scoring: Aggregated rankings
deactivate DP
Scoring-->>Main: Importance scores
deactivate Scoring
Main->>Main: Dump scores, generate drop list
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/pruning/depth_ranking.py`:
- Around line 303-320: In collect_scores(), avoid hardcoding a length-10 CUDA
tensor and metric assumptions: use the provided use_metric argument when reading
stats, infer device and per-sample vector lengths dynamically, and replace
torch.zeros((10,)).cuda() with a padding strategy (e.g., pad_sequence or
creating zeros using the same device and shape as stat) so torch.stack(res)
won't fail for different DP/batch sizes; ensure aggregation (median/mean)
computes over the correct dim and that drop_group/drop_blocks are applied to the
sorted_indices from the selected metric; apply the same fixes in the related
block around the other occurrence (lines ~510-514) so both places respect
variable sample counts, device, and the caller's metric/drop settings.
- Line 1: Add the standard NVIDIA Apache 2.0 license header to the top of
depth_ranking.py (replace the solitary copyright line), using the full Apache
2.0 block required by the repository policy so the file begins with the complete
license header comment instead of just the copyright notice.
- Around line 16-17: The example imports test-only utilities
(get_mcore_gpt_model, set_seed and the inference helpers) from _test_utils which
couples the example to the test tree; replace those imports with supported
runtime equivalents or relocate the helper implementations into the examples
package. Concretely, remove references to
_test_utils.torch.megatron.models.get_mcore_gpt_model and
_test_utils.torch.misc.set_seed and either import model-building and seed
utilities from the public API (or a stable examples.helpers module), and copy
any inference helper functions used on lines ~93-96 into
examples/pruning/helpers.py and import them from there so the example no longer
depends on tests.
- Around line 78-80: Wrap the transformer_engine imports in a try/except and set
a flag (e.g., HAS_TE) so importing the example won't fail when
transformer_engine is not installed, and import/alias torch.nn.LayerNorm as the
fallback norm; then update setup_gates() to detect and handle both TE norm
classes (RMSNorm/LayerNorm from transformer_engine when HAS_TE) and standard
PyTorch norms (torch.nn.LayerNorm and any other supported norm types) by
creating the same logits_gate_list entries for either case; additionally, add a
defensive check before using model.logits_gate_list[0] to raise a clear error or
construct a default gate when logits_gate_list is empty, and optionally ensure
main() pins transformer_impl to the TE backend when intended (or document that
transformer_engine must be installed).
- Around line 353-384: The layer-to-rank mapping is wrong because
layer_id_in_this_rank() assumes equal split via num_layers_per_pp and offset;
update it to use Megatron's true pipeline offsets (e.g., call the utility used
in the distill plugin like TransformerLayer._get_layer_offset() or compute
start/end using config.num_layers_in_first_pipeline_stage and
config.num_layers_in_last_pipeline_stage together with
get_pipeline_model_parallel_rank()/get_pipeline_model_parallel_world_size()) so
that global layer IDs map to local indices correctly; adjust
layer_id_in_this_rank() to return the local index or -1 based on that computed
start/end range, and ensure patch_model/unpatch_model still use the returned
local index consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f0060447-cc9c-4f31-b6c7-4a5eebc9255f
📒 Files selected for processing (1)
examples/pruning/depth_ranking.py
| @@ -0,0 +1,554 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Add the repository Apache 2.0 header.
This new example file only has a copyright notice, so it is missing the required Apache 2.0 license block for new Python files. As per coding guidelines, "NVIDIA Apache 2.0 license header required on all new Python/C++/CUDA files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/pruning/depth_ranking.py` at line 1, Add the standard NVIDIA Apache
2.0 license header to the top of depth_ranking.py (replace the solitary
copyright line), using the full Apache 2.0 block required by the repository
policy so the file begins with the complete license header comment instead of
just the copyright notice.
| from _test_utils.torch.megatron.models import get_mcore_gpt_model | ||
| from _test_utils.torch.misc import set_seed |
There was a problem hiding this comment.
Keep the example off the test-only import path.
get_mcore_gpt_model(), set_seed(), and the inference helpers are pulled from _test_utils, which lives under tests/_test_utils/.... That couples this example to the test tree instead of supported runtime code, so it will not run in a normal install or when the example is copied into user workflows.
Also applies to: 93-96
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/pruning/depth_ranking.py` around lines 16 - 17, The example imports
test-only utilities (get_mcore_gpt_model, set_seed and the inference helpers)
from _test_utils which couples the example to the test tree; replace those
imports with supported runtime equivalents or relocate the helper
implementations into the examples package. Concretely, remove references to
_test_utils.torch.megatron.models.get_mcore_gpt_model and
_test_utils.torch.misc.set_seed and either import model-building and seed
utilities from the public API (or a stable examples.helpers module), and copy
any inference helper functions used on lines ~93-96 into
examples/pruning/helpers.py and import them from there so the example no longer
depends on tests.
| from transformer_engine.pytorch.module.rmsnorm import RMSNorm | ||
| from transformer_engine.pytorch.module.layernorm import LayerNorm | ||
| import debugpy |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd "depth_ranking.py" examples/Repository: NVIDIA/Model-Optimizer
Length of output: 99
🏁 Script executed:
cat -n examples/pruning/depth_ranking.py | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 4033
🏁 Script executed:
cat -n examples/pruning/depth_ranking.py | sed -n '70,90p'Repository: NVIDIA/Model-Optimizer
Length of output: 1119
🏁 Script executed:
cat -n examples/pruning/depth_ranking.py | sed -n '250,290p'Repository: NVIDIA/Model-Optimizer
Length of output: 1656
🏁 Script executed:
cat -n examples/pruning/depth_ranking.py | sed -n '350,375p'Repository: NVIDIA/Model-Optimizer
Length of output: 1171
🏁 Script executed:
cat -n examples/pruning/depth_ranking.py | sed -n '520,550p'Repository: NVIDIA/Model-Optimizer
Length of output: 1063
🏁 Script executed:
cat pyproject.toml | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 2872
🏁 Script executed:
rg "transformer.engine|transformer_engine" pyproject.toml setup.py setup.cfgRepository: NVIDIA/Model-Optimizer
Length of output: 115
🏁 Script executed:
cat -n examples/pruning/depth_ranking.py | sed -n '340,365p'Repository: NVIDIA/Model-Optimizer
Length of output: 1112
🏁 Script executed:
cat setup.pyRepository: NVIDIA/Model-Optimizer
Length of output: 4823
🏁 Script executed:
rg "transformer_engine|transformer-engine" --type py -iRepository: NVIDIA/Model-Optimizer
Length of output: 4319
🏁 Script executed:
cat -n examples/pruning/depth_ranking.py | sed -n '1,50p'Repository: NVIDIA/Model-Optimizer
Length of output: 2057
Guard conditional imports of transformer_engine and support non-TE norm implementations.
Lines 78-79 import RMSNorm and LayerNorm from transformer_engine unconditionally at module level, but transformer_engine is not a required dependency. This will fail immediately if the package is not installed. Additionally, setup_gates() (lines 271-278) only matches TE norm implementations—if a model uses standard PyTorch LayerNorm or other norm types, logits_gate_list remains empty, causing an IndexError at line 361 when accessing model.logits_gate_list[0].
The main() function (lines 532-544) does not pin transformer_impl, so there's no guarantee TE norms are used. Either make transformer_engine an explicit requirement, or wrap the imports in a try-except block and add support for non-TE norm implementations in setup_gates().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/pruning/depth_ranking.py` around lines 78 - 80, Wrap the
transformer_engine imports in a try/except and set a flag (e.g., HAS_TE) so
importing the example won't fail when transformer_engine is not installed, and
import/alias torch.nn.LayerNorm as the fallback norm; then update setup_gates()
to detect and handle both TE norm classes (RMSNorm/LayerNorm from
transformer_engine when HAS_TE) and standard PyTorch norms (torch.nn.LayerNorm
and any other supported norm types) by creating the same logits_gate_list
entries for either case; additionally, add a defensive check before using
model.logits_gate_list[0] to raise a clear error or construct a default gate
when logits_gate_list is empty, and optionally ensure main() pins
transformer_impl to the TE backend when intended (or document that
transformer_engine must be installed).
| def collect_scores(model, use_metric: str="mse", aggregation: str="mean", drop_blocks: List[int]=[], drop_group: int=1): | ||
| stats=model.logits_gate_list[0].activations_stats | ||
| print(f'{stats=}') | ||
| res=[] | ||
| for i in range(len(stats[use_metric])): | ||
| stat = stats[use_metric][i] | ||
| res.append(stat) if stat.numel()>0 else res.append(torch.zeros((10,)).cuda()) | ||
|
|
||
| res=torch.stack(res).float() | ||
| print(f'{res.median(dim=1)[0].sort()=}') | ||
| print(f'{res.mean(dim=1)[0].sort()=}') | ||
| already_dropped = len(drop_blocks) | ||
| if aggregation == 'median': | ||
| sorted_indices = res.median(dim=1)[0].sort()[1] | ||
| else: | ||
| sorted_indices = res.mean(dim=1).sort()[1] | ||
|
|
||
| drop = sorted_indices[already_dropped:already_dropped+drop_group] |
There was a problem hiding this comment.
Scoring is hardwired to one metric and a fixed sample count.
collect_scores() substitutes empty entries with torch.zeros((10,)).cuda(), and the final call site still uses collect_scores(model) plus a hardcoded scores["mse_drop"]. So the ranking silently ignores the caller's metric/drop settings, and torch.stack(res) will break once the gathered tensor length is anything other than 10, such as DP>1 or a different number of evaluation samples.
Also applies to: 510-514
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/pruning/depth_ranking.py` around lines 303 - 320, In
collect_scores(), avoid hardcoding a length-10 CUDA tensor and metric
assumptions: use the provided use_metric argument when reading stats, infer
device and per-sample vector lengths dynamically, and replace
torch.zeros((10,)).cuda() with a padding strategy (e.g., pad_sequence or
creating zeros using the same device and shape as stat) so torch.stack(res)
won't fail for different DP/batch sizes; ensure aggregation (median/mean)
computes over the correct dim and that drop_group/drop_blocks are applied to the
sorted_indices from the selected metric; apply the same fixes in the related
block around the other occurrence (lines ~510-514) so both places respect
variable sample counts, device, and the caller's metric/drop settings.
| pp_size = get_pipeline_model_parallel_world_size() | ||
| pp_rank = get_pipeline_model_parallel_rank() | ||
| num_layers = config.num_layers | ||
| num_layers_per_pp = num_layers // pp_size | ||
| offset = pp_rank*num_layers_per_pp | ||
|
|
||
| setup_gates(model) | ||
| # set lm head in the last hidden hook | ||
| model.logits_gate_list[0].set_lm_head(model.output_layer) | ||
|
|
||
| # Prepare model | ||
| def patch_model(layer_id, block='transformer'): | ||
| if layer_id == -1: | ||
| return None | ||
| patch_register = model.decoder.layers[layer_id].forward | ||
| model.decoder.layers[layer_id].forward = noop_gpt_block_forward_patch | ||
| print_rank_0(f'Patched gpt block {layer_id} to noop_gpt_block_forward') | ||
|
|
||
| return patch_register | ||
|
|
||
| def unpatch_model(layer_id, patch_register, block='transformer'): | ||
| if layer_id == -1: | ||
| return None | ||
| print_rank_0(f'Unpatching gpt block {layer_id} ') | ||
| model.decoder.layers[layer_id].forward = patch_register | ||
|
|
||
|
|
||
| def layer_id_in_this_rank(layer_id): | ||
| if layer_id >= offset and layer_id < offset+num_layers_per_pp: | ||
| return layer_id-offset | ||
| else: | ||
| return -1 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the file
find . -type f -name "depth_ranking.py" | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 101
🏁 Script executed:
# Check the file size
wc -l examples/pruning/depth_ranking.py 2>/dev/null || echo "File not found in expected location"Repository: NVIDIA/Model-Optimizer
Length of output: 103
🏁 Script executed:
# Search for depth_ranking.py
fd "depth_ranking.py"Repository: NVIDIA/Model-Optimizer
Length of output: 99
🏁 Script executed:
# Read the relevant section and surrounding context
sed -n '340,390p' examples/pruning/depth_ranking.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1838
🏁 Script executed:
# Search for stage-to-layer mapping methods in the codebase
rg "stage.*layer|layer.*stage|get_.*layers" --type py | grep -i "stage\|partition" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 2441
🏁 Script executed:
# Look for imports and Megatron-related utilities
head -50 examples/pruning/depth_ranking.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1707
🏁 Script executed:
# Search for how to properly get layer ranges per stage
rg "first_pipeline_stage|last_pipeline_stage|layer.*range|start_layer|end_layer" --type py -A 3 | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 3670
🏁 Script executed:
# Check if the GPTModel or config object has methods for layer ranges
rg "get.*layer.*range|layer_range|stage.*layer|first.*layer.*last" --type py | grep -i "def\|class" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look at how the megatron_bridge prune_minitron handles this
sed -n '1,100p' examples/megatron_bridge/prune_minitron.pyRepository: NVIDIA/Model-Optimizer
Length of output: 4340
🏁 Script executed:
# Look at the model creation to understand what properties are available
rg "num_layers_in_first_pipeline_stage|num_layers_in_last_pipeline_stage" --type py -B 5 -A 5 | head -60Repository: NVIDIA/Model-Optimizer
Length of output: 5569
🏁 Script executed:
# Check the get_mcore_gpt_model function to see what model attributes are available
sed -n '1,150p' tests/_test_utils/torch/megatron/models.pyRepository: NVIDIA/Model-Optimizer
Length of output: 5223
🏁 Script executed:
# Search for how to get layer ranges in Megatron or if model has attributes for this
rg "layer.*offset|start.*layer|end.*layer" --type py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 2109
🏁 Script executed:
# Check the get_mcore_gpt_model function signature and implementation to see where layer ranges are stored
sed -n '100,300p' tests/_test_utils/torch/megatron/models.pyRepository: NVIDIA/Model-Optimizer
Length of output: 7349
🏁 Script executed:
# Search for how the model uses or stores layer range information
rg "decoder\.layers|first_pipeline_stage|last_pipeline_stage" --type py -B 3 -A 3 | head -80Repository: NVIDIA/Model-Optimizer
Length of output: 7807
🏁 Script executed:
# Look at the mcore_minitron plugin to see if it has similar layer range logic
sed -n '1,100p' modelopt/torch/prune/plugins/mcore_minitron.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3542
🏁 Script executed:
# Search for how to get actual layer boundaries from config or model
rg "first_pipeline_stage.*layers|last_pipeline_stage.*layers" --type py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 2509
🏁 Script executed:
# Look for methods in TransformerConfig that provide layer ranges
rg "class TransformerConfig" --type py -A 50 | head -80Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check how depth_ranking.py uses the layers and if there's a way to get actual layer ranges
sed -n '300,450p' examples/pruning/depth_ranking.pyRepository: NVIDIA/Model-Optimizer
Length of output: 5828
🏁 Script executed:
# Search for Megatron documentation or methods that calculate layer ranges based on first/last stage config
rg "compute.*layer|calculate.*layer|get_layer.*range" --type py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 341
🏁 Script executed:
# Look at what properties the config has related to layers
python3 << 'EOF'
import sys
# Try to find and understand how Megatron computes layer ranges
# Based on the config properties found: num_layers_in_first_pipeline_stage, num_layers_in_last_pipeline_stage
# Look for Megatron's approach
# Search string patterns that might hint at how layers are distributed
patterns = [
"num_layers_in_first_pipeline_stage",
"num_layers_in_last_pipeline_stage",
"pipeline_rank",
"pp_rank"
]
# We know from tests the layer distribution can be:
# [1, 3] for 2 stages with 4 total layers
# [3, 2, 2, 1] for 4 stages with 8 total layers
# [4, 1, 1, 1, 1, 1, 1, 6] for 8 stages
print("Layer distribution examples from test:")
print("2 stages, 4 total: [1, 3]")
print("4 stages, 8 total: [3, 2, 2, 1]")
print("8 stages, 8 total: [4, 1, 1, 1, 1, 1, 1, 6]")
print()
print("This shows that Megatron may not evenly distribute layers.")
print("A proper fix would need to compute actual layer ranges per stage.")
EOFRepository: NVIDIA/Model-Optimizer
Length of output: 332
🏁 Script executed:
# Search for Megatron utilities or comments that explain layer distribution
rg "pipeline.*rank|pp.*rank" examples/pruning/depth_ranking.py -B 5 -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 2317
🏁 Script executed:
# Search for how layers are actually computed/distributed in the model initialization
rg "num_layers.*first|num_layers.*last|layer.*offset|offset.*layer" --type py -B 3 -A 3 | grep -E "(def|class|return|=)" | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 4403
🏁 Script executed:
# Check if the model itself has properties that tell us the layer boundaries
rg "decoder\..*start|decoder\..*end|first_layer|last_layer" --type py | grep -v "test\|\.0\|\.1" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 2450
🏁 Script executed:
# Look for any utility function or method that computes layer ranges based on pp_rank
rg "def.*layer.*range|def.*offset|def.*start.*end" --type py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1206
🏁 Script executed:
# Search for _get_layer_offset or similar methods that handle layer offset computation
rg "_get_layer_offset|get_layer_offset" --type py -B 5 -A 10Repository: NVIDIA/Model-Optimizer
Length of output: 1499
🏁 Script executed:
# Look for how the distill plugin handles this layer offset
sed -n '1,100p' modelopt/torch/distill/plugins/megatron.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3995
🏁 Script executed:
# Search for how Megatron core handles layer distribution across pipeline stages
rg "virtual_pipeline|pipeline.*stage" --type py modelopt/ | grep -i "layer" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 284
🏁 Script executed:
# Find the actual _get_layer_offset implementation
rg "_get_layer_offset" --type py -A 15Repository: NVIDIA/Model-Optimizer
Length of output: 1350
🏁 Script executed:
# Search for TransformerLayer class definition to find _get_layer_offset
rg "class TransformerLayer" --type py -A 50 | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look for how Megatron computes actual layer ranges in pipeline parallelism
web_search
Megatron pipeline parallel layer distribution first_pipeline_stage last_pipeline_stageRepository: NVIDIA/Model-Optimizer
Length of output: 163
🌐 Web query:
Megatron-Core pipeline parallel layer distribution first_pipeline_stage last_pipeline_stage
💡 Result:
In Megatron-Core, “first/last pipeline stage” refers to the pipeline-parallel rank at the ends of the model graph:
- First pipeline stage: the rank that owns the “front” of the model (typically embeddings / pre_process). In the single-module case this is generally PP rank = 0 (and
parallel_state.is_pipeline_first_stage()is used to gate that logic). [1] - Last pipeline stage: the rank that owns the “tail” of the model (typically final norm / logits / loss / post_process). In the single-module case this is generally PP rank = PP_size − 1 (gated by
parallel_state.is_pipeline_last_stage()). [4]
How transformer layers are distributed across PP stages
-
Even split (default)
If you don’t request a custom/uneven layout, Megatron-Core assumes an equal split across pipeline stages (and, for interleaved/virtual pipeline, it further splits intoPP_size * VPP_size“stages”). Example shown in the docs: withPP=4,VPP=2,num_layers=16⇒ 8 stages of 2 layers each, interleaved across GPUs. [1] -
Uneven split with “first/last stage layer counts”
Megatron-Core exposes config fields:
num_layers_in_first_pipeline_stagenum_layers_in_last_pipeline_stage
If set, the remaining layers are evenly distributed over the middle PP ranks, and Megatron-Core requires the remainder to be divisible by the number of remaining stages; also, this mode is not compatible with interleaved (virtual) pipeline. [2][3]
(These correspond to CLI options in Megatron-LM like --decoder-first-pipeline-num-layers / --decoder-last-pipeline-num-layers.) [3]
- Fully custom per-stage layout
For non-uniform/handcrafted partitioning beyond “first/last heavier”, usepipeline_model_parallel_layout, which is parsed into aPipelineParallelLayerLayoutand can specify exactly which layers live on each PP (and VPP) stage. [2][5]
Sources: [1] https://docs.nvidia.com/megatron-core/developer-guide/latest/apidocs/core/core.parallel_state.html, [2] https://docs.nvidia.com/megatron-core/developer-guide/latest/apidocs/core/core.transformer.transformer_config.html, [3] NVIDIA/Megatron-LM#1303, [4] https://docs.nvidia.com/megatron-core/developer-guide/0.16.0/apidocs/core/core.pipeline_parallel.multimodule_communicator.html, [5] https://docs.nvidia.com/megatron-core/developer-guide/0.15.0/apidocs/core/core.transformer.pipeline_parallel_layer_layout.html
🏁 Script executed:
# Search for how the distill plugin imports or uses _get_layer_offset
rg "from megatron|import.*TransformerLayer" modelopt/torch/distill/plugins/megatron.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 432
🏁 Script executed:
# Look for any documentation or comments about layer offset computation
rg "layer.*offset|offset.*layer" examples/pruning/depth_ranking.py -B 5 -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 831
🏁 Script executed:
# Look for how to properly compute layer ranges using Megatron utilities
rg "def.*layer.*offset|_get_layer_offset" --type py -B 3 -A 10 | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 1252
🏁 Script executed:
# Check if config object has properties or methods to get layer ranges
python3 << 'EOF'
# Based on web search results, let's understand the layer distribution logic:
# 1. Even distribution (default): num_layers // pp_size per stage
# 2. Uneven distribution: first stage, last stage, middle stages get remainder
#
# The code needs to:
# - Check if num_layers_in_first_pipeline_stage or num_layers_in_last_pipeline_stage are set
# - If set, compute layer offset/range accordingly
# - If not set, use even distribution
print("Layer distribution logic needed:")
print("1. If first/last stage layer counts set in config:")
print(" - First stage gets num_layers_in_first_pipeline_stage")
print(" - Last stage gets num_layers_in_last_pipeline_stage")
print(" - Middle stages get (num_layers - first - last) / (pp_size - 2)")
print("")
print("2. If not set, use even distribution:")
print(" - Each stage gets num_layers // pp_size")
print("")
print("The current code only handles case 2")
EOFRepository: NVIDIA/Model-Optimizer
Length of output: 448
🏁 Script executed:
# Search for how other parts of the codebase handle this layer offset calculation
rg "first_pipeline_stage|last_pipeline_stage" --type py -B 2 -A 2 | grep -E "(if|def|offset|layer)" | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 3238
🏁 Script executed:
# Check if config object exposes these properties directly
sed -n '350,380p' examples/pruning/depth_ranking.py | grep "config\."Repository: NVIDIA/Model-Optimizer
Length of output: 100
🏁 Script executed:
# Look for how to access the first/last stage layer counts from config
rg "config\.num_layers_in_first|config\.num_layers_in_last" --type py | head -10Repository: NVIDIA/Model-Optimizer
Length of output: 284
🏁 Script executed:
# Understand when estimate_depth_importance is called and if uneven PP is expected
rg "estimate_depth_importance" --type py -B 5 -A 2 | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 972
🏁 Script executed:
# Check if there are examples or documentation about how depth_ranking.py is meant to be used
find . -name "*depth*" -type f | head -10Repository: NVIDIA/Model-Optimizer
Length of output: 101
🏁 Script executed:
# Look for comments in the file about assumptions
sed -n '1,100p' examples/pruning/depth_ranking.py | grep -i "assumption\|note\|assume\|parallel\|pipeline"Repository: NVIDIA/Model-Optimizer
Length of output: 1006
Uneven pipeline parallelism breaks layer offset calculation.
The current code assumes evenly distributed layers across pipeline stages using num_layers // pp_size, but Megatron-Core supports uneven distribution where the first and last pipeline stages can have different layer counts (configured via num_layers_in_first_pipeline_stage and num_layers_in_last_pipeline_stage). This causes layer_id_in_this_rank() to patch or skip incorrect layers.
Use Megatron's layer offset utilities (like TransformerLayer._get_layer_offset() used in the distill plugin) or compute layer ranges directly from the config's first/last stage properties to correctly map global layer IDs to local ones.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/pruning/depth_ranking.py` around lines 353 - 384, The layer-to-rank
mapping is wrong because layer_id_in_this_rank() assumes equal split via
num_layers_per_pp and offset; update it to use Megatron's true pipeline offsets
(e.g., call the utility used in the distill plugin like
TransformerLayer._get_layer_offset() or compute start/end using
config.num_layers_in_first_pipeline_stage and
config.num_layers_in_last_pipeline_stage together with
get_pipeline_model_parallel_rank()/get_pipeline_model_parallel_world_size()) so
that global layer IDs map to local indices correctly; adjust
layer_id_in_this_rank() to return the local index or -1 based on that computed
start/end range, and ensure patch_model/unpatch_model still use the returned
local index consistently.
What does this PR do?
Type of change: ? new-feature
The script goes through the GPTModel transformer patching each block as no-op one-by-one estimating how much removing this block affects the final output representation.
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit