Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughLayerActivationCollector was moved into a new utils module and implemented as a stateful patching helper. sequential_calibrate was updated to use it for per-layer capture/replay. HuggingFace plugin added decoder-discovery helpers and registered them. Tests expanded to cover collector behavior and sequential calibration scenarios. Changes
Sequence DiagramsequenceDiagram
participant Model
participant LAC as LayerActivationCollector
participant ForwardLoop
participant CalibFunc as Calibration Function
Note over Model,LAC: Initialization
ForwardLoop->>LAC: provide model & decoder_layers
LAC->>Model: _patch_all_layers(decoder_layers)
Model-->>LAC: patched forwards installed
loop for each decoder layer i
Note over ForwardLoop,LAC: Capture phase
ForwardLoop->>Model: run forward (patched)
Model->>LAC: patched_forward(mode=capture) -> record inputs, raise EarlyStop
LAC->>ForwardLoop: return captured inputs for layer i
Note over ForwardLoop,CalibFunc: Calibration
ForwardLoop->>CalibFunc: call with per-layer forward that replays captured inputs
CalibFunc->>Model: apply weight updates to layer i
Note over ForwardLoop,LAC: Replay phase
ForwardLoop->>Model: run forward (patched, mode=run)
Model->>LAC: patched_forward(mode=run) -> replay using cached inputs
LAC->>LAC: _set_layer_states(i+1)
end
Note over Model,LAC: Cleanup
LAC->>Model: _unpatch_all_layers()
Model-->>LAC: original forwards restored
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
13b9033 to
7f72422
Compare
modelopt/torch/quantization/utils.py
Outdated
|
|
||
| class LayerActivationCollector: | ||
| """Helper class for collecting layer activations during forward passes. | ||
| @dataclass |
There was a problem hiding this comment.
can we move the patching & capture related logics to a new file?
There was a problem hiding this comment.
Pull request overview
Refactors sequential calibration to avoid O(N²) activation collection by introducing a stateful, model-forward-based activation collection approach (skip/run/capture) that preserves parent-module forward logic.
Changes:
- Replaces the previous decoder-layer detection + per-layer patching with a state-aware
LayerActivationCollectorthat patches all decoder layers and uses early-stop to capture inputs efficiently. - Updates
sequential_calibrate()to use the new collector and removes the legacyget_decoder_layers()heuristic frommodelopt/torch/utils/network.py. - Adds/extends unit tests to validate skip/run/capture behavior, decoder-layer discoverer registration, and HuggingFace integration.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/torch/quantization/test_utils.py | Adds unit tests for the new LayerActivationCollector registration/discovery API and skip/run/capture semantics. |
| tests/unit/torch/quantization/test_sequential_calibrate.py | Adds deeper behavioral tests for tuple outputs, inter-layer ops, mode transitions, and cleanup for sequential calibration patching. |
| tests/unit/torch/quantization/test_calib.py | Adds tests for the new sequential calibration support gate and activation propagation behavior. |
| tests/unit/torch/quantization/plugins/test_huggingface.py | Adds tests validating HuggingFace decoder-layer discoverer registration and homogeneous model detection. |
| modelopt/torch/utils/network.py | Removes the old get_decoder_layers() heuristic helper. |
| modelopt/torch/quantization/utils.py | Introduces the new stateful LayerActivationCollector with patched forward and output-metadata-based skipping. |
| modelopt/torch/quantization/plugins/huggingface.py | Registers HuggingFace decoder-layer discoverers with LayerActivationCollector. |
| modelopt/torch/quantization/model_calib.py | Updates sequential_calibrate() to use the new collector lifecycle (patch once, calibrate layer-by-layer, unpatch). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def is_homogenous_hf_model(model: nn.Module) -> bool: | ||
| if is_nemotron_h_model(model): | ||
| return False | ||
| decoder_layers = get_homogeneous_hf_decoder_layers(model) | ||
| if decoder_layers is None or len(decoder_layers) == 0: | ||
| return False | ||
| layer_classes = {type(layer) for layer in decoder_layers} | ||
| return len(layer_classes) == 1 |
There was a problem hiding this comment.
The helper is named is_homogenous_hf_model, but the correct term is “homogeneous”. Since this is a newly introduced API surface (and is referenced by tests/registration), consider renaming it (and its uses) to is_homogeneous_hf_model for clarity and to avoid baking in a misspelling.
| def _register_test_discoverer(monkeypatch): | ||
| """Register a simple discoverer that finds model.layers on any model.""" | ||
| monkeypatch.setattr( | ||
| LayerActivationCollector, | ||
| "_decoder_layer_support", | ||
| [(lambda m: hasattr(m, "layers"), lambda m: m.layers)], | ||
| ) |
There was a problem hiding this comment.
This file now introduces _register_test_discoverer() for decoder-layer discovery, but earlier tests in the same module still call LayerActivationCollector.get_input_activations() / sequential_calibrate() without registering any decoder-layer support (and without patching collector layers). With the new LayerActivationCollector implementation, those tests will fail unless they register a discoverer (like this helper) and patch/unpatch appropriately.
modelopt/torch/quantization/utils.py
Outdated
| _decoder_layer_support: list[tuple[Any, Any]] = [] | ||
| _LAYER_ATTR = "_seq_calib" | ||
|
|
||
| def __init__(self, model: nn.Module): | ||
| self.model = model | ||
| self._decoder_layers: nn.ModuleList | None = None | ||
| self._layer_to_idx: dict[nn.Module, int] = {} | ||
| self._patched = False | ||
|
|
||
| # ------------------------------------------------------------------ | ||
| # Decoder-layer discovery | ||
| # ------------------------------------------------------------------ | ||
|
|
||
| @staticmethod | ||
| def _patch_and_initialize_layer(layer: torch.nn.Module, stop_after_collection: bool = False): | ||
| """Patch a layer to collect inputs during forward passes.""" | ||
| def get_decoder_layers(model: nn.Module) -> nn.ModuleList | None: | ||
| """Return decoder layers supported by sequential calibration.""" | ||
| for is_supported, discoverer in LayerActivationCollector._decoder_layer_support: | ||
| if not is_supported(model): | ||
| continue | ||
| decoder_layers = discoverer(model) | ||
| if decoder_layers is not None: | ||
| return decoder_layers | ||
| return None |
There was a problem hiding this comment.
LayerActivationCollector._decoder_layer_support defaults to an empty list, so LayerActivationCollector.is_supported() (and thus sequential_calibrate()) will always fail unless some plugin/module has been imported that registers discoverers. If sequential_calibrate/collector are intended to be usable when imported directly (without modelopt.torch.quantization which imports plugins), consider registering baseline discoverers for common patterns (e.g. model.layers, model.model.layers, model.decoder.layers, model.transformer.h) in utils.py, similar to the removed get_decoder_layers heuristic.
modelopt/torch/quantization/utils.py
Outdated
| Layers before the target are skipped or re-run (if just calibrated), the | ||
| target layer captures its inputs, and an early-stop prevents unnecessary | ||
| computation beyond the target. | ||
| """ |
There was a problem hiding this comment.
LayerActivationCollector.get_input_activations() assumes _patch_all_layers() has already been called (it indexes into _layer_to_idx and relies on patched forwards). As written, calling this public method without patching will raise KeyError / behave incorrectly. Either make patching lazy inside get_input_activations() (patch on first use) or add a clear assertion/error telling callers to call _patch_all_layers() first.
| """ | |
| """ | |
| # Ensure layers have been patched before attempting to collect activations. | |
| if not hasattr(self, "_layer_to_idx") or layer not in self._layer_to_idx: | |
| raise RuntimeError( | |
| "LayerActivationCollector.get_input_activations() was called before layers were patched. " | |
| "Make sure to call LayerActivationCollector._patch_all_layers() (or the appropriate " | |
| "patching method) before requesting input activations." | |
| ) |
Logic Correctness ReviewOverall, this is a well-designed refactor that correctly implements the state machine for sequential calibration. However, I found several logic issues that should be addressed: 🔴 Critical Issues1. Race condition in state transitions for layer 0 and 1In if layer_idx > 1:
done = self._decoder_layers[layer_idx - 2]._seq_calib
done.mode = "skip"
done.cached_inputs = [] # Free memory
if layer_idx > 0:
prev = self._decoder_layers[layer_idx - 1]._seq_calib
prev.mode = "run"
prev.cached_inputs = prev.collected_inputs # Move collected → cached
prev.collected_inputs = []Issue: When
However, when
But there's a subtle bug: layer 0's Iteration 0 (calibrating layer 0):
Iteration 1 (calibrating layer 1):
Wait, let me re-check. Actually the logic looks correct because The real bug: When layer 0 transitions to "original" after its calibration, and then we start calibrating layer 1, we call Actually, looking more carefully at the test Let me reconsider... The logic seems correct for the intended use case. However, there's a different issue: 2. Missing validation: What if
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
1199-1206: Consider making the generic HF discoverer structural instead of homogeneous-only.
LayerActivationCollectornow tracksoutput_metaper layer, and the new regression cases cover heterogeneous layer stacks. Requiring a single layer class here seems stricter than the runtime needs and will exclude mixed-block decoders that still expose a cleanmodel.model.layerslist.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 1199 - 1206, The current is_homogenous_hf_model function rejects decoders with mixed layer classes; instead relax it to a structural discoverer: in is_homogenous_hf_model (and still respect is_nemotron_h_model) return True whenever get_homogeneous_hf_decoder_layers(model) yields a non-empty decoder layer list and the model exposes a layers sequence (e.g. model.model.layers) whose elements provide the runtime metadata LayerActivationCollector expects (check for the presence of output_meta or another per-layer attribute/ability that LayerActivationCollector uses), and remove the strict len(layer_classes) == 1 check; keep using get_homogeneous_hf_decoder_layers and is_nemotron_h_model as anchors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/activation_collector.py`:
- Around line 50-52: The current implementation stores a single output_meta that
gets overwritten in ActivationCollector.run and reused in skip, causing replayed
batches to get incorrect shapes; change output_meta from a single tuple to a
per-batch sequence (e.g., a deque or list) named output_meta_list or
output_meta: deque, append the metadata for each captured batch inside
ActivationCollector.run (use the same indexing order as
cached_inputs/collected_inputs), update ActivationCollector.skip and any replay
logic to consume/peek the corresponding per-batch metadata in FIFO order
(maintaining alignment with cached_inputs entries) and remove or rotate entries
consistently after replays so each replayed batch uses its original per-batch
output_meta rather than a shared slot (also apply the same change to the other
occurrence referenced around lines 163-179).
- Around line 241-265: The state-transition code in _set_layer_states assumes
prior stages captured/replayed successfully; add fail-fast checks before
changing modes: verify done.output_meta is present (non-empty) before setting
done.mode="skip" and verify prev.collected_inputs is non-empty before setting
prev.mode="run" and copying to prev.cached_inputs; if either check fails, raise
a clear RuntimeError indicating the specific layer index and missing data (use
layer_idx and references to self._decoder_layers[...] ._seq_calib). After
forward_loop() completes, also validate that the current layer's
collected_inputs is non-empty and raise a descriptive error immediately if the
capture is empty. Apply the same guardrails to the analogous transition block
later in the file (the other spot that manipulates ._seq_calib,
.collected_inputs and .cached_inputs).
---
Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 1199-1206: The current is_homogenous_hf_model function rejects
decoders with mixed layer classes; instead relax it to a structural discoverer:
in is_homogenous_hf_model (and still respect is_nemotron_h_model) return True
whenever get_homogeneous_hf_decoder_layers(model) yields a non-empty decoder
layer list and the model exposes a layers sequence (e.g. model.model.layers)
whose elements provide the runtime metadata LayerActivationCollector expects
(check for the presence of output_meta or another per-layer attribute/ability
that LayerActivationCollector uses), and remove the strict len(layer_classes) ==
1 check; keep using get_homogeneous_hf_decoder_layers and is_nemotron_h_model as
anchors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 913af19a-8050-4673-b980-42453284b8d7
📒 Files selected for processing (9)
modelopt/torch/quantization/activation_collector.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/utils.pymodelopt/torch/utils/network.pytests/unit/torch/quantization/plugins/test_huggingface.pytests/unit/torch/quantization/test_calib.pytests/unit/torch/quantization/test_sequential_calibrate.pytests/unit/torch/quantization/test_utils.py
💤 Files with no reviewable changes (1)
- modelopt/torch/utils/network.py
| cached_inputs: deque = field(default_factory=deque) | ||
| collected_inputs: list = field(default_factory=list) | ||
| output_meta: tuple | None = None |
There was a problem hiding this comment.
output_meta only tracks the last batch.
run overwrites a single output_meta on every replayed batch, and skip reuses that one shape for every later batch. With a normal drop_last=False loader or variable sequence lengths, a later pass can synthesize dummy outputs with the wrong dimensions for earlier batches. This needs per-batch metadata that can be replayed in order on each future pass, not one shared slot per layer.
Also applies to: 163-179
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/activation_collector.py` around lines 50 - 52,
The current implementation stores a single output_meta that gets overwritten in
ActivationCollector.run and reused in skip, causing replayed batches to get
incorrect shapes; change output_meta from a single tuple to a per-batch sequence
(e.g., a deque or list) named output_meta_list or output_meta: deque, append the
metadata for each captured batch inside ActivationCollector.run (use the same
indexing order as cached_inputs/collected_inputs), update
ActivationCollector.skip and any replay logic to consume/peek the corresponding
per-batch metadata in FIFO order (maintaining alignment with cached_inputs
entries) and remove or rotate entries consistently after replays so each
replayed batch uses its original per-batch output_meta rather than a shared slot
(also apply the same change to the other occurrence referenced around lines
163-179).
| def _set_layer_states(self, layer_idx: int): | ||
| """Transition layer modes for the next calibration step. | ||
|
|
||
| When calibrating layer *i*, three transitions happen: | ||
|
|
||
| * Layer ``i - 2`` → **skip** (fully done, free its cached inputs). | ||
| * Layer ``i - 1`` → **run** (replay captured inputs with calibrated weights). | ||
| * Layer ``i`` → **capture** (record inputs, then early-stop). | ||
| """ | ||
| assert self._decoder_layers is not None | ||
|
|
||
| if layer_idx > 1: | ||
| done = self._decoder_layers[layer_idx - 2]._seq_calib | ||
| done.mode = "skip" | ||
| done.cached_inputs = deque() | ||
|
|
||
| if layer_idx > 0: | ||
| prev = self._decoder_layers[layer_idx - 1]._seq_calib | ||
| prev.mode = "run" | ||
| prev.cached_inputs = deque(prev.collected_inputs) | ||
| prev.collected_inputs = [] | ||
|
|
||
| cur = self._decoder_layers[layer_idx]._seq_calib | ||
| cur.mode = "capture" | ||
| cur.collected_inputs = [] |
There was a problem hiding this comment.
Fail fast when the sequential state machine falls out of sync.
These transitions assume the previous pass both captured inputs and replayed once. If a caller skips a decoder layer, or forward_loop() never reaches the target, you only discover it later when a skipped layer hits the RuntimeError on Line 165. Please validate done.output_meta / prev.collected_inputs before switching modes, and reject an empty capture right after forward_loop() so the failure is reported at the source.
Suggested guardrails
def _set_layer_states(self, layer_idx: int):
assert self._decoder_layers is not None
if layer_idx > 1:
done = self._decoder_layers[layer_idx - 2]._seq_calib
+ if done.output_meta is None:
+ raise RuntimeError(
+ f"Layer {done.name} cannot enter 'skip' before a successful replay."
+ )
done.mode = "skip"
done.cached_inputs = deque()
if layer_idx > 0:
prev = self._decoder_layers[layer_idx - 1]._seq_calib
+ if not prev.collected_inputs:
+ raise RuntimeError(
+ f"Layer {prev.name} has no captured inputs to replay."
+ )
prev.mode = "run"
prev.cached_inputs = deque(prev.collected_inputs)
prev.collected_inputs = [] forward_loop(self.model)
info = layer._seq_calib
inputs = list(info.collected_inputs)
# After capture, set to original so calib_func can call the layer's
# real forward directly. The layer will transition to run → skip
# in subsequent iterations via _set_layer_states.
info.mode = "original"
+ if not inputs:
+ raise RuntimeError(
+ f"Layer {info.name} did not capture any inputs during forward_loop()."
+ )
return inputsAlso applies to: 293-307
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/activation_collector.py` around lines 241 - 265,
The state-transition code in _set_layer_states assumes prior stages
captured/replayed successfully; add fail-fast checks before changing modes:
verify done.output_meta is present (non-empty) before setting done.mode="skip"
and verify prev.collected_inputs is non-empty before setting prev.mode="run" and
copying to prev.cached_inputs; if either check fails, raise a clear RuntimeError
indicating the specific layer index and missing data (use layer_idx and
references to self._decoder_layers[...] ._seq_calib). After forward_loop()
completes, also validate that the current layer's collected_inputs is non-empty
and raise a descriptive error immediately if the capture is empty. Apply the
same guardrails to the analogous transition block later in the file (the other
spot that manipulates ._seq_calib, .collected_inputs and .cached_inputs).
cjluo-nv
left a comment
There was a problem hiding this comment.
Review Summary
The core refactoring from O(N²) to O(N) sequential calibration via skip/run/capture state machine is well-designed. The architecture, cleanup handling, and test coverage are all strong. However, there are a few issues that should be addressed before merge.
Critical: Missing model pattern registrations (regression)
The old get_decoder_layers() in network.py supported 5 model patterns. The PR only registers 2 (Nemotron-H and homogeneous HF). Missing support for:
- Megatron/MCore:
model.decoder.layers— no registration inplugins/megatron.py - GPT-style:
model.transformer.h - Direct layers:
model.layers(when it's annn.ModuleList) - Nemotron Super/Nano without
block_type: old code matchedmodel.backbone.layersunconditionally; newget_nemotron_h_decoder_layersrequiresblock_typeonlayers[0]
Models that previously worked with sequential calibration will now raise "Could not find transformer layers".
Medium Issues
-
Typo in public API:
is_homogenous_hf_model→ should beis_homogeneous_hf_model(missing 'e'). Appears in function name, test name, and registration. Worth fixing now before it becomes a stable API. -
forward_loop is Noneguard removed: The old code explicitly raisedValueError("forward_loop must not be None ..."). The PR removes this check (and the corresponding testtest_seq_calib_raises_on_none_forward_loop) without adding an equivalent guard. ANoneforward_loop will now produce an unhelpfulTypeErrordeep in the stack. -
_decoder_layer_supportas mutable class variable (activation_collector.py:83): This list is shared across all instances and grows monotonically. Tests work around this withmonkeypatch, but in production, if plugins are re-imported or the module is reloaded, entries could accumulate. Consider documenting this as intentional or adding a guard.
Minor
- License header on the new file says
2024— should likely be2025for a new file. copy.deepcopy(meta[1])in_zeros_from_metafor the"other"case could be expensive for complex non-tensor outputs. A comment noting this is expected to be lightweight values (e.g.None) would help.
CI
linux and unit-pr-required-check are failing — should be investigated before merge.
What's Good
- Clean state machine design with skip/run/capture/original modes
- Proper
try/finallycleanup in bothsequential_calibrateand_patch_all_layers - Output metadata approach (
_extract_output_meta/_zeros_from_meta) correctly handles tuple/list/tensor outputs for skip layers - Excellent test coverage (~470 new test lines) covering mode transitions, heterogeneous layers, tuple unpacking, inter-layer norm, error paths, and cleanup
- Extensible plugin-based decoder discovery via
register_decoder_layer_support - Backward-compatible re-export of
LayerActivationCollectorfromutils.py
cjluo-nv
left a comment
There was a problem hiding this comment.
Review Summary
The core refactoring from O(N²) to O(N) sequential calibration via skip/run/capture state machine is well-designed. The architecture, cleanup handling, and test coverage are all strong. However, there are a few issues that should be addressed before merge.
Critical: Missing model pattern registrations (regression)
The old get_decoder_layers() in network.py supported 5 model patterns. The PR only registers 2 (Nemotron-H and homogeneous HF). Missing support for:
- Megatron/MCore:
model.decoder.layers— no registration inplugins/megatron.py - GPT-style:
model.transformer.h - Direct layers:
model.layers(when it is annn.ModuleList) - Nemotron Super/Nano without
block_type: old code matchedmodel.backbone.layersunconditionally; newget_nemotron_h_decoder_layersrequiresblock_typeonlayers[0]
Models that previously worked with sequential calibration will now raise "Could not find transformer layers".
Medium Issues
-
Typo in public API:
is_homogenous_hf_modelshould beis_homogeneous_hf_model(missing 'e'). Appears in function name, test name, and registration. Worth fixing now before it becomes a stable API. -
forward_loop is Noneguard removed: The old code explicitly raisedValueError("forward_loop must not be None ..."). The PR removes this check (and the corresponding testtest_seq_calib_raises_on_none_forward_loop) without adding an equivalent guard. ANoneforward_loop will now produce an unhelpfulTypeErrordeep in the stack. -
_decoder_layer_supportas mutable class variable (activation_collector.py:83): This list is shared across all instances and grows monotonically. Tests work around this withmonkeypatch, but in production, if plugins are re-imported or the module is reloaded, entries could accumulate. Consider documenting this as intentional or adding a guard.
Minor
- License header on the new file says
2024— should likely be2025for a new file. copy.deepcopy(meta[1])in_zeros_from_metafor the"other"case could be expensive for complex non-tensor outputs. A comment noting this is expected to be lightweight values (e.g.None) would help.
CI
linux and unit-pr-required-check are failing — should be investigated before merge.
What is Good
- Clean state machine design with skip/run/capture/original modes
- Proper
try/finallycleanup in bothsequential_calibrateand_patch_all_layers - Output metadata approach (
_extract_output_meta/_zeros_from_meta) correctly handles tuple/list/tensor outputs for skip layers - Excellent test coverage (~470 new test lines) covering mode transitions, heterogeneous layers, tuple unpacking, inter-layer norm, error paths, and cleanup
- Extensible plugin-based decoder discovery via
register_decoder_layer_support - Backward-compatible re-export of
LayerActivationCollectorfromutils.py
| LayerActivationCollector.register_decoder_layer_support( | ||
| is_nemotron_h_model, get_nemotron_h_decoder_layers | ||
| ) | ||
|
|
||
| LayerActivationCollector.register_decoder_layer_support( | ||
| is_homogenous_hf_model, get_homogeneous_hf_decoder_layers | ||
| ) |
There was a problem hiding this comment.
This does not cover the pattern for Megatron models, which have model.decoder.layers. Maybe a follow up is to add similar support for mcore/
| output_meta: tuple | None = None | ||
|
|
||
|
|
||
| class LayerActivationCollector: |
There was a problem hiding this comment.
qq, will this work in TP/EP/PP scenarios, which we usually use for mcore?
There was a problem hiding this comment.
It might need additional sync logic to work for these scenarios.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
1209-1216: Consider extending pattern coverage for other model architectures.The current implementation only checks
model.model.layers. As noted in a previous review, this doesn't cover patterns likemodel.decoder.layers(Megatron) or other architectures.While not blocking, consider whether additional patterns should be checked here or in a follow-up PR to improve model coverage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 1209 - 1216, get_homogeneous_hf_decoder_layers currently only returns model.model.layers; extend its checks to cover other common HF decoder patterns (e.g., model.decoder.layers for Megatron-style models, model.model.decoder.layers, and model.transformer.layers) so more architectures are detected. Inside get_homogeneous_hf_decoder_layers, add checks in order (e.g., hasattr(model, "decoder") and hasattr(model.decoder, "layers"), then hasattr(model, "model") and hasattr(model.model, "decoder") and hasattr(model.model.decoder, "layers"), then hasattr(model, "transformer") and hasattr(model.transformer, "layers")) and return the first matching nn.ModuleList; keep the early return None behavior if none match. Ensure you reference this exact function name when implementing the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 1209-1216: get_homogeneous_hf_decoder_layers currently only
returns model.model.layers; extend its checks to cover other common HF decoder
patterns (e.g., model.decoder.layers for Megatron-style models,
model.model.decoder.layers, and model.transformer.layers) so more architectures
are detected. Inside get_homogeneous_hf_decoder_layers, add checks in order
(e.g., hasattr(model, "decoder") and hasattr(model.decoder, "layers"), then
hasattr(model, "model") and hasattr(model.model, "decoder") and
hasattr(model.model.decoder, "layers"), then hasattr(model, "transformer") and
hasattr(model.transformer, "layers")) and return the first matching
nn.ModuleList; keep the early return None behavior if none match. Ensure you
reference this exact function name when implementing the changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 221748c7-9696-4123-9104-58da5f79313e
📒 Files selected for processing (6)
modelopt/torch/quantization/activation_collector.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/utils.pytests/unit/torch/quantization/plugins/test_huggingface.pytests/unit/torch/quantization/test_sequential_calibrate.py
🚧 Files skipped from review as they are similar to previous changes (2)
- modelopt/torch/quantization/activation_collector.py
- tests/unit/torch/quantization/plugins/test_huggingface.py
There was a problem hiding this comment.
@sugunav14 can we make modelopt/torch/quantization/utils a folder and move this file to the utils folder? We should preserve the backward compatibility (so we should do __init__ properly)
There was a problem hiding this comment.
Okay, I can do that!
| # ------------------------------------------------------------------ | ||
| # Patched forward | ||
| # ------------------------------------------------------------------ |
There was a problem hiding this comment.
obvious comment, can we remove it
| # ------------------------------------------------------------------ | |
| # Patched forward | |
| # ------------------------------------------------------------------ |
| return tuple(LayerActivationCollector._zeros_from_meta(m) for m in meta[1]) | ||
| if tag == "list": | ||
| return [LayerActivationCollector._zeros_from_meta(m) for m in meta[1]] | ||
| return copy.deepcopy(meta[1]) |
There was a problem hiding this comment.
nit:
| return copy.deepcopy(meta[1]) | |
| return meta[1] |
| # ------------------------------------------------------------------ | ||
|
|
||
| @staticmethod | ||
| def _patched_forward(self, *args, **kwargs): |
There was a problem hiding this comment.
nit: we could define this in _patch_all_layers -
|
@sugunav14 have you tested an end to end flow with it? As a follow up later, we can add a registry based plugin for saving the model so far (to enable resume during calibration) |
Yeah, I did a sanity check on the e2e generation with max sequential calibrate on Qwen3-8B and Nemotron-Nano-v3. The generated outputs look okay. I was thinking of doing the accuracy checks along with the GPTQ PR! |
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
604f2e1 to
c18d109
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/torch/quantization/test_sequential_calibrate.py (1)
372-398:⚠️ Potential issue | 🟠 MajorThese tests lock in the wrong missing-inputs behavior.
One test treats an empty
forward_loopas success, and the other codifies the lateAssertionErrorafter a layer reaches"run"with no cache. That makes it harder to move to the clearer fail-fast behavior when no replay inputs were captured.Also applies to: 599-613
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/quantization/test_sequential_calibrate.py` around lines 372 - 398, The tests (notably test_seq_calib_empty_forward_loop) currently accept an empty forward_loop as success which prevents moving to a fail-fast behavior; update this and the related test (around the other occurrence) so they assert that sequential_calibrate(...) raises an AssertionError when forward_loop supplies no replay inputs. Concretely, change test_seq_calib_empty_forward_loop and the test at the other location to call sequential_calibrate with forward_loop=lambda m: None inside a pytest.raises(AssertionError) context (or equivalent) and remove the existing logic that treats zero-count replays as success; keep references to sequential_calibrate, calib_func, forward_loop, and the test function names to locate the code.
♻️ Duplicate comments (2)
modelopt/torch/quantization/utils/activation_collector.py (2)
271-283:⚠️ Potential issue | 🔴 CriticalFail fast on invalid state transitions and empty capture results.
_set_layer_states()currently transitions intorun/skipwithout verifying prerequisites, andget_input_activations()returns even when capture produced no inputs. This delays failure to later execution paths and makes debugging harder.🔧 Proposed guardrails
diff --git a/modelopt/torch/quantization/utils/activation_collector.py b/modelopt/torch/quantization/utils/activation_collector.py @@ if layer_idx > 1: done = self._decoder_layers[layer_idx - 2]._seq_calib + if done.output_meta is None: + raise RuntimeError( + f"Layer {layer_idx - 2} ({done.name}) cannot enter 'skip' without output metadata." + ) done.mode = "skip" @@ if layer_idx > 0: prev = self._decoder_layers[layer_idx - 1]._seq_calib + if not prev.collected_inputs: + raise RuntimeError( + f"Layer {layer_idx - 1} ({prev.name}) has no captured inputs to replay in 'run' mode." + ) prev.mode = "run" prev.cached_inputs = deque(prev.collected_inputs) prev.collected_inputs = [] @@ inputs = list(info.collected_inputs) + if not inputs: + raise RuntimeError( + f"Layer {layer_idx} ({info.name}) did not capture any inputs during forward_loop()." + ) # After capture, set to original so calib_func can call the layer'sAlso applies to: 335-340
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/utils/activation_collector.py` around lines 271 - 283, The state transition in ActivationCollector._set_layer_states (the block manipulating self._decoder_layers[...] ._seq_calib) must validate prerequisites before flipping modes: check that the target _seq_calib objects exist and that prev.collected_inputs is non-empty before setting prev.mode="run" and assigning prev.cached_inputs = deque(prev.collected_inputs); if collected_inputs is empty raise/return an explicit error instead of silently continuing. Likewise, when setting done.mode="skip" ensure done has required output_meta and clear cached_inputs only after verifying skip semantics. Also update get_input_activations to explicitly raise or return an error when collected_inputs is empty (avoid returning an empty activation), and apply the same guards to the analogous block around lines 335-340 that manipulates _seq_calib, prev, done, collected_inputs and cached_inputs.
52-52:⚠️ Potential issue | 🟠 MajorSingle
output_metais not batch-safe forskipreplay.
runoverwrites one metadata slot per batch, thenskipreuses that one shape/type for all later batches. With variable batch shapes (e.g., last batch), dummy outputs can be wrong.🔧 Proposed direction (per-batch metadata queue)
diff --git a/modelopt/torch/quantization/utils/activation_collector.py b/modelopt/torch/quantization/utils/activation_collector.py @@ `@dataclass` class _LayerCalibState: @@ - output_meta: tuple | None = None + output_meta_queue: deque = field(default_factory=deque) @@ if info.mode == "skip": - if info.output_meta is None: + if not info.output_meta_queue: raise RuntimeError( - f"Layer {info.name} is in 'skip' mode but has no output_meta. " + f"Layer {info.name} is in 'skip' mode but has no output metadata queue. " "This indicates a state-machine bug: the layer should have run " "in 'run' mode (which sets output_meta) before transitioning to 'skip'." ) - return LayerActivationCollector._zeros_from_meta(info.output_meta) + meta = info.output_meta_queue[0] + # rotate to preserve per-batch ordering across repeated forward loops + info.output_meta_queue.rotate(-1) + return LayerActivationCollector._zeros_from_meta(meta) @@ if info.mode == "run": @@ output = self._original_forward(*real_args, **real_kwargs) - info.output_meta = LayerActivationCollector._extract_output_meta(output) + info.output_meta_queue.append(LayerActivationCollector._extract_output_meta(output)) return outputAlso applies to: 173-180, 188-188
🧹 Nitpick comments (3)
tests/unit/torch/quantization/plugins/test_huggingface.py (1)
210-228: Add a Nemotron-H regression alongside the generic HF path.The production change adds a dedicated
is_nemotron_h_model/get_nemotron_h_decoder_layersbranch plus an ordering constraint, but these tests only exercise the genericmodel.model.layerspath. A tiny fake model withbackbone.layers[0].block_typewould keep that new branch from regressing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/quantization/plugins/test_huggingface.py` around lines 210 - 228, The tests currently only exercise the generic HF path and miss the new Nemotron-H branch and ordering; add a regression test alongside the existing HF tests that constructs a tiny fake model exposing backbone.layers[0].block_type (or use the provided tiny fake model variant) and asserts that LayerActivationCollector._decoder_layer_support contains the (is_nemotron_h_model, get_nemotron_h_decoder_layers) entry in the expected order relative to (is_homogeneous_hf_model, get_homogeneous_hf_decoder_layers), and verify LayerActivationCollector.get_decoder_layers(model) returns get_nemotron_h_decoder_layers(model) for that model to prevent regressions.modelopt/torch/quantization/utils/__init__.py (1)
18-34: Use explicit re-exports here.
from .core_utils import *now pulls every non-private symbol fromcore_utils.pyinto the package namespace, but__all__still omitsLayerActivationCollector. That leaves the public surface both wider and less consistent than intended; please switch this file to explicit imports (or restore a constrainedcore_utils.__all__) and addLayerActivationCollectorto__all__.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/utils/__init__.py` around lines 18 - 34, Replace the wildcard import from core_utils with explicit imports for the symbols listed in __all__ (EXPORT_MODE, convert_quantization_axis_to_reduce_axis, export_torch_mode, is_quantized, is_quantized_column_parallel_linear, is_quantized_linear, is_quantized_row_parallel_linear, reduce_amax, reduce_sum, replace_function, update_quant_cfg_with_kv_cache_quant, weight_attr_names) by importing them directly from .core_utils, import LayerActivationCollector from .activation_collector, and add "LayerActivationCollector" to the __all__ list so the package public surface is explicit and consistent.tests/unit/torch/quantization/test_sequential_calibrate.py (1)
696-745: This doesn't verify what layer 1 actually replayed.The final assertion only proves that layer 0's weights were doubled by the end of calibration. The test would still pass if replay into layer 1 used stale activations, so please capture layer 1's actual inputs during
weight_doubling_caliband assert those values directly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/quantization/test_sequential_calibrate.py` around lines 696 - 745, The test currently only asserts layer 0's final weights changed; instead capture what was actually replayed into layer 1 during calibration and assert it equals the doubled activations. Add a closure list (e.g. captured_inputs = []) and modify weight_doubling_calib to, when the `layer` is `model.layers[1]` (or compare identity), register a temporary forward hook on that layer that appends input[0].detach().clone() to captured_inputs, perform the existing weight doubling and call layer_forward_loop(layer), then remove the hook; after sequential_calibrate assert captured_inputs is non-empty and torch.allclose(captured_inputs[0], activations_before_weight_update * 2.0). This uses the existing symbols test_run_layer_reflects_weight_updates, weight_doubling_calib, sequential_calibrate, model.layers[0], and model.layers[1].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1864-1872: The loop calls calib_func even when
input_getter.get_input_activations(...) returns an empty list, causing silent
no-op calibration; before constructing _layer_forward_loop and calling
calib_func for each layer (symbols: transformer_layers,
input_getter.get_input_activations, forward_loop, layer_inputs,
_layer_forward_loop, calib_func), check whether layer_inputs is empty and if so
reject it (either raise an exception or skip with a clear error/warning and do
not call calib_func) so layers with no captured batches are not treated as
successfully calibrated.
In `@tests/unit/torch/quantization/test_sequential_calibrate.py`:
- Around line 183-185: Wrap the sequence that patches/unpatches in a try/finally
so the model is always unpatched if an exception occurs: call
collector._patch_all_layers() before the try, invoke
collector.get_input_activations(model.layers[0], forward_loop) (and any other
test calls) inside the try, and ensure collector._unpatch_all_layers() is called
in the finally block; apply the same change for the other occurrence that uses
the same three calls to avoid leaving the model patched.
---
Outside diff comments:
In `@tests/unit/torch/quantization/test_sequential_calibrate.py`:
- Around line 372-398: The tests (notably test_seq_calib_empty_forward_loop)
currently accept an empty forward_loop as success which prevents moving to a
fail-fast behavior; update this and the related test (around the other
occurrence) so they assert that sequential_calibrate(...) raises an
AssertionError when forward_loop supplies no replay inputs. Concretely, change
test_seq_calib_empty_forward_loop and the test at the other location to call
sequential_calibrate with forward_loop=lambda m: None inside a
pytest.raises(AssertionError) context (or equivalent) and remove the existing
logic that treats zero-count replays as success; keep references to
sequential_calibrate, calib_func, forward_loop, and the test function names to
locate the code.
---
Duplicate comments:
In `@modelopt/torch/quantization/utils/activation_collector.py`:
- Around line 271-283: The state transition in
ActivationCollector._set_layer_states (the block manipulating
self._decoder_layers[...] ._seq_calib) must validate prerequisites before
flipping modes: check that the target _seq_calib objects exist and that
prev.collected_inputs is non-empty before setting prev.mode="run" and assigning
prev.cached_inputs = deque(prev.collected_inputs); if collected_inputs is empty
raise/return an explicit error instead of silently continuing. Likewise, when
setting done.mode="skip" ensure done has required output_meta and clear
cached_inputs only after verifying skip semantics. Also update
get_input_activations to explicitly raise or return an error when
collected_inputs is empty (avoid returning an empty activation), and apply the
same guards to the analogous block around lines 335-340 that manipulates
_seq_calib, prev, done, collected_inputs and cached_inputs.
---
Nitpick comments:
In `@modelopt/torch/quantization/utils/__init__.py`:
- Around line 18-34: Replace the wildcard import from core_utils with explicit
imports for the symbols listed in __all__ (EXPORT_MODE,
convert_quantization_axis_to_reduce_axis, export_torch_mode, is_quantized,
is_quantized_column_parallel_linear, is_quantized_linear,
is_quantized_row_parallel_linear, reduce_amax, reduce_sum, replace_function,
update_quant_cfg_with_kv_cache_quant, weight_attr_names) by importing them
directly from .core_utils, import LayerActivationCollector from
.activation_collector, and add "LayerActivationCollector" to the __all__ list so
the package public surface is explicit and consistent.
In `@tests/unit/torch/quantization/plugins/test_huggingface.py`:
- Around line 210-228: The tests currently only exercise the generic HF path and
miss the new Nemotron-H branch and ordering; add a regression test alongside the
existing HF tests that constructs a tiny fake model exposing
backbone.layers[0].block_type (or use the provided tiny fake model variant) and
asserts that LayerActivationCollector._decoder_layer_support contains the
(is_nemotron_h_model, get_nemotron_h_decoder_layers) entry in the expected order
relative to (is_homogeneous_hf_model, get_homogeneous_hf_decoder_layers), and
verify LayerActivationCollector.get_decoder_layers(model) returns
get_nemotron_h_decoder_layers(model) for that model to prevent regressions.
In `@tests/unit/torch/quantization/test_sequential_calibrate.py`:
- Around line 696-745: The test currently only asserts layer 0's final weights
changed; instead capture what was actually replayed into layer 1 during
calibration and assert it equals the doubled activations. Add a closure list
(e.g. captured_inputs = []) and modify weight_doubling_calib to, when the
`layer` is `model.layers[1]` (or compare identity), register a temporary forward
hook on that layer that appends input[0].detach().clone() to captured_inputs,
perform the existing weight doubling and call layer_forward_loop(layer), then
remove the hook; after sequential_calibrate assert captured_inputs is non-empty
and torch.allclose(captured_inputs[0], activations_before_weight_update * 2.0).
This uses the existing symbols test_run_layer_reflects_weight_updates,
weight_doubling_calib, sequential_calibrate, model.layers[0], and
model.layers[1].
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d6398879-2fd1-4cd7-98e8-2a6fc14aacbf
📒 Files selected for processing (9)
modelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/utils/__init__.pymodelopt/torch/quantization/utils/activation_collector.pymodelopt/torch/quantization/utils/core_utils.pytests/unit/torch/quantization/plugins/test_huggingface.pytests/unit/torch/quantization/test_calib.pytests/unit/torch/quantization/test_sequential_calibrate.pytests/unit/torch/quantization/test_utils.py
| for layer_idx, layer in enumerate(transformer_layers): | ||
| print_rank_0(f"Calibrating layer {layer_idx}") | ||
| layer_inputs = input_getter.get_input_activations(layer, forward_loop) | ||
|
|
||
| # Define a forward loop for the current layer | ||
| def _layer_forward_loop(m, _inputs=layer_inputs): | ||
| for args, kwargs_input in _inputs: | ||
| m(*args, **kwargs_input) | ||
| def _layer_forward_loop(m, _inputs=layer_inputs): | ||
| for args, kwargs_input in _inputs: | ||
| m(*args, **kwargs_input) | ||
|
|
||
| # Call calibration function | ||
| calib_func(layer, _layer_forward_loop, **calib_kwargs) | ||
| del layer_inputs | ||
| torch.cuda.empty_cache() | ||
| calib_func(layer, _layer_forward_loop, **calib_kwargs) |
There was a problem hiding this comment.
Reject empty captures before invoking calib_func.
If forward_loop yields no calibration batches, layer_inputs is empty and we silently run calib_func with a no-op replay loop. That makes sequential calibration look successful while leaving the layer uncalibrated.
Suggested guard
for layer_idx, layer in enumerate(transformer_layers):
print_rank_0(f"Calibrating layer {layer_idx}")
layer_inputs = input_getter.get_input_activations(layer, forward_loop)
+ if not layer_inputs:
+ raise RuntimeError(
+ f"No calibration inputs were captured for layer {layer_idx}. "
+ "Please ensure `forward_loop` executes at least one forward pass."
+ )
def _layer_forward_loop(m, _inputs=layer_inputs):
for args, kwargs_input in _inputs:
m(*args, **kwargs_input)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/model_calib.py` around lines 1864 - 1872, The
loop calls calib_func even when input_getter.get_input_activations(...) returns
an empty list, causing silent no-op calibration; before constructing
_layer_forward_loop and calling calib_func for each layer (symbols:
transformer_layers, input_getter.get_input_activations, forward_loop,
layer_inputs, _layer_forward_loop, calib_func), check whether layer_inputs is
empty and if so reject it (either raise an exception or skip with a clear
error/warning and do not call calib_func) so layers with no captured batches are
not treated as successfully calibrated.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
modelopt/torch/quantization/model_calib.py (1)
1862-1870:⚠️ Potential issue | 🟠 MajorReject empty captures before invoking
calib_func.If
forward_loopproduces no batches,layer_inputsis empty and this path silently treats the layer as calibrated while replaying nothing. Raise here instead of constructing a no-op_layer_forward_loop.test_seq_calib_empty_forward_loop()should then flip to expecting that failure.Suggested guard
for layer_idx, layer in enumerate(transformer_layers): print_rank_0(f"Calibrating layer {layer_idx}") layer_inputs = input_getter.get_input_activations(layer, forward_loop) + if not layer_inputs: + raise RuntimeError( + f"No calibration inputs were captured for layer {layer_idx}. " + "Please ensure `forward_loop` executes at least one forward pass." + ) def _layer_forward_loop(m, _inputs=layer_inputs): for args, kwargs_input in _inputs: m(*args, **kwargs_input)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 1862 - 1870, The loop over transformer_layers currently proceeds even if input_getter.get_input_activations(transformer_layer, forward_loop) returns an empty list, causing calib_func to be called with a no-op _layer_forward_loop; change this by checking layer_inputs after calling input_getter.get_input_activations and raise a clear exception (e.g., ValueError) when it's empty (reference symbols: transformer_layers, input_getter.get_input_activations, layer_inputs, _layer_forward_loop, calib_func, forward_loop) so the caller/test (test_seq_calib_empty_forward_loop) can expect/fail accordingly.tests/unit/torch/quantization/test_sequential_calibrate.py (1)
183-185:⚠️ Potential issue | 🟡 MinorAlways unpatch in these two tests.
These paths still call
_patch_all_layers()/_unpatch_all_layers()without atry/finally, so an unexpected failure inget_input_activations()leaves patched forwards behind and can cascade into later cases.Also applies to: 626-630
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/quantization/test_sequential_calibrate.py` around lines 183 - 185, The tests call collector._patch_all_layers() then collector.get_input_activations(...) and finally collector._unpatch_all_layers() but lack a try/finally so failures leave patches active; update the test(s) that use _patch_all_layers/_unpatch_all_layers (including the block invoking get_input_activations on model.layers[0] and the duplicated block around lines 626-630) to wrap the call to get_input_activations inside a try/finally where collector._unpatch_all_layers() is invoked in the finally clause to guarantee cleanup even on exceptions.modelopt/torch/quantization/utils/activation_collector.py (2)
50-52:⚠️ Potential issue | 🟠 Major
output_metaneeds to be tracked per replayed batch.
runoverwrites a singleoutput_meta, andskipreuses that same structure for every later batch. With variable sequence lengths or a short last batch, skipped layers will synthesize dummy outputs with the wrong shape/structure on subsequent passes. Keep per-batch metadata aligned with the captured inputs and replay it in order instead of storing one shared slot per layer.Also applies to: 173-180, 182-189
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/utils/activation_collector.py` around lines 50 - 52, The collector currently stores a single output_meta that is overwritten and reused across batches; change output_meta from a single slot to a per-batch structure (e.g., a list/deque) aligned with cached_inputs/collected_inputs so each captured input has its own output metadata; update the ActivationCollector fields (cached_inputs, collected_inputs, output_meta) and modify the run and skip methods to append/read the corresponding per-batch output_meta entry in the same order as inputs (use the same indexing/pop order as cached_inputs) so replayed batches use the exact metadata for that batch instead of a shared slot.
271-282:⚠️ Potential issue | 🟠 MajorValidate state transitions before flipping modes.
If a layer never captured inputs or never produced replay metadata,
_set_layer_states()still moves it intorun/skipand the failure only appears later inside_patched_forwardafter some state has already been mutated. Reject emptyprev.collected_inputsand missingdone.output_metahere so the error points at the broken transition.test_mode_transitions_across_calibration_steps()will need to seed realistic capture state once this guard exists.Suggested guardrails
if layer_idx > 1: done = self._decoder_layers[layer_idx - 2]._seq_calib + if done.output_meta is None: + raise RuntimeError( + f"Layer {done.name} cannot enter 'skip' before a successful replay." + ) done.mode = "skip" # output_meta is intentionally kept: skip mode needs it to produce # correctly shaped zero-filled outputs for the parent forward. done.cached_inputs.clear() if layer_idx > 0: prev = self._decoder_layers[layer_idx - 1]._seq_calib + if not prev.collected_inputs: + raise RuntimeError( + f"Layer {prev.name} has no captured inputs to replay." + ) prev.mode = "run" prev.cached_inputs = deque(prev.collected_inputs) prev.collected_inputs = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/utils/activation_collector.py` around lines 271 - 282, In _set_layer_states (the block that flips _seq_calib.mode for self._decoder_layers), validate state before mutating: when setting done = self._decoder_layers[layer_idx - 2]._seq_calib ensure done.output_meta is present (raise a clear error/ValueError if missing) and when preparing prev = self._decoder_layers[layer_idx - 1]._seq_calib ensure prev.collected_inputs is non-empty (raise a clear error/ValueError if empty) instead of proceeding to set modes and swap cached_inputs; this makes failures point to the bad transition rather than surfacing later in _patched_forward.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/utils/__init__.py`:
- Around line 18-34: Add explicit imports for every name listed in __all__
(e.g., EXPORT_MODE, convert_quantization_axis_to_reduce_axis, export_torch_mode,
is_quantized, is_quantized_column_parallel_linear, is_quantized_linear,
is_quantized_row_parallel_linear, reduce_amax, reduce_sum, replace_function,
update_quant_cfg_with_kv_cache_quant, weight_attr_names) at the top of
modelopt/torch/quantization/utils/__init__.py (you may keep the existing from
.core_utils import * for runtime surface), and include LayerActivationCollector
in the __all__ list so mypy and Ruff can resolve the exports; ensure the
explicit imports reference the correct module paths (e.g., from
.activation_collector import LayerActivationCollector) and update the __all__
array to contain all those symbol names.
---
Duplicate comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1862-1870: The loop over transformer_layers currently proceeds
even if input_getter.get_input_activations(transformer_layer, forward_loop)
returns an empty list, causing calib_func to be called with a no-op
_layer_forward_loop; change this by checking layer_inputs after calling
input_getter.get_input_activations and raise a clear exception (e.g.,
ValueError) when it's empty (reference symbols: transformer_layers,
input_getter.get_input_activations, layer_inputs, _layer_forward_loop,
calib_func, forward_loop) so the caller/test (test_seq_calib_empty_forward_loop)
can expect/fail accordingly.
In `@modelopt/torch/quantization/utils/activation_collector.py`:
- Around line 50-52: The collector currently stores a single output_meta that is
overwritten and reused across batches; change output_meta from a single slot to
a per-batch structure (e.g., a list/deque) aligned with
cached_inputs/collected_inputs so each captured input has its own output
metadata; update the ActivationCollector fields (cached_inputs,
collected_inputs, output_meta) and modify the run and skip methods to
append/read the corresponding per-batch output_meta entry in the same order as
inputs (use the same indexing/pop order as cached_inputs) so replayed batches
use the exact metadata for that batch instead of a shared slot.
- Around line 271-282: In _set_layer_states (the block that flips
_seq_calib.mode for self._decoder_layers), validate state before mutating: when
setting done = self._decoder_layers[layer_idx - 2]._seq_calib ensure
done.output_meta is present (raise a clear error/ValueError if missing) and when
preparing prev = self._decoder_layers[layer_idx - 1]._seq_calib ensure
prev.collected_inputs is non-empty (raise a clear error/ValueError if empty)
instead of proceeding to set modes and swap cached_inputs; this makes failures
point to the bad transition rather than surfacing later in _patched_forward.
In `@tests/unit/torch/quantization/test_sequential_calibrate.py`:
- Around line 183-185: The tests call collector._patch_all_layers() then
collector.get_input_activations(...) and finally collector._unpatch_all_layers()
but lack a try/finally so failures leave patches active; update the test(s) that
use _patch_all_layers/_unpatch_all_layers (including the block invoking
get_input_activations on model.layers[0] and the duplicated block around lines
626-630) to wrap the call to get_input_activations inside a try/finally where
collector._unpatch_all_layers() is invoked in the finally clause to guarantee
cleanup even on exceptions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6423b2c7-8f29-433a-85a5-59b87dfcb39f
📒 Files selected for processing (10)
modelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/utils/__init__.pymodelopt/torch/quantization/utils/activation_collector.pymodelopt/torch/quantization/utils/core_utils.pymodelopt/torch/utils/network.pytests/unit/torch/quantization/plugins/test_huggingface.pytests/unit/torch/quantization/test_calib.pytests/unit/torch/quantization/test_sequential_calibrate.pytests/unit/torch/quantization/test_utils.py
💤 Files with no reviewable changes (1)
- modelopt/torch/utils/network.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/torch/quantization/test_calib.py
| from .core_utils import * # noqa: F401,F403 | ||
| from .activation_collector import LayerActivationCollector # noqa: F401 | ||
|
|
||
| __all__ = [ | ||
| "EXPORT_MODE", | ||
| "convert_quantization_axis_to_reduce_axis", | ||
| "export_torch_mode", | ||
| "is_quantized", | ||
| "is_quantized_column_parallel_linear", | ||
| "is_quantized_linear", | ||
| "is_quantized_row_parallel_linear", | ||
| "reduce_amax", | ||
| "reduce_sum", | ||
| "replace_function", | ||
| "update_quant_cfg_with_kv_cache_quant", | ||
| "weight_attr_names", | ||
| ] |
There was a problem hiding this comment.
Make the package exports explicit here.
This hunk is the source of both CI failures: Ruff cannot resolve the names referenced from __all__ through the star import, and mypy does not see LayerActivationCollector as exported because it is missing from __all__. Keep the star import if you still need the broader runtime surface, but explicitly import the symbols referenced in __all__ and add LayerActivationCollector there.
One minimal way to satisfy both tools without shrinking the current runtime surface
from .core_utils import * # noqa: F401,F403
-from .activation_collector import LayerActivationCollector # noqa: F401
+from .core_utils import (
+ EXPORT_MODE,
+ convert_quantization_axis_to_reduce_axis,
+ export_torch_mode,
+ is_quantized,
+ is_quantized_column_parallel_linear,
+ is_quantized_linear,
+ is_quantized_row_parallel_linear,
+ reduce_amax,
+ reduce_sum,
+ replace_function,
+ update_quant_cfg_with_kv_cache_quant,
+ weight_attr_names,
+)
+from .activation_collector import LayerActivationCollector
__all__ = [
"EXPORT_MODE",
"convert_quantization_axis_to_reduce_axis",
"export_torch_mode",
@@
"replace_function",
"update_quant_cfg_with_kv_cache_quant",
"weight_attr_names",
+ "LayerActivationCollector",
]As per coding guidelines, **/*.py: Use ruff linter for Python code (configured in pyproject.toml) and Use mypy for type checking on Python code (configured in pyproject.toml).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/utils/__init__.py` around lines 18 - 34, Add
explicit imports for every name listed in __all__ (e.g., EXPORT_MODE,
convert_quantization_axis_to_reduce_axis, export_torch_mode, is_quantized,
is_quantized_column_parallel_linear, is_quantized_linear,
is_quantized_row_parallel_linear, reduce_amax, reduce_sum, replace_function,
update_quant_cfg_with_kv_cache_quant, weight_attr_names) at the top of
modelopt/torch/quantization/utils/__init__.py (you may keep the existing from
.core_utils import * for runtime surface), and include LayerActivationCollector
in the __all__ list so mypy and Ruff can resolve the exports; ensure the
explicit imports reference the correct module paths (e.g., from
.activation_collector import LayerActivationCollector) and update the __all__
array to contain all those symbol names.
What does this PR do?
Type of change: New feature
The current sequential calibration support has O(N^2) complexity for collecting updated activations for a decoder layer. To solve this, we adopted a modular/plugin based approach which involves hooks to capture the updated activations by running forward on the previous decoder layer using cached prev layer activations. This leads to an issue with nested modules i.e. the logic in the parent module might need to be replicated in the lower level modules to ensure equivalence. For example, in the nemotron model, the parent module NemotronHModel has logic to create and select appropriate mask based on the decoder layer type (mamba vs attention).
This PR implements a more generic solution for sequential calibration, by choosing to collect activations using model forward, thereby ensuring that all the parent module logic is preserved. We use an attribute "state"on the modules to indicate whether to perform recomputation/skip the layer while running module forward. This can help us avoid redundant computations for getting updated activations.
The overall flow is as follows
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True, usingtorch.load(..., weights_only=True), avoidingpickle, etc.).Additional Information
Summary by CodeRabbit
New Features
Improvements
Tests