Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds pack-quantized model loading support by temporarily patching CompressedLinear during initialization, post-load unpacking/fixing of compressed/BF16/expert weights from safetensors, and runtime on-demand decompression/unpacking in quantized linear implementations. Changes
Sequence Diagram(s)sequenceDiagram
participant User as get_model
participant Detector as Pack-quantized Detector
participant Patcher as CompressedLinear Patch
participant Loader as Model Loader
participant Unpacker as Weight Unpacker
User->>Detector: inspect config
Detector->>Patcher: request patch if pack-quantized
Patcher->>Loader: patched CompressedLinear used during init
Loader->>Loader: initialize layers (dummy/missing weights allowed)
Loader->>Unpacker: post-load call to unpack weights from safetensors
Unpacker->>Unpacker: restore BF16 weights, fix expert metadata
Unpacker->>Patcher: restore original CompressedLinear behavior
Unpacker->>User: model ready
sequenceDiagram
participant Forward as _QuantCompressedLinear.forward
participant Check as weight_packed check
participant Builder as _build_compressed_data
participant Decompressor as compressor.decompress_weight
participant Replace as unpack_weight path
Forward->>Check: is quantization_status COMPRESSED and weight_packed int32?
alt packed int32
Check->>Builder: gather weight_packed, scale, shape, zero_point, scheme
Builder->>Decompressor: provide compressed_data & quant_args
Decompressor->>Forward: return decompressed weight (one-time log)
Forward->>Forward: use decompressed weight
Forward->>Replace: later unpack replaces placeholder with nn.Parameter and cleans attrs
else non-packed / BF16
Check->>Forward: use weight_packed directly (no decompression)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #820 +/- ##
=======================================
Coverage 70.11% 70.11%
=======================================
Files 221 221
Lines 25459 25459
=======================================
Hits 17851 17851
Misses 7608 7608 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cjluo-nv
left a comment
There was a problem hiding this comment.
qq: if you just load kimi k2.5 using HF and do a generation call (not using modeopt) Were you able to do it?
| return dtype | ||
|
|
||
|
|
||
| def _patch_compressed_linear_init(): |
There was a problem hiding this comment.
Can it be a transformers version issue? I was able to load kimi k2 thinking int4 without an issue. Is this specific to kimi k2.5?
| print("Patched CompressedLinear for transformers compatibility") | ||
|
|
||
|
|
||
| def _unpack_compressed_linear_weights(model, ckpt_path=None): |
There was a problem hiding this comment.
we do not need it. We should be able to unpack on the fly with logics in the quantization plugins
There was a problem hiding this comment.
We need this function as Kimi-K2.5 has BF16 layers (vision, lm_head) alongside compressed INT4 expert layers.
examples/llm_ptq/example_utils.py
Outdated
| ): | ||
| torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16) | ||
| elif has_pack_quantized_config(hf_config): | ||
| # Patch CompressedLinear before loading to handle missing weight attribute |
There was a problem hiding this comment.
I don't think you need this
|
|
||
| if self.quantization_status == QuantizationStatus.COMPRESSED: | ||
| weight_data = self.compressor.decompress_module(self) | ||
| # Check if we should use decompress_module or manual decompress_weight |
There was a problem hiding this comment.
is this specific to kimi k2.5?
99912fb to
3ff37cd
Compare
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Replace print() with logging, extract duplicated compressed_data builder into _build_compressed_data() helper, fix formatting and remove stale comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
3b7ec74 to
44ce987
Compare
There was a problem hiding this comment.
Pull request overview
Adds support for running PTQ flows on Kimi-K2.5 HuggingFace checkpoints that use compressed-tensors “pack-quantized” weights, by improving CompressedLinear handling during load/inference and adding example-side workarounds for transformers initialization.
Changes:
- Update
_QuantCompressedLinearto support on-the-fly decompression viadecompress_weight()and improved compressed metadata handling. - Add example utilities to monkeypatch
CompressedLinearduringfrom_pretrained()and to (attempt to) restore mixed BF16/compressed weights post-load. - Extend pack-quantized config detection to handle nested
text_config.quantization_config(for multi-modal configs).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
modelopt/torch/quantization/plugins/huggingface.py |
Adds on-the-fly decompression path + new compressed metadata builder and updated unpack_weight() logic for CompressedLinear. |
examples/llm_ptq/example_utils.py |
Adds CompressedLinear monkeypatching during model load and post-load restoration logic; updates pack-quant config detection and load branch behavior. |
Comments suppressed due to low confidence (1)
modelopt/torch/quantization/plugins/huggingface.py:934
- New
_QuantCompressedLinearbehavior (on-the-flydecompress_weight()inforward()plus the newunpack_weight()parameter/buffer cleanup logic) doesn’t appear to be covered by the existing unit tests formodelopt.torch.quantization.plugins.huggingface. Adding a focused test (guarded withpytest.importorskip("compressed_tensors")if needed) would help prevent regressions for pack-quantized models.
def forward(self, input: Tensor) -> Tensor:
from compressed_tensors.quantization import QuantizationStatus
if self.quantization_status == QuantizationStatus.COMPRESSED:
# Real packed weights are int32. If it's float, it's not actually compressed.
if self.weight_packed.dtype == torch.int32:
compressed_data, quant_args = self._build_compressed_data()
if not hasattr(self, "_logged_on_the_fly"):
logger.debug("On-the-fly decompression for %s", self.__class__.__name__)
self._logged_on_the_fly = True
weight_data = self.compressor.decompress_weight(
compressed_data=compressed_data,
quantization_args=quant_args,
)
else:
weight_data = self.weight_packed
else:
weight_data = self.weight
return linear(self.input_quantizer(input), self.weight_quantizer(weight_data), self.bias)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| model = AutoModelForCausalLM.from_pretrained( | ||
| ckpt_path, | ||
| device_map="auto", | ||
| trust_remote_code=trust_remote_code, | ||
| torch_dtype=torch_dtype, | ||
| torch_dtype="auto", | ||
| ) |
| if ckpt_path is None: | ||
| ckpt_path = getattr(model.config, "_name_or_path", None) | ||
| if not ckpt_path: | ||
| return | ||
|
|
||
| from safetensors import safe_open | ||
|
|
||
| # Load non-expert weights and metadata from safetensors | ||
| checkpoint_weights = {} | ||
| index_path = os.path.join(ckpt_path, "model.safetensors.index.json") | ||
| st_files = [os.path.join(ckpt_path, "model.safetensors")] | ||
| if os.path.exists(index_path): | ||
| with open(index_path) as f: | ||
| index = json.load(f) | ||
| st_files = [os.path.join(ckpt_path, f) for f in set(index.get("weight_map", {}).values())] | ||
|
|
||
| for sf_path in st_files: | ||
| if not os.path.exists(sf_path): | ||
| continue |
| for key in f: | ||
| if ".mlp.experts." not in key or "weight_shape" in key: | ||
| checkpoint_weights[key] = f.get_tensor(key) | ||
|
|
||
| # Hybrid restoration | ||
| for name, module in model.named_modules(): | ||
| if not isinstance(module, CompressedLinear): | ||
| continue | ||
|
|
||
| with torch.no_grad(): | ||
| target_device = next(module.parameters()).device | ||
|
|
||
| # CASE A: Real BF16 weight exists (vision, lm_head) | ||
| if f"{name}.weight" in checkpoint_weights: | ||
| w = checkpoint_weights[f"{name}.weight"].to(target_device) | ||
| module._parameters.pop("weight", None) | ||
| module._buffers.pop("weight", None) | ||
| module.__dict__.pop("weight", None) | ||
| param = torch.nn.Parameter(w, requires_grad=False) | ||
| module._parameters["weight"] = param | ||
| module.__dict__["weight"] = param | ||
| module.quantization_status = QuantizationStatus.FROZEN | ||
| logger.debug("Restored BF16 layer: %s", name) | ||
|
|
||
| # CASE B: Expert (stay compressed, fix metadata) | ||
| elif f"{name}.weight_shape" in checkpoint_weights: | ||
| ws = checkpoint_weights[f"{name}.weight_shape"] | ||
| if f"{name}.weight_packed" in checkpoint_weights: | ||
| module.weight_packed = checkpoint_weights[f"{name}.weight_packed"].to( | ||
| torch.int32 | ||
| ) | ||
| module._parameters.pop("weight", None) |
| # Skip non-pack-quantized weights (e.g., vision modules stored as BF16) | ||
| if isinstance(compressed_data["weight_packed"], torch.Tensor): | ||
| if compressed_data["weight_packed"].dtype != torch.int32: | ||
| return | ||
|
|
||
| decompressed = self.compressor.decompress_weight( | ||
| compressed_data=compressed_data, | ||
| quantization_args=quant_args, | ||
| ) |
| CompressedLinear.__getattr__ = CompressedLinear._modelopt_original_getattr | ||
| delattr(CompressedLinear, "_modelopt_original_getattr") | ||
| elif hasattr(CompressedLinear, "__getattr__"): | ||
| del CompressedLinear.__getattr__ |
examples/llm_ptq/example_utils.py
Outdated
| # Patch CompressedLinear before loading to handle missing weight attribute | ||
| _patch_compressed_linear_init() | ||
| # Pass torch_dtype="auto" to preserve original dtypes from safetensors | ||
| # This prevents int32 packed weights from being converted to float | ||
| model = AutoModelForCausalLM.from_pretrained( | ||
| ckpt_path, | ||
| device_map="auto", | ||
| trust_remote_code=trust_remote_code, | ||
| torch_dtype=torch_dtype, | ||
| torch_dtype="auto", | ||
| ) | ||
| # Restore original CompressedLinear behavior after loading | ||
| _restore_compressed_linear() |
| ws = checkpoint_weights[f"{name}.weight_shape"] | ||
| if f"{name}.weight_packed" in checkpoint_weights: | ||
| module.weight_packed = checkpoint_weights[f"{name}.weight_packed"].to( | ||
| torch.int32 | ||
| ) | ||
| module._parameters.pop("weight", None) | ||
| module._buffers.pop("weight", None) | ||
| module.__dict__.pop("weight", None) | ||
| shape_param = torch.nn.Parameter(ws.to(torch.int32), requires_grad=False) | ||
| module._parameters.pop("weight_shape", None) | ||
| module.__dict__.pop("weight_shape", None) | ||
| module._parameters["weight_shape"] = shape_param | ||
| module.__dict__["weight_shape"] = shape_param |
| except ImportError: | ||
| return | ||
|
|
||
| if hasattr(CompressedLinear, "_modelopt_init_patched"): |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
19-31:⚠️ Potential issue | 🟡 MinorMove
loggerinitialization below the remaining imports.
logger = logging.getLogger(__name__)is executable code, so the later imports on Lines 45-66 now trip Ruff E402. That matches the Code Quality failure on this PR.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 19 - 31, The logger initialization is placed before later imports and triggers an E402 import order error; move the statement logger = logging.getLogger(__name__) so it sits after all import statements (i.e., below the remaining imports that follow the current block) and keep the name unchanged; update any references to logger in this module as needed but do not alter its value or placement relative to other executable code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 461-462: The guard currently uses hasattr(CompressedLinear,
"_modelopt_init_patched") which only checks for attribute existence and prevents
re-patching after _restore_compressed_linear sets the flag to False; change the
check to use getattr(CompressedLinear, "_modelopt_init_patched", False) so it
returns the actual boolean value and only short-circuits when True, allowing
re-patching when the attribute is False or missing.
- Around line 686-698: The patching around CompressedLinear must be
exception-safe and preserve caller options: wrap the call to
AutoModelForCausalLM.from_pretrained inside a try/finally so
_restore_compressed_linear() always runs, do not hardcode device_map="auto" but
use the existing device_map variable (or include it from model_kwargs), and pass
through **model_kwargs (which contains torch_dtype, attn_implementation,
max_memory, etc.) to from_pretrained rather than dropping them; update the block
that checks has_pack_quantized_config(hf_config) to call
_patch_compressed_linear_init(), then in a try call
AutoModelForCausalLM.from_pretrained(ckpt_path, **model_kwargs) (ensuring
device_map is included in model_kwargs), and finally call
_restore_compressed_linear() in the finally clause.
- Around line 563-572: When restoring a BF16 weight in the branch that sets
module._parameters["weight"] and module.quantization_status =
QuantizationStatus.FROZEN (the block referencing name, checkpoint_weights,
module, and logger), also remove any leftover compressed-quantization attributes
to avoid duplicated memory: pop "weight_packed", "weight_scale", "weight_shape",
and "weight_zero_point" from module._parameters, module._buffers, and
module.__dict__ (use .pop(..., None)) so those tensors are cleared from the
module after replacing the weight.
- Around line 528-553: The _unpack_compressed_linear_weights() logic currently
treats ckpt_path (often model.config._name_or_path) as a filesystem path and
skips processing when os.path.exists fails; call the helper
_resolve_model_path(ckpt_path) at the start (or when ckpt_path is assigned) to
convert HF repo IDs to the local cached path, then use the resolved path for the
subsequent os.path.exists checks, index_path construction, st_files list and
safe_open usage so BF16/metadata repair runs for repo IDs as well.
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 907-908: _unbuild_compressed_data() now stores weight_zero_point
into compressed_data but unpack_weight() (and the other unpacking paths around
the same area) never removes it, leaving stale quantization metadata in the
unpacked module/state_dict; update unpack_weight() (and the related
unpack/decompression functions referenced near the weight unpacking logic) to
pop/remove "weight_zero_point" from the compressed data or from the module
attributes after decompression so the unpacked layer doesn't retain
weight_zero_point in memory or state_dict().
---
Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 19-31: The logger initialization is placed before later imports
and triggers an E402 import order error; move the statement logger =
logging.getLogger(__name__) so it sits after all import statements (i.e., below
the remaining imports that follow the current block) and keep the name
unchanged; update any references to logger in this module as needed but do not
alter its value or placement relative to other executable code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3ea0c347-cf35-4862-bcb8-9892a874581f
📒 Files selected for processing (2)
examples/llm_ptq/example_utils.pymodelopt/torch/quantization/plugins/huggingface.py
| if hasattr(CompressedLinear, "_modelopt_init_patched"): | ||
| return |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file and understand the structure around the guard conditions
cat -n examples/llm_ptq/example_utils.py | sed -n '450,520p'Repository: NVIDIA/Model-Optimizer
Length of output: 3541
🏁 Script executed:
# Search for all occurrences of _modelopt_init_patched to understand the attribute lifecycle
rg "_modelopt_init_patched" examples/llm_ptq/example_utils.py -n -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 644
🏁 Script executed:
# Check the _patch_compressed_linear_init and _restore_compressed_linear functions
sed -n '449,515p' examples/llm_ptq/example_utils.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 3225
Use value check instead of existence check to allow re-patching on subsequent loads.
The guard at line 461 uses hasattr() which checks if the attribute exists, not its value. After _restore_compressed_linear() sets _modelopt_init_patched = False at line 509, the attribute still exists on the class. A second pack-quantized load in the same process will find the attribute exists (despite being False) and skip re-patching entirely.
Change to:
if getattr(CompressedLinear, "_modelopt_init_patched", False):
returnThis checks the actual boolean value, allowing re-patching after restore.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/example_utils.py` around lines 461 - 462, The guard
currently uses hasattr(CompressedLinear, "_modelopt_init_patched") which only
checks for attribute existence and prevents re-patching after
_restore_compressed_linear sets the flag to False; change the check to use
getattr(CompressedLinear, "_modelopt_init_patched", False) so it returns the
actual boolean value and only short-circuits when True, allowing re-patching
when the attribute is False or missing.
| if ckpt_path is None: | ||
| ckpt_path = getattr(model.config, "_name_or_path", None) | ||
| if not ckpt_path: | ||
| return | ||
|
|
||
| from safetensors import safe_open | ||
|
|
||
| # Load non-expert weights and metadata from safetensors | ||
| checkpoint_weights = {} | ||
| index_path = os.path.join(ckpt_path, "model.safetensors.index.json") | ||
| st_files = [os.path.join(ckpt_path, "model.safetensors")] | ||
| if os.path.exists(index_path): | ||
| with open(index_path) as f: | ||
| index = json.load(f) | ||
| st_files = [ | ||
| os.path.join(ckpt_path, f) for f in set(index.get("weight_map", {}).values()) | ||
| ] | ||
|
|
||
| for sf_path in st_files: | ||
| if not os.path.exists(sf_path): | ||
| continue | ||
| with safe_open(sf_path, framework="pt") as f: | ||
| for key in f.keys(): | ||
| if ".mlp.experts." not in key or "weight_shape" in key: | ||
| checkpoint_weights[key] = f.get_tensor(key) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the function containing these lines
cd examples/llm_ptq && cat -n example_utils.py | sed -n '500,560p'Repository: NVIDIA/Model-Optimizer
Length of output: 2932
🏁 Script executed:
# Let's also check the function signature and understand the full context
cd examples/llm_ptq && cat -n example_utils.py | sed -n '490,560p'Repository: NVIDIA/Model-Optimizer
Length of output: 3395
🏁 Script executed:
# Search for where this function is called and how ckpt_path is passed
rg "def.*ckpt_path" examples/llm_ptq/example_utils.py -A 5 -B 2Repository: NVIDIA/Model-Optimizer
Length of output: 671
🏁 Script executed:
# Look for the function name that contains this code
rg -B 20 "model.safetensors.index.json" examples/llm_ptq/example_utils.py | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 1736
🏁 Script executed:
# Search for calls to _unpack_compressed_linear_weights
rg "_unpack_compressed_linear_weights" examples/llm_ptq/ -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 1035
🏁 Script executed:
# Check the main example file to see how models are loaded
fd "*.py" examples/llm_ptq/ -type f | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 236
🏁 Script executed:
# Look at the main example to see model loading and function calls
cat examples/llm_ptq/run_ptq.py | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 125
🏁 Script executed:
# Search for where moonshotai/Kimi-K2.5 or other repo IDs appear in the PR context
rg "moonshotai|Kimi" examples/ -rRepository: NVIDIA/Model-Optimizer
Length of output: 479
🏁 Script executed:
# Find Python files in examples/llm_ptq/
find examples/llm_ptq -name "*.py" | head -10Repository: NVIDIA/Model-Optimizer
Length of output: 265
🏁 Script executed:
# Look at the context around where _unpack_compressed_linear_weights is called
rg "_unpack_compressed_linear_weights" examples/llm_ptq/example_utils.py -B 30 | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 2189
🏁 Script executed:
# Search for the load_model_and_tokenizer function signature
rg "def load_model_and_tokenizer" examples/llm_ptq/example_utils.py -A 50 | head -60Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look for any HF repo ID examples in the codebase
rg "from_pretrained" examples/llm_ptq/ -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 10904
🏁 Script executed:
# Check the load_model_and_tokenizer function more carefully to see how ckpt_path is used
cat examples/llm_ptq/example_utils.py | sed -n '200,350p'Repository: NVIDIA/Model-Optimizer
Length of output: 5757
🏁 Script executed:
# Look for the function signature and where _unpack_compressed_linear_weights is called
cat examples/llm_ptq/example_utils.py | sed -n '200,450p'Repository: NVIDIA/Model-Optimizer
Length of output: 9311
🏁 Script executed:
# Search for any snapshot_download or other HF hub resolution patterns in the codebase
rg "snapshot_download|snapshot_download" examples/ -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 2180
🏁 Script executed:
# Check if there's any documentation on how ckpt_path is supposed to be handled for HF repo IDs
grep -n "ckpt_path" examples/llm_ptq/example_utils.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1225
🏁 Script executed:
# Find the full load_model_and_tokenizer function
rg "def load_model_and_tokenizer" examples/llm_ptq/example_utils.py -A 150 | head -170Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look at the exact lines around where _unpack_compressed_linear_weights is called
cat examples/llm_ptq/example_utils.py | sed -n '350,450p'Repository: NVIDIA/Model-Optimizer
Length of output: 3707
🏁 Script executed:
# Check the get_local_model_path_or_name function more carefully
rg "def get_local_model_path_or_name" examples/llm_ptq/example_utils.py -A 40Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look for the function that contains the _unpack_compressed_linear_weights call
cat examples/llm_ptq/example_utils.py | grep -n "_unpack_compressed_linear_weights" -B 50 -A 10 | tail -70Repository: NVIDIA/Model-Optimizer
Length of output: 3508
🏁 Script executed:
# Search for the load_model_and_tokenizer function
rg "def load_model_and_tokenizer" examples/llm_ptq/example_utils.py -A 200 | tail -100Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check the line numbers around the call to understand the context
cat examples/llm_ptq/example_utils.py | sed -n '600,750p'Repository: NVIDIA/Model-Optimizer
Length of output: 6804
🏁 Script executed:
# Check if there's a helper function to resolve HF repo IDs to local paths
rg "def get_local_model_path_or_name" examples/llm_ptq/example_utils.py -A 30Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look at how load_model_and_tokenizer is called - check the calling context
rg "load_model_and_tokenizer\(" examples/llm_ptq/ -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check the PR context by looking at recent changes to example files
rg "moonshotai" examples/ -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 527
🏁 Script executed:
# Look at the signature and flow of load_model_and_tokenizer
cat examples/llm_ptq/example_utils.py | sed -n '625,660p'Repository: NVIDIA/Model-Optimizer
Length of output: 1624
🏁 Script executed:
# Check the full context of how ckpt_path flows through the function
cat examples/llm_ptq/example_utils.py | sed -n '625,645p'Repository: NVIDIA/Model-Optimizer
Length of output: 1031
🏁 Script executed:
# Look for where load_model_and_tokenizer is actually called in examples
find examples/llm_ptq -name "*.py" -exec grep -l "load_model_and_tokenizer" {} \;Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Let's check hf_ptq.py to see how the model is loaded
cat examples/llm_ptq/hf_ptq.py | head -150Repository: NVIDIA/Model-Optimizer
Length of output: 5134
🏁 Script executed:
# Look at the get_local_model_path_or_name function that uses snapshot_download
cat examples/llm_ptq/example_utils.py | sed -n '100,175p'Repository: NVIDIA/Model-Optimizer
Length of output: 3212
🏁 Script executed:
# Check if get_local_model_path_or_name is called before _unpack_compressed_linear_weights
rg "get_local_model_path_or_name" examples/llm_ptq/example_utils.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check what model.config._name_or_path contains - search for any documentation or tests
rg "_name_or_path" examples/llm_ptq/ -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 3605
🏁 Script executed:
# Now let me look at the _resolve_model_path function completely
cat examples/llm_ptq/example_utils.py | sed -n '1,50p'Repository: NVIDIA/Model-Optimizer
Length of output: 1505
🏁 Script executed:
# Check the complete _resolve_model_path function
rg "def _resolve_model_path" examples/llm_ptq/example_utils.py -A 50Repository: NVIDIA/Model-Optimizer
Length of output: 2321
🏁 Script executed:
# Let me see the complete _resolve_model_path function
cat examples/llm_ptq/example_utils.py | sed -n '1,100p' | tail -50Repository: NVIDIA/Model-Optimizer
Length of output: 1958
🏁 Script executed:
# Check where _unpack_compressed_linear_weights is called and verify context
cat examples/llm_ptq/example_utils.py | sed -n '750,770p'Repository: NVIDIA/Model-Optimizer
Length of output: 802
🏁 Script executed:
# Search if _resolve_model_path is used elsewhere or with _unpack_compressed_linear_weights
rg "_resolve_model_path|_unpack_compressed_linear" examples/llm_ptq/example_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 352
🏁 Script executed:
# Check if there's been any recent fix or call to _resolve_model_path before _unpack_compressed_linear_weights
cat examples/llm_ptq/example_utils.py | sed -n '730,765p'Repository: NVIDIA/Model-Optimizer
Length of output: 1463
🏁 Script executed:
# Check what model.config._name_or_path contains when loaded from a HF repo ID
# by looking at any documentation or tests
rg "config._name_or_path" . -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 2232
🏁 Script executed:
# Let's verify the actual flow: check if transformers sets _name_or_path to the repo ID or cache path
# by looking for any comments or documentation
rg "from_pretrained" examples/llm_ptq/example_utils.py | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 381
🏁 Script executed:
# Most importantly, let's check if there's a case where ckpt_path (the parameter passed to load_model_and_tokenizer) is a repo ID
cat examples/llm_ptq/hf_ptq.py | grep -A 5 "load_model_and_tokenizer\|args.pyt_ckpt_path" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 713
🏁 Script executed:
# Check what types of inputs are expected for ckpt_path in the calling code
rg "pyt_ckpt_path\|ckpt_path" examples/llm_ptq/hf_ptq.py -B 2 -A 2 | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Let me check the actual behavior by looking at what HF returns for _name_or_path
# Check if there's any initialization of get_model that might resolve the path
rg "def get_model" examples/llm_ptq/example_utils.py -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 201
🏁 Script executed:
# Look at the actual line where load_model_and_tokenizer is called to see what's passed as ckpt_path
cat examples/llm_ptq/hf_ptq.py | sed -n '1,100p'Repository: NVIDIA/Model-Optimizer
Length of output: 3469
🏁 Script executed:
# Check the argparse definition to see what pyt_ckpt_path expects
rg "pyt_ckpt_path" examples/llm_ptq/hf_ptq.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1133
🏁 Script executed:
# Find the argparse definition for pyt_ckpt_path
rg "pyt_ckpt_path" examples/llm_ptq/hf_ptq.py -B 3 -A 3 | grep -A 5 "add_argument.*pyt_ckpt_path"Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check the full argparse section
cat examples/llm_ptq/hf_ptq.py | sed -n '700,800p'Repository: NVIDIA/Model-Optimizer
Length of output: 3988
🏁 Script executed:
# Let's look for where pyt_ckpt_path is defined in argparse
rg "add_argument.*pyt_ckpt_path" examples/llm_ptq/hf_ptq.py -B 1 -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Search for the argument parser definition
rg "argparse.ArgumentParser" examples/llm_ptq/hf_ptq.py -A 100 | head -150Repository: NVIDIA/Model-Optimizer
Length of output: 3535
🏁 Script executed:
# Or search directly for where pyt_ckpt_path argument is added
grep -n "pyt_ckpt_path" examples/llm_ptq/hf_ptq.py | tail -20Repository: NVIDIA/Model-Optimizer
Length of output: 1197
🏁 Script executed:
# Check the actual behavior: does transformers keep the repo ID or convert to local path in _name_or_path?
# Let's see if there are any tests or documentation
rg "model.config._name_or_path" . -B 5 -A 5 | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 1394
🏁 Script executed:
# Most importantly, let's verify if _resolve_model_path should be called before _unpack_compressed_linear_weights
# by checking how _resolve_model_path is used elsewhere
rg "_resolve_model_path" examples/llm_ptq/example_utils.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 423
Use _resolve_model_path() to handle HuggingFace repo IDs in _unpack_compressed_linear_weights().
The function receives ckpt_path which may be a HuggingFace repo ID (e.g., moonshotai/Kimi-K2.5), but only checks os.path.exists() on it. When model.config._name_or_path retains the repo ID string instead of the cached local path, the BF16/metadata repair becomes a silent no-op. Call _resolve_model_path() before the os.path.exists() checks to convert repo IDs to local cache paths.
🧰 Tools
🪛 GitHub Actions: Code Quality
[error] 550-550: SIM118 Use key in dict instead of key in dict.keys().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/example_utils.py` around lines 528 - 553, The
_unpack_compressed_linear_weights() logic currently treats ckpt_path (often
model.config._name_or_path) as a filesystem path and skips processing when
os.path.exists fails; call the helper _resolve_model_path(ckpt_path) at the
start (or when ckpt_path is assigned) to convert HF repo IDs to the local cached
path, then use the resolved path for the subsequent os.path.exists checks,
index_path construction, st_files list and safe_open usage so BF16/metadata
repair runs for repo IDs as well.
| elif has_pack_quantized_config(hf_config): | ||
| # Patch CompressedLinear before loading to handle missing weight attribute | ||
| _patch_compressed_linear_init() | ||
| # Pass torch_dtype="auto" to preserve original dtypes from safetensors | ||
| # This prevents int32 packed weights from being converted to float | ||
| model = AutoModelForCausalLM.from_pretrained( | ||
| ckpt_path, | ||
| device_map="auto", | ||
| trust_remote_code=trust_remote_code, | ||
| torch_dtype=torch_dtype, | ||
| torch_dtype="auto", | ||
| ) | ||
| # Restore original CompressedLinear behavior after loading | ||
| _restore_compressed_linear() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and examine the file
fd "example_utils.py" examples/llm_ptq/Repository: NVIDIA/Model-Optimizer
Length of output: 99
🏁 Script executed:
# Read the file to understand the function context around lines 686-698
head -n 750 examples/llm_ptq/example_utils.py | tail -n 100Repository: NVIDIA/Model-Optimizer
Length of output: 4912
🏁 Script executed:
# Get a broader view of the function containing these lines
sed -n '620,720p' examples/llm_ptq/example_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 4759
🏁 Script executed:
# Find the function signature and parameters
sed -n '550,630p' examples/llm_ptq/example_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3428
🏁 Script executed:
# Search for _patch_compressed_linear_init and _restore_compressed_linear definitions
rg -n "_patch_compressed_linear_init|_restore_compressed_linear" examples/llm_ptq/example_utils.py -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 1378
🏁 Script executed:
# Get the full context of the patch and restore functions
sed -n '449,515p' examples/llm_ptq/example_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2756
🏁 Script executed:
# Verify the function signature and confirm device_map and model_kwargs are parameters
sed -n '610,640p' examples/llm_ptq/example_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1365
Use try/finally to ensure patch restoration and preserve caller-supplied parameters.
Hardcoding device_map="auto" ignores the dynamically configured device_map (which may be set to "cpu", "sequential", None, or "auto" based on device parameter, VL model detection, or BART model type). Skipping **model_kwargs drops caller-supplied settings like attn_implementation, max_memory, and torch_dtype that are built earlier in the function.
Additionally, if AutoModelForCausalLM.from_pretrained() raises an exception, the CompressedLinear.__getattr__ patch remains active globally, affecting the rest of the process. The patch must be restored in all code paths using a try/finally block.
Suggested fix
elif has_pack_quantized_config(hf_config):
# Patch CompressedLinear before loading to handle missing weight attribute
_patch_compressed_linear_init()
- # Pass torch_dtype="auto" to preserve original dtypes from safetensors
- # This prevents int32 packed weights from being converted to float
- model = AutoModelForCausalLM.from_pretrained(
- ckpt_path,
- device_map="auto",
- trust_remote_code=trust_remote_code,
- torch_dtype="auto",
- )
- # Restore original CompressedLinear behavior after loading
- _restore_compressed_linear()
+ try:
+ # Pass torch_dtype="auto" to preserve original dtypes from safetensors
+ # This prevents int32 packed weights from being converted to float
+ pack_model_kwargs = {**model_kwargs, "torch_dtype": "auto"}
+ model = AutoModelForCausalLM.from_pretrained(
+ ckpt_path,
+ device_map=device_map,
+ **pack_model_kwargs,
+ )
+ finally:
+ # Restore original CompressedLinear behavior after loading
+ _restore_compressed_linear()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/example_utils.py` around lines 686 - 698, The patching
around CompressedLinear must be exception-safe and preserve caller options: wrap
the call to AutoModelForCausalLM.from_pretrained inside a try/finally so
_restore_compressed_linear() always runs, do not hardcode device_map="auto" but
use the existing device_map variable (or include it from model_kwargs), and pass
through **model_kwargs (which contains torch_dtype, attn_implementation,
max_memory, etc.) to from_pretrained rather than dropping them; update the block
that checks has_pack_quantized_config(hf_config) to call
_patch_compressed_linear_init(), then in a try call
AutoModelForCausalLM.from_pretrained(ckpt_path, **model_kwargs) (ensuring
device_map is included in model_kwargs), and finally call
_restore_compressed_linear() in the finally clause.
| if hasattr(self, "weight_zero_point"): | ||
| compressed_data["weight_zero_point"] = self.weight_zero_point |
There was a problem hiding this comment.
Also drop weight_zero_point when unpacking.
_build_compressed_data() now feeds weight_zero_point into decompression, but unpack_weight() never removes it afterward. Unpacked layers will keep stale quantization metadata in memory and state_dict().
Suggested fix
if hasattr(self, "weight_scale"):
del self.weight_scale
+ if hasattr(self, "weight_zero_point"):
+ del self.weight_zero_point
if hasattr(self, "weight_shape"):
if "weight_shape" in self._parameters:
del self._parameters["weight_shape"]Also applies to: 962-972
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 907 - 908,
_unbuild_compressed_data() now stores weight_zero_point into compressed_data but
unpack_weight() (and the other unpacking paths around the same area) never
removes it, leaving stale quantization metadata in the unpacked
module/state_dict; update unpack_weight() (and the related unpack/decompression
functions referenced near the weight unpacking logic) to pop/remove
"weight_zero_point" from the compressed data or from the module attributes after
decompression so the unpacked layer doesn't retain weight_zero_point in memory
or state_dict().
There was a problem hiding this comment.
♻️ Duplicate comments (5)
modelopt/torch/quantization/plugins/huggingface.py (1)
960-970:⚠️ Potential issue | 🟡 MinorAlso remove
weight_zero_pointduring unpack cleanup.
_build_compressed_data()now carriesweight_zero_point, but unpack cleanup never clears it, leaving stale quant metadata in unpacked modules.Suggested fix
if hasattr(self, "weight_scale"): del self.weight_scale + if hasattr(self, "weight_zero_point"): + del self.weight_zero_point if hasattr(self, "weight_shape"): if "weight_shape" in self._parameters: del self._parameters["weight_shape"] else: delattr(self, "weight_shape")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 960 - 970, The unpack cleanup block must also remove weight_zero_point to avoid leaving stale quant metadata after calling _build_compressed_data(); update the same cleanup in the method that deletes weight_packed, weight_scale and weight_shape to additionally check for and delete weight_zero_point (similar to how weight_shape is removed from _parameters or via delattr), and ensure quantization_status transitions from QuantizationStatus.COMPRESSED to QuantizationStatus.FROZEN remains unchanged.examples/llm_ptq/example_utils.py (4)
528-543:⚠️ Potential issue | 🟠 MajorResolve HuggingFace repo IDs before filesystem checks.
This path assumes
ckpt_pathis local; with repo IDs, unpack/metadata repair can silently no-op. Resolve first, then build safetensor paths from the resolved directory.Suggested fix
if ckpt_path is None: ckpt_path = getattr(model.config, "_name_or_path", None) if not ckpt_path: return + ckpt_path = _resolve_model_path(ckpt_path) + if not os.path.isdir(ckpt_path): + return from safetensors import safe_open🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 528 - 543, The code builds index_path and st_files from ckpt_path before ensuring ckpt_path is a resolved local filesystem directory, which breaks when ckpt_path is a HF repo ID; resolve the repo ID to a local snapshot first (i.e. replace ckpt_path from model.config._name_or_path with a resolved local path) before doing os.path.join and os.path.exists checks. Update the logic around ckpt_path, index_path and st_files so resolution happens first (resolve ckpt_path -> local_dir, then set index_path = os.path.join(local_dir, "model.safetensors.index.json") and st_files = [os.path.join(local_dir, "model.safetensors")] and only then load the index file if it exists).
461-462:⚠️ Potential issue | 🟠 MajorUse the patch flag value, not attribute existence.
hasattr()makes subsequent loads skip patching even after restore sets the flag toFalse. Usegetattr(..., False)so re-patching works in the same process.Suggested fix
- if hasattr(CompressedLinear, "_modelopt_init_patched"): + if getattr(CompressedLinear, "_modelopt_init_patched", False): return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 461 - 462, The current guard uses hasattr(CompressedLinear, "_modelopt_init_patched") which prevents re-patching after the flag is set back to False; change the check to read the boolean flag value instead (e.g., getattr(CompressedLinear, "_modelopt_init_patched", False)) so the code only skips patching when the attribute exists and is True; update the early-return logic around CompressedLinear and its _modelopt_init_patched flag (where the attribute is set/unset) so re-applying the patch in the same process works correctly.
684-696:⚠️ Potential issue | 🟠 MajorMake patch restoration exception-safe and preserve computed load kwargs.
This branch currently hardcodes
device_map="auto"and dropsmodel_kwargs; it also risks leaving the global patch active if load raises.Suggested fix
elif has_pack_quantized_config(hf_config): # Patch CompressedLinear before loading to handle missing weight attribute _patch_compressed_linear_init() - # Pass torch_dtype="auto" to preserve original dtypes from safetensors - # This prevents int32 packed weights from being converted to float - model = AutoModelForCausalLM.from_pretrained( - ckpt_path, - device_map="auto", - trust_remote_code=trust_remote_code, - torch_dtype="auto", - ) - # Restore original CompressedLinear behavior after loading - _restore_compressed_linear() + try: + pack_model_kwargs = model_kwargs.copy() + pack_model_kwargs["device_map"] = device_map + pack_model_kwargs["torch_dtype"] = "auto" + model = AutoModelForCausalLM.from_pretrained( + ckpt_path, + **pack_model_kwargs, + ) + finally: + # Restore original CompressedLinear behavior after loading + _restore_compressed_linear()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 684 - 696, The branch for pack-quantized configs currently hardcodes device_map and drops the precomputed model_kwargs and can leave the CompressedLinear patch active on exceptions; modify the block around has_pack_quantized_config to (1) use the existing model_kwargs (instead of hardcoding device_map="auto") when calling AutoModelForCausalLM.from_pretrained(ckpt_path, ...), (2) wrap the load call with a try/finally so _restore_compressed_linear() always runs even if from_pretrained raises, and (3) ensure torch_dtype="auto" is merged into the preserved model_kwargs before passing them to AutoModelForCausalLM.from_pretrained; keep references to _patch_compressed_linear_init, _restore_compressed_linear, has_pack_quantized_config, AutoModelForCausalLM.from_pretrained, ckpt_path, trust_remote_code and model_kwargs to locate and update the code.
561-570:⚠️ Potential issue | 🟠 MajorClear compressed tensors after BF16 restoration.
After replacing
weightand freezing, compressed attrs should be removed; otherwise large layers keep duplicate tensors and inflate memory/state_dict.Suggested fix
module._parameters["weight"] = param module.__dict__["weight"] = param module.quantization_status = QuantizationStatus.FROZEN + for attr in ("weight_packed", "weight_scale", "weight_shape", "weight_zero_point"): + module._parameters.pop(attr, None) + module._buffers.pop(attr, None) + module.__dict__.pop(attr, None) logger.debug("Restored BF16 layer: %s", name)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 561 - 570, After restoring the BF16 weight and freezing it in module (code around checkpoint_weights, module._parameters["weight"], and QuantizationStatus.FROZEN), remove any leftover compressed/quantization artifacts to avoid duplicate large tensors; after setting module._parameters["weight"] and module.quantization_status = QuantizationStatus.FROZEN add cleanup that pops known compressed keys from module._parameters, module._buffers and module.__dict__ (e.g. "compressed_weight", "weight_compressed", "weight_quantized", "quantizer_state", "scales", "zp" or any key starting with "compressed" or "weight_") and use delattr(module, key) if present so no duplicate large tensors remain in state_dict or memory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 528-543: The code builds index_path and st_files from ckpt_path
before ensuring ckpt_path is a resolved local filesystem directory, which breaks
when ckpt_path is a HF repo ID; resolve the repo ID to a local snapshot first
(i.e. replace ckpt_path from model.config._name_or_path with a resolved local
path) before doing os.path.join and os.path.exists checks. Update the logic
around ckpt_path, index_path and st_files so resolution happens first (resolve
ckpt_path -> local_dir, then set index_path = os.path.join(local_dir,
"model.safetensors.index.json") and st_files = [os.path.join(local_dir,
"model.safetensors")] and only then load the index file if it exists).
- Around line 461-462: The current guard uses hasattr(CompressedLinear,
"_modelopt_init_patched") which prevents re-patching after the flag is set back
to False; change the check to read the boolean flag value instead (e.g.,
getattr(CompressedLinear, "_modelopt_init_patched", False)) so the code only
skips patching when the attribute exists and is True; update the early-return
logic around CompressedLinear and its _modelopt_init_patched flag (where the
attribute is set/unset) so re-applying the patch in the same process works
correctly.
- Around line 684-696: The branch for pack-quantized configs currently hardcodes
device_map and drops the precomputed model_kwargs and can leave the
CompressedLinear patch active on exceptions; modify the block around
has_pack_quantized_config to (1) use the existing model_kwargs (instead of
hardcoding device_map="auto") when calling
AutoModelForCausalLM.from_pretrained(ckpt_path, ...), (2) wrap the load call
with a try/finally so _restore_compressed_linear() always runs even if
from_pretrained raises, and (3) ensure torch_dtype="auto" is merged into the
preserved model_kwargs before passing them to
AutoModelForCausalLM.from_pretrained; keep references to
_patch_compressed_linear_init, _restore_compressed_linear,
has_pack_quantized_config, AutoModelForCausalLM.from_pretrained, ckpt_path,
trust_remote_code and model_kwargs to locate and update the code.
- Around line 561-570: After restoring the BF16 weight and freezing it in module
(code around checkpoint_weights, module._parameters["weight"], and
QuantizationStatus.FROZEN), remove any leftover compressed/quantization
artifacts to avoid duplicate large tensors; after setting
module._parameters["weight"] and module.quantization_status =
QuantizationStatus.FROZEN add cleanup that pops known compressed keys from
module._parameters, module._buffers and module.__dict__ (e.g.
"compressed_weight", "weight_compressed", "weight_quantized", "quantizer_state",
"scales", "zp" or any key starting with "compressed" or "weight_") and use
delattr(module, key) if present so no duplicate large tensors remain in
state_dict or memory.
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 960-970: The unpack cleanup block must also remove
weight_zero_point to avoid leaving stale quant metadata after calling
_build_compressed_data(); update the same cleanup in the method that deletes
weight_packed, weight_scale and weight_shape to additionally check for and
delete weight_zero_point (similar to how weight_shape is removed from
_parameters or via delattr), and ensure quantization_status transitions from
QuantizationStatus.COMPRESSED to QuantizationStatus.FROZEN remains unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e49da756-845d-4f4e-ad41-4700cf9d85f1
📒 Files selected for processing (2)
examples/llm_ptq/example_utils.pymodelopt/torch/quantization/plugins/huggingface.py
When loading pack-quantized models with trust_remote_code=True, transformers calls _init_weights on CompressedLinear modules that have a missing 'weight' key. Custom model code (e.g. modeling_deepseek.py) accesses module.weight.data directly, which crashes because CompressedLinear has weight_packed instead. The patch returns a no-op dummy to let initialization complete harmlessly. Also adds class docstring to _QuantCompressedLinear explaining the on-the-fly decompression rationale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What does this PR do?
Type of change: New model support
Overview: Support Kimi-K2.5 PTQ.
Usage
Testing
You may need
pip install transformers==4.57.1and the model file here: https://huggingface.co/nvidia/Kimi-K2.5-NVFP4/blob/main/modeling_kimi_k25.pyBefore your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Bug Fixes
New Features