Support Kimi-K2.5 PTQ by Edwardf0t1 · Pull Request #820 · NVIDIA/Model-Optimizer

Edwardf0t1 · 2026-01-27T07:51:23Z

What does this PR do?

Type of change: New model support

Overview: Support Kimi-K2.5 PTQ.

Usage

python3 hf_ptq.py   --pyt_ckpt_path moonshotai/Kimi-K2.5   --qformat nvfp4_mlp_only   --export_path ./kimi-k2.5-nvfp4   --trust_remote_code

Testing

You may need pip install transformers==4.57.1 and the model file here: https://huggingface.co/nvidia/Kimi-K2.5-NVFP4/blob/main/modeling_kimi_k25.py

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Bug Fixes
- Fixed initialization and weight-restoration when loading pack-quantized models so BF16 and expert weights are correctly restored and stale placeholders removed.
- Added error handling to avoid failures if optional decompression components are unavailable.
New Features
- Added conditional patch/restore around pack-quantized loads and final unpacking of weights, plus on‑the‑fly decompression during inference and improved logging.

copy-pr-bot · 2026-01-27T07:51:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-01-27T07:51:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 12fcd61e-adcc-44c4-8723-9109d7e906a8

📥 Commits

Reviewing files that changed from the base of the PR and between 44ce987 and 4053273.

📒 Files selected for processing (2)

examples/llm_ptq/example_utils.py
modelopt/torch/quantization/plugins/huggingface.py

📝 Walkthrough

Walkthrough

Adds pack-quantized model loading support by temporarily patching CompressedLinear during initialization, post-load unpacking/fixing of compressed/BF16/expert weights from safetensors, and runtime on-demand decompression/unpacking in quantized linear implementations.

Changes

Cohort / File(s)	Summary
Model loading & patching `examples/llm_ptq/example_utils.py`	Adds logger and helpers: `_patch_compressed_linear_init`, `_restore_compressed_linear`, `_unpack_compressed_linear_weights`; detects pack-quantized configs, patches CompressedLinear to supply dummy weights during init, restores behavior after load, and unpacks/fixes weights (BF16, experts) from safetensors with safe error handling.
Quantized linear decompression & unpack `modelopt/torch/quantization/plugins/huggingface.py`	Adds `_build_compressed_data` and updates `_QuantCompressedLinear` to support on-the-fly decompression when `weight_packed` is int32 and to robustly unpack COMPRESSED weights into real `nn.Parameter`, clean stale attrs, and set quantization status to FROZEN. Also integrates logging and related HF utility imports.

Sequence Diagram(s)

sequenceDiagram
    participant User as get_model
    participant Detector as Pack-quantized Detector
    participant Patcher as CompressedLinear Patch
    participant Loader as Model Loader
    participant Unpacker as Weight Unpacker

    User->>Detector: inspect config
    Detector->>Patcher: request patch if pack-quantized
    Patcher->>Loader: patched CompressedLinear used during init
    Loader->>Loader: initialize layers (dummy/missing weights allowed)
    Loader->>Unpacker: post-load call to unpack weights from safetensors
    Unpacker->>Unpacker: restore BF16 weights, fix expert metadata
    Unpacker->>Patcher: restore original CompressedLinear behavior
    Unpacker->>User: model ready

sequenceDiagram
    participant Forward as _QuantCompressedLinear.forward
    participant Check as weight_packed check
    participant Builder as _build_compressed_data
    participant Decompressor as compressor.decompress_weight
    participant Replace as unpack_weight path

    Forward->>Check: is quantization_status COMPRESSED and weight_packed int32?
    alt packed int32
        Check->>Builder: gather weight_packed, scale, shape, zero_point, scheme
        Builder->>Decompressor: provide compressed_data & quant_args
        Decompressor->>Forward: return decompressed weight (one-time log)
        Forward->>Forward: use decompressed weight
        Forward->>Replace: later unpack replaces placeholder with nn.Parameter and cleans attrs
    else non-packed / BF16
        Check->>Forward: use weight_packed directly (no decompression)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'Support Kimi-K2.5 PTQ' clearly describes the main change: adding post-training quantization support for the Kimi-K2.5 model.
Security Anti-Patterns	✅ Passed	Pull request adheres to all SECURITY.md coding practices with no unsafe deserialization, eval/exec, or hardcoded trust parameters introduced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch zhiyu/support-kimi-k2.5-ptq

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-27T08:03:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.11%. Comparing base (bc87981) to head (44ce987).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #820   +/-   ##
=======================================
  Coverage   70.11%   70.11%           
=======================================
  Files         221      221           
  Lines       25459    25459           
=======================================
  Hits        17851    17851           
  Misses       7608     7608

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv

qq: if you just load kimi k2.5 using HF and do a generation call (not using modeopt) Were you able to do it?

cjluo-nv · 2026-01-29T21:32:10Z

examples/llm_ptq/example_utils.py

    return dtype


+def _patch_compressed_linear_init():


Can it be a transformers version issue? I was able to load kimi k2 thinking int4 without an issue. Is this specific to kimi k2.5?

cjluo-nv · 2026-01-29T21:32:50Z

examples/llm_ptq/example_utils.py

+    print("Patched CompressedLinear for transformers compatibility")
+
+
+def _unpack_compressed_linear_weights(model, ckpt_path=None):


we do not need it. We should be able to unpack on the fly with logics in the quantization plugins

We need this function as Kimi-K2.5 has BF16 layers (vision, lm_head) alongside compressed INT4 expert layers.

cjluo-nv · 2026-01-29T21:33:15Z

examples/llm_ptq/example_utils.py

-        ):
-            torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16)
+        elif has_pack_quantized_config(hf_config):
+            # Patch CompressedLinear before loading to handle missing weight attribute


I don't think you need this

cjluo-nv · 2026-01-29T21:35:11Z

modelopt/torch/quantization/plugins/huggingface.py


        if self.quantization_status == QuantizationStatus.COMPRESSED:
-            weight_data = self.compressor.decompress_module(self)
+            # Check if we should use decompress_module or manual decompress_weight


is this specific to kimi k2.5?

copy-pr-bot · 2026-03-13T06:00:32Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Replace print() with logging, extract duplicated compressed_data builder into _build_compressed_data() helper, fix formatting and remove stale comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Copilot

Pull request overview

Adds support for running PTQ flows on Kimi-K2.5 HuggingFace checkpoints that use compressed-tensors “pack-quantized” weights, by improving CompressedLinear handling during load/inference and adding example-side workarounds for transformers initialization.

Changes:

Update _QuantCompressedLinear to support on-the-fly decompression via decompress_weight() and improved compressed metadata handling.
Add example utilities to monkeypatch CompressedLinear during from_pretrained() and to (attempt to) restore mixed BF16/compressed weights post-load.
Extend pack-quantized config detection to handle nested text_config.quantization_config (for multi-modal configs).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File	Description
`modelopt/torch/quantization/plugins/huggingface.py`	Adds on-the-fly decompression path + new compressed metadata builder and updated `unpack_weight()` logic for CompressedLinear.
`examples/llm_ptq/example_utils.py`	Adds CompressedLinear monkeypatching during model load and post-load restoration logic; updates pack-quant config detection and load branch behavior.

Comments suppressed due to low confidence (1)

modelopt/torch/quantization/plugins/huggingface.py:934

New _QuantCompressedLinear behavior (on-the-fly decompress_weight() in forward() plus the new unpack_weight() parameter/buffer cleanup logic) doesn’t appear to be covered by the existing unit tests for modelopt.torch.quantization.plugins.huggingface. Adding a focused test (guarded with pytest.importorskip("compressed_tensors") if needed) would help prevent regressions for pack-quantized models.

    def forward(self, input: Tensor) -> Tensor:
        from compressed_tensors.quantization import QuantizationStatus

        if self.quantization_status == QuantizationStatus.COMPRESSED:
            # Real packed weights are int32. If it's float, it's not actually compressed.
            if self.weight_packed.dtype == torch.int32:
                compressed_data, quant_args = self._build_compressed_data()
                if not hasattr(self, "_logged_on_the_fly"):
                    logger.debug("On-the-fly decompression for %s", self.__class__.__name__)
                    self._logged_on_the_fly = True
                weight_data = self.compressor.decompress_weight(
                    compressed_data=compressed_data,
                    quantization_args=quant_args,
                )
            else:
                weight_data = self.weight_packed
        else:
            weight_data = self.weight

        return linear(self.input_quantizer(input), self.weight_quantizer(weight_data), self.bias)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

examples/llm_ptq/example_utils.py

            model = AutoModelForCausalLM.from_pretrained(
                ckpt_path,
                device_map="auto",
                trust_remote_code=trust_remote_code,
-                torch_dtype=torch_dtype,
+                torch_dtype="auto",
            )


examples/llm_ptq/example_utils.py

+    if ckpt_path is None:
+        ckpt_path = getattr(model.config, "_name_or_path", None)
+    if not ckpt_path:
+        return
+
+    from safetensors import safe_open
+
+    # Load non-expert weights and metadata from safetensors
+    checkpoint_weights = {}
+    index_path = os.path.join(ckpt_path, "model.safetensors.index.json")
+    st_files = [os.path.join(ckpt_path, "model.safetensors")]
+    if os.path.exists(index_path):
+        with open(index_path) as f:
+            index = json.load(f)
+        st_files = [os.path.join(ckpt_path, f) for f in set(index.get("weight_map", {}).values())]
+
+    for sf_path in st_files:
+        if not os.path.exists(sf_path):
+            continue


examples/llm_ptq/example_utils.py

+            for key in f:
+                if ".mlp.experts." not in key or "weight_shape" in key:
+                    checkpoint_weights[key] = f.get_tensor(key)
+
+    # Hybrid restoration
+    for name, module in model.named_modules():
+        if not isinstance(module, CompressedLinear):
+            continue
+
+        with torch.no_grad():
+            target_device = next(module.parameters()).device
+
+            # CASE A: Real BF16 weight exists (vision, lm_head)
+            if f"{name}.weight" in checkpoint_weights:
+                w = checkpoint_weights[f"{name}.weight"].to(target_device)
+                module._parameters.pop("weight", None)
+                module._buffers.pop("weight", None)
+                module.__dict__.pop("weight", None)
+                param = torch.nn.Parameter(w, requires_grad=False)
+                module._parameters["weight"] = param
+                module.__dict__["weight"] = param
+                module.quantization_status = QuantizationStatus.FROZEN
+                logger.debug("Restored BF16 layer: %s", name)
+
+            # CASE B: Expert (stay compressed, fix metadata)
+            elif f"{name}.weight_shape" in checkpoint_weights:
+                ws = checkpoint_weights[f"{name}.weight_shape"]
+                if f"{name}.weight_packed" in checkpoint_weights:
+                    module.weight_packed = checkpoint_weights[f"{name}.weight_packed"].to(
+                        torch.int32
+                    )
+                module._parameters.pop("weight", None)


modelopt/torch/quantization/plugins/huggingface.py

+            # Skip non-pack-quantized weights (e.g., vision modules stored as BF16)
+            if isinstance(compressed_data["weight_packed"], torch.Tensor):
+                if compressed_data["weight_packed"].dtype != torch.int32:
+                    return
+
+            decompressed = self.compressor.decompress_weight(
+                compressed_data=compressed_data,
+                quantization_args=quant_args,
+            )


examples/llm_ptq/example_utils.py

+            CompressedLinear.__getattr__ = CompressedLinear._modelopt_original_getattr
+            delattr(CompressedLinear, "_modelopt_original_getattr")
+        elif hasattr(CompressedLinear, "__getattr__"):
+            del CompressedLinear.__getattr__


examples/llm_ptq/example_utils.py

+            # Patch CompressedLinear before loading to handle missing weight attribute
+            _patch_compressed_linear_init()
+            # Pass torch_dtype="auto" to preserve original dtypes from safetensors
+            # This prevents int32 packed weights from being converted to float
            model = AutoModelForCausalLM.from_pretrained(
                ckpt_path,
                device_map="auto",
                trust_remote_code=trust_remote_code,
-                torch_dtype=torch_dtype,
+                torch_dtype="auto",
            )
+            # Restore original CompressedLinear behavior after loading
+            _restore_compressed_linear()


examples/llm_ptq/example_utils.py

+                ws = checkpoint_weights[f"{name}.weight_shape"]
+                if f"{name}.weight_packed" in checkpoint_weights:
+                    module.weight_packed = checkpoint_weights[f"{name}.weight_packed"].to(
+                        torch.int32
+                    )
+                module._parameters.pop("weight", None)
+                module._buffers.pop("weight", None)
+                module.__dict__.pop("weight", None)
+                shape_param = torch.nn.Parameter(ws.to(torch.int32), requires_grad=False)
+                module._parameters.pop("weight_shape", None)
+                module.__dict__.pop("weight_shape", None)
+                module._parameters["weight_shape"] = shape_param
+                module.__dict__["weight_shape"] = shape_param


examples/llm_ptq/example_utils.py

+    except ImportError:
+        return
+
+    if hasattr(CompressedLinear, "_modelopt_init_patched"):


coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/plugins/huggingface.py (1)
19-31: ⚠️ Potential issue | 🟡 Minor

Move logger initialization below the remaining imports.

logger = logging.getLogger(__name__) is executable code, so the later imports on Lines 45-66 now trip Ruff E402. That matches the Code Quality failure on this PR.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 19 - 31, The
logger initialization is placed before later imports and triggers an E402 import
order error; move the statement logger = logging.getLogger(__name__) so it sits
after all import statements (i.e., below the remaining imports that follow the
current block) and keep the name unchanged; update any references to logger in
this module as needed but do not alter its value or placement relative to other
executable code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 461-462: The guard currently uses hasattr(CompressedLinear,
"_modelopt_init_patched") which only checks for attribute existence and prevents
re-patching after _restore_compressed_linear sets the flag to False; change the
check to use getattr(CompressedLinear, "_modelopt_init_patched", False) so it
returns the actual boolean value and only short-circuits when True, allowing
re-patching when the attribute is False or missing.
- Around line 686-698: The patching around CompressedLinear must be
exception-safe and preserve caller options: wrap the call to
AutoModelForCausalLM.from_pretrained inside a try/finally so
_restore_compressed_linear() always runs, do not hardcode device_map="auto" but
use the existing device_map variable (or include it from model_kwargs), and pass
through **model_kwargs (which contains torch_dtype, attn_implementation,
max_memory, etc.) to from_pretrained rather than dropping them; update the block
that checks has_pack_quantized_config(hf_config) to call
_patch_compressed_linear_init(), then in a try call
AutoModelForCausalLM.from_pretrained(ckpt_path, **model_kwargs) (ensuring
device_map is included in model_kwargs), and finally call
_restore_compressed_linear() in the finally clause.
- Around line 563-572: When restoring a BF16 weight in the branch that sets
module._parameters["weight"] and module.quantization_status =
QuantizationStatus.FROZEN (the block referencing name, checkpoint_weights,
module, and logger), also remove any leftover compressed-quantization attributes
to avoid duplicated memory: pop "weight_packed", "weight_scale", "weight_shape",
and "weight_zero_point" from module._parameters, module._buffers, and
module.__dict__ (use .pop(..., None)) so those tensors are cleared from the
module after replacing the weight.
- Around line 528-553: The _unpack_compressed_linear_weights() logic currently
treats ckpt_path (often model.config._name_or_path) as a filesystem path and
skips processing when os.path.exists fails; call the helper
_resolve_model_path(ckpt_path) at the start (or when ckpt_path is assigned) to
convert HF repo IDs to the local cached path, then use the resolved path for the
subsequent os.path.exists checks, index_path construction, st_files list and
safe_open usage so BF16/metadata repair runs for repo IDs as well.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 907-908: _unbuild_compressed_data() now stores weight_zero_point
into compressed_data but unpack_weight() (and the other unpacking paths around
the same area) never removes it, leaving stale quantization metadata in the
unpacked module/state_dict; update unpack_weight() (and the related
unpack/decompression functions referenced near the weight unpacking logic) to
pop/remove "weight_zero_point" from the compressed data or from the module
attributes after decompression so the unpacked layer doesn't retain
weight_zero_point in memory or state_dict().

---

Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 19-31: The logger initialization is placed before later imports
and triggers an E402 import order error; move the statement logger =
logging.getLogger(__name__) so it sits after all import statements (i.e., below
the remaining imports that follow the current block) and keep the name
unchanged; update any references to logger in this module as needed but do not
alter its value or placement relative to other executable code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ea0c347-cf35-4862-bcb8-9892a874581f

📥 Commits

Reviewing files that changed from the base of the PR and between bc87981 and 6c35801.

📒 Files selected for processing (2)

examples/llm_ptq/example_utils.py
modelopt/torch/quantization/plugins/huggingface.py

coderabbitai · 2026-03-13T07:35:25Z

examples/llm_ptq/example_utils.py

+    if hasattr(CompressedLinear, "_modelopt_init_patched"):
+        return


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file and understand the structure around the guard conditions cat -n examples/llm_ptq/example_utils.py | sed -n '450,520p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3541

🏁 Script executed:

# Search for all occurrences of _modelopt_init_patched to understand the attribute lifecycle rg "_modelopt_init_patched" examples/llm_ptq/example_utils.py -n -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 644

🏁 Script executed:

# Check the _patch_compressed_linear_init and _restore_compressed_linear functions sed -n '449,515p' examples/llm_ptq/example_utils.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 3225

Use value check instead of existence check to allow re-patching on subsequent loads.

The guard at line 461 uses hasattr() which checks if the attribute exists, not its value. After _restore_compressed_linear() sets _modelopt_init_patched = False at line 509, the attribute still exists on the class. A second pack-quantized load in the same process will find the attribute exists (despite being False) and skip re-patching entirely.

Change to:

if getattr(CompressedLinear, "_modelopt_init_patched", False): return

This checks the actual boolean value, allowing re-patching after restore.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 461 - 462, The guard currently uses hasattr(CompressedLinear, "_modelopt_init_patched") which only checks for attribute existence and prevents re-patching after _restore_compressed_linear sets the flag to False; change the check to use getattr(CompressedLinear, "_modelopt_init_patched", False) so it returns the actual boolean value and only short-circuits when True, allowing re-patching when the attribute is False or missing.

coderabbitai · 2026-03-13T07:35:25Z

examples/llm_ptq/example_utils.py

+    if ckpt_path is None:
+        ckpt_path = getattr(model.config, "_name_or_path", None)
+    if not ckpt_path:
+        return
+
+    from safetensors import safe_open
+
+    # Load non-expert weights and metadata from safetensors
+    checkpoint_weights = {}
+    index_path = os.path.join(ckpt_path, "model.safetensors.index.json")
+    st_files = [os.path.join(ckpt_path, "model.safetensors")]
+    if os.path.exists(index_path):
+        with open(index_path) as f:
+            index = json.load(f)
+        st_files = [
+            os.path.join(ckpt_path, f) for f in set(index.get("weight_map", {}).values())
+        ]
+
+    for sf_path in st_files:
+        if not os.path.exists(sf_path):
+            continue
+        with safe_open(sf_path, framework="pt") as f:
+            for key in f.keys():
+                if ".mlp.experts." not in key or "weight_shape" in key:
+                    checkpoint_weights[key] = f.get_tensor(key)
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the function containing these lines cd examples/llm_ptq && cat -n example_utils.py | sed -n '500,560p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2932

🏁 Script executed:

# Let's also check the function signature and understand the full context cd examples/llm_ptq && cat -n example_utils.py | sed -n '490,560p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3395

🏁 Script executed:

# Search for where this function is called and how ckpt_path is passed rg "def.*ckpt_path" examples/llm_ptq/example_utils.py -A 5 -B 2

Repository: NVIDIA/Model-Optimizer

Length of output: 671

🏁 Script executed:

# Look for the function name that contains this code rg -B 20 "model.safetensors.index.json" examples/llm_ptq/example_utils.py | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1736

🏁 Script executed:

# Search for calls to _unpack_compressed_linear_weights rg "_unpack_compressed_linear_weights" examples/llm_ptq/ -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 1035

🏁 Script executed:

# Check the main example file to see how models are loaded fd "*.py" examples/llm_ptq/ -type f | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 236

🏁 Script executed:

# Look at the main example to see model loading and function calls cat examples/llm_ptq/run_ptq.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 125

🏁 Script executed:

# Search for where moonshotai/Kimi-K2.5 or other repo IDs appear in the PR context rg "moonshotai|Kimi" examples/ -r

Repository: NVIDIA/Model-Optimizer

Length of output: 479

🏁 Script executed:

# Find Python files in examples/llm_ptq/ find examples/llm_ptq -name "*.py" | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 265

🏁 Script executed:

# Look at the context around where _unpack_compressed_linear_weights is called rg "_unpack_compressed_linear_weights" examples/llm_ptq/example_utils.py -B 30 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2189

🏁 Script executed:

# Search for the load_model_and_tokenizer function signature rg "def load_model_and_tokenizer" examples/llm_ptq/example_utils.py -A 50 | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Look for any HF repo ID examples in the codebase rg "from_pretrained" examples/llm_ptq/ -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 10904

🏁 Script executed:

# Check the load_model_and_tokenizer function more carefully to see how ckpt_path is used cat examples/llm_ptq/example_utils.py | sed -n '200,350p'

Repository: NVIDIA/Model-Optimizer

Length of output: 5757

🏁 Script executed:

# Look for the function signature and where _unpack_compressed_linear_weights is called cat examples/llm_ptq/example_utils.py | sed -n '200,450p'

Repository: NVIDIA/Model-Optimizer

Length of output: 9311

🏁 Script executed:

# Search for any snapshot_download or other HF hub resolution patterns in the codebase rg "snapshot_download|snapshot_download" examples/ -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 2180

🏁 Script executed:

# Check if there's any documentation on how ckpt_path is supposed to be handled for HF repo IDs grep -n "ckpt_path" examples/llm_ptq/example_utils.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1225

🏁 Script executed:

# Find the full load_model_and_tokenizer function rg "def load_model_and_tokenizer" examples/llm_ptq/example_utils.py -A 150 | head -170

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Look at the exact lines around where _unpack_compressed_linear_weights is called cat examples/llm_ptq/example_utils.py | sed -n '350,450p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3707

🏁 Script executed:

# Check the get_local_model_path_or_name function more carefully rg "def get_local_model_path_or_name" examples/llm_ptq/example_utils.py -A 40

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Look for the function that contains the _unpack_compressed_linear_weights call cat examples/llm_ptq/example_utils.py | grep -n "_unpack_compressed_linear_weights" -B 50 -A 10 | tail -70

Repository: NVIDIA/Model-Optimizer

Length of output: 3508

🏁 Script executed:

# Search for the load_model_and_tokenizer function rg "def load_model_and_tokenizer" examples/llm_ptq/example_utils.py -A 200 | tail -100

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check the line numbers around the call to understand the context cat examples/llm_ptq/example_utils.py | sed -n '600,750p'

Repository: NVIDIA/Model-Optimizer

Length of output: 6804

🏁 Script executed:

# Check if there's a helper function to resolve HF repo IDs to local paths rg "def get_local_model_path_or_name" examples/llm_ptq/example_utils.py -A 30

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Look at how load_model_and_tokenizer is called - check the calling context rg "load_model_and_tokenizer\(" examples/llm_ptq/ -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check the PR context by looking at recent changes to example files rg "moonshotai" examples/ -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 527

🏁 Script executed:

# Look at the signature and flow of load_model_and_tokenizer cat examples/llm_ptq/example_utils.py | sed -n '625,660p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1624

🏁 Script executed:

# Check the full context of how ckpt_path flows through the function cat examples/llm_ptq/example_utils.py | sed -n '625,645p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1031

🏁 Script executed:

# Look for where load_model_and_tokenizer is actually called in examples find examples/llm_ptq -name "*.py" -exec grep -l "load_model_and_tokenizer" {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Let's check hf_ptq.py to see how the model is loaded cat examples/llm_ptq/hf_ptq.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 5134

🏁 Script executed:

# Look at the get_local_model_path_or_name function that uses snapshot_download cat examples/llm_ptq/example_utils.py | sed -n '100,175p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3212

🏁 Script executed:

# Check if get_local_model_path_or_name is called before _unpack_compressed_linear_weights rg "get_local_model_path_or_name" examples/llm_ptq/example_utils.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check what model.config._name_or_path contains - search for any documentation or tests rg "_name_or_path" examples/llm_ptq/ -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 3605

🏁 Script executed:

# Now let me look at the _resolve_model_path function completely cat examples/llm_ptq/example_utils.py | sed -n '1,50p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1505

🏁 Script executed:

# Check the complete _resolve_model_path function rg "def _resolve_model_path" examples/llm_ptq/example_utils.py -A 50

Repository: NVIDIA/Model-Optimizer

Length of output: 2321

🏁 Script executed:

# Let me see the complete _resolve_model_path function cat examples/llm_ptq/example_utils.py | sed -n '1,100p' | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1958

🏁 Script executed:

# Check where _unpack_compressed_linear_weights is called and verify context cat examples/llm_ptq/example_utils.py | sed -n '750,770p'

Repository: NVIDIA/Model-Optimizer

Length of output: 802

🏁 Script executed:

# Search if _resolve_model_path is used elsewhere or with _unpack_compressed_linear_weights rg "_resolve_model_path|_unpack_compressed_linear" examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 352

🏁 Script executed:

# Check if there's been any recent fix or call to _resolve_model_path before _unpack_compressed_linear_weights cat examples/llm_ptq/example_utils.py | sed -n '730,765p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1463

🏁 Script executed:

# Check what model.config._name_or_path contains when loaded from a HF repo ID # by looking at any documentation or tests rg "config._name_or_path" . -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 2232

🏁 Script executed:

# Let's verify the actual flow: check if transformers sets _name_or_path to the repo ID or cache path # by looking for any comments or documentation rg "from_pretrained" examples/llm_ptq/example_utils.py | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 381

🏁 Script executed:

# Most importantly, let's check if there's a case where ckpt_path (the parameter passed to load_model_and_tokenizer) is a repo ID cat examples/llm_ptq/hf_ptq.py | grep -A 5 "load_model_and_tokenizer\|args.pyt_ckpt_path" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 713

🏁 Script executed:

# Check what types of inputs are expected for ckpt_path in the calling code rg "pyt_ckpt_path\|ckpt_path" examples/llm_ptq/hf_ptq.py -B 2 -A 2 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Let me check the actual behavior by looking at what HF returns for _name_or_path # Check if there's any initialization of get_model that might resolve the path rg "def get_model" examples/llm_ptq/example_utils.py -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 201

🏁 Script executed:

# Look at the actual line where load_model_and_tokenizer is called to see what's passed as ckpt_path cat examples/llm_ptq/hf_ptq.py | sed -n '1,100p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3469

🏁 Script executed:

# Check the argparse definition to see what pyt_ckpt_path expects rg "pyt_ckpt_path" examples/llm_ptq/hf_ptq.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1133

🏁 Script executed:

# Find the argparse definition for pyt_ckpt_path rg "pyt_ckpt_path" examples/llm_ptq/hf_ptq.py -B 3 -A 3 | grep -A 5 "add_argument.*pyt_ckpt_path"

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check the full argparse section cat examples/llm_ptq/hf_ptq.py | sed -n '700,800p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3988

🏁 Script executed:

# Let's look for where pyt_ckpt_path is defined in argparse rg "add_argument.*pyt_ckpt_path" examples/llm_ptq/hf_ptq.py -B 1 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Search for the argument parser definition rg "argparse.ArgumentParser" examples/llm_ptq/hf_ptq.py -A 100 | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 3535

🏁 Script executed:

# Or search directly for where pyt_ckpt_path argument is added grep -n "pyt_ckpt_path" examples/llm_ptq/hf_ptq.py | tail -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1197

🏁 Script executed:

# Check the actual behavior: does transformers keep the repo ID or convert to local path in _name_or_path? # Let's see if there are any tests or documentation rg "model.config._name_or_path" . -B 5 -A 5 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1394

🏁 Script executed:

# Most importantly, let's verify if _resolve_model_path should be called before _unpack_compressed_linear_weights # by checking how _resolve_model_path is used elsewhere rg "_resolve_model_path" examples/llm_ptq/example_utils.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 423

Use _resolve_model_path() to handle HuggingFace repo IDs in _unpack_compressed_linear_weights().

The function receives ckpt_path which may be a HuggingFace repo ID (e.g., moonshotai/Kimi-K2.5), but only checks os.path.exists() on it. When model.config._name_or_path retains the repo ID string instead of the cached local path, the BF16/metadata repair becomes a silent no-op. Call _resolve_model_path() before the os.path.exists() checks to convert repo IDs to local cache paths.

🧰 Tools

🪛 GitHub Actions: Code Quality

[error] 550-550: SIM118 Use key in dict instead of key in dict.keys().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 528 - 553, The _unpack_compressed_linear_weights() logic currently treats ckpt_path (often model.config._name_or_path) as a filesystem path and skips processing when os.path.exists fails; call the helper _resolve_model_path(ckpt_path) at the start (or when ckpt_path is assigned) to convert HF repo IDs to the local cached path, then use the resolved path for the subsequent os.path.exists checks, index_path construction, st_files list and safe_open usage so BF16/metadata repair runs for repo IDs as well.

examples/llm_ptq/example_utils.py

coderabbitai · 2026-03-13T07:35:25Z

examples/llm_ptq/example_utils.py

+        elif has_pack_quantized_config(hf_config):
+            # Patch CompressedLinear before loading to handle missing weight attribute
+            _patch_compressed_linear_init()
+            # Pass torch_dtype="auto" to preserve original dtypes from safetensors
+            # This prevents int32 packed weights from being converted to float
            model = AutoModelForCausalLM.from_pretrained(
                ckpt_path,
                device_map="auto",
                trust_remote_code=trust_remote_code,
-                torch_dtype=torch_dtype,
+                torch_dtype="auto",
            )
+            # Restore original CompressedLinear behavior after loading
+            _restore_compressed_linear()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file fd "example_utils.py" examples/llm_ptq/

Repository: NVIDIA/Model-Optimizer

Length of output: 99

🏁 Script executed:

# Read the file to understand the function context around lines 686-698 head -n 750 examples/llm_ptq/example_utils.py | tail -n 100

Repository: NVIDIA/Model-Optimizer

Length of output: 4912

🏁 Script executed:

# Get a broader view of the function containing these lines sed -n '620,720p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4759

🏁 Script executed:

# Find the function signature and parameters sed -n '550,630p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3428

🏁 Script executed:

# Search for _patch_compressed_linear_init and _restore_compressed_linear definitions rg -n "_patch_compressed_linear_init|_restore_compressed_linear" examples/llm_ptq/example_utils.py -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 1378

🏁 Script executed:

# Get the full context of the patch and restore functions sed -n '449,515p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2756

🏁 Script executed:

# Verify the function signature and confirm device_map and model_kwargs are parameters sed -n '610,640p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1365

Use try/finally to ensure patch restoration and preserve caller-supplied parameters.

Hardcoding device_map="auto" ignores the dynamically configured device_map (which may be set to "cpu", "sequential", None, or "auto" based on device parameter, VL model detection, or BART model type). Skipping **model_kwargs drops caller-supplied settings like attn_implementation, max_memory, and torch_dtype that are built earlier in the function.

Additionally, if AutoModelForCausalLM.from_pretrained() raises an exception, the CompressedLinear.__getattr__ patch remains active globally, affecting the rest of the process. The patch must be restored in all code paths using a try/finally block.

Suggested fix

elif has_pack_quantized_config(hf_config): # Patch CompressedLinear before loading to handle missing weight attribute _patch_compressed_linear_init() - # Pass torch_dtype="auto" to preserve original dtypes from safetensors - # This prevents int32 packed weights from being converted to float - model = AutoModelForCausalLM.from_pretrained( - ckpt_path, - device_map="auto", - trust_remote_code=trust_remote_code, - torch_dtype="auto", - ) - # Restore original CompressedLinear behavior after loading - _restore_compressed_linear() + try: + # Pass torch_dtype="auto" to preserve original dtypes from safetensors + # This prevents int32 packed weights from being converted to float + pack_model_kwargs = {**model_kwargs, "torch_dtype": "auto"} + model = AutoModelForCausalLM.from_pretrained( + ckpt_path, + device_map=device_map, + **pack_model_kwargs, + ) + finally: + # Restore original CompressedLinear behavior after loading + _restore_compressed_linear()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 686 - 698, The patching around CompressedLinear must be exception-safe and preserve caller options: wrap the call to AutoModelForCausalLM.from_pretrained inside a try/finally so _restore_compressed_linear() always runs, do not hardcode device_map="auto" but use the existing device_map variable (or include it from model_kwargs), and pass through **model_kwargs (which contains torch_dtype, attn_implementation, max_memory, etc.) to from_pretrained rather than dropping them; update the block that checks has_pack_quantized_config(hf_config) to call _patch_compressed_linear_init(), then in a try call AutoModelForCausalLM.from_pretrained(ckpt_path, **model_kwargs) (ensuring device_map is included in model_kwargs), and finally call _restore_compressed_linear() in the finally clause.

coderabbitai · 2026-03-13T07:35:25Z

modelopt/torch/quantization/plugins/huggingface.py

+        if hasattr(self, "weight_zero_point"):
+            compressed_data["weight_zero_point"] = self.weight_zero_point


⚠️ Potential issue | 🟡 Minor

Also drop weight_zero_point when unpacking.

_build_compressed_data() now feeds weight_zero_point into decompression, but unpack_weight() never removes it afterward. Unpacked layers will keep stale quantization metadata in memory and state_dict().

Suggested fix

if hasattr(self, "weight_scale"): del self.weight_scale + if hasattr(self, "weight_zero_point"): + del self.weight_zero_point if hasattr(self, "weight_shape"): if "weight_shape" in self._parameters: del self._parameters["weight_shape"]

Also applies to: 962-972

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 907 - 908, _unbuild_compressed_data() now stores weight_zero_point into compressed_data but unpack_weight() (and the other unpacking paths around the same area) never removes it, leaving stale quantization metadata in the unpacked module/state_dict; update unpack_weight() (and the related unpack/decompression functions referenced near the weight unpacking logic) to pop/remove "weight_zero_point" from the compressed data or from the module attributes after decompression so the unpacked layer doesn't retain weight_zero_point in memory or state_dict().

coderabbitai

♻️ Duplicate comments (5)

modelopt/torch/quantization/plugins/huggingface.py (1)

960-970: ⚠️ Potential issue | 🟡 Minor

Also remove weight_zero_point during unpack cleanup.

_build_compressed_data() now carries weight_zero_point, but unpack cleanup never clears it, leaving stale quant metadata in unpacked modules.

Suggested fix

         if hasattr(self, "weight_scale"):
             del self.weight_scale
+        if hasattr(self, "weight_zero_point"):
+            del self.weight_zero_point
         if hasattr(self, "weight_shape"):
             if "weight_shape" in self._parameters:
                 del self._parameters["weight_shape"]
             else:
                 delattr(self, "weight_shape")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 960 - 970,
The unpack cleanup block must also remove weight_zero_point to avoid leaving
stale quant metadata after calling _build_compressed_data(); update the same
cleanup in the method that deletes weight_packed, weight_scale and weight_shape
to additionally check for and delete weight_zero_point (similar to how
weight_shape is removed from _parameters or via delattr), and ensure
quantization_status transitions from QuantizationStatus.COMPRESSED to
QuantizationStatus.FROZEN remains unchanged.

examples/llm_ptq/example_utils.py (4)

528-543: ⚠️ Potential issue | 🟠 Major

Resolve HuggingFace repo IDs before filesystem checks.

This path assumes ckpt_path is local; with repo IDs, unpack/metadata repair can silently no-op. Resolve first, then build safetensor paths from the resolved directory.

Suggested fix

     if ckpt_path is None:
         ckpt_path = getattr(model.config, "_name_or_path", None)
     if not ckpt_path:
         return
+    ckpt_path = _resolve_model_path(ckpt_path)
+    if not os.path.isdir(ckpt_path):
+        return
 
     from safetensors import safe_open

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 528 - 543, The code builds
index_path and st_files from ckpt_path before ensuring ckpt_path is a resolved
local filesystem directory, which breaks when ckpt_path is a HF repo ID; resolve
the repo ID to a local snapshot first (i.e. replace ckpt_path from
model.config._name_or_path with a resolved local path) before doing os.path.join
and os.path.exists checks. Update the logic around ckpt_path, index_path and
st_files so resolution happens first (resolve ckpt_path -> local_dir, then set
index_path = os.path.join(local_dir, "model.safetensors.index.json") and
st_files = [os.path.join(local_dir, "model.safetensors")] and only then load the
index file if it exists).

461-462: ⚠️ Potential issue | 🟠 Major

Use the patch flag value, not attribute existence.

hasattr() makes subsequent loads skip patching even after restore sets the flag to False. Use getattr(..., False) so re-patching works in the same process.

Suggested fix

-    if hasattr(CompressedLinear, "_modelopt_init_patched"):
+    if getattr(CompressedLinear, "_modelopt_init_patched", False):
         return

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 461 - 462, The current guard
uses hasattr(CompressedLinear, "_modelopt_init_patched") which prevents
re-patching after the flag is set back to False; change the check to read the
boolean flag value instead (e.g., getattr(CompressedLinear,
"_modelopt_init_patched", False)) so the code only skips patching when the
attribute exists and is True; update the early-return logic around
CompressedLinear and its _modelopt_init_patched flag (where the attribute is
set/unset) so re-applying the patch in the same process works correctly.

684-696: ⚠️ Potential issue | 🟠 Major

Make patch restoration exception-safe and preserve computed load kwargs.

This branch currently hardcodes device_map="auto" and drops model_kwargs; it also risks leaving the global patch active if load raises.

Suggested fix

         elif has_pack_quantized_config(hf_config):
             # Patch CompressedLinear before loading to handle missing weight attribute
             _patch_compressed_linear_init()
-            # Pass torch_dtype="auto" to preserve original dtypes from safetensors
-            # This prevents int32 packed weights from being converted to float
-            model = AutoModelForCausalLM.from_pretrained(
-                ckpt_path,
-                device_map="auto",
-                trust_remote_code=trust_remote_code,
-                torch_dtype="auto",
-            )
-            # Restore original CompressedLinear behavior after loading
-            _restore_compressed_linear()
+            try:
+                pack_model_kwargs = model_kwargs.copy()
+                pack_model_kwargs["device_map"] = device_map
+                pack_model_kwargs["torch_dtype"] = "auto"
+                model = AutoModelForCausalLM.from_pretrained(
+                    ckpt_path,
+                    **pack_model_kwargs,
+                )
+            finally:
+                # Restore original CompressedLinear behavior after loading
+                _restore_compressed_linear()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 684 - 696, The branch for
pack-quantized configs currently hardcodes device_map and drops the precomputed
model_kwargs and can leave the CompressedLinear patch active on exceptions;
modify the block around has_pack_quantized_config to (1) use the existing
model_kwargs (instead of hardcoding device_map="auto") when calling
AutoModelForCausalLM.from_pretrained(ckpt_path, ...), (2) wrap the load call
with a try/finally so _restore_compressed_linear() always runs even if
from_pretrained raises, and (3) ensure torch_dtype="auto" is merged into the
preserved model_kwargs before passing them to
AutoModelForCausalLM.from_pretrained; keep references to
_patch_compressed_linear_init, _restore_compressed_linear,
has_pack_quantized_config, AutoModelForCausalLM.from_pretrained, ckpt_path,
trust_remote_code and model_kwargs to locate and update the code.

561-570: ⚠️ Potential issue | 🟠 Major

Clear compressed tensors after BF16 restoration.

After replacing weight and freezing, compressed attrs should be removed; otherwise large layers keep duplicate tensors and inflate memory/state_dict.

Suggested fix

                 module._parameters["weight"] = param
                 module.__dict__["weight"] = param
                 module.quantization_status = QuantizationStatus.FROZEN
+                for attr in ("weight_packed", "weight_scale", "weight_shape", "weight_zero_point"):
+                    module._parameters.pop(attr, None)
+                    module._buffers.pop(attr, None)
+                    module.__dict__.pop(attr, None)
                 logger.debug("Restored BF16 layer: %s", name)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 561 - 570, After restoring
the BF16 weight and freezing it in module (code around checkpoint_weights,
module._parameters["weight"], and QuantizationStatus.FROZEN), remove any
leftover compressed/quantization artifacts to avoid duplicate large tensors;
after setting module._parameters["weight"] and module.quantization_status =
QuantizationStatus.FROZEN add cleanup that pops known compressed keys from
module._parameters, module._buffers and module.__dict__ (e.g.
"compressed_weight", "weight_compressed", "weight_quantized", "quantizer_state",
"scales", "zp" or any key starting with "compressed" or "weight_") and use
delattr(module, key) if present so no duplicate large tensors remain in
state_dict or memory.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 528-543: The code builds index_path and st_files from ckpt_path
before ensuring ckpt_path is a resolved local filesystem directory, which breaks
when ckpt_path is a HF repo ID; resolve the repo ID to a local snapshot first
(i.e. replace ckpt_path from model.config._name_or_path with a resolved local
path) before doing os.path.join and os.path.exists checks. Update the logic
around ckpt_path, index_path and st_files so resolution happens first (resolve
ckpt_path -> local_dir, then set index_path = os.path.join(local_dir,
"model.safetensors.index.json") and st_files = [os.path.join(local_dir,
"model.safetensors")] and only then load the index file if it exists).
- Around line 461-462: The current guard uses hasattr(CompressedLinear,
"_modelopt_init_patched") which prevents re-patching after the flag is set back
to False; change the check to read the boolean flag value instead (e.g.,
getattr(CompressedLinear, "_modelopt_init_patched", False)) so the code only
skips patching when the attribute exists and is True; update the early-return
logic around CompressedLinear and its _modelopt_init_patched flag (where the
attribute is set/unset) so re-applying the patch in the same process works
correctly.
- Around line 684-696: The branch for pack-quantized configs currently hardcodes
device_map and drops the precomputed model_kwargs and can leave the
CompressedLinear patch active on exceptions; modify the block around
has_pack_quantized_config to (1) use the existing model_kwargs (instead of
hardcoding device_map="auto") when calling
AutoModelForCausalLM.from_pretrained(ckpt_path, ...), (2) wrap the load call
with a try/finally so _restore_compressed_linear() always runs even if
from_pretrained raises, and (3) ensure torch_dtype="auto" is merged into the
preserved model_kwargs before passing them to
AutoModelForCausalLM.from_pretrained; keep references to
_patch_compressed_linear_init, _restore_compressed_linear,
has_pack_quantized_config, AutoModelForCausalLM.from_pretrained, ckpt_path,
trust_remote_code and model_kwargs to locate and update the code.
- Around line 561-570: After restoring the BF16 weight and freezing it in module
(code around checkpoint_weights, module._parameters["weight"], and
QuantizationStatus.FROZEN), remove any leftover compressed/quantization
artifacts to avoid duplicate large tensors; after setting
module._parameters["weight"] and module.quantization_status =
QuantizationStatus.FROZEN add cleanup that pops known compressed keys from
module._parameters, module._buffers and module.__dict__ (e.g.
"compressed_weight", "weight_compressed", "weight_quantized", "quantizer_state",
"scales", "zp" or any key starting with "compressed" or "weight_") and use
delattr(module, key) if present so no duplicate large tensors remain in
state_dict or memory.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 960-970: The unpack cleanup block must also remove
weight_zero_point to avoid leaving stale quant metadata after calling
_build_compressed_data(); update the same cleanup in the method that deletes
weight_packed, weight_scale and weight_shape to additionally check for and
delete weight_zero_point (similar to how weight_shape is removed from
_parameters or via delattr), and ensure quantization_status transitions from
QuantizationStatus.COMPRESSED to QuantizationStatus.FROZEN remains unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e49da756-845d-4f4e-ad41-4700cf9d85f1

📥 Commits

Reviewing files that changed from the base of the PR and between 6c35801 and 44ce987.

📒 Files selected for processing (2)

examples/llm_ptq/example_utils.py
modelopt/torch/quantization/plugins/huggingface.py

When loading pack-quantized models with trust_remote_code=True, transformers calls _init_weights on CompressedLinear modules that have a missing 'weight' key. Custom model code (e.g. modeling_deepseek.py) accesses module.weight.data directly, which crashes because CompressedLinear has weight_packed instead. The patch returns a no-op dummy to let initialization complete harmlessly. Also adds class docstring to _QuantCompressedLinear explaining the on-the-fly decompression rationale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cjluo-nv reviewed Jan 29, 2026

View reviewed changes

Edwardf0t1 force-pushed the zhiyu/support-kimi-k2.5-ptq branch from 99912fb to 3ff37cd Compare March 13, 2026 06:00

Edwardf0t1 marked this pull request as ready for review March 13, 2026 07:22

Edwardf0t1 requested review from a team as code owners March 13, 2026 07:22

Edwardf0t1 requested a review from cjluo-nv March 13, 2026 07:22

Edwardf0t1 and others added 9 commits March 13, 2026 00:27

support kimi k2.5 ptq

fec89d8

Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

update

1dae9c6

Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

updates, fix weight_shape tensor issue

48b1944

Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

decompress on CPU

94e9bf1

Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

decompress experts on-the-fly to avoid OOTM

e7bee47

Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

resolve issues when use large calib size

d17426b

Signed-off-by: Zhiyu <zhiyuc@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix format

777060b

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix format

44ce987

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 force-pushed the zhiyu/support-kimi-k2.5-ptq branch from 3b7ec74 to 44ce987 Compare March 13, 2026 07:27

Edwardf0t1 requested review from chadvoegele, Copilot, jingyu-ml and meenchen March 13, 2026 07:27

Copilot started reviewing on behalf of Edwardf0t1 March 13, 2026 07:28 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

		print("Patched CompressedLinear for transformers compatibility")


		def _unpack_compressed_linear_weights(model, ckpt_path=None):

		if hasattr(CompressedLinear, "_modelopt_init_patched"):
		return

		if hasattr(self, "weight_zero_point"):
		compressed_data["weight_zero_point"] = self.weight_zero_point

Conversation

Edwardf0t1 commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot bot commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Edwardf0t1 commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

codecov bot commented Jan 27, 2026 •

edited

Loading