Skip to content

Support Kimi-K2.5 PTQ#820

Open
Edwardf0t1 wants to merge 10 commits intomainfrom
zhiyu/support-kimi-k2.5-ptq
Open

Support Kimi-K2.5 PTQ#820
Edwardf0t1 wants to merge 10 commits intomainfrom
zhiyu/support-kimi-k2.5-ptq

Conversation

@Edwardf0t1
Copy link
Contributor

@Edwardf0t1 Edwardf0t1 commented Jan 27, 2026

What does this PR do?

Type of change: New model support

Overview: Support Kimi-K2.5 PTQ.

Usage

python3 hf_ptq.py   --pyt_ckpt_path moonshotai/Kimi-K2.5   --qformat nvfp4_mlp_only   --export_path ./kimi-k2.5-nvfp4   --trust_remote_code

Testing

You may need pip install transformers==4.57.1 and the model file here: https://huggingface.co/nvidia/Kimi-K2.5-NVFP4/blob/main/modeling_kimi_k25.py

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • Bug Fixes

    • Fixed initialization and weight-restoration when loading pack-quantized models so BF16 and expert weights are correctly restored and stale placeholders removed.
    • Added error handling to avoid failures if optional decompression components are unavailable.
  • New Features

    • Added conditional patch/restore around pack-quantized loads and final unpacking of weights, plus on‑the‑fly decompression during inference and improved logging.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 12fcd61e-adcc-44c4-8723-9109d7e906a8

📥 Commits

Reviewing files that changed from the base of the PR and between 44ce987 and 4053273.

📒 Files selected for processing (2)
  • examples/llm_ptq/example_utils.py
  • modelopt/torch/quantization/plugins/huggingface.py

📝 Walkthrough

Walkthrough

Adds pack-quantized model loading support by temporarily patching CompressedLinear during initialization, post-load unpacking/fixing of compressed/BF16/expert weights from safetensors, and runtime on-demand decompression/unpacking in quantized linear implementations.

Changes

Cohort / File(s) Summary
Model loading & patching
examples/llm_ptq/example_utils.py
Adds logger and helpers: _patch_compressed_linear_init, _restore_compressed_linear, _unpack_compressed_linear_weights; detects pack-quantized configs, patches CompressedLinear to supply dummy weights during init, restores behavior after load, and unpacks/fixes weights (BF16, experts) from safetensors with safe error handling.
Quantized linear decompression & unpack
modelopt/torch/quantization/plugins/huggingface.py
Adds _build_compressed_data and updates _QuantCompressedLinear to support on-the-fly decompression when weight_packed is int32 and to robustly unpack COMPRESSED weights into real nn.Parameter, clean stale attrs, and set quantization status to FROZEN. Also integrates logging and related HF utility imports.

Sequence Diagram(s)

sequenceDiagram
    participant User as get_model
    participant Detector as Pack-quantized Detector
    participant Patcher as CompressedLinear Patch
    participant Loader as Model Loader
    participant Unpacker as Weight Unpacker

    User->>Detector: inspect config
    Detector->>Patcher: request patch if pack-quantized
    Patcher->>Loader: patched CompressedLinear used during init
    Loader->>Loader: initialize layers (dummy/missing weights allowed)
    Loader->>Unpacker: post-load call to unpack weights from safetensors
    Unpacker->>Unpacker: restore BF16 weights, fix expert metadata
    Unpacker->>Patcher: restore original CompressedLinear behavior
    Unpacker->>User: model ready
Loading
sequenceDiagram
    participant Forward as _QuantCompressedLinear.forward
    participant Check as weight_packed check
    participant Builder as _build_compressed_data
    participant Decompressor as compressor.decompress_weight
    participant Replace as unpack_weight path

    Forward->>Check: is quantization_status COMPRESSED and weight_packed int32?
    alt packed int32
        Check->>Builder: gather weight_packed, scale, shape, zero_point, scheme
        Builder->>Decompressor: provide compressed_data & quant_args
        Decompressor->>Forward: return decompressed weight (one-time log)
        Forward->>Forward: use decompressed weight
        Forward->>Replace: later unpack replaces placeholder with nn.Parameter and cleans attrs
    else non-packed / BF16
        Check->>Forward: use weight_packed directly (no decompression)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'Support Kimi-K2.5 PTQ' clearly describes the main change: adding post-training quantization support for the Kimi-K2.5 model.
Security Anti-Patterns ✅ Passed Pull request adheres to all SECURITY.md coding practices with no unsafe deserialization, eval/exec, or hardcoded trust parameters introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch zhiyu/support-kimi-k2.5-ptq
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.11%. Comparing base (bc87981) to head (44ce987).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #820   +/-   ##
=======================================
  Coverage   70.11%   70.11%           
=======================================
  Files         221      221           
  Lines       25459    25459           
=======================================
  Hits        17851    17851           
  Misses       7608     7608           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: if you just load kimi k2.5 using HF and do a generation call (not using modeopt) Were you able to do it?

return dtype


def _patch_compressed_linear_init():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be a transformers version issue? I was able to load kimi k2 thinking int4 without an issue. Is this specific to kimi k2.5?

print("Patched CompressedLinear for transformers compatibility")


def _unpack_compressed_linear_weights(model, ckpt_path=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not need it. We should be able to unpack on the fly with logics in the quantization plugins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this function as Kimi-K2.5 has BF16 layers (vision, lm_head) alongside compressed INT4 expert layers.

):
torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16)
elif has_pack_quantized_config(hf_config):
# Patch CompressedLinear before loading to handle missing weight attribute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this


if self.quantization_status == QuantizationStatus.COMPRESSED:
weight_data = self.compressor.decompress_module(self)
# Check if we should use decompress_module or manual decompress_weight
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this specific to kimi k2.5?

@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/support-kimi-k2.5-ptq branch from 99912fb to 3ff37cd Compare March 13, 2026 06:00
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 13, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Edwardf0t1 Edwardf0t1 marked this pull request as ready for review March 13, 2026 07:22
@Edwardf0t1 Edwardf0t1 requested review from a team as code owners March 13, 2026 07:22
@Edwardf0t1 Edwardf0t1 requested a review from cjluo-nv March 13, 2026 07:22
Edwardf0t1 and others added 9 commits March 13, 2026 00:27
Signed-off-by: Zhiyu <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Replace print() with logging, extract duplicated compressed_data builder
into _build_compressed_data() helper, fix formatting and remove stale
comments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for running PTQ flows on Kimi-K2.5 HuggingFace checkpoints that use compressed-tensors “pack-quantized” weights, by improving CompressedLinear handling during load/inference and adding example-side workarounds for transformers initialization.

Changes:

  • Update _QuantCompressedLinear to support on-the-fly decompression via decompress_weight() and improved compressed metadata handling.
  • Add example utilities to monkeypatch CompressedLinear during from_pretrained() and to (attempt to) restore mixed BF16/compressed weights post-load.
  • Extend pack-quantized config detection to handle nested text_config.quantization_config (for multi-modal configs).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
modelopt/torch/quantization/plugins/huggingface.py Adds on-the-fly decompression path + new compressed metadata builder and updated unpack_weight() logic for CompressedLinear.
examples/llm_ptq/example_utils.py Adds CompressedLinear monkeypatching during model load and post-load restoration logic; updates pack-quant config detection and load branch behavior.
Comments suppressed due to low confidence (1)

modelopt/torch/quantization/plugins/huggingface.py:934

  • New _QuantCompressedLinear behavior (on-the-fly decompress_weight() in forward() plus the new unpack_weight() parameter/buffer cleanup logic) doesn’t appear to be covered by the existing unit tests for modelopt.torch.quantization.plugins.huggingface. Adding a focused test (guarded with pytest.importorskip("compressed_tensors") if needed) would help prevent regressions for pack-quantized models.
    def forward(self, input: Tensor) -> Tensor:
        from compressed_tensors.quantization import QuantizationStatus

        if self.quantization_status == QuantizationStatus.COMPRESSED:
            # Real packed weights are int32. If it's float, it's not actually compressed.
            if self.weight_packed.dtype == torch.int32:
                compressed_data, quant_args = self._build_compressed_data()
                if not hasattr(self, "_logged_on_the_fly"):
                    logger.debug("On-the-fly decompression for %s", self.__class__.__name__)
                    self._logged_on_the_fly = True
                weight_data = self.compressor.decompress_weight(
                    compressed_data=compressed_data,
                    quantization_args=quant_args,
                )
            else:
                weight_data = self.weight_packed
        else:
            weight_data = self.weight

        return linear(self.input_quantizer(input), self.weight_quantizer(weight_data), self.bias)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 689 to 694
model = AutoModelForCausalLM.from_pretrained(
ckpt_path,
device_map="auto",
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
torch_dtype="auto",
)
Comment on lines +528 to +546
if ckpt_path is None:
ckpt_path = getattr(model.config, "_name_or_path", None)
if not ckpt_path:
return

from safetensors import safe_open

# Load non-expert weights and metadata from safetensors
checkpoint_weights = {}
index_path = os.path.join(ckpt_path, "model.safetensors.index.json")
st_files = [os.path.join(ckpt_path, "model.safetensors")]
if os.path.exists(index_path):
with open(index_path) as f:
index = json.load(f)
st_files = [os.path.join(ckpt_path, f) for f in set(index.get("weight_map", {}).values())]

for sf_path in st_files:
if not os.path.exists(sf_path):
continue
Comment on lines +548 to +579
for key in f:
if ".mlp.experts." not in key or "weight_shape" in key:
checkpoint_weights[key] = f.get_tensor(key)

# Hybrid restoration
for name, module in model.named_modules():
if not isinstance(module, CompressedLinear):
continue

with torch.no_grad():
target_device = next(module.parameters()).device

# CASE A: Real BF16 weight exists (vision, lm_head)
if f"{name}.weight" in checkpoint_weights:
w = checkpoint_weights[f"{name}.weight"].to(target_device)
module._parameters.pop("weight", None)
module._buffers.pop("weight", None)
module.__dict__.pop("weight", None)
param = torch.nn.Parameter(w, requires_grad=False)
module._parameters["weight"] = param
module.__dict__["weight"] = param
module.quantization_status = QuantizationStatus.FROZEN
logger.debug("Restored BF16 layer: %s", name)

# CASE B: Expert (stay compressed, fix metadata)
elif f"{name}.weight_shape" in checkpoint_weights:
ws = checkpoint_weights[f"{name}.weight_shape"]
if f"{name}.weight_packed" in checkpoint_weights:
module.weight_packed = checkpoint_weights[f"{name}.weight_packed"].to(
torch.int32
)
module._parameters.pop("weight", None)
Comment on lines +942 to +950
# Skip non-pack-quantized weights (e.g., vision modules stored as BF16)
if isinstance(compressed_data["weight_packed"], torch.Tensor):
if compressed_data["weight_packed"].dtype != torch.int32:
return

decompressed = self.compressor.decompress_weight(
compressed_data=compressed_data,
quantization_args=quant_args,
)
Comment on lines +505 to +508
CompressedLinear.__getattr__ = CompressedLinear._modelopt_original_getattr
delattr(CompressedLinear, "_modelopt_original_getattr")
elif hasattr(CompressedLinear, "__getattr__"):
del CompressedLinear.__getattr__
Comment on lines +685 to +696
# Patch CompressedLinear before loading to handle missing weight attribute
_patch_compressed_linear_init()
# Pass torch_dtype="auto" to preserve original dtypes from safetensors
# This prevents int32 packed weights from being converted to float
model = AutoModelForCausalLM.from_pretrained(
ckpt_path,
device_map="auto",
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
torch_dtype="auto",
)
# Restore original CompressedLinear behavior after loading
_restore_compressed_linear()
Comment on lines +574 to +586
ws = checkpoint_weights[f"{name}.weight_shape"]
if f"{name}.weight_packed" in checkpoint_weights:
module.weight_packed = checkpoint_weights[f"{name}.weight_packed"].to(
torch.int32
)
module._parameters.pop("weight", None)
module._buffers.pop("weight", None)
module.__dict__.pop("weight", None)
shape_param = torch.nn.Parameter(ws.to(torch.int32), requires_grad=False)
module._parameters.pop("weight_shape", None)
module.__dict__.pop("weight_shape", None)
module._parameters["weight_shape"] = shape_param
module.__dict__["weight_shape"] = shape_param
except ImportError:
return

if hasattr(CompressedLinear, "_modelopt_init_patched"):
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)

19-31: ⚠️ Potential issue | 🟡 Minor

Move logger initialization below the remaining imports.

logger = logging.getLogger(__name__) is executable code, so the later imports on Lines 45-66 now trip Ruff E402. That matches the Code Quality failure on this PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 19 - 31, The
logger initialization is placed before later imports and triggers an E402 import
order error; move the statement logger = logging.getLogger(__name__) so it sits
after all import statements (i.e., below the remaining imports that follow the
current block) and keep the name unchanged; update any references to logger in
this module as needed but do not alter its value or placement relative to other
executable code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 461-462: The guard currently uses hasattr(CompressedLinear,
"_modelopt_init_patched") which only checks for attribute existence and prevents
re-patching after _restore_compressed_linear sets the flag to False; change the
check to use getattr(CompressedLinear, "_modelopt_init_patched", False) so it
returns the actual boolean value and only short-circuits when True, allowing
re-patching when the attribute is False or missing.
- Around line 686-698: The patching around CompressedLinear must be
exception-safe and preserve caller options: wrap the call to
AutoModelForCausalLM.from_pretrained inside a try/finally so
_restore_compressed_linear() always runs, do not hardcode device_map="auto" but
use the existing device_map variable (or include it from model_kwargs), and pass
through **model_kwargs (which contains torch_dtype, attn_implementation,
max_memory, etc.) to from_pretrained rather than dropping them; update the block
that checks has_pack_quantized_config(hf_config) to call
_patch_compressed_linear_init(), then in a try call
AutoModelForCausalLM.from_pretrained(ckpt_path, **model_kwargs) (ensuring
device_map is included in model_kwargs), and finally call
_restore_compressed_linear() in the finally clause.
- Around line 563-572: When restoring a BF16 weight in the branch that sets
module._parameters["weight"] and module.quantization_status =
QuantizationStatus.FROZEN (the block referencing name, checkpoint_weights,
module, and logger), also remove any leftover compressed-quantization attributes
to avoid duplicated memory: pop "weight_packed", "weight_scale", "weight_shape",
and "weight_zero_point" from module._parameters, module._buffers, and
module.__dict__ (use .pop(..., None)) so those tensors are cleared from the
module after replacing the weight.
- Around line 528-553: The _unpack_compressed_linear_weights() logic currently
treats ckpt_path (often model.config._name_or_path) as a filesystem path and
skips processing when os.path.exists fails; call the helper
_resolve_model_path(ckpt_path) at the start (or when ckpt_path is assigned) to
convert HF repo IDs to the local cached path, then use the resolved path for the
subsequent os.path.exists checks, index_path construction, st_files list and
safe_open usage so BF16/metadata repair runs for repo IDs as well.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 907-908: _unbuild_compressed_data() now stores weight_zero_point
into compressed_data but unpack_weight() (and the other unpacking paths around
the same area) never removes it, leaving stale quantization metadata in the
unpacked module/state_dict; update unpack_weight() (and the related
unpack/decompression functions referenced near the weight unpacking logic) to
pop/remove "weight_zero_point" from the compressed data or from the module
attributes after decompression so the unpacked layer doesn't retain
weight_zero_point in memory or state_dict().

---

Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 19-31: The logger initialization is placed before later imports
and triggers an E402 import order error; move the statement logger =
logging.getLogger(__name__) so it sits after all import statements (i.e., below
the remaining imports that follow the current block) and keep the name
unchanged; update any references to logger in this module as needed but do not
alter its value or placement relative to other executable code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ea0c347-cf35-4862-bcb8-9892a874581f

📥 Commits

Reviewing files that changed from the base of the PR and between bc87981 and 6c35801.

📒 Files selected for processing (2)
  • examples/llm_ptq/example_utils.py
  • modelopt/torch/quantization/plugins/huggingface.py

Comment on lines +461 to +462
if hasattr(CompressedLinear, "_modelopt_init_patched"):
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file and understand the structure around the guard conditions
cat -n examples/llm_ptq/example_utils.py | sed -n '450,520p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3541


🏁 Script executed:

# Search for all occurrences of _modelopt_init_patched to understand the attribute lifecycle
rg "_modelopt_init_patched" examples/llm_ptq/example_utils.py -n -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 644


🏁 Script executed:

# Check the _patch_compressed_linear_init and _restore_compressed_linear functions
sed -n '449,515p' examples/llm_ptq/example_utils.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 3225


Use value check instead of existence check to allow re-patching on subsequent loads.

The guard at line 461 uses hasattr() which checks if the attribute exists, not its value. After _restore_compressed_linear() sets _modelopt_init_patched = False at line 509, the attribute still exists on the class. A second pack-quantized load in the same process will find the attribute exists (despite being False) and skip re-patching entirely.

Change to:

if getattr(CompressedLinear, "_modelopt_init_patched", False):
    return

This checks the actual boolean value, allowing re-patching after restore.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 461 - 462, The guard
currently uses hasattr(CompressedLinear, "_modelopt_init_patched") which only
checks for attribute existence and prevents re-patching after
_restore_compressed_linear sets the flag to False; change the check to use
getattr(CompressedLinear, "_modelopt_init_patched", False) so it returns the
actual boolean value and only short-circuits when True, allowing re-patching
when the attribute is False or missing.

Comment on lines +528 to +553
if ckpt_path is None:
ckpt_path = getattr(model.config, "_name_or_path", None)
if not ckpt_path:
return

from safetensors import safe_open

# Load non-expert weights and metadata from safetensors
checkpoint_weights = {}
index_path = os.path.join(ckpt_path, "model.safetensors.index.json")
st_files = [os.path.join(ckpt_path, "model.safetensors")]
if os.path.exists(index_path):
with open(index_path) as f:
index = json.load(f)
st_files = [
os.path.join(ckpt_path, f) for f in set(index.get("weight_map", {}).values())
]

for sf_path in st_files:
if not os.path.exists(sf_path):
continue
with safe_open(sf_path, framework="pt") as f:
for key in f.keys():
if ".mlp.experts." not in key or "weight_shape" in key:
checkpoint_weights[key] = f.get_tensor(key)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the function containing these lines
cd examples/llm_ptq && cat -n example_utils.py | sed -n '500,560p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2932


🏁 Script executed:

# Let's also check the function signature and understand the full context
cd examples/llm_ptq && cat -n example_utils.py | sed -n '490,560p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3395


🏁 Script executed:

# Search for where this function is called and how ckpt_path is passed
rg "def.*ckpt_path" examples/llm_ptq/example_utils.py -A 5 -B 2

Repository: NVIDIA/Model-Optimizer

Length of output: 671


🏁 Script executed:

# Look for the function name that contains this code
rg -B 20 "model.safetensors.index.json" examples/llm_ptq/example_utils.py | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1736


🏁 Script executed:

# Search for calls to _unpack_compressed_linear_weights
rg "_unpack_compressed_linear_weights" examples/llm_ptq/ -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 1035


🏁 Script executed:

# Check the main example file to see how models are loaded
fd "*.py" examples/llm_ptq/ -type f | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 236


🏁 Script executed:

# Look at the main example to see model loading and function calls
cat examples/llm_ptq/run_ptq.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 125


🏁 Script executed:

# Search for where moonshotai/Kimi-K2.5 or other repo IDs appear in the PR context
rg "moonshotai|Kimi" examples/ -r

Repository: NVIDIA/Model-Optimizer

Length of output: 479


🏁 Script executed:

# Find Python files in examples/llm_ptq/
find examples/llm_ptq -name "*.py" | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 265


🏁 Script executed:

# Look at the context around where _unpack_compressed_linear_weights is called
rg "_unpack_compressed_linear_weights" examples/llm_ptq/example_utils.py -B 30 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2189


🏁 Script executed:

# Search for the load_model_and_tokenizer function signature
rg "def load_model_and_tokenizer" examples/llm_ptq/example_utils.py -A 50 | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look for any HF repo ID examples in the codebase
rg "from_pretrained" examples/llm_ptq/ -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 10904


🏁 Script executed:

# Check the load_model_and_tokenizer function more carefully to see how ckpt_path is used
cat examples/llm_ptq/example_utils.py | sed -n '200,350p'

Repository: NVIDIA/Model-Optimizer

Length of output: 5757


🏁 Script executed:

# Look for the function signature and where _unpack_compressed_linear_weights is called
cat examples/llm_ptq/example_utils.py | sed -n '200,450p'

Repository: NVIDIA/Model-Optimizer

Length of output: 9311


🏁 Script executed:

# Search for any snapshot_download or other HF hub resolution patterns in the codebase
rg "snapshot_download|snapshot_download" examples/ -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 2180


🏁 Script executed:

# Check if there's any documentation on how ckpt_path is supposed to be handled for HF repo IDs
grep -n "ckpt_path" examples/llm_ptq/example_utils.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1225


🏁 Script executed:

# Find the full load_model_and_tokenizer function
rg "def load_model_and_tokenizer" examples/llm_ptq/example_utils.py -A 150 | head -170

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look at the exact lines around where _unpack_compressed_linear_weights is called
cat examples/llm_ptq/example_utils.py | sed -n '350,450p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3707


🏁 Script executed:

# Check the get_local_model_path_or_name function more carefully
rg "def get_local_model_path_or_name" examples/llm_ptq/example_utils.py -A 40

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look for the function that contains the _unpack_compressed_linear_weights call
cat examples/llm_ptq/example_utils.py | grep -n "_unpack_compressed_linear_weights" -B 50 -A 10 | tail -70

Repository: NVIDIA/Model-Optimizer

Length of output: 3508


🏁 Script executed:

# Search for the load_model_and_tokenizer function
rg "def load_model_and_tokenizer" examples/llm_ptq/example_utils.py -A 200 | tail -100

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check the line numbers around the call to understand the context
cat examples/llm_ptq/example_utils.py | sed -n '600,750p'

Repository: NVIDIA/Model-Optimizer

Length of output: 6804


🏁 Script executed:

# Check if there's a helper function to resolve HF repo IDs to local paths
rg "def get_local_model_path_or_name" examples/llm_ptq/example_utils.py -A 30

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look at how load_model_and_tokenizer is called - check the calling context
rg "load_model_and_tokenizer\(" examples/llm_ptq/ -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check the PR context by looking at recent changes to example files
rg "moonshotai" examples/ -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 527


🏁 Script executed:

# Look at the signature and flow of load_model_and_tokenizer
cat examples/llm_ptq/example_utils.py | sed -n '625,660p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1624


🏁 Script executed:

# Check the full context of how ckpt_path flows through the function
cat examples/llm_ptq/example_utils.py | sed -n '625,645p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1031


🏁 Script executed:

# Look for where load_model_and_tokenizer is actually called in examples
find examples/llm_ptq -name "*.py" -exec grep -l "load_model_and_tokenizer" {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Let's check hf_ptq.py to see how the model is loaded
cat examples/llm_ptq/hf_ptq.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 5134


🏁 Script executed:

# Look at the get_local_model_path_or_name function that uses snapshot_download
cat examples/llm_ptq/example_utils.py | sed -n '100,175p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3212


🏁 Script executed:

# Check if get_local_model_path_or_name is called before _unpack_compressed_linear_weights
rg "get_local_model_path_or_name" examples/llm_ptq/example_utils.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check what model.config._name_or_path contains - search for any documentation or tests
rg "_name_or_path" examples/llm_ptq/ -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 3605


🏁 Script executed:

# Now let me look at the _resolve_model_path function completely
cat examples/llm_ptq/example_utils.py | sed -n '1,50p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1505


🏁 Script executed:

# Check the complete _resolve_model_path function
rg "def _resolve_model_path" examples/llm_ptq/example_utils.py -A 50

Repository: NVIDIA/Model-Optimizer

Length of output: 2321


🏁 Script executed:

# Let me see the complete _resolve_model_path function
cat examples/llm_ptq/example_utils.py | sed -n '1,100p' | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1958


🏁 Script executed:

# Check where _unpack_compressed_linear_weights is called and verify context
cat examples/llm_ptq/example_utils.py | sed -n '750,770p'

Repository: NVIDIA/Model-Optimizer

Length of output: 802


🏁 Script executed:

# Search if _resolve_model_path is used elsewhere or with _unpack_compressed_linear_weights
rg "_resolve_model_path|_unpack_compressed_linear" examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 352


🏁 Script executed:

# Check if there's been any recent fix or call to _resolve_model_path before _unpack_compressed_linear_weights
cat examples/llm_ptq/example_utils.py | sed -n '730,765p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1463


🏁 Script executed:

# Check what model.config._name_or_path contains when loaded from a HF repo ID
# by looking at any documentation or tests
rg "config._name_or_path" . -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 2232


🏁 Script executed:

# Let's verify the actual flow: check if transformers sets _name_or_path to the repo ID or cache path
# by looking for any comments or documentation
rg "from_pretrained" examples/llm_ptq/example_utils.py | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 381


🏁 Script executed:

# Most importantly, let's check if there's a case where ckpt_path (the parameter passed to load_model_and_tokenizer) is a repo ID
cat examples/llm_ptq/hf_ptq.py | grep -A 5 "load_model_and_tokenizer\|args.pyt_ckpt_path" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 713


🏁 Script executed:

# Check what types of inputs are expected for ckpt_path in the calling code
rg "pyt_ckpt_path\|ckpt_path" examples/llm_ptq/hf_ptq.py -B 2 -A 2 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Let me check the actual behavior by looking at what HF returns for _name_or_path
# Check if there's any initialization of get_model that might resolve the path
rg "def get_model" examples/llm_ptq/example_utils.py -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 201


🏁 Script executed:

# Look at the actual line where load_model_and_tokenizer is called to see what's passed as ckpt_path
cat examples/llm_ptq/hf_ptq.py | sed -n '1,100p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3469


🏁 Script executed:

# Check the argparse definition to see what pyt_ckpt_path expects
rg "pyt_ckpt_path" examples/llm_ptq/hf_ptq.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1133


🏁 Script executed:

# Find the argparse definition for pyt_ckpt_path
rg "pyt_ckpt_path" examples/llm_ptq/hf_ptq.py -B 3 -A 3 | grep -A 5 "add_argument.*pyt_ckpt_path"

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check the full argparse section
cat examples/llm_ptq/hf_ptq.py | sed -n '700,800p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3988


🏁 Script executed:

# Let's look for where pyt_ckpt_path is defined in argparse
rg "add_argument.*pyt_ckpt_path" examples/llm_ptq/hf_ptq.py -B 1 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Search for the argument parser definition
rg "argparse.ArgumentParser" examples/llm_ptq/hf_ptq.py -A 100 | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 3535


🏁 Script executed:

# Or search directly for where pyt_ckpt_path argument is added
grep -n "pyt_ckpt_path" examples/llm_ptq/hf_ptq.py | tail -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1197


🏁 Script executed:

# Check the actual behavior: does transformers keep the repo ID or convert to local path in _name_or_path?
# Let's see if there are any tests or documentation
rg "model.config._name_or_path" . -B 5 -A 5 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1394


🏁 Script executed:

# Most importantly, let's verify if _resolve_model_path should be called before _unpack_compressed_linear_weights
# by checking how _resolve_model_path is used elsewhere
rg "_resolve_model_path" examples/llm_ptq/example_utils.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 423


Use _resolve_model_path() to handle HuggingFace repo IDs in _unpack_compressed_linear_weights().

The function receives ckpt_path which may be a HuggingFace repo ID (e.g., moonshotai/Kimi-K2.5), but only checks os.path.exists() on it. When model.config._name_or_path retains the repo ID string instead of the cached local path, the BF16/metadata repair becomes a silent no-op. Call _resolve_model_path() before the os.path.exists() checks to convert repo IDs to local cache paths.

🧰 Tools
🪛 GitHub Actions: Code Quality

[error] 550-550: SIM118 Use key in dict instead of key in dict.keys().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 528 - 553, The
_unpack_compressed_linear_weights() logic currently treats ckpt_path (often
model.config._name_or_path) as a filesystem path and skips processing when
os.path.exists fails; call the helper _resolve_model_path(ckpt_path) at the
start (or when ckpt_path is assigned) to convert HF repo IDs to the local cached
path, then use the resolved path for the subsequent os.path.exists checks,
index_path construction, st_files list and safe_open usage so BF16/metadata
repair runs for repo IDs as well.

Comment on lines +686 to +698
elif has_pack_quantized_config(hf_config):
# Patch CompressedLinear before loading to handle missing weight attribute
_patch_compressed_linear_init()
# Pass torch_dtype="auto" to preserve original dtypes from safetensors
# This prevents int32 packed weights from being converted to float
model = AutoModelForCausalLM.from_pretrained(
ckpt_path,
device_map="auto",
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
torch_dtype="auto",
)
# Restore original CompressedLinear behavior after loading
_restore_compressed_linear()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file
fd "example_utils.py" examples/llm_ptq/

Repository: NVIDIA/Model-Optimizer

Length of output: 99


🏁 Script executed:

# Read the file to understand the function context around lines 686-698
head -n 750 examples/llm_ptq/example_utils.py | tail -n 100

Repository: NVIDIA/Model-Optimizer

Length of output: 4912


🏁 Script executed:

# Get a broader view of the function containing these lines
sed -n '620,720p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4759


🏁 Script executed:

# Find the function signature and parameters
sed -n '550,630p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3428


🏁 Script executed:

# Search for _patch_compressed_linear_init and _restore_compressed_linear definitions
rg -n "_patch_compressed_linear_init|_restore_compressed_linear" examples/llm_ptq/example_utils.py -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 1378


🏁 Script executed:

# Get the full context of the patch and restore functions
sed -n '449,515p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2756


🏁 Script executed:

# Verify the function signature and confirm device_map and model_kwargs are parameters
sed -n '610,640p' examples/llm_ptq/example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1365


Use try/finally to ensure patch restoration and preserve caller-supplied parameters.

Hardcoding device_map="auto" ignores the dynamically configured device_map (which may be set to "cpu", "sequential", None, or "auto" based on device parameter, VL model detection, or BART model type). Skipping **model_kwargs drops caller-supplied settings like attn_implementation, max_memory, and torch_dtype that are built earlier in the function.

Additionally, if AutoModelForCausalLM.from_pretrained() raises an exception, the CompressedLinear.__getattr__ patch remains active globally, affecting the rest of the process. The patch must be restored in all code paths using a try/finally block.

Suggested fix
         elif has_pack_quantized_config(hf_config):
             # Patch CompressedLinear before loading to handle missing weight attribute
             _patch_compressed_linear_init()
-            # Pass torch_dtype="auto" to preserve original dtypes from safetensors
-            # This prevents int32 packed weights from being converted to float
-            model = AutoModelForCausalLM.from_pretrained(
-                ckpt_path,
-                device_map="auto",
-                trust_remote_code=trust_remote_code,
-                torch_dtype="auto",
-            )
-            # Restore original CompressedLinear behavior after loading
-            _restore_compressed_linear()
+            try:
+                # Pass torch_dtype="auto" to preserve original dtypes from safetensors
+                # This prevents int32 packed weights from being converted to float
+                pack_model_kwargs = {**model_kwargs, "torch_dtype": "auto"}
+                model = AutoModelForCausalLM.from_pretrained(
+                    ckpt_path,
+                    device_map=device_map,
+                    **pack_model_kwargs,
+                )
+            finally:
+                # Restore original CompressedLinear behavior after loading
+                _restore_compressed_linear()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 686 - 698, The patching
around CompressedLinear must be exception-safe and preserve caller options: wrap
the call to AutoModelForCausalLM.from_pretrained inside a try/finally so
_restore_compressed_linear() always runs, do not hardcode device_map="auto" but
use the existing device_map variable (or include it from model_kwargs), and pass
through **model_kwargs (which contains torch_dtype, attn_implementation,
max_memory, etc.) to from_pretrained rather than dropping them; update the block
that checks has_pack_quantized_config(hf_config) to call
_patch_compressed_linear_init(), then in a try call
AutoModelForCausalLM.from_pretrained(ckpt_path, **model_kwargs) (ensuring
device_map is included in model_kwargs), and finally call
_restore_compressed_linear() in the finally clause.

Comment on lines +907 to +908
if hasattr(self, "weight_zero_point"):
compressed_data["weight_zero_point"] = self.weight_zero_point
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Also drop weight_zero_point when unpacking.

_build_compressed_data() now feeds weight_zero_point into decompression, but unpack_weight() never removes it afterward. Unpacked layers will keep stale quantization metadata in memory and state_dict().

Suggested fix
         if hasattr(self, "weight_scale"):
             del self.weight_scale
+        if hasattr(self, "weight_zero_point"):
+            del self.weight_zero_point
         if hasattr(self, "weight_shape"):
             if "weight_shape" in self._parameters:
                 del self._parameters["weight_shape"]

Also applies to: 962-972

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 907 - 908,
_unbuild_compressed_data() now stores weight_zero_point into compressed_data but
unpack_weight() (and the other unpacking paths around the same area) never
removes it, leaving stale quantization metadata in the unpacked
module/state_dict; update unpack_weight() (and the related unpack/decompression
functions referenced near the weight unpacking logic) to pop/remove
"weight_zero_point" from the compressed data or from the module attributes after
decompression so the unpacked layer doesn't retain weight_zero_point in memory
or state_dict().

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (5)
modelopt/torch/quantization/plugins/huggingface.py (1)

960-970: ⚠️ Potential issue | 🟡 Minor

Also remove weight_zero_point during unpack cleanup.

_build_compressed_data() now carries weight_zero_point, but unpack cleanup never clears it, leaving stale quant metadata in unpacked modules.

Suggested fix
         if hasattr(self, "weight_scale"):
             del self.weight_scale
+        if hasattr(self, "weight_zero_point"):
+            del self.weight_zero_point
         if hasattr(self, "weight_shape"):
             if "weight_shape" in self._parameters:
                 del self._parameters["weight_shape"]
             else:
                 delattr(self, "weight_shape")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 960 - 970,
The unpack cleanup block must also remove weight_zero_point to avoid leaving
stale quant metadata after calling _build_compressed_data(); update the same
cleanup in the method that deletes weight_packed, weight_scale and weight_shape
to additionally check for and delete weight_zero_point (similar to how
weight_shape is removed from _parameters or via delattr), and ensure
quantization_status transitions from QuantizationStatus.COMPRESSED to
QuantizationStatus.FROZEN remains unchanged.
examples/llm_ptq/example_utils.py (4)

528-543: ⚠️ Potential issue | 🟠 Major

Resolve HuggingFace repo IDs before filesystem checks.

This path assumes ckpt_path is local; with repo IDs, unpack/metadata repair can silently no-op. Resolve first, then build safetensor paths from the resolved directory.

Suggested fix
     if ckpt_path is None:
         ckpt_path = getattr(model.config, "_name_or_path", None)
     if not ckpt_path:
         return
+    ckpt_path = _resolve_model_path(ckpt_path)
+    if not os.path.isdir(ckpt_path):
+        return
 
     from safetensors import safe_open
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 528 - 543, The code builds
index_path and st_files from ckpt_path before ensuring ckpt_path is a resolved
local filesystem directory, which breaks when ckpt_path is a HF repo ID; resolve
the repo ID to a local snapshot first (i.e. replace ckpt_path from
model.config._name_or_path with a resolved local path) before doing os.path.join
and os.path.exists checks. Update the logic around ckpt_path, index_path and
st_files so resolution happens first (resolve ckpt_path -> local_dir, then set
index_path = os.path.join(local_dir, "model.safetensors.index.json") and
st_files = [os.path.join(local_dir, "model.safetensors")] and only then load the
index file if it exists).

461-462: ⚠️ Potential issue | 🟠 Major

Use the patch flag value, not attribute existence.

hasattr() makes subsequent loads skip patching even after restore sets the flag to False. Use getattr(..., False) so re-patching works in the same process.

Suggested fix
-    if hasattr(CompressedLinear, "_modelopt_init_patched"):
+    if getattr(CompressedLinear, "_modelopt_init_patched", False):
         return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 461 - 462, The current guard
uses hasattr(CompressedLinear, "_modelopt_init_patched") which prevents
re-patching after the flag is set back to False; change the check to read the
boolean flag value instead (e.g., getattr(CompressedLinear,
"_modelopt_init_patched", False)) so the code only skips patching when the
attribute exists and is True; update the early-return logic around
CompressedLinear and its _modelopt_init_patched flag (where the attribute is
set/unset) so re-applying the patch in the same process works correctly.

684-696: ⚠️ Potential issue | 🟠 Major

Make patch restoration exception-safe and preserve computed load kwargs.

This branch currently hardcodes device_map="auto" and drops model_kwargs; it also risks leaving the global patch active if load raises.

Suggested fix
         elif has_pack_quantized_config(hf_config):
             # Patch CompressedLinear before loading to handle missing weight attribute
             _patch_compressed_linear_init()
-            # Pass torch_dtype="auto" to preserve original dtypes from safetensors
-            # This prevents int32 packed weights from being converted to float
-            model = AutoModelForCausalLM.from_pretrained(
-                ckpt_path,
-                device_map="auto",
-                trust_remote_code=trust_remote_code,
-                torch_dtype="auto",
-            )
-            # Restore original CompressedLinear behavior after loading
-            _restore_compressed_linear()
+            try:
+                pack_model_kwargs = model_kwargs.copy()
+                pack_model_kwargs["device_map"] = device_map
+                pack_model_kwargs["torch_dtype"] = "auto"
+                model = AutoModelForCausalLM.from_pretrained(
+                    ckpt_path,
+                    **pack_model_kwargs,
+                )
+            finally:
+                # Restore original CompressedLinear behavior after loading
+                _restore_compressed_linear()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 684 - 696, The branch for
pack-quantized configs currently hardcodes device_map and drops the precomputed
model_kwargs and can leave the CompressedLinear patch active on exceptions;
modify the block around has_pack_quantized_config to (1) use the existing
model_kwargs (instead of hardcoding device_map="auto") when calling
AutoModelForCausalLM.from_pretrained(ckpt_path, ...), (2) wrap the load call
with a try/finally so _restore_compressed_linear() always runs even if
from_pretrained raises, and (3) ensure torch_dtype="auto" is merged into the
preserved model_kwargs before passing them to
AutoModelForCausalLM.from_pretrained; keep references to
_patch_compressed_linear_init, _restore_compressed_linear,
has_pack_quantized_config, AutoModelForCausalLM.from_pretrained, ckpt_path,
trust_remote_code and model_kwargs to locate and update the code.

561-570: ⚠️ Potential issue | 🟠 Major

Clear compressed tensors after BF16 restoration.

After replacing weight and freezing, compressed attrs should be removed; otherwise large layers keep duplicate tensors and inflate memory/state_dict.

Suggested fix
                 module._parameters["weight"] = param
                 module.__dict__["weight"] = param
                 module.quantization_status = QuantizationStatus.FROZEN
+                for attr in ("weight_packed", "weight_scale", "weight_shape", "weight_zero_point"):
+                    module._parameters.pop(attr, None)
+                    module._buffers.pop(attr, None)
+                    module.__dict__.pop(attr, None)
                 logger.debug("Restored BF16 layer: %s", name)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 561 - 570, After restoring
the BF16 weight and freezing it in module (code around checkpoint_weights,
module._parameters["weight"], and QuantizationStatus.FROZEN), remove any
leftover compressed/quantization artifacts to avoid duplicate large tensors;
after setting module._parameters["weight"] and module.quantization_status =
QuantizationStatus.FROZEN add cleanup that pops known compressed keys from
module._parameters, module._buffers and module.__dict__ (e.g.
"compressed_weight", "weight_compressed", "weight_quantized", "quantizer_state",
"scales", "zp" or any key starting with "compressed" or "weight_") and use
delattr(module, key) if present so no duplicate large tensors remain in
state_dict or memory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 528-543: The code builds index_path and st_files from ckpt_path
before ensuring ckpt_path is a resolved local filesystem directory, which breaks
when ckpt_path is a HF repo ID; resolve the repo ID to a local snapshot first
(i.e. replace ckpt_path from model.config._name_or_path with a resolved local
path) before doing os.path.join and os.path.exists checks. Update the logic
around ckpt_path, index_path and st_files so resolution happens first (resolve
ckpt_path -> local_dir, then set index_path = os.path.join(local_dir,
"model.safetensors.index.json") and st_files = [os.path.join(local_dir,
"model.safetensors")] and only then load the index file if it exists).
- Around line 461-462: The current guard uses hasattr(CompressedLinear,
"_modelopt_init_patched") which prevents re-patching after the flag is set back
to False; change the check to read the boolean flag value instead (e.g.,
getattr(CompressedLinear, "_modelopt_init_patched", False)) so the code only
skips patching when the attribute exists and is True; update the early-return
logic around CompressedLinear and its _modelopt_init_patched flag (where the
attribute is set/unset) so re-applying the patch in the same process works
correctly.
- Around line 684-696: The branch for pack-quantized configs currently hardcodes
device_map and drops the precomputed model_kwargs and can leave the
CompressedLinear patch active on exceptions; modify the block around
has_pack_quantized_config to (1) use the existing model_kwargs (instead of
hardcoding device_map="auto") when calling
AutoModelForCausalLM.from_pretrained(ckpt_path, ...), (2) wrap the load call
with a try/finally so _restore_compressed_linear() always runs even if
from_pretrained raises, and (3) ensure torch_dtype="auto" is merged into the
preserved model_kwargs before passing them to
AutoModelForCausalLM.from_pretrained; keep references to
_patch_compressed_linear_init, _restore_compressed_linear,
has_pack_quantized_config, AutoModelForCausalLM.from_pretrained, ckpt_path,
trust_remote_code and model_kwargs to locate and update the code.
- Around line 561-570: After restoring the BF16 weight and freezing it in module
(code around checkpoint_weights, module._parameters["weight"], and
QuantizationStatus.FROZEN), remove any leftover compressed/quantization
artifacts to avoid duplicate large tensors; after setting
module._parameters["weight"] and module.quantization_status =
QuantizationStatus.FROZEN add cleanup that pops known compressed keys from
module._parameters, module._buffers and module.__dict__ (e.g.
"compressed_weight", "weight_compressed", "weight_quantized", "quantizer_state",
"scales", "zp" or any key starting with "compressed" or "weight_") and use
delattr(module, key) if present so no duplicate large tensors remain in
state_dict or memory.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 960-970: The unpack cleanup block must also remove
weight_zero_point to avoid leaving stale quant metadata after calling
_build_compressed_data(); update the same cleanup in the method that deletes
weight_packed, weight_scale and weight_shape to additionally check for and
delete weight_zero_point (similar to how weight_shape is removed from
_parameters or via delattr), and ensure quantization_status transitions from
QuantizationStatus.COMPRESSED to QuantizationStatus.FROZEN remains unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e49da756-845d-4f4e-ad41-4700cf9d85f1

📥 Commits

Reviewing files that changed from the base of the PR and between 6c35801 and 44ce987.

📒 Files selected for processing (2)
  • examples/llm_ptq/example_utils.py
  • modelopt/torch/quantization/plugins/huggingface.py

When loading pack-quantized models with trust_remote_code=True,
transformers calls _init_weights on CompressedLinear modules that
have a missing 'weight' key. Custom model code (e.g. modeling_deepseek.py)
accesses module.weight.data directly, which crashes because
CompressedLinear has weight_packed instead. The patch returns a no-op
dummy to let initialization complete harmlessly.

Also adds class docstring to _QuantCompressedLinear explaining the
on-the-fly decompression rationale.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants