[OMNIML-3277] Update kv cache behavior by meenchen · Pull Request #1012 · NVIDIA/Model-Optimizer

meenchen · 2026-03-09T23:47:04Z

What does this PR do?

Type of change: New feature

Add fp8_cast and nvfp4_cast modes for --kv_cache_qformat in hf_ptq.py. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4
quantization.
Default --kv_cache_qformat changed from fp8 to fp8_cast. To restore the previous calibrated behavior, explicitly pass --kv_cache_qformat fp8.
Add use_constant_amax field to QuantizerAttributeConfig. When enabled, the quantizer uses a fixed amax and skips calibration. During calibration, these quantizers are disabled to avoid corrupting the forward pass for other quantizers.
Remove KV cache scale clamping (clamp_(min=1.0)) in the HF checkpoint export path. Calibrated scales are now exported as-is.
Fix _compute_kv_cache_dtype to correctly pass is_affine (was hardcoded to True, causing NVFP4 non-affine to export as NVFP4_AFFINE).
Fix get_kv_cache_scaling_factor to handle None scaling factors for cast-mode quantizers.

Usage

  # Quantize with default constant KV scale (no calibration pass for KV)
  python hf_ptq.py --model ... --qformat fp8 --kv_cache_qformat fp8

  # Opt into data-driven KV calibration
  python hf_ptq.py --model ... --qformat fp8 --kv_cache_qformat fp8 --calibrate_kv_cache

  # Use constant_amax in a custom quant config
  quant_cfg = {
      "quant_cfg": {
          "*[kv]_bmm_quantizer": {"num_bits": (4, 3), "enable": True, "use_constant_amax": True},
      },
      "algorithm": "max",
  }
  model = mtq.quantize(model, quant_cfg, forward_loop=calibrate_loop)

Testing

Unit tests: test_use_constant_amax, test_use_constant_amax_skips_calibration
Manual PTQ runs verified for all four modes (fp8_cast, fp8, nvfp4_cast, nvfp4)
Checkpoint verification: cast modes export no KV scales, calibrated modes export scales
Integration tests added to test_ptq_llama for fp8_cast, fp8, nvfp4_cast

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: Partially
- Update default behavior, --kv_cache_qformat changed from fp8 to fp8_cast (no calibration, no exported KV scales).
- KV cache scale clamping removed in HF export.
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅
Did you update Changelog?: ✅

Additional Information

Summary by CodeRabbit

New Features
- New CLI flag --calibrate_kv_cache to optionally enable data-driven KV cache calibration; default uses a fixed KV scale and omits KV scales from exported checkpoints.
- Added constant_amax option to set fixed quantizer scales and skip dynamic calibration for configured quantizers.
Bug Fixes
- Removed forced flooring/clamp of KV cache scales; out-of-range activations now emit a shorter warning.
Tests
- Added tests for constant_amax behavior and calibration interaction.

copy-pr-bot · 2026-03-09T23:47:07Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-03-09T23:47:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds CLI flag --calibrate_kv_cache to choose between fixed or data-driven FP8 KV cache quantization (default fixed amax=448.0). Introduces constant_amax on quantizers to set a fixed amax and skip calibration when present; adjusts calibration flow, export postprocessing, and tests accordingly.

Changes

Cohort / File(s)	Summary
Changelog & Config `CHANGELOG.rst`, `modelopt/torch/quantization/config.py`	Documented new flag and added `constant_amax: float
CLI / PTQ flow `examples/llm_ptq/hf_ptq.py`	Added `--calibrate_kv_cache` flag, deepcopy KV quantizer configs to avoid in-place mutation, conditional KV handling: set `constant_amax=448.0` and skip KV calibration when disabled, else run KV-only calibration. Updated `kv_cache_qformat` help and arg validation.
Quantizer runtime / calibration `modelopt/torch/quantization/nn/modules/tensor_quantizer.py`, `modelopt/torch/quantization/model_calib.py`	Added internal `_constant_amax` via config setter; `_get_amax` returns constant amax when present. `enable_stats_collection` skips calibration for quantizers with constant amax and retains them quantized.
Export / postprocess `modelopt/torch/export/quant_utils.py`	Skip processing when a KV scaling factor is `None`; removed clamp-to-1.0 flooring for KV scales and shortened warning message. Preserve computed KV scales when present.
Tests `tests/_test_utils/torch/quantization/tensor_quantizer_common.py`, `tests/gpu/torch/export/test_export.py`	Added tests `test_constant_amax` and `test_constant_amax_skips_calibration`. Updated KV FP8 postprocessing expectations (k_scale changed from `1.0` to `0.001`) in two export tests.

Sequence Diagram(s)

sequenceDiagram
    participant User as CLI
    participant HF as hf_ptq.py
    participant Config as QuantizerConfig
    participant Calib as CalibrationEngine
    participant Export as Exporter

    User->>HF: run with/without --calibrate_kv_cache
    HF->>Config: derive KV quantizer configs (deepcopy)
    alt calibrate_kv_cache = false
        HF->>Config: set constant_amax = 448.0 for KV quantizers
        HF->>Calib: skip KV calibration, enable calibration for other quantizers
    else calibrate_kv_cache = true
        HF->>Calib: enable calibration for KV quantizers only (disable others)
    end
    Calib->>Config: collect stats / compute scales
    HF->>Export: postprocess_state_dict (write KV scales if calibrated)
    Export->>User: export checkpoint (KV scales present when calibrated)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[OMNIML-3277] Update kv cache behavior' is directly related to the changeset, which centers on modifying KV cache quantization behavior through a new constant_amax mechanism and calibration flag.
Security Anti-Patterns	✅ Passed	PR introduces KV cache quantization changes and constant_amax field with no security anti-patterns found in SECURITY.md review.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch weimingc/update_kv_cache_behavior

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

codecov · 2026-03-10T07:33:59Z

Codecov Report

❌ Patch coverage is 84.61538% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.10%. Comparing base (5d0e012) to head (c6ecb39).
⚠️ Report is 20 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_calib.py	77.77%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1012      +/-   ##
==========================================
- Coverage   71.73%   70.10%   -1.64%     
==========================================
  Files         211      221      +10     
  Lines       23948    25541    +1593     
==========================================
+ Hits        17180    17905     +725     
- Misses       6768     7636     +868

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

examples/llm_ptq/hf_ptq.py (1)
304-323: Consider using defensive .pop() to avoid potential KeyError.

Line 311 uses kv_cache_quant_cfg.pop("default") which will raise KeyError if the "default" key doesn't exist. While current KV configs should have this key, using a defensive approach would be more robust:
-        kv_cache_quant_cfg.pop("default")  # keep other quantizers from auto_quantize
+        kv_cache_quant_cfg.pop("default", None)  # keep other quantizers from auto_quantize
The overall KV cache calibration logic looks correct:

When calibrate_kv_cache=False: Uses constant amax=448.0 (scale=1.0)

When calibrate_kv_cache=True: Runs data-driven calibration on KV quantizers only
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 304 - 323, Replace the direct
pop("default") call with a defensive removal to avoid KeyError: check for the
key or use a safe-pop pattern on kv_cache_quant_cfg (the dict built from
getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]) before
proceeding; keep the rest of the KV cache path the same (including setting
kv_cache_quant_cfg["*[kv]_bmm_quantizer"]["constant_amax"] when not
args.calibrate_kv_cache and using mtq.set_quantizer_by_cfg /
mtq.set_quantizer_by_cfg_context with language_model and mtq.calibrate when
args.calibrate_kv_cache).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 946-949: The current assignment to
quant_cfg["quant_cfg"]["*[kv]_bmm_quantizer"]["constant_amax"] can KeyError for
nvfp4_rotate because that format uses separate "*q_bmm_quantizer",
"*k_bmm_quantizer", "*v_bmm_quantizer" keys; update the branch that runs when
args.kv_cache_qformat != "none" and not args.calibrate_kv_cache to detect
args.kv_cache_qformat == "nvfp4_rotate" and in that case set "constant_amax" on
each of quant_cfg["quant_cfg"]["*q_bmm_quantizer"], "*k_bmm_quantizer" and
"*v_bmm_quantizer" (checking each key exists before assignment), otherwise keep
the existing assignment to "*[kv]_bmm_quantizer" (also guarding existence) so no
KeyError occurs.

In `@modelopt/torch/quantization/config.py`:
- Around line 1033-1043: The new constant_amax config must be validated to be a
finite, strictly positive number to prevent downstream broken scales; update the
ModeloptField for constant_amax so that its validator rejects None? (keep None
allowed) and any values that are <= 0, NaN, or infinite, and raise a clear
validation error. Specifically, add a validation callback or constraint on the
constant_amax ModeloptField declaration (symbol: constant_amax) that uses
math.isfinite(value) and value > 0, and ensure TensorQuantizer._get_amax() can
continue to trust the validated value without additional checks. Ensure the
error message references constant_amax and explains it must be a finite positive
number.

In `@modelopt/torch/quantization/model_calib.py`:
- Around line 709-712: The current early `continue` when module._constant_amax
is set skips updating per-module calibration flags and so prevents static-bias
calibrators from collecting bias stats and also leaves stale _if_quant/_if_calib
state; change the branch in model_calib.py so that when getattr(module,
"_constant_amax", None) is not None you do not unconditionally continue but
instead: if module._calibrator indicates a static-bias calibrator (or if
module._calibrator is not None and supports bias calibration) set
module._if_calib = True (and ensure module._if_quant is set/cleared
consistently) so load_calib_bias()/compute_bias() can run, otherwise explicitly
clear module._if_calib and module._if_quant to avoid preserving stale state;
reference the attributes _constant_amax, _calibrator, _if_calib, _if_quant and
the function finish_stats_collection()/load_calib_bias()/compute_bias() when
making the change.

In `@modelopt/torch/quantization/nn/modules/tensor_quantizer.py`:
- Around line 617-618: get_kv_cache_scaling_factor currently assumes every entry
from get_scaling_factor() is non-None and calls factor.item(), which crashes
when a KV quantizer uses constant_amax (export_amax has no _amax buffer and
get_scaling_factor returns None). Fix by guarding against None: in
get_kv_cache_scaling_factor (the loop that iterates scaling_factors when dtype
== KV_CACHE_FP8), either filter out None values before iterating or add an if
factor is None: continue check before calling factor.item(); ensure this
respects constant_amax/_constant_amax semantics so fixed-amax quantizers are
skipped rather than dereferenced.

---

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 304-323: Replace the direct pop("default") call with a defensive
removal to avoid KeyError: check for the key or use a safe-pop pattern on
kv_cache_quant_cfg (the dict built from getattr(mtq,
KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]) before proceeding;
keep the rest of the KV cache path the same (including setting
kv_cache_quant_cfg["*[kv]_bmm_quantizer"]["constant_amax"] when not
args.calibrate_kv_cache and using mtq.set_quantizer_by_cfg /
mtq.set_quantizer_by_cfg_context with language_model and mtq.calibrate when
args.calibrate_kv_cache).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e7b126c8-ce02-4772-8f0c-d36987d90865

📥 Commits

Reviewing files that changed from the base of the PR and between a56b6f3 and 4176c6d.

📒 Files selected for processing (7)

CHANGELOG.rst
examples/llm_ptq/hf_ptq.py
modelopt/torch/export/quant_utils.py
modelopt/torch/quantization/config.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/nn/modules/tensor_quantizer.py
tests/_test_utils/torch/quantization/tensor_quantizer_common.py

modelopt/torch/quantization/config.py

modelopt/torch/quantization/model_calib.py

modelopt/torch/quantization/nn/modules/tensor_quantizer.py

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 313-323: The nvfp4_rotate KV format is left uncalibrated because
the code only sets constant_amax when "*[kv]_bmm_quantizer" is present and only
runs calibration when args.calibrate_kv_cache is True; fix by detecting the
nvfp4_rotate KV format and forcing KV calibration or rejecting the combination:
either set args.calibrate_kv_cache = True when the parsed KV format equals
"nvfp4_rotate" before you build kv_cache_quant_cfg (or immediately before
calling mtq.set_quantizer_by_cfg/mtq.calibrate), or validate arguments earlier
and raise an error if --kv_cache_qformat nvfp4_rotate is used without
--calibrate_kv_cache; apply the same change at the other similar spots that call
mtq.set_quantizer_by_cfg, mtq.set_quantizer_by_cfg_context, and mtq.calibrate so
nvfp4_rotate never runs on the uncalibrated default path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5aaec202-14ac-4b91-92cf-eea7abee0dd2

📥 Commits

Reviewing files that changed from the base of the PR and between 4176c6d and ba93ebd.

📒 Files selected for processing (1)

examples/llm_ptq/hf_ptq.py

examples/llm_ptq/hf_ptq.py

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 964-975: The current shortcut unconditionally sets constant_amax
to 448.0 for "*[kv]_bmm_quantizer" which is valid only for FP8 E4M3; change the
logic in the block that references kv_quantizer_cfg, args.kv_cache_qformat, and
args.calibrate_kv_cache so that you either (a) only apply the constant_amax
shortcut when the quantizer format is FP8 (detect via format identifier or
kv_quantizer_cfg properties), or (b) compute constant_amax from the quantizer’s
configured maxbound (read the quantizer format’s maxbound property from
kv_quantizer_cfg) and set constant_amax = maxbound instead of the hardcoded
448.0; ensure you deepcopy quant_cfg before mutating "*[kv]_bmm_quantizer" as
already done.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 72a1fc78-55a0-4637-b95d-7e2673599983

📥 Commits

Reviewing files that changed from the base of the PR and between 5614009 and 41cba84.

📒 Files selected for processing (1)

examples/llm_ptq/hf_ptq.py

examples/llm_ptq/hf_ptq.py

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/gpu/torch/export/test_export.py (1)

211-225: ⚠️ Potential issue | 🟡 Minor

Cover the new default “no KV scale emitted” branch.

These updated assertions only validate the calibrated path (_amax present → exported scale = amax / maxbound). The PR’s default behavior is the opposite branch: with constant_amax, postprocess_state_dict() should omit k_scale / v_scale entirely so downstream falls back to 1.0. Please add a KV_CACHE_FP8 case with no k_bmm_quantizer._amax / v_bmm_quantizer._amax entries; otherwise the main behavior change can regress unnoticed.

Suggested test addition

 `@pytest.mark.parametrize`(
     ("state_dict", "quantization", "maxbound", "expected_state_dict"),
     [
+        (  # Default constant-amax KV cache path should emit no KV scales
+            {
+                "layer1.input_quantizer._pre_quant_scale": torch.tensor([0.128]),
+            },
+            KV_CACHE_FP8,
+            128.0,
+            {
+                "layer1.pre_quant_scale": torch.tensor([0.128]),
+            },
+        ),
         (  # Test replacements and KV cache scaling
             {
                 "layer1.k_bmm_quantizer._amax": torch.tensor([0.128]),
                 "layer1.v_bmm_quantizer._amax": torch.tensor([256.0]),
                 "layer1.input_quantizer._pre_quant_scale": torch.tensor([0.128]),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/export/test_export.py` around lines 211 - 225, Add a test
that covers the default "no KV scale emitted" branch by invoking
postprocess_state_dict() with KV_CACHE_FP8 and constant_amax but without
providing "layer1.k_bmm_quantizer._amax" or "layer1.v_bmm_quantizer._amax" keys;
assert that "layer1.k_proj.k_scale" and "layer1.v_proj.v_scale" are not present
in the returned state dict (i.e., omitted so downstream will default to 1.0)
rather than being computed, mirroring the existing calibrated-case tests that
include _amax keys.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/gpu/torch/export/test_export.py`:
- Around line 211-225: Add a test that covers the default "no KV scale emitted"
branch by invoking postprocess_state_dict() with KV_CACHE_FP8 and constant_amax
but without providing "layer1.k_bmm_quantizer._amax" or
"layer1.v_bmm_quantizer._amax" keys; assert that "layer1.k_proj.k_scale" and
"layer1.v_proj.v_scale" are not present in the returned state dict (i.e.,
omitted so downstream will default to 1.0) rather than being computed, mirroring
the existing calibrated-case tests that include _amax keys.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: beab94cb-122c-4edd-94aa-71aa4021788c

📥 Commits

Reviewing files that changed from the base of the PR and between 41cba84 and 0a0439c.

📒 Files selected for processing (1)

tests/gpu/torch/export/test_export.py

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

examples/llm_ptq/hf_ptq.py

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

examples/llm_ptq/hf_ptq.py

cjluo-nv · 2026-03-12T17:02:27Z

modelopt/torch/quantization/config.py

        """,
    )

+    constant_amax: float | None = ModeloptField(


@realAsma please share your opinion on this one. My opinion is that we should not introduce this.

@cjluo-nv what would you suggest instead? I thought we wanted a way to specify whether to use constant amax or calibrated amax

Could you please give a few suggestions to do this?

My thinking is that we should have a way to simulate the inference behavior of dummy scales, so adding this option here.

@realAsma, after syncing with @cjluo-nv offline, I changed this to cast_to_fp8, a flag to toggle instead of a constant value that users must set. wdyt?

examples/llm_ptq/hf_ptq.py

Copilot

Pull request overview

This PR changes the default KV cache quantization behavior in hf_ptq.py. Previously, FP8 KV cache quantization was the default and required data-driven calibration. Now, KV cache quantization defaults to none, and when enabled, uses a constant scale of 1.0 (amax=448.0) without calibration by default. A new --calibrate_kv_cache flag opts into the previous data-driven calibration behavior. The implementation adds a constant_amax field to QuantizerAttributeConfig that allows quantizers to skip calibration entirely and use a fixed amax value.

Changes:

Added constant_amax field to QuantizerAttributeConfig and integrated it into TensorQuantizer._get_amax() and enable_stats_collection() to skip calibration for constant-amax quantizers.
Changed --kv_cache_qformat default from fp8 to none, added --calibrate_kv_cache flag, and removed the KV scale floor/clamp logic in the export path.
Added unit tests for constant_amax behavior and updated export tests to match the removed clamping.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`modelopt/torch/quantization/config.py`	Added `constant_amax` field to `QuantizerAttributeConfig`
`modelopt/torch/quantization/nn/modules/tensor_quantizer.py`	Added `constant_amax` to attribute setter mapping and `_get_amax` short-circuit
`modelopt/torch/quantization/model_calib.py`	Skip calibration for constant_amax quantizers in `enable_stats_collection`
`modelopt/torch/export/quant_utils.py`	Handle None scaling factors; remove KV scale floor/clamp
`examples/llm_ptq/hf_ptq.py`	Changed defaults, added `--calibrate_kv_cache`, constant_amax injection logic
`tests/_test_utils/torch/quantization/tensor_quantizer_common.py`	Added tests for `constant_amax` behavior
`tests/gpu/torch/export/test_export.py`	Updated expected values after removing KV scale clamping
`CHANGELOG.rst`	Documented new features and changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

modelopt/torch/quantization/nn/modules/tensor_quantizer.py

examples/llm_ptq/hf_ptq.py

CHANGELOG.rst

Edwardf0t1

I think currently by default we still calibrate kv cache for fp8, right? But in the export stage, most values are clamped to 1.0.

examples/llm_ptq/hf_ptq.py

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen · 2026-03-12T20:44:14Z

I think currently by default we still calibrate kv cache for fp8, right? But in the export stage, most values are clamped to 1.0.

@Edwardf0t1 , yes that's correct. An example would be Nemotron models. All KV are actually clamped to 1.0

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

cjluo-nv

Review

The core logic is sound — skipping calibration for KV cache when scale=1.0 is the engine default is a good perf win. Tests look solid. A few concerns before approving:

1. Naming: `cast_to_fp8` is misleading (medium)

The field is used for both fp8_cast and nvfp4_cast, yet _get_amax always returns torch.finfo(torch.float8_e4m3fn).max (448.0). For an NVFP4 quantizer, a field named cast_to_fp8 is confusing. Consider renaming to something more general like use_constant_amax or skip_calibration, or parameterize the amax value. Same applies to the helper _set_kv_cache_cast_to_fp8 which is called for nvfp4_cast too.

2. Breaking default behavior change (medium-high)

Default --kv_cache_qformat changed from fp8 → fp8_cast. This silently changes behavior for every existing script that doesn't explicitly set --kv_cache_qformat. The changelog entry should call this out as a default behavior change, not just a new feature.

3. `getattr(module, "_cast_to_fp8", False)` is fragile (low)

In model_calib.py, _cast_to_fp8 is checked via getattr with a default. The attribute should be initialized to False in TensorQuantizer.__init__ so it's always present, rather than relying on dynamic attribute checking.

4. Removed KV scale clamping has broader impact (low-medium)

The clamp_(min=1.0) removal in postprocess_state_dict and the floor removal in get_kv_cache_scaling_factor affect all KV export paths, not just cast_to_fp8. Users on --kv_cache_qformat fp8 (calibrated) with scales < 1.0 will now get different exported values. Please confirm downstream engines handle small scales correctly.

cjluo-nv

Review

The core logic is sound — skipping calibration for KV cache when scale=1.0 is the engine default is a good perf win. Tests look solid. A few concerns before approving:

1. Naming: `cast_to_fp8` is misleading (medium)

The field is used for both fp8_cast and nvfp4_cast, yet _get_amax always returns torch.finfo(torch.float8_e4m3fn).max (448.0). For an NVFP4 quantizer, a field named cast_to_fp8 is confusing. Consider renaming to something more general like use_constant_amax or skip_calibration, or parameterize the amax value. Same applies to the helper _set_kv_cache_cast_to_fp8 which is called for nvfp4_cast too.

2. Breaking default behavior change (medium-high)

Default --kv_cache_qformat changed from fp8 → fp8_cast. This silently changes behavior for every existing script that doesn't explicitly set --kv_cache_qformat. The changelog entry should call this out as a default behavior change, not just a new feature.

3. `getattr(module, "_cast_to_fp8", False)` is fragile (low)

In model_calib.py, _cast_to_fp8 is checked via getattr with a default. The attribute should be initialized to False in TensorQuantizer.__init__ so it's always present, rather than relying on dynamic attribute checking.

4. Removed KV scale clamping has broader impact (low-medium)

The clamp_(min=1.0) removal in postprocess_state_dict and the floor removal in get_kv_cache_scaling_factor affect all KV export paths, not just cast_to_fp8. Users on --kv_cache_qformat fp8 (calibrated) with scales < 1.0 will now get different exported values. Please confirm downstream engines handle small scales correctly.

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen · 2026-03-13T23:04:01Z

Thanks for the review. Addressed them in the latest commit

Review

The core logic is sound — skipping calibration for KV cache when scale=1.0 is the engine default is a good perf win. Tests look solid. A few concerns before approving:

1. Naming: cast_to_fp8 is misleading (medium)

The field is used for both fp8_cast and nvfp4_cast, yet _get_amax always returns torch.finfo(torch.float8_e4m3fn).max (448.0). For an NVFP4 quantizer, a field named cast_to_fp8 is confusing. Consider renaming to something more general like use_constant_amax or skip_calibration, or parameterize the amax value. Same applies to the helper _set_kv_cache_cast_to_fp8 which is called for nvfp4_cast too.

Renamed cast_to_fp8 → use_constant_amax

2. Breaking default behavior change (medium-high)

Default --kv_cache_qformat changed from fp8 → fp8_cast. This silently changes behavior for every existing script that doesn't explicitly set --kv_cache_qformat. The changelog entry should call this out as a default behavior change, not just a new feature.

Added a Backward Breaking Changes section in CHANGELOG.rst documenting the default change

3. getattr(module, "_cast_to_fp8", False) is fragile (low)

In model_calib.py, _cast_to_fp8 is checked via getattr with a default. The attribute should be initialized to False in TensorQuantizer.__init__ so it's always present, rather than relying on dynamic attribute checking.

Initialized self._use_constant_amax = False in TensorQuantizer.init before set_from_attribute_config, then changed all getattr(module, "_cast_to_fp8", False) to direct module._use_constant_amax access.

4. Removed KV scale clamping has broader impact (low-medium)

The clamp_(min=1.0) removal in postprocess_state_dict and the floor removal in get_kv_cache_scaling_factor affect all KV export paths, not just cast_to_fp8. Users on --kv_cache_qformat fp8 (calibrated) with scales < 1.0 will now get different exported values. Please confirm downstream engines handle small scales correctly.

Added description in the breaking changes section advising users to try casting methods if they see accuracy degradation with calibrated KV cache.

cjluo-nv

All concerns from previous review have been addressed:

Renamed cast_to_fp8 → use_constant_amax throughout
Changelog now has a dedicated backward-breaking changes section
_use_constant_amax properly initialized in __init__
Defensive .pop("default", None)

LGTM, thanks!

Edwardf0t1

LGTM

meenchen self-assigned this Mar 10, 2026

meenchen added 3 commits March 10, 2026 07:22

update kv cache behavior

1d37b9d

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

refactor design

db7725d

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

clearn up

4176c6d

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen force-pushed the weimingc/update_kv_cache_behavior branch from edb0827 to 4176c6d Compare March 10, 2026 07:22

meenchen marked this pull request as ready for review March 10, 2026 17:34

meenchen requested review from a team as code owners March 10, 2026 17:34

meenchen requested a review from sugunav14 March 10, 2026 17:34

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

modelopt/torch/quantization/config.py Outdated Show resolved Hide resolved

modelopt/torch/quantization/model_calib.py Outdated Show resolved Hide resolved

modelopt/torch/quantization/nn/modules/tensor_quantizer.py Outdated Show resolved Hide resolved

address comment

ba93ebd

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen force-pushed the weimingc/update_kv_cache_behavior branch from bde9317 to ba93ebd Compare March 10, 2026 18:47

address comment

5614009

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

examples/llm_ptq/hf_ptq.py Outdated Show resolved Hide resolved

meenchen requested review from Edwardf0t1, cjluo-nv, jenchen13 and realAsma March 10, 2026 18:55

jenchen13 reviewed Mar 10, 2026

View reviewed changes

examples/llm_ptq/hf_ptq.py Outdated Show resolved Hide resolved

address comments

41cba84

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

examples/llm_ptq/hf_ptq.py Outdated Show resolved Hide resolved

fix test

0a0439c

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen force-pushed the weimingc/update_kv_cache_behavior branch from f238de1 to 0a0439c Compare March 11, 2026 04:06

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

address comment

c2e91ed

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen force-pushed the weimingc/update_kv_cache_behavior branch from e30aded to c2e91ed Compare March 11, 2026 18:05

jenchen13 reviewed Mar 11, 2026

View reviewed changes

examples/llm_ptq/hf_ptq.py Outdated Show resolved Hide resolved

examples/llm_ptq/hf_ptq.py Outdated Show resolved Hide resolved

meenchen added 2 commits March 11, 2026 20:13

address comment

1ff9ba8

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

minor

150a278

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

jenchen13 approved these changes Mar 11, 2026

View reviewed changes

cjluo-nv reviewed Mar 12, 2026

View reviewed changes

examples/llm_ptq/hf_ptq.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Mar 12, 2026

View reviewed changes

examples/llm_ptq/hf_ptq.py Outdated Show resolved Hide resolved

Edwardf0t1 requested a review from Copilot March 12, 2026 19:07

Copilot started reviewing on behalf of Edwardf0t1 March 12, 2026 19:08 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

modelopt/torch/quantization/nn/modules/tensor_quantizer.py Outdated Show resolved Hide resolved

examples/llm_ptq/hf_ptq.py Outdated Show resolved Hide resolved

examples/llm_ptq/hf_ptq.py Outdated Show resolved Hide resolved

CHANGELOG.rst Outdated Show resolved Hide resolved

Edwardf0t1 reviewed Mar 12, 2026

View reviewed changes

examples/llm_ptq/hf_ptq.py Outdated Show resolved Hide resolved

fix typo

290572a

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen added 2 commits March 12, 2026 20:52

address comment

b1760d0

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

address comment

fecf8c5

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen requested review from Edwardf0t1 and cjluo-nv March 13, 2026 17:17

meenchen added 2 commits March 13, 2026 20:24

address comments

d7fe11c

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

address comments

6b0cbe3

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

cjluo-nv requested changes Mar 13, 2026

View reviewed changes

cjluo-nv reviewed Mar 13, 2026

View reviewed changes

address comments

c6ecb39

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

cjluo-nv approved these changes Mar 13, 2026

View reviewed changes

meenchen enabled auto-merge (squash) March 13, 2026 23:52

Edwardf0t1 approved these changes Mar 14, 2026

View reviewed changes

Conversation

meenchen commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 9, 2026

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjluo-nv Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

meenchen Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

meenchen Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

meenchen commented Mar 12, 2026

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Review

meenchen commented Mar 9, 2026 •

edited

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

codecov bot commented Mar 10, 2026 •

edited

Loading

1. Naming: `cast_to_fp8` is misleading (medium)

3. `getattr(module, "_cast_to_fp8", False)` is fragile (low)

1. Naming: `cast_to_fp8` is misleading (medium)

3. `getattr(module, "_cast_to_fp8", False)` is fragile (low)

1. Naming: `cast_to_fp8` is misleading (medium)

3. `getattr(module, "_cast_to_fp8", False)` is fragile (low)