Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds CLI flag Changes
Sequence Diagram(s)sequenceDiagram
participant User as CLI
participant HF as hf_ptq.py
participant Config as QuantizerConfig
participant Calib as CalibrationEngine
participant Export as Exporter
User->>HF: run with/without --calibrate_kv_cache
HF->>Config: derive KV quantizer configs (deepcopy)
alt calibrate_kv_cache = false
HF->>Config: set constant_amax = 448.0 for KV quantizers
HF->>Calib: skip KV calibration, enable calibration for other quantizers
else calibrate_kv_cache = true
HF->>Calib: enable calibration for KV quantizers only (disable others)
end
Calib->>Config: collect stats / compute scales
HF->>Export: postprocess_state_dict (write KV scales if calibrated)
Export->>User: export checkpoint (KV scales present when calibrated)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
edb0827 to
4176c6d
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1012 +/- ##
==========================================
- Coverage 71.73% 70.10% -1.64%
==========================================
Files 211 221 +10
Lines 23948 25541 +1593
==========================================
+ Hits 17180 17905 +725
- Misses 6768 7636 +868 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
examples/llm_ptq/hf_ptq.py (1)
304-323: Consider using defensive.pop()to avoid potential KeyError.Line 311 uses
kv_cache_quant_cfg.pop("default")which will raiseKeyErrorif the "default" key doesn't exist. While current KV configs should have this key, using a defensive approach would be more robust:- kv_cache_quant_cfg.pop("default") # keep other quantizers from auto_quantize + kv_cache_quant_cfg.pop("default", None) # keep other quantizers from auto_quantizeThe overall KV cache calibration logic looks correct:
- When
calibrate_kv_cache=False: Uses constant amax=448.0 (scale=1.0)- When
calibrate_kv_cache=True: Runs data-driven calibration on KV quantizers only🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 304 - 323, Replace the direct pop("default") call with a defensive removal to avoid KeyError: check for the key or use a safe-pop pattern on kv_cache_quant_cfg (the dict built from getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]) before proceeding; keep the rest of the KV cache path the same (including setting kv_cache_quant_cfg["*[kv]_bmm_quantizer"]["constant_amax"] when not args.calibrate_kv_cache and using mtq.set_quantizer_by_cfg / mtq.set_quantizer_by_cfg_context with language_model and mtq.calibrate when args.calibrate_kv_cache).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 946-949: The current assignment to
quant_cfg["quant_cfg"]["*[kv]_bmm_quantizer"]["constant_amax"] can KeyError for
nvfp4_rotate because that format uses separate "*q_bmm_quantizer",
"*k_bmm_quantizer", "*v_bmm_quantizer" keys; update the branch that runs when
args.kv_cache_qformat != "none" and not args.calibrate_kv_cache to detect
args.kv_cache_qformat == "nvfp4_rotate" and in that case set "constant_amax" on
each of quant_cfg["quant_cfg"]["*q_bmm_quantizer"], "*k_bmm_quantizer" and
"*v_bmm_quantizer" (checking each key exists before assignment), otherwise keep
the existing assignment to "*[kv]_bmm_quantizer" (also guarding existence) so no
KeyError occurs.
In `@modelopt/torch/quantization/config.py`:
- Around line 1033-1043: The new constant_amax config must be validated to be a
finite, strictly positive number to prevent downstream broken scales; update the
ModeloptField for constant_amax so that its validator rejects None? (keep None
allowed) and any values that are <= 0, NaN, or infinite, and raise a clear
validation error. Specifically, add a validation callback or constraint on the
constant_amax ModeloptField declaration (symbol: constant_amax) that uses
math.isfinite(value) and value > 0, and ensure TensorQuantizer._get_amax() can
continue to trust the validated value without additional checks. Ensure the
error message references constant_amax and explains it must be a finite positive
number.
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 709-712: The current early `continue` when module._constant_amax
is set skips updating per-module calibration flags and so prevents static-bias
calibrators from collecting bias stats and also leaves stale _if_quant/_if_calib
state; change the branch in model_calib.py so that when getattr(module,
"_constant_amax", None) is not None you do not unconditionally continue but
instead: if module._calibrator indicates a static-bias calibrator (or if
module._calibrator is not None and supports bias calibration) set
module._if_calib = True (and ensure module._if_quant is set/cleared
consistently) so load_calib_bias()/compute_bias() can run, otherwise explicitly
clear module._if_calib and module._if_quant to avoid preserving stale state;
reference the attributes _constant_amax, _calibrator, _if_calib, _if_quant and
the function finish_stats_collection()/load_calib_bias()/compute_bias() when
making the change.
In `@modelopt/torch/quantization/nn/modules/tensor_quantizer.py`:
- Around line 617-618: get_kv_cache_scaling_factor currently assumes every entry
from get_scaling_factor() is non-None and calls factor.item(), which crashes
when a KV quantizer uses constant_amax (export_amax has no _amax buffer and
get_scaling_factor returns None). Fix by guarding against None: in
get_kv_cache_scaling_factor (the loop that iterates scaling_factors when dtype
== KV_CACHE_FP8), either filter out None values before iterating or add an if
factor is None: continue check before calling factor.item(); ensure this
respects constant_amax/_constant_amax semantics so fixed-amax quantizers are
skipped rather than dereferenced.
---
Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 304-323: Replace the direct pop("default") call with a defensive
removal to avoid KeyError: check for the key or use a safe-pop pattern on
kv_cache_quant_cfg (the dict built from getattr(mtq,
KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]) before proceeding;
keep the rest of the KV cache path the same (including setting
kv_cache_quant_cfg["*[kv]_bmm_quantizer"]["constant_amax"] when not
args.calibrate_kv_cache and using mtq.set_quantizer_by_cfg /
mtq.set_quantizer_by_cfg_context with language_model and mtq.calibrate when
args.calibrate_kv_cache).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e7b126c8-ce02-4772-8f0c-d36987d90865
📒 Files selected for processing (7)
CHANGELOG.rstexamples/llm_ptq/hf_ptq.pymodelopt/torch/export/quant_utils.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pytests/_test_utils/torch/quantization/tensor_quantizer_common.py
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
bde9317 to
ba93ebd
Compare
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 313-323: The nvfp4_rotate KV format is left uncalibrated because
the code only sets constant_amax when "*[kv]_bmm_quantizer" is present and only
runs calibration when args.calibrate_kv_cache is True; fix by detecting the
nvfp4_rotate KV format and forcing KV calibration or rejecting the combination:
either set args.calibrate_kv_cache = True when the parsed KV format equals
"nvfp4_rotate" before you build kv_cache_quant_cfg (or immediately before
calling mtq.set_quantizer_by_cfg/mtq.calibrate), or validate arguments earlier
and raise an error if --kv_cache_qformat nvfp4_rotate is used without
--calibrate_kv_cache; apply the same change at the other similar spots that call
mtq.set_quantizer_by_cfg, mtq.set_quantizer_by_cfg_context, and mtq.calibrate so
nvfp4_rotate never runs on the uncalibrated default path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5aaec202-14ac-4b91-92cf-eea7abee0dd2
📒 Files selected for processing (1)
examples/llm_ptq/hf_ptq.py
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 964-975: The current shortcut unconditionally sets constant_amax
to 448.0 for "*[kv]_bmm_quantizer" which is valid only for FP8 E4M3; change the
logic in the block that references kv_quantizer_cfg, args.kv_cache_qformat, and
args.calibrate_kv_cache so that you either (a) only apply the constant_amax
shortcut when the quantizer format is FP8 (detect via format identifier or
kv_quantizer_cfg properties), or (b) compute constant_amax from the quantizer’s
configured maxbound (read the quantizer format’s maxbound property from
kv_quantizer_cfg) and set constant_amax = maxbound instead of the hardcoded
448.0; ensure you deepcopy quant_cfg before mutating "*[kv]_bmm_quantizer" as
already done.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 72a1fc78-55a0-4637-b95d-7e2673599983
📒 Files selected for processing (1)
examples/llm_ptq/hf_ptq.py
f238de1 to
0a0439c
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/gpu/torch/export/test_export.py (1)
211-225:⚠️ Potential issue | 🟡 MinorCover the new default “no KV scale emitted” branch.
These updated assertions only validate the calibrated path (
_amaxpresent → exported scale =amax / maxbound). The PR’s default behavior is the opposite branch: withconstant_amax,postprocess_state_dict()should omitk_scale/v_scaleentirely so downstream falls back to 1.0. Please add aKV_CACHE_FP8case with nok_bmm_quantizer._amax/v_bmm_quantizer._amaxentries; otherwise the main behavior change can regress unnoticed.Suggested test addition
`@pytest.mark.parametrize`( ("state_dict", "quantization", "maxbound", "expected_state_dict"), [ + ( # Default constant-amax KV cache path should emit no KV scales + { + "layer1.input_quantizer._pre_quant_scale": torch.tensor([0.128]), + }, + KV_CACHE_FP8, + 128.0, + { + "layer1.pre_quant_scale": torch.tensor([0.128]), + }, + ), ( # Test replacements and KV cache scaling { "layer1.k_bmm_quantizer._amax": torch.tensor([0.128]), "layer1.v_bmm_quantizer._amax": torch.tensor([256.0]), "layer1.input_quantizer._pre_quant_scale": torch.tensor([0.128]),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/export/test_export.py` around lines 211 - 225, Add a test that covers the default "no KV scale emitted" branch by invoking postprocess_state_dict() with KV_CACHE_FP8 and constant_amax but without providing "layer1.k_bmm_quantizer._amax" or "layer1.v_bmm_quantizer._amax" keys; assert that "layer1.k_proj.k_scale" and "layer1.v_proj.v_scale" are not present in the returned state dict (i.e., omitted so downstream will default to 1.0) rather than being computed, mirroring the existing calibrated-case tests that include _amax keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/gpu/torch/export/test_export.py`:
- Around line 211-225: Add a test that covers the default "no KV scale emitted"
branch by invoking postprocess_state_dict() with KV_CACHE_FP8 and constant_amax
but without providing "layer1.k_bmm_quantizer._amax" or
"layer1.v_bmm_quantizer._amax" keys; assert that "layer1.k_proj.k_scale" and
"layer1.v_proj.v_scale" are not present in the returned state dict (i.e.,
omitted so downstream will default to 1.0) rather than being computed, mirroring
the existing calibrated-case tests that include _amax keys.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: beab94cb-122c-4edd-94aa-71aa4021788c
📒 Files selected for processing (1)
tests/gpu/torch/export/test_export.py
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
e30aded to
c2e91ed
Compare
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
| """, | ||
| ) | ||
|
|
||
| constant_amax: float | None = ModeloptField( |
There was a problem hiding this comment.
@realAsma please share your opinion on this one. My opinion is that we should not introduce this.
There was a problem hiding this comment.
@cjluo-nv what would you suggest instead? I thought we wanted a way to specify whether to use constant amax or calibrated amax
There was a problem hiding this comment.
Could you please give a few suggestions to do this?
There was a problem hiding this comment.
My thinking is that we should have a way to simulate the inference behavior of dummy scales, so adding this option here.
There was a problem hiding this comment.
Pull request overview
This PR changes the default KV cache quantization behavior in hf_ptq.py. Previously, FP8 KV cache quantization was the default and required data-driven calibration. Now, KV cache quantization defaults to none, and when enabled, uses a constant scale of 1.0 (amax=448.0) without calibration by default. A new --calibrate_kv_cache flag opts into the previous data-driven calibration behavior. The implementation adds a constant_amax field to QuantizerAttributeConfig that allows quantizers to skip calibration entirely and use a fixed amax value.
Changes:
- Added
constant_amaxfield toQuantizerAttributeConfigand integrated it intoTensorQuantizer._get_amax()andenable_stats_collection()to skip calibration for constant-amax quantizers. - Changed
--kv_cache_qformatdefault fromfp8tonone, added--calibrate_kv_cacheflag, and removed the KV scale floor/clamp logic in the export path. - Added unit tests for
constant_amaxbehavior and updated export tests to match the removed clamping.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
modelopt/torch/quantization/config.py |
Added constant_amax field to QuantizerAttributeConfig |
modelopt/torch/quantization/nn/modules/tensor_quantizer.py |
Added constant_amax to attribute setter mapping and _get_amax short-circuit |
modelopt/torch/quantization/model_calib.py |
Skip calibration for constant_amax quantizers in enable_stats_collection |
modelopt/torch/export/quant_utils.py |
Handle None scaling factors; remove KV scale floor/clamp |
examples/llm_ptq/hf_ptq.py |
Changed defaults, added --calibrate_kv_cache, constant_amax injection logic |
tests/_test_utils/torch/quantization/tensor_quantizer_common.py |
Added tests for constant_amax behavior |
tests/gpu/torch/export/test_export.py |
Updated expected values after removing KV scale clamping |
CHANGELOG.rst |
Documented new features and changes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Edwardf0t1
left a comment
There was a problem hiding this comment.
I think currently by default we still calibrate kv cache for fp8, right? But in the export stage, most values are clamped to 1.0.
@Edwardf0t1 , yes that's correct. An example would be Nemotron models. All KV are actually clamped to 1.0 |
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
cjluo-nv
left a comment
There was a problem hiding this comment.
Review
The core logic is sound — skipping calibration for KV cache when scale=1.0 is the engine default is a good perf win. Tests look solid. A few concerns before approving:
1. Naming: cast_to_fp8 is misleading (medium)
The field is used for both fp8_cast and nvfp4_cast, yet _get_amax always returns torch.finfo(torch.float8_e4m3fn).max (448.0). For an NVFP4 quantizer, a field named cast_to_fp8 is confusing. Consider renaming to something more general like use_constant_amax or skip_calibration, or parameterize the amax value. Same applies to the helper _set_kv_cache_cast_to_fp8 which is called for nvfp4_cast too.
2. Breaking default behavior change (medium-high)
Default --kv_cache_qformat changed from fp8 → fp8_cast. This silently changes behavior for every existing script that doesn't explicitly set --kv_cache_qformat. The changelog entry should call this out as a default behavior change, not just a new feature.
3. getattr(module, "_cast_to_fp8", False) is fragile (low)
In model_calib.py, _cast_to_fp8 is checked via getattr with a default. The attribute should be initialized to False in TensorQuantizer.__init__ so it's always present, rather than relying on dynamic attribute checking.
4. Removed KV scale clamping has broader impact (low-medium)
The clamp_(min=1.0) removal in postprocess_state_dict and the floor removal in get_kv_cache_scaling_factor affect all KV export paths, not just cast_to_fp8. Users on --kv_cache_qformat fp8 (calibrated) with scales < 1.0 will now get different exported values. Please confirm downstream engines handle small scales correctly.
cjluo-nv
left a comment
There was a problem hiding this comment.
Review
The core logic is sound — skipping calibration for KV cache when scale=1.0 is the engine default is a good perf win. Tests look solid. A few concerns before approving:
1. Naming: cast_to_fp8 is misleading (medium)
The field is used for both fp8_cast and nvfp4_cast, yet _get_amax always returns torch.finfo(torch.float8_e4m3fn).max (448.0). For an NVFP4 quantizer, a field named cast_to_fp8 is confusing. Consider renaming to something more general like use_constant_amax or skip_calibration, or parameterize the amax value. Same applies to the helper _set_kv_cache_cast_to_fp8 which is called for nvfp4_cast too.
2. Breaking default behavior change (medium-high)
Default --kv_cache_qformat changed from fp8 → fp8_cast. This silently changes behavior for every existing script that doesn't explicitly set --kv_cache_qformat. The changelog entry should call this out as a default behavior change, not just a new feature.
3. getattr(module, "_cast_to_fp8", False) is fragile (low)
In model_calib.py, _cast_to_fp8 is checked via getattr with a default. The attribute should be initialized to False in TensorQuantizer.__init__ so it's always present, rather than relying on dynamic attribute checking.
4. Removed KV scale clamping has broader impact (low-medium)
The clamp_(min=1.0) removal in postprocess_state_dict and the floor removal in get_kv_cache_scaling_factor affect all KV export paths, not just cast_to_fp8. Users on --kv_cache_qformat fp8 (calibrated) with scales < 1.0 will now get different exported values. Please confirm downstream engines handle small scales correctly.
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
|
Thanks for the review. Addressed them in the latest commit
Renamed cast_to_fp8 → use_constant_amax
Added a Backward Breaking Changes section in CHANGELOG.rst documenting the default change
Initialized self._use_constant_amax = False in TensorQuantizer.init before set_from_attribute_config, then changed all getattr(module, "_cast_to_fp8", False) to direct module._use_constant_amax access.
Added description in the breaking changes section advising users to try casting methods if they see accuracy degradation with calibrated KV cache. |
cjluo-nv
left a comment
There was a problem hiding this comment.
All concerns from previous review have been addressed:
- Renamed
cast_to_fp8→use_constant_amaxthroughout - Changelog now has a dedicated backward-breaking changes section
_use_constant_amaxproperly initialized in__init__- Defensive
.pop("default", None)
LGTM, thanks!
What does this PR do?
Type of change: New feature
fp8_castandnvfp4_castmodes for--kv_cache_qformatinhf_ptq.py. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4quantization.
--kv_cache_qformatchanged fromfp8tofp8_cast. To restore the previous calibrated behavior, explicitly pass--kv_cache_qformat fp8.use_constant_amaxfield toQuantizerAttributeConfig. When enabled, the quantizer uses a fixed amax and skips calibration. During calibration, these quantizers are disabled to avoid corrupting the forward pass for other quantizers.clamp_(min=1.0)) in the HF checkpoint export path. Calibrated scales are now exported as-is._compute_kv_cache_dtypeto correctly passis_affine(was hardcoded toTrue, causing NVFP4 non-affine to export asNVFP4_AFFINE).get_kv_cache_scaling_factorto handleNonescaling factors for cast-mode quantizers.Usage
Testing
test_use_constant_amax,test_use_constant_amax_skips_calibrationtest_ptq_llamafor fp8_cast, fp8, nvfp4_castBefore your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).Is this change backward compatible?: Partially
--kv_cache_qformatchanged fromfp8tofp8_cast(no calibration, no exported KV scales).If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅
Did you update Changelog?: ✅
Additional Information
Summary by CodeRabbit
New Features
Bug Fixes
Tests