Skip to content

[Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015

Open
Hale423 wants to merge 9 commits intoNVIDIA:mainfrom
Hale423:dev-wahao-autotune-subgraph-profile
Open

[Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015
Hale423 wants to merge 9 commits intoNVIDIA:mainfrom
Hale423:dev-wahao-autotune-subgraph-profile

Conversation

@Hale423
Copy link

@Hale423 Hale423 commented Mar 10, 2026

Pull Request: ONNX Q/DQ Autotuning with Subgraph Mode

Branch: dev-wahao-autotune-subgraph-profilemain
Type: Feature


Summary

This PR adds automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models using TensorRT performance measurements. It introduces two workflow modes:

  1. Region mode (default) – Pattern-based region analysis with random mutation; optimizes by region pattern and reuses schemes across similar regions.
  2. Subgraph mode – Fusion-aware grouping derived from TensorRT’s layer/fusion boundaries (graph.json); profiles isolated subgraphs for much faster tuning on large or dynamic-shape models (~30 min vs ~25 h in practice).

Subgraph mode is the main addition over a baseline “auto QDQ placement” implementation: it uses TRT fusion info, optional per-layer timing, incremental full-model validation, and cache/resume.


What’s New (vs main)

Area Description
Autotune package New modelopt.onnx.quantization.autotune package: region discovery, scheme generation, TensorRT benchmarking (Python API + optional trtexec), pattern cache, QDQ baseline import.
Subgraph workflow --mode subgraph: fusion-aware grouping from TensorRT graph.json; per-subgraph QDQ scheme profiling; optional per-layer timing when trtexec supports it (with fallback to total latency).
Fusion grouping fusion_grouping.py: parse TRT graph.json, build fusion groups, infer shapes for extracted subgraphs. If --graph-json is omitted, runs trtexec once to generate graph.json (FP16 build with --exportLayerInfo).
Incremental validation Optional per-group full-model check: apply QDQ groups one-by-one and keep only those that improve latency; saves optimized_raw.onnx (all qualifying QDQ) and optimized_final.onnx (validated). Default: on (--incremental-validation); use --no-incremental-validation to disable.
Cache / resume Subgraph mode writes progress to autotune_cache.json for Phase 2 (subgraph profiling) and Phase 3 (incremental validation). Re-running the same command resumes from the last checkpoint.
trtexec path --use-trtexec plus --trtexec-args for benchmarking with dynamic shapes (e.g. --optShapes) and custom options (e.g. --useCudaGraph, --stronglyTyped). trtexec profiling flags are optional; on “Unknown option” the code strips them and retries (fallback to total latency).
Example New examples/qdq_placement/: README (Quick Start, region vs subgraph, output layout, subgraph best practices) and set_batch_size.py for fixed-batch ResNet50.

Key Files

Path Role
modelopt/onnx/quantization/autotune/__main__.py CLI: --mode, --graph-json, --incremental-validation, --use-trtexec, --trtexec-args, etc.
modelopt/onnx/quantization/autotune/subgraph_workflow.py Subgraph pipeline: Phase 1 (fusion grouping), Phase 2 (subgraph profiling), Phase 3 (full-model + incremental validation), cache I/O.
modelopt/onnx/quantization/autotune/fusion_grouping.py Parse graph.json, create fusion groups, generate_graph_json() (trtexec FP16 build when no graph is provided).
modelopt/onnx/quantization/autotune/subgraph_extractor.py Extract subgraph ONNX from full model given group inputs/outputs and shapes.
modelopt/onnx/quantization/autotune/tensorrt_utils.py Trtexec benchmark runner: optional export_profile_path, profiling-flag dedup and “Unknown option” retry without profiling.
modelopt/onnx/quantization/autotune/workflows.py Dispatcher and benchmark_onnx_model(); passes through export_profile_path when using trtexec.
modelopt/onnx/quantization/autotune/autotuner.py Region-mode autotuner (pattern discovery, scheme generation, state/pattern cache).
modelopt/onnx/quantization/autotune/region_*.py Region search and pattern handling for region mode.
examples/qdq_placement/README.md User-facing example: prerequisites, Quick Start (region + subgraph), output layout, subgraph best practices, optional graph.json behavior.
examples/qdq_placement/set_batch_size.py ResNet50 fixed-batch script for reproducible benchmarking.

How to Test

Region mode (no trtexec):

cd examples/qdq_placement
curl -L -o resnet50_Opset17.onnx https://github.com/onnx/models/raw/main/Computer_Vision/resnet50_Opset17_torch_hub/resnet50_Opset17.onnx
python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx
python3 -m modelopt.onnx.quantization.autotune --model resnet50.bs128.onnx --output ./resnet50_results --quant-type int8 --schemes-per-region 20
# Expect: ./resnet50_results/optimized_final.onnx and logs under ./resnet50_results/logs/

Subgraph mode with trtexec (FP8, optional graph.json):

python3 -m modelopt.onnx.quantization.autotune \
  --model resnet50.bs128.onnx \
  --output ./resnet50_subgraph \
  --mode subgraph \
  --quant-type fp8 \
  --use-trtexec \
  --warmup-runs 5 \
  --timing-runs 20 \
  --incremental-validation \
  --trtexec-args "--stronglyTyped" \
  --schemes-per-region 30
# If --graph-json is omitted, first run will trigger trtexec to generate graph.json under output dir.
# Expect: optimized_raw.onnx, optimized_final.onnx, autotune_cache.json, logs/, subgraphs/

Resume: Kill the subgraph run mid-way, then re-run the same command; it should resume from autotune_cache.json.


Checklist

  • CI / unit tests pass (if applicable).
  • Region mode runs end-to-end (ResNet50 or equivalent).
  • Subgraph mode runs end-to-end with --use-trtexec (with or without --graph-json).
  • Without --graph-json, one trtexec FP16 build runs and produces *.fp16.graph.json in the output dir.
  • Interrupted subgraph run resumes after re-run.
  • examples/qdq_placement/README.md matches behavior (region vs subgraph, outputs, best practices).

Documentation

  • Example: examples/qdq_placement/README.md – Quick Start, subgraph best practices, output layout, optional graph generation.
  • Guides / API: This branch may add or update docs/source/guides/9_qdq_placement.rst and docs/source/reference/2_qdq_placement.rst; confirm they align with the CLI and behavior above when submitting.

Notes

  • main does not currently contain the autotune package; this PR adds it in full (region + subgraph).
  • Subgraph mode is recommended for large or dynamic-shape models; region mode remains the default for compatibility and smaller models.
  • trtexec versions that do not support --exportProfile / --profilingVerbosity are handled by retrying without those flags and using total latency for scheme selection.

willg-nv and others added 9 commits December 8, 2025 04:51
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
- Add export_profile_path support; append --exportProfile/--profilingVerbosity when requested
- Skip adding --separateProfileRun if already present in user trtexec args
- On trtexec 'Unknown option' error, strip profiling flags and retry once without them
- Set _profile_unsupported so later runs use total-latency comparison only
- Extract _exec_and_log for shared run-and-log logic

Made-with: Cursor
- Add export_profile_path support; append --exportProfile/--profilingVerbosity when requested
- Skip adding --separateProfileRun if already present in user trtexec args
- On trtexec 'Unknown option' error, strip profiling flags and retry once without them
- Set _profile_unsupported so later runs use total-latency comparison only
- Extract _exec_and_log for shared run-and-log logic
@Hale423 Hale423 requested review from a team as code owners March 10, 2026 02:14
@Hale423 Hale423 requested a review from cjluo-nv March 10, 2026 02:14
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cjluo-nv cjluo-nv requested review from ajrasane and gcunhase March 10, 2026 06:38
Copy link
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces 16k+ lines of changes. Please consider sharing a design and get design review.

@Hale423
Copy link
Author

Hale423 commented Mar 10, 2026

Thanks for the feedback.

Sharing this design, please kindly take a look.

Design: ONNX Q/DQ Autotuning with Subgraph Mode

Design: ONNX Q/DQ Autotuning for TensorRT

Design review document for PR #1015
Branch: dev-wahao-autotune-subgraph-profile
Target: main

1. Background

TensorRT performance for quantized ONNX models depends not only on whether Q/DQ nodes exist, but also on where they are inserted. In practice:

  • Quantizing too aggressively can break fusion opportunities or introduce reformat overhead.
  • Quantizing too conservatively leaves performance on the table.
  • The best placement strategy is model-specific and often difficult to determine manually.

This branch introduces an ONNX Q/DQ autotuning system that searches for better Q/DQ placement using actual TensorRT latency measurements.

The design intentionally supports two workflows:

  1. Region mode for structure-aware, pattern-based optimization across the whole model.
  2. Subgraph mode for faster, fusion-aware tuning on large or dynamic-shape models.

2. Goals

  • Provide an end-to-end ONNX Q/DQ autotuning workflow that optimizes for TensorRT latency.
  • Reuse optimization results across structurally similar parts of the same model.
  • Support warm-start from:
    • Previously learned pattern caches
    • Existing QDQ baseline ONNX models
  • Support large and dynamic-shape models via a trtexec-based path.
  • Provide resumable workflows for long-running tuning jobs.
  • Expose the feature through a simple CLI and user-facing example.

3. Non-goals

  • This system does not perform accuracy validation or calibration quality evaluation.
  • This system does not try to search all possible Q/DQ placements exhaustively.
  • This system does not replace standard PTQ/QAT flows; it only optimizes placement.
  • This system does not target backends other than TensorRT.

4. Scope Relative to main

main does not contain the ONNX Q/DQ autotuning package. This branch adds:

  • A new modelopt.onnx.quantization.autotune package
  • CLI support for region and subgraph autotuning workflows
  • TensorRT benchmarking utilities (Python API and trtexec)
  • Region discovery, pattern matching, insertion-point modeling, scheme generation, state persistence, and warm-start support
  • A subgraph workflow based on TensorRT fusion boundaries
  • User documentation and examples

5. User Experience

5.1 Region Mode

Region mode is the default workflow:

python3 -m modelopt.onnx.quantization.autotune \
    --model model.onnx \
    --output ./results \
    --quant-type int8 \
    --schemes-per-region 30

User-visible properties:

  • Automatically discovers optimization regions
  • Groups structurally equivalent regions into patterns
  • Profiles multiple insertion schemes per pattern
  • Saves:
    • baseline.onnx
    • optimized_final.onnx
    • autotuner_state.yaml
    • autotuner_state_pattern_cache.yaml

5.2 Subgraph Mode

Subgraph mode is intended for large and/or dynamic-shape models:

python3 -m modelopt.onnx.quantization.autotune \
    --model model.onnx \
    --output ./results_subgraph \
    --mode subgraph \
    --quant-type fp8 \
    --use-trtexec \
    --incremental-validation \
    --trtexec-args "--stronglyTyped --optShapes=input:1x3x224x224"

User-visible properties:

  • Uses TensorRT layer/fusion information from graph.json
  • If --graph-json is not provided, runs trtexec once to generate it
  • Profiles isolated subgraphs
  • Saves:
    • optimized_raw.onnx
    • optimized_final.onnx
    • autotune_cache.json
    • logs/
    • subgraphs/

6. High-level Architecture

The implementation is organized around four layers.

6.1 Core Modeling Layer

Files:

  • common.py
  • insertion_points.py
  • region_pattern.py
  • qdq_utils.py

Core abstractions:

  • Region: hierarchical optimization unit in the graph
  • RegionPattern: deterministic structural signature used to group equivalent regions
  • InsertionScheme: a candidate set of Q/DQ insertion points
  • PatternSchemes: all candidate schemes for one pattern plus measurements
  • PatternCache: reusable best schemes across runs
  • Insertion point types:
    • NodeInputInsertionPoint
    • ChildRegionInputInsertionPoint
    • RegionOutputInsertionPoint
    • ResolvedInsertionPoint

Key design choice:

  • Schemes are represented using pattern-relative insertion points, not absolute node IDs.
    This allows one measured scheme to be reused for every region instance with the same structure.

6.2 Region-mode Optimization Layer

Files:

  • autotuner.py
  • torch_region_builder.py
  • region_search.py
  • workflows.py

Responsibilities:

  • Discover optimization regions from the ONNX graph
  • Build structural patterns
  • Generate candidate insertion schemes
  • Export test models with current/best schemes applied
  • Record measurements and update best-known schemes
  • Persist and restore session state

Current implementation detail:

  • The default region discovery path in QDQAutotuner._search_regions() uses TorchRegionBuilder.
  • TorchRegionBuilder builds hierarchical regions from PyTorch-style ONNX node names.
  • region_search.py provides a more general region-search foundation and algorithms, but this branch’s default autotuner path is centered on the torch-name-based hierarchy.

6.3 Subgraph-mode Optimization Layer

Files:

  • subgraph_workflow.py
  • fusion_grouping.py
  • subgraph_extractor.py

Responsibilities:

  • Obtain TensorRT fusion boundaries from graph.json
  • Convert fused TRT layers into ONNX-level optimization groups
  • Extract standalone ONNX subgraphs per group
  • Test heuristic QDQ schemes per group
  • Merge accepted schemes back into the full model
  • Optionally validate improvements incrementally on the full model
  • Cache progress for resume

6.4 Benchmarking Layer

Files:

  • tensorrt_utils.py
  • workflows.py

Responsibilities:

  • Benchmark ONNX models using either:
    • TensorRT Python API
    • trtexec
  • Reuse timing caches and persistent benchmark instances
  • Support plugin libraries
  • Support dynamic-shape arguments
  • Export per-layer profiles when trtexec supports them
  • Fall back gracefully when profiling flags are unsupported

7. Region Mode Design

7.1 Why Region Mode Exists

Region mode is intended to solve the general case:

  • The model may not have TensorRT fusion metadata available yet
  • Similar subgraphs may appear repeatedly across the model
  • A single structural pattern may benefit from the same placement policy everywhere

This workflow optimizes patterns, not isolated instances.

7.2 Region Discovery

The branch uses TorchRegionBuilder as the default discovery mechanism because many production ONNX models are exported from PyTorch and preserve hierarchical naming such as:

/encoder/layer_0/self_attn/q_proj/MatMul

This naming convention enables:

  • Stable hierarchical grouping
  • More semantically meaningful optimization units
  • Better scheme reuse across repeated blocks (e.g. transformer layers)

TorchRegionBuilder:

  • Builds a path trie from node names
  • Creates composite and leaf regions
  • Merges small neighboring regions
  • Probes epilogues and computes region boundaries
  • Can filter out non-quantizable regions

7.3 Pattern-based Scheme Reuse

Once regions are discovered:

  1. A RegionPattern is built from region topology and op types.
  2. Regions with identical signatures share a single PatternSchemes object.
  3. Candidate schemes are generated and measured for that pattern.
  4. The best scheme is reused for all matching regions.

This reduces duplicated work and makes the search tractable.

7.4 Warm Start

The region workflow supports two warm-start mechanisms:

  • Pattern cache: load prior best schemes by pattern signature
  • QDQ baseline import: inspect an already-quantized model and infer pattern-level insertion points from it

This allows transfer learning across:

  • Different runs of the same model
  • Similar models
  • User-provided manually tuned quantized baselines

7.5 State and Resume

Region mode saves:

  • Baseline latency
  • Profiled schemes and latencies
  • Best scheme selection
  • Region discovery results

The state file is autotuner_state.yaml, allowing restart from the last profiled region.

8. Subgraph Mode Design

8.1 Why Subgraph Mode Exists

Region mode can become expensive for large models because each candidate scheme requires building and profiling increasingly quantized versions of the full model.

Subgraph mode addresses this by:

  • Aligning optimization units with TensorRT fusion groups
  • Profiling isolated subgraphs instead of the whole model
  • Using a small, heuristic scheme set instead of a broad mutation search

This makes the workflow significantly faster on large and dynamic-shape models.

8.2 Phase 1: Fusion-aware Grouping

Subgraph mode begins by obtaining TensorRT layer information:

  • If --graph-json is provided, use it directly.
  • Otherwise, call generate_graph_json() in fusion_grouping.py.

generate_graph_json() runs trtexec once in FP16 mode and writes:

  • <model>.fp16.graph.json
  • <model>.fp16.build.log

The exported layer info is then parsed to create fusion groups that map TensorRT fused layers back to ONNX nodes.

8.3 Phase 2: Subgraph Profiling

For each quantizable fusion group:

  1. Resolve boundary tensors
  2. Extract a standalone ONNX subgraph
  3. Build a small heuristic set of schemes, for example:
    • baseline
    • full QDQ
    • weight-only
    • activation-only
  4. Benchmark each scheme
  5. Choose the best scheme for that group

Dynamic-shape handling:

  • The workflow parses --minShapes/--optShapes/--maxShapes from --trtexec-args
  • Infers intermediate tensor shapes
  • Constructs subgraph-specific shape arguments for extracted subgraphs

8.4 Per-layer Comparison

A major issue with isolated subgraph benchmarking is that total subgraph latency can be dominated by artifacts such as:

  • Reformat layers
  • Transposes
  • Layout adaptation overhead

To reduce this noise, subgraph mode attempts to use trtexec per-layer profile export:

  • --exportProfile
  • --profilingVerbosity=detailed
  • --separateProfileRun

If supported, the workflow:

  • Parses the per-layer profile JSON
  • Filters out reformat-style overhead
  • Compares the time spent on compute layers

If the local trtexec version does not support those flags, the code falls back to total subgraph latency.

8.5 Phase 3: Full-model Merge and Incremental Validation

The subgraph workflow distinguishes two outputs:

  • optimized_raw.onnx: all qualifying groups applied together
  • optimized_final.onnx: final validated model after incremental validation

Incremental validation exists because multiple individually beneficial subgraph changes can still regress full-model latency when combined.

When enabled:

  1. Start from the FP16 baseline full model
  2. Apply candidate groups one at a time
  3. Benchmark the full model after each change
  4. Keep the change only if latency improves

This design preserves:

  • A raw “all accepted by local heuristic” model
  • A validated “accepted by full-model measurement” model

8.6 Cache and Resume

Subgraph mode uses autotune_cache.json in the output directory.

It stores:

  • Phase 2 group profiling results
  • Phase 3 validation progress
  • Current latency and keep/reject decisions

This enables:

  • Resuming long runs after interruption
  • Restarting directly at incremental validation if Phase 2 is already complete

9. Benchmarking Design

9.1 Python API Benchmark

The Python API path is used for standard latency measurement when trtexec is not required.

Design decisions:

  • Reuse a global benchmark instance
  • Reuse TensorRT builder/runtime state
  • Reuse timing cache across measurements

This reduces repeated benchmark setup overhead.

9.2 trtexec Benchmark

trtexec is needed for:

  • Dynamic-shape benchmarking with custom shape flags
  • Generating TensorRT graph.json
  • Optional per-layer profile export

The trtexec path accepts a free-form --trtexec-args string so users can pass deployment-specific settings.

9.3 Profiling-flag Compatibility Fallback

Different TensorRT/trtexec versions do not always support the same profiling flags.

The branch therefore implements the following behavior:

  1. Attempt to run with per-layer profiling flags if a profile is requested
  2. If trtexec fails with Unknown option, remove profiling-related flags
  3. Retry once without profiling
  4. Mark profiling as unsupported for subsequent runs

This keeps subgraph mode functional even on older or restricted trtexec versions.

10. CLI Design

The CLI is defined in __main__.py.

Important options:

  • --mode {region,subgraph}
  • --graph-json
  • --incremental-validation / --no-incremental-validation
  • --use-trtexec
  • --trtexec-args
  • --pattern-cache
  • --qdq-baseline
  • --state-file
  • --plugin-libraries
  • --schemes-per-region

Design rationale:

  • Keep region mode as the default for backward compatibility
  • Make subgraph mode opt-in because it depends on TRT fusion metadata and trtexec
  • Keep advanced deployment settings in --trtexec-args instead of introducing many narrowly scoped flags

11. Data and Output Model

11.1 Region Mode Outputs

  • baseline.onnx
  • optimized_final.onnx
  • autotuner_state.yaml
  • autotuner_state_pattern_cache.yaml
  • logs/
  • region_models/

11.2 Subgraph Mode Outputs

  • optimized_raw.onnx
  • optimized_final.onnx
  • autotune_cache.json
  • logs/
  • subgraphs/
  • optional generated <model>.fp16.graph.json

12. Key Trade-offs

12.1 Region Mode vs Subgraph Mode

Region mode:

  • Pros:
    • General
    • Pattern-reuse friendly
    • Does not require TRT fusion metadata up front
  • Cons:
    • Can be slow on large models
    • More expensive search loop

Subgraph mode:

  • Pros:
    • Much faster
    • Aligned with actual TRT fusion boundaries
    • Better fit for large and dynamic-shape models
  • Cons:
    • Requires TRT graph export or prior graph
    • More heuristic
    • Local subgraph wins may not transfer cleanly to full-model wins

12.2 Why Keep Both Raw and Final Models

We save both optimized_raw.onnx and optimized_final.onnx because they serve different purposes:

  • optimized_raw.onnx preserves all locally accepted group changes
  • optimized_final.onnx preserves only changes that survive full-model validation

This avoids hiding useful intermediate output from advanced users while still giving a safer default final artifact.

12.3 Why Pattern-relative Insertion Points

Absolute node IDs would make learned schemes brittle and model-instance-specific.

Pattern-relative insertion points:

  • Generalize across repeated structures
  • Enable pattern cache reuse
  • Make imported QDQ baselines transferable

13. Limitations and Risks

  • Region-mode default discovery currently relies on PyTorch-style ONNX naming for best results.
  • The autotuner optimizes for latency only; it does not enforce accuracy constraints.
  • Subgraph heuristics do not explore the full combinatorial space.
  • Per-layer profiling depends on local trtexec support.
  • Full-model validation still depends on benchmark stability; small latency differences can be noisy.

14. Alternatives Considered

14.1 Full-model-only Search

Rejected because it is too slow for large models and dynamic-shape deployments.

14.2 Subgraph-only Design

Rejected because it gives up the benefits of pattern reuse and a backend-agnostic structural view of the graph.

14.3 Absolute-node Placement Search

Rejected because it would prevent scheme reuse across equivalent structures and make state transfer less useful.

15. Validation Plan

Recommended design-review validation:

  1. Basic region-mode run

    • Run on ResNet50 fixed-batch example
    • Confirm optimized_final.onnx and state files are produced
  2. Subgraph mode without --graph-json

    • Confirm trtexec first generates *.fp16.graph.json
    • Confirm the workflow proceeds into Phase 2 and Phase 3
  3. Resume behavior

    • Interrupt a subgraph run mid-way
    • Re-run with the same output directory
    • Confirm resume from autotune_cache.json
  4. Profiling fallback

    • Run in an environment where trtexec does not support profiling flags
    • Confirm automatic retry without those flags
  5. Warm-start path

    • Run once to produce a pattern cache
    • Re-run using --pattern-cache
    • Confirm seeding behavior and faster convergence

16. Rollout and Documentation

This branch already includes:

  • CLI integration
  • Example under examples/qdq_placement/
  • User documentation updates
  • API documentation updates

Suggested rollout:

  1. Complete design review
  2. Land PR behind the existing CLI mode separation
  3. Gather feedback on:
    • default region discovery behavior
    • subgraph heuristic quality
    • benchmark noise and validation thresholds

17. Open Questions for Review

  • Should region mode continue to default to TorchRegionBuilder, or should a more general search path be reinstated as the primary default?
  • Should subgraph mode require --use-trtexec, or should it remain flexible while documenting that trtexec is the recommended path?
  • Should the per-group acceptance threshold remain latency-only, or should we expose stronger hysteresis / noise margins for incremental validation?
  • Should we eventually add accuracy hooks so this becomes a joint latency + quality optimization flow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants