Optimizing Models: A Train Of Thought

An T. Le, Hanoi, Nov 2025 (revised Jan 2026)

In practice, modern foundation models are optimized in two tightly-coupled layers:

(A) Model-side optimization: quantization, pruning/sparsity, distillation, low-rank / factorized parameterization, etc. (reduce compute/memory while preserving task performance).
(B) Deployment optimization: compilers + runtimes (e.g., TensorRT-LLM/TensorRT, OpenVINO, ONNX Runtime, LiteRT (ex-TFLite), TVM, ncnn, vendor SDKs) that turn an optimized checkpoint into a hardware-efficient engine.

NOTE: TorchAO / torchao mostly belongs to (A), but it increasingly acts as the “bridge” into (B) via export/compile flows.

This note is a quick mental map of the mainstream pathways for compressing and deploying foundation models (with links to docs + code).

1. Mainstream (A) Model-side optimization

1.1 Quantization (almost always the first step)

Goal: reduce weights/activations from FP16/FP32 -> INT8/INT4/FP8/FP4 without unacceptable accuracy loss.

Common variants (esp. for transformers) (survey: Zhu et al., 2024):

Post-Training Quantization (PTQ)
No retraining; use a small calibration set.
- Static (typical INT8): collect activation stats offline, quantize weights + activations.
- Dynamic (common on CPUs): quantize weights offline; compute some activation quant params at runtime.
- Transformer-specific PTQ recipes (examples): SmoothQuant, AWQ.
Codebases:
- SmoothQuant: mit-han-lab/smoothquant
- AWQ: mit-han-lab/llm-awq
Quantization-Aware Training (QAT) Simulate quantization during fine-tuning to recover accuracy when PTQ is too lossy.
- PyTorch-native QAT for low-bit: torchao QAT docs
- NVIDIA QAT workflows (for TensorRT/TensorRT-LLM): Model Optimizer docs
Mixed precision & vendor-specific formats
- Mixed FP16/BF16 is the default “cheap win.”
- NVIDIA-specific low-bit floats: FP8 and NVFP4/FP4 are supported through TensorRT + NVIDIA Model Optimizer (ModelOpt). Practical entry points: NVFP4 overview, TensorRT quantized types.

Reality check: the format only matters if your deployment stack has the matching kernels (this is where TensorRT-LLM / TensorRT / OpenVINO / ORT / LiteRT differ).

Example (GR00T-style VLA):

Vision + language backbones: INT8 / FP8 / FP4 (where supported)
Action/diffusion head + final layers: FP16/BF16

References:

GR00T N1.5 technical page: research.nvidia.com/labs/gear/gr00t-n1_5
Isaac GR00T code/checkpoints: NVIDIA/Isaac-GR00T

1.2 Pruning & sparsity

Goal: remove “unimportant” parameters (often combined with quantization). Speedups depend heavily on kernel support and sparsity structure.

Common flavors:

Unstructured pruning (weight sparsity)
- Easy to apply, but usually needs specialized sparse kernels to see wall-clock speedups.
Structured pruning (more reliable speedups)
- Prune attention heads, MLP channels, entire blocks/layers, tokens, etc.
N:M (semi-structured) sparsity
- Example: 2:4 sparsity (50% zeros in a constrained pattern) which maps to NVIDIA Sparse Tensor Cores.
- Acceleration stack often involves TensorRT and/or cuSPARSELt plus an export path that preserves sparsity metadata.
References:
- cuSPARSELt docs: docs.nvidia.com/cuda/cusparselt
- torchao sparsity overview (incl. semi-structured): torchao sparsity docs
- NVIDIA Model Optimizer (pruning + sparsity): Model Optimizer repo

1.3 Knowledge Distillation (KD)

Goal: train a smaller/cheaper student to mimic a larger teacher.

Main flavors:

Logit distillation: student matches teacher soft logits.
Feature distillation: align hidden states / attention maps.
Sequence / behavior distillation: student imitates teacher-generated trajectories/actions.

Tutorials / tooling:

PyTorch KD tutorial: Knowledge distillation in PyTorch
NVIDIA distillation workflow (ModelOpt + NeMo/HF): Model Optimizer distillation docs

1.4 Low-rank & factorization tricks

Often used for parameter-efficient adaptation, and sometimes for compression if merged or baked into the final model:

LoRA / low-rank adapters: train low-rank deltas, optionally merge.
- LoRA ref: microsoft/LoRA
- Practical LoRA/QLoRA tooling: huggingface/peft
Matrix/tensor decompositions: SVD / Tucker / CP, etc.

2. Mainstream (B) Deployment pipelines / toolchains

2.1 NVIDIA-centric: TensorRT-LLM + NVIDIA Model Optimizer (ModelOpt)

NVIDIA Model Optimizer (formerly “TensorRT Model Optimizer”) is the main toolkit for PTQ/QAT + pruning + distillation + speculative decoding + sparsity in the NVIDIA stack.

Code + docs:
- Repo: NVIDIA/Model-Optimizer
- Docs: nvidia.github.io/Model-Optimizer
LLM runtime:
- TensorRT-LLM repo: NVIDIA/TensorRT-LLM
- TensorRT-LLM docs: nvidia.github.io/TensorRT-LLM

Pipeline sketch

Start from a Hugging Face / PyTorch checkpoint (e.g., GR00T N1.5).
Apply PTQ or QAT with ModelOpt (INT8/FP8/NVFP4, etc.).
If needed: pruning/sparsity + distillation to recover accuracy at lower cost.
Export/build a TensorRT(-LLM) engine; deploy (Jetson / server GPUs / Blackwell-class hardware).

Pre-quantized checkpoints:

Hugging Face collection: Inference-optimized checkpoints (Model Optimizer)

2.2 Intel / CPU-centric: OpenVINO + NNCF (plus Intel Neural Compressor where useful)

For Intel CPUs/GPUs and many industrial deployments, the “mainline” path today is:

OpenVINO as the inference runtime: OpenVINO toolkit
NNCF as the compression backend (PTQ/QAT/weight compression):
- Docs: OpenVINO model optimization (NNCF)
- Repo: openvinotoolkit/nncf

Hugging Face-friendly workflow:

Optimum Intel + OpenVINO/NNCF: Optimum Intel OpenVINO optimization

Intel Neural Compressor (INC) is still relevant as a cross-framework quant/prune/distill toolkit (especially outside OpenVINO-only workflows):

Docs: intel.github.io/neural-compressor
Repo: intel/neural-compressor

2.3 PyTorch-native: torchao + torch.export (PT2E quantization)

If you want to stay close to PyTorch while exploring low-bit + sparsity:

torchao docs: pytorch.org/ao
Repo: pytorch/ao

Key tutorials (PyTorch 2 export quantization):

PTQ (graph-mode): PyTorch 2 Export PTQ
QAT (graph-mode): PyTorch 2 Export QAT

Pruning in “vanilla PyTorch”:

torch.nn.utils.prune is useful for simple experiments, but for structured pruning (channels/blocks with dependency handling), libraries like VainF/Torch-Pruning are often more practical.

2.4 Framework-agnostic / edge-oriented runtimes

Common choices across heterogeneous edge targets:

ONNX Runtime (quantization + graph optimizations):
- ORT quantization docs: onnxruntime.ai quantization guide
LiteRT (formerly TensorFlow Lite) for mobile/embedded:
- Overview: ai.google.dev/edge/litert
- GitHub: google-ai-edge/LiteRT
Apache TVM (compiler + autotuning): tvm.apache.org
ncnn (lightweight C++ runtime): Tencent/ncnn

Typical flow:

Do pruning/KD in PyTorch/TF.
Export (ONNX / LiteRT / IR).
Run runtime-specific quantization + graph optimizations.
Deploy.

2.5 “All-in-one” compression frameworks

If you want config-driven automation across methods:

DeepSpeed Compression (quantization + pruning + distillation workflows): DeepSpeed model compression tutorial
LLM Compressor (vLLM-focused compression toolkit): vllm-project/llm-compressor
Pruna (commercial + OSS tooling): Pruna docs

Note: older “SparseML” references exist, but the upstream repo is archived; treat it as legacy unless your org already depends on it.

2.6 Example: GR00T-like robotics FM

A practical “first pass” for a GR00T-style VLA:

Baseline profiling on target (Jetson / server GPU / CPU box / SoC).
Quantize most transformer layers (PTQ INT8/FP8/FP4 depending on hardware + stack); keep sensitive heads higher precision.
Structured pruning / 2:4 sparsity only if your deployment engine has real sparse kernels for your shapes.
Distill:
- smaller VLA, or
- task-specific students (e.g., manipulator-only policy).
Export to the deployment stack:
- NVIDIA path: PyTorch -> ModelOpt -> TensorRT-LLM/TensorRT
- Intel path: PyTorch/HF -> OpenVINO IR -> NNCF -> OpenVINO runtime
- General path: PyTorch -> ONNX -> ORT / TVM / ncnn
- Mobile path: TF/PyTorch -> LiteRT

3. Serving-time optimization (often the biggest real-world win)

A useful mental model: cost splits into prefill (prompt processing) and decode (token-by-token).

Prefill is usually compute-bound (big GEMMs + attention).
Decode is often memory / KV-cache bandwidth bound.

So:

Weight-only INT4/FP4 helps decode if your stack has good kernels.
Better attention kernels (FlashAttention/FlashInfer) help prefill and reduce memory traffic.
KV-cache tricks matter most for long context and high concurrency.

3.1 Batching + scheduling + KV memory management

If you do nothing else, choose a serving engine that gives you:

continuous / dynamic batching
paged KV cache (reduces fragmentation under concurrency)
optional chunked prefill (smooths very long prompts)

Good entry points:

vLLM: PagedAttention + continuous batching + CUDA/HIP graph execution. Docs: vLLM docs · Repo: vllm-project/vllm
Hugging Face Text Generation Inference (TGI): production server with dynamic batching + tensor parallelism. Docs: TGI docs · Repo: huggingface/text-generation-inference
SGLang: high-performance LLM serving framework + runtime stack. Repo: sgl-project/sglang

Reality check: feature parity differs (quant formats, speculative decoding, MoE, multi-modal, etc.).
Always confirm against each runtime’s “supported hardware + quantization” tables.

3.2 Kernel libraries that matter in practice

For transformer-heavy workloads, attention + MLP kernels are usually the make-or-break.

FlashAttention (training + inference attention kernels): Dao-AILab/flash-attention
FlashInfer (serving-focused kernels: attention, paged attention, sampling, etc.): Repo: flashinfer-ai/flashinfer · Docs: flashinfer.ai docs

3.3 KV-cache optimization for long context + high concurrency

When context length or concurrency grows, KV cache can dominate VRAM and drive latency cliffs.

Two complementary strategies:

Systems: paged KV cache + chunked prefill (runtime feature).
Model-side: KV cache quantization/compression (typically 4–8 bit; often mixed precision).

Researchy-but-usable codebases:

KVQuant: SqueezeAILab/KVQuant
ZipCache: ThisisBillhe/ZipCache

Kernel library for efficient pruning decisions:

Flash-ColReduce: Triton kernels for attention column-wise reductions (sum/mean/max) with O(N) memory; enables efficient token/KV importance estimation without materializing O(N²) attention matrix. Repo: z-lab/flash-colreduce

3.4 Decoding acceleration (reduce target-model forward passes)

If decode is the bottleneck, you can reduce the number of expensive target-model steps:

Speculative decoding (draft model + verification): romsto/Speculative-Decoding
Multi-token heads (Medusa): FasterDecoding/Medusa
Block diffusion parallel drafting (DFlash): lightweight diffusion-based draft model generating multiple tokens in parallel; proven on Qwen/Llama/GPT-OSS; benefits LLM serving + high concurrency. Repo: z-lab/dflash

4. Deployment lanes by hardware (quick cheat sheet)

NVIDIA GPUs / Jetson / Blackwell-class

NVIDIA Model Optimizer (ModelOpt) for PTQ/QAT + sparsity/distillation: Docs: ModelOpt docs · Repo: NVIDIA/Model-Optimizer
TensorRT-LLM for engine build + kernels + serving: Docs: TensorRT-LLM docs · Repo: NVIDIA/TensorRT-LLM
embedl-models (pre-optimized model collection): Repo: embedl/embedl-models
NTransformer (inference engine for limited VRAM): Repo: xaskasdf/ntransformer

AMD GPUs (ROCm/HIP) + non-NVIDIA datacenter

Serving engines like vLLM can run with HIP backends; quantization support is more kernel-dependent and can be narrower than NVIDIA.
PyTorch torch.compile (Inductor) is a good “graph+kernel” optimization baseline across NVIDIA/AMD/Intel GPUs (via Triton): API: torch.compile · Guide: torch.compiler docs

CPUs (x86 + ARM servers)

First levers:

smaller model (distill) and/or weight-only quantization (INT8/INT4).

Runtimes:

OpenVINO + NNCF (Intel-heavy deployments): OpenVINO · NNCF repo
ONNX Runtime (cross-platform): ORT quantization docs
ONNX Runtime GenAI (generation loop tooling): microsoft/onnxruntime-genai
llama.cpp (local C++ inference; GGUF ecosystem): ggml-org/llama.cpp
BitNet.cpp (1-bit LLM inference on CPU): microsoft/BitNet

Apple silicon (laptop / mobile-class SoC)

MLX for training + inference on Apple silicon: MLX repo · MLX docs
LLM-oriented tooling: mlx-lm
App conversion pipeline: coremltools: Repo · Guide

Android / Qualcomm / embedded SoCs

LiteRT (ex-TFLite) runtime + delegates: Docs: LiteRT · Repo: google-ai-edge/LiteRT · Samples: litert-samples LLM pipeline: google-ai-edge/LiteRT-LM
ExecuTorch (PyTorch -> on-device runtime): Docs · Repo
ONNX Runtime + QNN EP (Qualcomm acceleration): ORT QNN EP docs · Qualcomm ORT QNN EP docs

“Runs everywhere” local inference engines

These are often the fastest way to get something working across laptops + edge boxes:

MLC LLM (TVM-based compiler + runtime for LLM deployment): mlc-ai/mlc-llm · MLC LLM docs
llama.cpp (GGUF + broad backend support): llamacpp

5. Beyond LLMs: VLMs + diffusion model optimization in robotics

VLMs / VLAs

Optimize each submodule separately (vision encoder, LLM, action head) and re-profile end-to-end.
Watch for non-model bottlenecks: image decode, resizing, tokenization, simulator/robot loop.

Diffusion / image generation

Model-side levers (reduce steps or faster steps):

Reduce steps (often biggest win): Latent Consistency Models (LCM): luosiallen/latent-consistency-model
Make each step faster (quantize/compile kernels):
- SVDQuant (4-bit diffusion via low-rank outlier absorption): arXiv:2411.05007
- Diffusers bitsandbytes quantization guide: Diffusers bitsandbytes quantization
- Reference implementations / research code: Stability-AI/generative-models

Distillation frameworks (end-to-end model compression):

Fast Generation from Diffusion Models: FastGen Repo: NVlabs/FastGen

Serving-time optimization (inference-time caching, not training-based):

PyTorch-native and Flexible Inference Engine with Hybrid Cache Acceleration and Parallelism for DiTs: Cache-DiT Repo: vipshop/cache-dit
From Instantaneous to Average Velocity for Accelerating Flow Matching Inference: MeanCache Paper: ICLR 2026 · Repo: UnicomAI/MeanCache

6. Minimal “what should I do first?” decision tree

Profile and label the bottleneck: weights vs KV cache vs kernels vs scheduling.
If decode/VRAM dominates -> start with weight-only INT4/FP4, but only if your runtime supports it well.
If long context/concurrency dominates -> fix paged KV + chunked prefill, then consider KV cache quantization.
If prefill compute dominates -> better kernels (FlashAttention/FlashInfer) + compile (torch.compile / TensorRT).
If you still can’t hit constraints -> distill (often the only way to cut both compute and memory).

7. Closing remarks

Model optimization and deployment optimization are inseparable.
Most “wins” come from matching a compression method to the runtime’s kernels.
Treat it as a feedback loop: profile -> compress -> compile -> measure -> iterate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing Models: A Train Of Thought

1. Mainstream (A) Model-side optimization

1.1 Quantization (almost always the first step)

1.2 Pruning & sparsity

1.3 Knowledge Distillation (KD)

1.4 Low-rank & factorization tricks

2. Mainstream (B) Deployment pipelines / toolchains

2.1 NVIDIA-centric: TensorRT-LLM + NVIDIA Model Optimizer (ModelOpt)

2.2 Intel / CPU-centric: OpenVINO + NNCF (plus Intel Neural Compressor where useful)

2.3 PyTorch-native: torchao + torch.export (PT2E quantization)

2.4 Framework-agnostic / edge-oriented runtimes

2.5 “All-in-one” compression frameworks

2.6 Example: GR00T-like robotics FM

3. Serving-time optimization (often the biggest real-world win)

3.1 Batching + scheduling + KV memory management

3.2 Kernel libraries that matter in practice

3.3 KV-cache optimization for long context + high concurrency

3.4 Decoding acceleration (reduce target-model forward passes)

4. Deployment lanes by hardware (quick cheat sheet)

NVIDIA GPUs / Jetson / Blackwell-class

AMD GPUs (ROCm/HIP) + non-NVIDIA datacenter

CPUs (x86 + ARM servers)

Apple silicon (laptop / mobile-class SoC)

Android / Qualcomm / embedded SoCs

“Runs everywhere” local inference engines

5. Beyond LLMs: VLMs + diffusion model optimization in robotics

VLMs / VLAs

Diffusion / image generation

6. Minimal “what should I do first?” decision tree

7. Closing remarks

FilesExpand file tree

OptimizingModels.md

Latest commit

History

OptimizingModels.md

File metadata and controls

Optimizing Models: A Train Of Thought

1. Mainstream (A) Model-side optimization

1.1 Quantization (almost always the first step)

1.2 Pruning & sparsity

1.3 Knowledge Distillation (KD)

1.4 Low-rank & factorization tricks

2. Mainstream (B) Deployment pipelines / toolchains

2.1 NVIDIA-centric: TensorRT-LLM + NVIDIA Model Optimizer (ModelOpt)

2.2 Intel / CPU-centric: OpenVINO + NNCF (plus Intel Neural Compressor where useful)

2.3 PyTorch-native: torchao + torch.export (PT2E quantization)

2.4 Framework-agnostic / edge-oriented runtimes

2.5 “All-in-one” compression frameworks

2.6 Example: GR00T-like robotics FM

3. Serving-time optimization (often the biggest real-world win)

3.1 Batching + scheduling + KV memory management

3.2 Kernel libraries that matter in practice

3.3 KV-cache optimization for long context + high concurrency

3.4 Decoding acceleration (reduce target-model forward passes)

4. Deployment lanes by hardware (quick cheat sheet)

NVIDIA GPUs / Jetson / Blackwell-class

AMD GPUs (ROCm/HIP) + non-NVIDIA datacenter

CPUs (x86 + ARM servers)

Apple silicon (laptop / mobile-class SoC)

Android / Qualcomm / embedded SoCs

“Runs everywhere” local inference engines

5. Beyond LLMs: VLMs + diffusion model optimization in robotics

VLMs / VLAs

Diffusion / image generation

6. Minimal “what should I do first?” decision tree

7. Closing remarks