An T. Le, Hanoi, Nov 2025 (revised Jan 2026)
In practice, modern foundation models are optimized in two tightly-coupled layers:
- (A) Model-side optimization: quantization, pruning/sparsity, distillation, low-rank / factorized parameterization, etc. (reduce compute/memory while preserving task performance).
- (B) Deployment optimization: compilers + runtimes (e.g., TensorRT-LLM/TensorRT, OpenVINO, ONNX Runtime, LiteRT (ex-TFLite), TVM, ncnn, vendor SDKs) that turn an optimized checkpoint into a hardware-efficient engine.
NOTE: TorchAO / torchao mostly belongs to (A), but it increasingly acts as the “bridge” into (B) via export/compile flows.
This note is a quick mental map of the mainstream pathways for compressing and deploying foundation models (with links to docs + code).
Goal: reduce weights/activations from FP16/FP32 -> INT8/INT4/FP8/FP4 without unacceptable accuracy loss.
Common variants (esp. for transformers) (survey: Zhu et al., 2024):
-
Post-Training Quantization (PTQ)
No retraining; use a small calibration set.- Static (typical INT8): collect activation stats offline, quantize weights + activations.
- Dynamic (common on CPUs): quantize weights offline; compute some activation quant params at runtime.
- Transformer-specific PTQ recipes (examples): SmoothQuant, AWQ.
Codebases:
- SmoothQuant: mit-han-lab/smoothquant
- AWQ: mit-han-lab/llm-awq
-
Quantization-Aware Training (QAT) Simulate quantization during fine-tuning to recover accuracy when PTQ is too lossy.
- PyTorch-native QAT for low-bit: torchao QAT docs
- NVIDIA QAT workflows (for TensorRT/TensorRT-LLM): Model Optimizer docs
-
Mixed precision & vendor-specific formats
- Mixed FP16/BF16 is the default “cheap win.”
- NVIDIA-specific low-bit floats: FP8 and NVFP4/FP4 are supported through TensorRT + NVIDIA Model Optimizer (ModelOpt). Practical entry points: NVFP4 overview, TensorRT quantized types.
Reality check: the format only matters if your deployment stack has the matching kernels (this is where TensorRT-LLM / TensorRT / OpenVINO / ORT / LiteRT differ).
Example (GR00T-style VLA):
- Vision + language backbones: INT8 / FP8 / FP4 (where supported)
- Action/diffusion head + final layers: FP16/BF16
References:
- GR00T N1.5 technical page: research.nvidia.com/labs/gear/gr00t-n1_5
- Isaac GR00T code/checkpoints: NVIDIA/Isaac-GR00T
Goal: remove “unimportant” parameters (often combined with quantization). Speedups depend heavily on kernel support and sparsity structure.
Common flavors:
-
Unstructured pruning (weight sparsity)
- Easy to apply, but usually needs specialized sparse kernels to see wall-clock speedups.
-
Structured pruning (more reliable speedups)
- Prune attention heads, MLP channels, entire blocks/layers, tokens, etc.
-
N:M (semi-structured) sparsity
- Example: 2:4 sparsity (50% zeros in a constrained pattern) which maps to NVIDIA Sparse Tensor Cores.
- Acceleration stack often involves TensorRT and/or cuSPARSELt plus an export path that preserves sparsity metadata.
References:
- cuSPARSELt docs: docs.nvidia.com/cuda/cusparselt
- torchao sparsity overview (incl. semi-structured): torchao sparsity docs
- NVIDIA Model Optimizer (pruning + sparsity): Model Optimizer repo
Goal: train a smaller/cheaper student to mimic a larger teacher.
Main flavors:
- Logit distillation: student matches teacher soft logits.
- Feature distillation: align hidden states / attention maps.
- Sequence / behavior distillation: student imitates teacher-generated trajectories/actions.
Tutorials / tooling:
- PyTorch KD tutorial: Knowledge distillation in PyTorch
- NVIDIA distillation workflow (ModelOpt + NeMo/HF): Model Optimizer distillation docs
Often used for parameter-efficient adaptation, and sometimes for compression if merged or baked into the final model:
-
LoRA / low-rank adapters: train low-rank deltas, optionally merge.
- LoRA ref: microsoft/LoRA
- Practical LoRA/QLoRA tooling: huggingface/peft
-
Matrix/tensor decompositions: SVD / Tucker / CP, etc.
NVIDIA Model Optimizer (formerly “TensorRT Model Optimizer”) is the main toolkit for PTQ/QAT + pruning + distillation + speculative decoding + sparsity in the NVIDIA stack.
- Code + docs:
- LLM runtime:
- TensorRT-LLM repo: NVIDIA/TensorRT-LLM
- TensorRT-LLM docs: nvidia.github.io/TensorRT-LLM
Pipeline sketch
- Start from a Hugging Face / PyTorch checkpoint (e.g., GR00T N1.5).
- Apply PTQ or QAT with ModelOpt (INT8/FP8/NVFP4, etc.).
- If needed: pruning/sparsity + distillation to recover accuracy at lower cost.
- Export/build a TensorRT(-LLM) engine; deploy (Jetson / server GPUs / Blackwell-class hardware).
Pre-quantized checkpoints:
- Hugging Face collection: Inference-optimized checkpoints (Model Optimizer)
For Intel CPUs/GPUs and many industrial deployments, the “mainline” path today is:
- OpenVINO as the inference runtime: OpenVINO toolkit
- NNCF as the compression backend (PTQ/QAT/weight compression):
Hugging Face-friendly workflow:
- Optimum Intel + OpenVINO/NNCF: Optimum Intel OpenVINO optimization
Intel Neural Compressor (INC) is still relevant as a cross-framework quant/prune/distill toolkit (especially outside OpenVINO-only workflows):
If you want to stay close to PyTorch while exploring low-bit + sparsity:
- torchao docs: pytorch.org/ao
- Repo: pytorch/ao
Key tutorials (PyTorch 2 export quantization):
- PTQ (graph-mode): PyTorch 2 Export PTQ
- QAT (graph-mode): PyTorch 2 Export QAT
Pruning in “vanilla PyTorch”:
torch.nn.utils.pruneis useful for simple experiments, but for structured pruning (channels/blocks with dependency handling), libraries like VainF/Torch-Pruning are often more practical.
Common choices across heterogeneous edge targets:
- ONNX Runtime (quantization + graph optimizations):
- ORT quantization docs: onnxruntime.ai quantization guide
- LiteRT (formerly TensorFlow Lite) for mobile/embedded:
- Overview: ai.google.dev/edge/litert
- GitHub: google-ai-edge/LiteRT
- Apache TVM (compiler + autotuning): tvm.apache.org
- ncnn (lightweight C++ runtime): Tencent/ncnn
Typical flow:
- Do pruning/KD in PyTorch/TF.
- Export (ONNX / LiteRT / IR).
- Run runtime-specific quantization + graph optimizations.
- Deploy.
If you want config-driven automation across methods:
- DeepSpeed Compression (quantization + pruning + distillation workflows): DeepSpeed model compression tutorial
- LLM Compressor (vLLM-focused compression toolkit): vllm-project/llm-compressor
- Pruna (commercial + OSS tooling): Pruna docs
Note: older “SparseML” references exist, but the upstream repo is archived; treat it as legacy unless your org already depends on it.
A practical “first pass” for a GR00T-style VLA:
- Baseline profiling on target (Jetson / server GPU / CPU box / SoC).
- Quantize most transformer layers (PTQ INT8/FP8/FP4 depending on hardware + stack); keep sensitive heads higher precision.
- Structured pruning / 2:4 sparsity only if your deployment engine has real sparse kernels for your shapes.
- Distill:
- smaller VLA, or
- task-specific students (e.g., manipulator-only policy).
- Export to the deployment stack:
- NVIDIA path: PyTorch -> ModelOpt -> TensorRT-LLM/TensorRT
- Intel path: PyTorch/HF -> OpenVINO IR -> NNCF -> OpenVINO runtime
- General path: PyTorch -> ONNX -> ORT / TVM / ncnn
- Mobile path: TF/PyTorch -> LiteRT
A useful mental model: cost splits into prefill (prompt processing) and decode (token-by-token).
- Prefill is usually compute-bound (big GEMMs + attention).
- Decode is often memory / KV-cache bandwidth bound.
So:
- Weight-only INT4/FP4 helps decode if your stack has good kernels.
- Better attention kernels (FlashAttention/FlashInfer) help prefill and reduce memory traffic.
- KV-cache tricks matter most for long context and high concurrency.
If you do nothing else, choose a serving engine that gives you:
- continuous / dynamic batching
- paged KV cache (reduces fragmentation under concurrency)
- optional chunked prefill (smooths very long prompts)
Good entry points:
- vLLM: PagedAttention + continuous batching + CUDA/HIP graph execution. Docs: vLLM docs · Repo: vllm-project/vllm
- Hugging Face Text Generation Inference (TGI): production server with dynamic batching + tensor parallelism. Docs: TGI docs · Repo: huggingface/text-generation-inference
- SGLang: high-performance LLM serving framework + runtime stack. Repo: sgl-project/sglang
Reality check: feature parity differs (quant formats, speculative decoding, MoE, multi-modal, etc.).
Always confirm against each runtime’s “supported hardware + quantization” tables.
For transformer-heavy workloads, attention + MLP kernels are usually the make-or-break.
- FlashAttention (training + inference attention kernels): Dao-AILab/flash-attention
- FlashInfer (serving-focused kernels: attention, paged attention, sampling, etc.): Repo: flashinfer-ai/flashinfer · Docs: flashinfer.ai docs
When context length or concurrency grows, KV cache can dominate VRAM and drive latency cliffs.
Two complementary strategies:
- Systems: paged KV cache + chunked prefill (runtime feature).
- Model-side: KV cache quantization/compression (typically 4–8 bit; often mixed precision).
Researchy-but-usable codebases:
- KVQuant: SqueezeAILab/KVQuant
- ZipCache: ThisisBillhe/ZipCache
Kernel library for efficient pruning decisions:
- Flash-ColReduce: Triton kernels for attention column-wise reductions (sum/mean/max) with O(N) memory; enables efficient token/KV importance estimation without materializing O(N²) attention matrix. Repo: z-lab/flash-colreduce
If decode is the bottleneck, you can reduce the number of expensive target-model steps:
- Speculative decoding (draft model + verification): romsto/Speculative-Decoding
- Multi-token heads (Medusa): FasterDecoding/Medusa
- Block diffusion parallel drafting (DFlash): lightweight diffusion-based draft model generating multiple tokens in parallel; proven on Qwen/Llama/GPT-OSS; benefits LLM serving + high concurrency. Repo: z-lab/dflash
-
NVIDIA Model Optimizer (ModelOpt) for PTQ/QAT + sparsity/distillation: Docs: ModelOpt docs · Repo: NVIDIA/Model-Optimizer
-
TensorRT-LLM for engine build + kernels + serving: Docs: TensorRT-LLM docs · Repo: NVIDIA/TensorRT-LLM
-
embedl-models (pre-optimized model collection): Repo: embedl/embedl-models
-
NTransformer (inference engine for limited VRAM): Repo: xaskasdf/ntransformer
- Serving engines like vLLM can run with HIP backends; quantization support is more kernel-dependent and can be narrower than NVIDIA.
- PyTorch torch.compile (Inductor) is a good “graph+kernel” optimization baseline across NVIDIA/AMD/Intel GPUs (via Triton): API: torch.compile · Guide: torch.compiler docs
First levers:
- smaller model (distill) and/or weight-only quantization (INT8/INT4).
Runtimes:
- OpenVINO + NNCF (Intel-heavy deployments): OpenVINO · NNCF repo
- ONNX Runtime (cross-platform): ORT quantization docs
- ONNX Runtime GenAI (generation loop tooling): microsoft/onnxruntime-genai
- llama.cpp (local C++ inference; GGUF ecosystem): ggml-org/llama.cpp
- BitNet.cpp (1-bit LLM inference on CPU): microsoft/BitNet
- MLX for training + inference on Apple silicon: MLX repo · MLX docs
- LLM-oriented tooling: mlx-lm
- App conversion pipeline: coremltools: Repo · Guide
- LiteRT (ex-TFLite) runtime + delegates: Docs: LiteRT · Repo: google-ai-edge/LiteRT · Samples: litert-samples LLM pipeline: google-ai-edge/LiteRT-LM
- ExecuTorch (PyTorch -> on-device runtime): Docs · Repo
- ONNX Runtime + QNN EP (Qualcomm acceleration): ORT QNN EP docs · Qualcomm ORT QNN EP docs
These are often the fastest way to get something working across laptops + edge boxes:
- MLC LLM (TVM-based compiler + runtime for LLM deployment): mlc-ai/mlc-llm · MLC LLM docs
- llama.cpp (GGUF + broad backend support): llamacpp
- Optimize each submodule separately (vision encoder, LLM, action head) and re-profile end-to-end.
- Watch for non-model bottlenecks: image decode, resizing, tokenization, simulator/robot loop.
Model-side levers (reduce steps or faster steps):
- Reduce steps (often biggest win): Latent Consistency Models (LCM): luosiallen/latent-consistency-model
- Make each step faster (quantize/compile kernels):
- SVDQuant (4-bit diffusion via low-rank outlier absorption): arXiv:2411.05007
- Diffusers bitsandbytes quantization guide: Diffusers bitsandbytes quantization
- Reference implementations / research code: Stability-AI/generative-models
Distillation frameworks (end-to-end model compression):
- Fast Generation from Diffusion Models: FastGen Repo: NVlabs/FastGen
Serving-time optimization (inference-time caching, not training-based):
- PyTorch-native and Flexible Inference Engine with Hybrid Cache Acceleration and Parallelism for DiTs: Cache-DiT Repo: vipshop/cache-dit
- From Instantaneous to Average Velocity for Accelerating Flow Matching Inference: MeanCache Paper: ICLR 2026 · Repo: UnicomAI/MeanCache
- Profile and label the bottleneck: weights vs KV cache vs kernels vs scheduling.
- If decode/VRAM dominates -> start with weight-only INT4/FP4, but only if your runtime supports it well.
- If long context/concurrency dominates -> fix paged KV + chunked prefill, then consider KV cache quantization.
- If prefill compute dominates -> better kernels (FlashAttention/FlashInfer) + compile (torch.compile / TensorRT).
- If you still can’t hit constraints -> distill (often the only way to cut both compute and memory).
- Model optimization and deployment optimization are inseparable.
- Most “wins” come from matching a compression method to the runtime’s kernels.
- Treat it as a feedback loop: profile -> compress -> compile -> measure -> iterate.