🚀 The feature, motivation and pitch
Summary
Add an opt-in compile-time option that lets models whose weights don't all fit in GPU memory still run on the CUDA backend. Weights live in CPU memory, and the runtime copies only the few currently needed onto the GPU on demand using a capped CUDA memory pool, overlapping these copies with kernel execution so the overhead is mostly hidden. If the user's GPU memory budget is too small to safely execute the model, load fails with a specific required minimum — never silent corruption at runtime.
Motivation
Today the CUDA backend loads every weight of a model into GPU memory at startup. If the model's weights exceed GPU memory, loading fails before any computation runs.
The model itself doesn't actually need every weight on the GPU at the same time, e.g., each kernel reads a small subset but currently the runtime demands the load everything.
How it would work
Three pieces, all opt-in.
1. A graph pass at export time
The user runs a one-line transform on their exported program before compiling for CUDA. The pass does three things:
- Inserts a custom op in front of every weight, so every kernel reads its weight through this op rather than directly.
- Records the order in which weights are read during a forward pass. This order is deterministic — AOTI's generated code executes kernels in a fixed sequence — so it can be computed statically by walking the graph.
- Computes the minimum GPU byte budget required to safely execute the graph (see "Why a minimum budget" below).
Both the access order and the minimum budget are attached to the program as constant methods. The access order is per-weight structured data of non-trivial size and constant methods are the existing mechanism for that kind of data. Per-weight sizes, dtypes, and shapes are not recorded by the pass — they are read at load time directly from the .ptd file.
2. A CPU-side copy of every weight, sourced from the .ptd file
The .ptd file is memory-mapped — the OS exposes the file's bytes as a region of the process's address space without an explicit read into RAM. The runtime registers this region with the CUDA driver, which marks it as eligible for direct PCIe transfer to the GPU. CPU RAM usage stays minimal: the OS pages bytes in from disk as the runtime touches them and evicts them under memory pressure.
3. A capped CUDA memory pool with per-weight allocations
At load time the runtime creates a cudaMemPool_t whose maximum size equals the user's GPU byte budget, configured for cross-stream opportunistic reuse. The weights are allocated individually at their actual sizes, eliminating the bin-packing waste that uniform buffer sizing imposes on models with non-uniform weights.
When the custom op is called for a weight:
- If that weight currently has a live GPU allocation, its data pointer is returned to the kernel directly.
- If not, the runtime checks whether the pool has room for a new allocation of the requested size. If not, it picks the least-recently-used weight, returns its allocation to the pool via
cudaFreeAsync, and repeats until there is room. It then cudaMallocAsyncs the requested size, issues an H2D copy of the weight's bytes via PCIe, and returns the new allocation.
In parallel with the kernel that consumes the returned allocation, the runtime reads the recorded access order to determine which weight will be needed next, and starts an asynchronous copy of that weight into a new allocation on a separate CUDA stream. Synchronization between the copy stream and the compute stream uses CUDA events. The pool handles cross-stream memory reuse transparently — a weight freed on the compute stream becomes available to be re-allocated on the copy stream after stream-ordered synchronization.
At init the runtime also "warms" the pool by allocating and freeing one buffer per distinct weight size encountered in the access order, so first-iteration allocator overhead is amortized away.
Why a minimum budget
Two structural reasons:
- A single kernel that reads N weights (a fused QKV linear, for example) needs all N allocations valid at the same time when it launches. If the budget can't fit them simultaneously, the third probe call evicts the first — and the kernel later reads from memory whose bytes are now a different weight. Silent wrong output.
- Kernel launches are asynchronous. By the time the next kernel's probes start firing, the previous kernel's weights are still pinned in their allocations (the compute stream hasn't read them yet). Both kernels' weight sets need to fit simultaneously during the transition.
The pass computes the minimum budget from the static call order, in bytes:
floor_bytes = max over consecutive kernel pairs of (sum of weight bytes in K_i
+ sum of weight bytes in K_{i+1})
+ prefetch_headroom_bytes
The pair sum accounts for cross-kernel pinning; the headroom term accounts for in-flight prefetches. Because it sums actual weight bytes rather than count × max_weight_size, the floor is tight: models with a few large outlier weights pay only for those outliers, not for the outliers' size multiplied across every allocation.
At load, the runtime asserts the budget covers at least floor_bytes. If not, load fails with the exact required minimum. This eliminates the need for an event-aware eviction policy: above the floor, the LRU candidate is always a weight no kernel is still reading.
User surfaces
At export time (opt in once):
ep = export(model, inputs)
ep = apply_weight_offload(ep)
prog = to_edge_transform_and_lower(ep, partitioner=[CudaPartitioner(...)])
loaded = Module(pte, weight_offload_budget_mb=512)
Failure modes
| Failure |
Where it's caught |
Outcome |
| GPU memory below the floor |
Init |
Hard fail with required minimum |
cudaMemPool_t creation fails |
Init |
Hard fail with the system-level cause |
| .ptd missing a weight AOTI expects |
Init |
Hard fail with the missing FQN |
| .ptd shape/dtype disagrees with AOTI's metadata |
Init |
Hard fail naming the constant |
| AOTI rejects the pre-load placeholder install (open question 1) |
Init |
Hard fail; user falls back to non-offloaded path |
cudaHostRegister rejects the mmap region |
Init |
Hard fail with the system-level cause |
| Recorded schedule disagrees with the actual runtime call sequence |
Runtime |
Hard fail at the first divergent probe call (this is a bug in the export pass, not a recoverable state) |
Concurrent execute() on the same loaded module |
Documented limitation |
Not thread-safe in v3, same as AOTI itself |
There is intentionally no silent fallback for any of these. The cost of a hard init failure with a clear message is a worse first-run experience for some users; the cost of a runtime fallback is a class of confusing intermittent bugs and silent slowdowns. The trade prefers loud-and-early.
Non-goals
Real future work, but not in scope for this:
- Spreading a model across multiple GPUs
- Sharing the memory pool between models running in the same process
- Offloading weights that AOTI has fused together during constant folding
- Adjusting the budget after the model is loaded
- Event-aware eviction or pinned-allocation tracking — the floor-based init check obviates these
Open Questions:
cudaMallocAsync overhead in steady state. Per-allocation cost from a warm pool is on the order of microseconds, but for models with many weights per forward pass the cumulative cost may be measurable against kernel time. Init-time pool warming (allocate-and-free one buffer per distinct weight size) is the planned mitigation. Should be measured on a realistic workload before committing to the design.
What happens with constant folding? AOTI may fuse some weights into combined results at compile time; those results sit in a separate path the proposed mechanism does not intercept. Options: document as a limitation, require constant_folding=False when offloading, or extend the pass to cover folded outputs.
Alternatives
Fixed-size buffer pool. Allocate N equal-sized buffers at init, each sized to the largest weight in the model; reuse them via LRU. Considered and rejected: for models with non-uniform weights, the largest weight forces every buffer to its size, wasting most of the committed GPU memory. A model with a 256 MB lm_head and 16 MB attention weights would commit 9 × 256 = 2.3 GB to cover what actually averages 144 MB of in-use weight memory. The proposed per-allocation design recovers this waste.
Bucketed buffer sizes. Group weights into size buckets, one fixed-size pool per bucket. A middle ground between fixed and per-allocation. Considered and rejected: requires per-bucket floor analysis at export, per-bucket eviction at runtime, and a user-visible (or heuristic) choice of bucket count and sizes. Per-allocation via the CUDA memory pool removes the bucketing decision entirely and lets the pool's internal slab caching handle same-size reuse for the common case.
Custom best-fit allocator over a single cudaMalloc. Conceptually general but reinvents what cudaMallocAsync plus a cudaMemPool_t already provides — including fragmentation handling. No reason to own that complexity ourselves.
Load all weights first, then free the GPU copy. Let AOTI load weights as usual, then replace its pointers with placeholders and free the original GPU allocation. Functionally correct but at startup the GPU briefly holds both the original constants blob and the new pool — exactly the case offloading exists to avoid for the largest models.
Sizing at export time instead of load time. Rejected: a model deployed on different GPUs needs different budgets, and forcing re-export to retune adds friction with no upside. Export records what's model-dependent and not derivable elsewhere (access order, floor in bytes); load decides what's deployment-dependent (budget); per-weight metadata is read from the .ptd.
Event-aware eviction with pinned-allocation tracking. Lets the runtime tolerate budgets below the static floor by blocking on compute-stream events when no unpinned allocation is available. Considered and rejected: it lets users configure budgets that silently degrade to near-synchronous performance, hiding the real cost behind correct-but-slow execution. The floor-based hard fail at init surfaces the cost in a debuggable place.
Storing the access order in a compile spec. Rejected: compile specs are shaped for small configuration flags. The access order is per-weight structured data (kilobytes for realistic models) and belongs in a constant method, which is the existing mechanism for that shape.
No prefetch — synchronous copies only. A simpler implementation but performance would be unusable for the main use case. With no overlap, a model with many weights per layer spends most of its time blocked on PCIe transfers. Prefetch is part of the core proposal.
Speculative prefetch instead of a recorded access order. Could work without the export-time recording. Rejected because the access order through a compiled forward pass is deterministic and statically known — there is no reason to guess.
A runtime toggle to enable or disable offloading. The user already opted in at export by applying the pass. Adding a runtime toggle is surface area without a clear benefit.
Additional context
No response
RFC (Optional)
No response
🚀 The feature, motivation and pitch
Summary
Add an opt-in compile-time option that lets models whose weights don't all fit in GPU memory still run on the CUDA backend. Weights live in CPU memory, and the runtime copies only the few currently needed onto the GPU on demand using a capped CUDA memory pool, overlapping these copies with kernel execution so the overhead is mostly hidden. If the user's GPU memory budget is too small to safely execute the model, load fails with a specific required minimum — never silent corruption at runtime.
Motivation
Today the CUDA backend loads every weight of a model into GPU memory at startup. If the model's weights exceed GPU memory, loading fails before any computation runs.
The model itself doesn't actually need every weight on the GPU at the same time, e.g., each kernel reads a small subset but currently the runtime demands the load everything.
How it would work
Three pieces, all opt-in.
1. A graph pass at export time
The user runs a one-line transform on their exported program before compiling for CUDA. The pass does three things:
Both the access order and the minimum budget are attached to the program as constant methods. The access order is per-weight structured data of non-trivial size and constant methods are the existing mechanism for that kind of data. Per-weight sizes, dtypes, and shapes are not recorded by the pass — they are read at load time directly from the .ptd file.
2. A CPU-side copy of every weight, sourced from the .ptd file
The .ptd file is memory-mapped — the OS exposes the file's bytes as a region of the process's address space without an explicit read into RAM. The runtime registers this region with the CUDA driver, which marks it as eligible for direct PCIe transfer to the GPU. CPU RAM usage stays minimal: the OS pages bytes in from disk as the runtime touches them and evicts them under memory pressure.
3. A capped CUDA memory pool with per-weight allocations
At load time the runtime creates a
cudaMemPool_twhose maximum size equals the user's GPU byte budget, configured for cross-stream opportunistic reuse. The weights are allocated individually at their actual sizes, eliminating the bin-packing waste that uniform buffer sizing imposes on models with non-uniform weights.When the custom op is called for a weight:
cudaFreeAsync, and repeats until there is room. It thencudaMallocAsyncs the requested size, issues an H2D copy of the weight's bytes via PCIe, and returns the new allocation.In parallel with the kernel that consumes the returned allocation, the runtime reads the recorded access order to determine which weight will be needed next, and starts an asynchronous copy of that weight into a new allocation on a separate CUDA stream. Synchronization between the copy stream and the compute stream uses CUDA events. The pool handles cross-stream memory reuse transparently — a weight freed on the compute stream becomes available to be re-allocated on the copy stream after stream-ordered synchronization.
At init the runtime also "warms" the pool by allocating and freeing one buffer per distinct weight size encountered in the access order, so first-iteration allocator overhead is amortized away.
Why a minimum budget
Two structural reasons:
The pass computes the minimum budget from the static call order, in bytes:
The pair sum accounts for cross-kernel pinning; the headroom term accounts for in-flight prefetches. Because it sums actual weight bytes rather than
count × max_weight_size, the floor is tight: models with a few large outlier weights pay only for those outliers, not for the outliers' size multiplied across every allocation.At load, the runtime asserts the budget covers at least
floor_bytes. If not, load fails with the exact required minimum. This eliminates the need for an event-aware eviction policy: above the floor, the LRU candidate is always a weight no kernel is still reading.User surfaces
At export time (opt in once):
Failure modes
cudaMemPool_tcreation failscudaHostRegisterrejects the mmap regionexecute()on the same loaded moduleThere is intentionally no silent fallback for any of these. The cost of a hard init failure with a clear message is a worse first-run experience for some users; the cost of a runtime fallback is a class of confusing intermittent bugs and silent slowdowns. The trade prefers loud-and-early.
Non-goals
Real future work, but not in scope for this:
Open Questions:
cudaMallocAsyncoverhead in steady state. Per-allocation cost from a warm pool is on the order of microseconds, but for models with many weights per forward pass the cumulative cost may be measurable against kernel time. Init-time pool warming (allocate-and-free one buffer per distinct weight size) is the planned mitigation. Should be measured on a realistic workload before committing to the design.What happens with constant folding? AOTI may fuse some weights into combined results at compile time; those results sit in a separate path the proposed mechanism does not intercept. Options: document as a limitation, require
constant_folding=Falsewhen offloading, or extend the pass to cover folded outputs.Alternatives
Fixed-size buffer pool. Allocate N equal-sized buffers at init, each sized to the largest weight in the model; reuse them via LRU. Considered and rejected: for models with non-uniform weights, the largest weight forces every buffer to its size, wasting most of the committed GPU memory. A model with a 256 MB lm_head and 16 MB attention weights would commit 9 × 256 = 2.3 GB to cover what actually averages 144 MB of in-use weight memory. The proposed per-allocation design recovers this waste.
Bucketed buffer sizes. Group weights into size buckets, one fixed-size pool per bucket. A middle ground between fixed and per-allocation. Considered and rejected: requires per-bucket floor analysis at export, per-bucket eviction at runtime, and a user-visible (or heuristic) choice of bucket count and sizes. Per-allocation via the CUDA memory pool removes the bucketing decision entirely and lets the pool's internal slab caching handle same-size reuse for the common case.
Custom best-fit allocator over a single
cudaMalloc. Conceptually general but reinvents whatcudaMallocAsyncplus acudaMemPool_talready provides — including fragmentation handling. No reason to own that complexity ourselves.Load all weights first, then free the GPU copy. Let AOTI load weights as usual, then replace its pointers with placeholders and free the original GPU allocation. Functionally correct but at startup the GPU briefly holds both the original constants blob and the new pool — exactly the case offloading exists to avoid for the largest models.
Sizing at export time instead of load time. Rejected: a model deployed on different GPUs needs different budgets, and forcing re-export to retune adds friction with no upside. Export records what's model-dependent and not derivable elsewhere (access order, floor in bytes); load decides what's deployment-dependent (budget); per-weight metadata is read from the .ptd.
Event-aware eviction with pinned-allocation tracking. Lets the runtime tolerate budgets below the static floor by blocking on compute-stream events when no unpinned allocation is available. Considered and rejected: it lets users configure budgets that silently degrade to near-synchronous performance, hiding the real cost behind correct-but-slow execution. The floor-based hard fail at init surfaces the cost in a debuggable place.
Storing the access order in a compile spec. Rejected: compile specs are shaped for small configuration flags. The access order is per-weight structured data (kilobytes for realistic models) and belongs in a constant method, which is the existing mechanism for that shape.
No prefetch — synchronous copies only. A simpler implementation but performance would be unusable for the main use case. With no overlap, a model with many weights per layer spends most of its time blocked on PCIe transfers. Prefetch is part of the core proposal.
Speculative prefetch instead of a recorded access order. Could work without the export-time recording. Rejected because the access order through a compiled forward pass is deterministic and statically known — there is no reason to guess.
A runtime toggle to enable or disable offloading. The user already opted in at export by applying the pass. Adding a runtime toggle is surface area without a clear benefit.
Additional context
No response
RFC (Optional)
No response