[JAX] Warmup FFIs with "initialize" stage by jberchtold-nvidia · Pull Request #2800 · NVIDIA/TransformerEngine

jberchtold-nvidia · 2026-03-25T21:19:36Z

Description

Add "initialize" stage to TE FFIs that didn't previously have them.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Add initialize FFI handlers in JAX .cpp extensions
Register them as "initialize" stage in pybind.cpp

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

for more information, see https://pre-commit.ci

jberchtold-nvidia · 2026-03-25T21:22:04Z

/te-ci L1 jax

greptile-apps · 2026-03-25T21:22:17Z

Greptile Summary

This PR adds XLA FFI kInitialize-stage handlers to JAX transformer engine extensions that previously lacked them, enabling CUDA graph warmup via wrapInStreamCapture. Each new initialize function is a one-line trampoline that captures and immediately destroys a CUDA graph to prime cuBLAS/cuDNN state before any live graph capture occurs. All signatures match their execute counterparts; the previously-reported GroupedGemmV2InitializeFFI argument mismatch has been resolved.

Confidence Score: 5/5

Safe to merge; all initialize handlers are correct one-line trampolines with matching signatures and no logic changes to execute paths.

No P0 or P1 issues found. All initialize FFI signatures were verified against their execute counterparts across all 7 changed files. The GroupedGemmV2 argument-mismatch flagged in a prior review is confirmed resolved. No remaining findings rise above P2.

No files require special attention.

Vulnerabilities

No security concerns identified. The PR only adds warmup trampolines that begin a relaxed-mode CUDA stream capture, call the existing execute FFI, and immediately discard the resulting graph via cudaGraphDestroy. No new I/O, memory ownership transfer, or untrusted input parsing is introduced.

Important Files Changed

Filename	Overview
transformer_engine/jax/csrc/extensions.h	Adds XLA_FFI_DECLARE_HANDLER_SYMBOL declarations for 15 new initialize handlers; purely mechanical, no logic.
transformer_engine/jax/csrc/extensions/softmax.cpp	Adds 6 softmax initialize handlers; ScaledMaskedSoftmaxBackwardInitialize correctly reuses ScaledSoftmaxBackwardInitializeFFI, matching the existing execute-handler design.
transformer_engine/jax/csrc/extensions/quantization.cpp	Adds DBiasQuantizeInitializeFFI and DequantizeInitializeFFI with correct signatures; GroupedQuantize intentionally left without an initialize handler.
transformer_engine/jax/csrc/extensions/attention.cpp	Adds FusedAttnForward/BackwardInitializeFFI with correct RemainingArgs variadic slot; registered alongside existing CudnnHandleInitHandler prepare stage.
transformer_engine/jax/csrc/extensions/gemm.cpp	Adds GemmInitializeFFI, GemmV2InitializeFFI, and GroupedGemmV2InitializeFFI; all signatures match execute counterparts; prior GroupedGemmV2 argument-mismatch is resolved.
transformer_engine/jax/csrc/extensions/router.cpp	Adds 4 MoE router initialize handlers (TopK fwd/bwd, AuxLoss fwd/bwd); all signatures match execute counterparts exactly.
transformer_engine/jax/csrc/extensions/pybind.cpp	Converts 12 bare FFI registrations to initialize+execute dicts; intentional omissions for GroupedGemm and GroupedQuantize are consistent with the existing non-graph-safe design.

Sequence Diagram

sequenceDiagram
    participant JAX as JAX Runtime
    participant Init as *InitializeFFI
    participant WSC as wrapInStreamCapture
    participant Exec as *ExecuteFFI (called internally)
    participant CUDA as CUDA Driver

    Note over JAX,CUDA: kInitialize stage (warmup — once per model)
    JAX->>Init: call initialize handler
    Init->>WSC: wrapInStreamCapture(ExecuteFFI, stream, args...)
    WSC->>CUDA: cudaStreamBeginCapture(stream, Relaxed)
    WSC->>Exec: ExecuteFFI(stream, args...) — warms cuBLAS/cuDNN state
    Exec-->>WSC: Error_Type
    WSC->>CUDA: cudaStreamEndCapture → cudaGraph_t
    WSC->>CUDA: cudaGraphDestroy (graph discarded)
    WSC-->>Init: Error_Type
    Init-->>JAX: return

    Note over JAX,CUDA: kExecute stage (every inference step)
    JAX->>Exec: ExecuteFFI(stream, args...) — now graph-capture-safe

_{Reviews (4): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

phu0ngng · 2026-04-06T18:42:19Z

LGTM!

phu0ngng · 2026-04-07T22:31:25Z

/te-ci JAX L1

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

for more information, see https://pre-commit.ci

jberchtold-nvidia · 2026-04-07T23:29:37Z

/te-ci L1 jax

Warmup FFIs with "initialize" stage

58f49d5

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

jberchtold-nvidia marked this pull request as draft March 25, 2026 21:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

fe1a406

for more information, see https://pre-commit.ci

jberchtold-nvidia marked this pull request as ready for review March 30, 2026 19:28

jberchtold-nvidia requested review from phu0ngng and tdophung March 30, 2026 19:30

phu0ngng previously approved these changes Apr 1, 2026

View reviewed changes

Merge branch 'main' into jberchtold/warmup-xla-ffis

54d1b52

greptile-apps Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread transformer_engine/jax/csrc/extensions/gemm.cpp Outdated

Fix GroupedGEMM V2 FFI signature after rebase

0884a8f

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

jberchtold-nvidia dismissed phu0ngng’s stale review via 0884a8f April 7, 2026 23:26

[pre-commit.ci] auto fixes from pre-commit.com hooks

10e67c4

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JAX] Warmup FFIs with "initialize" stage#2800

[JAX] Warmup FFIs with "initialize" stage#2800
jberchtold-nvidia wants to merge 5 commits intoNVIDIA:mainfrom
jberchtold-nvidia:jberchtold/warmup-xla-ffis

jberchtold-nvidia commented Mar 25, 2026 •

edited

Loading

Uh oh!

jberchtold-nvidia commented Mar 25, 2026

Uh oh!

greptile-apps Bot commented Mar 25, 2026 •

edited

Loading

Vulnerabilities

Uh oh!

phu0ngng commented Apr 6, 2026

Uh oh!

phu0ngng commented Apr 7, 2026

Uh oh!

Uh oh!

jberchtold-nvidia commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jberchtold-nvidia commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

jberchtold-nvidia commented Mar 25, 2026

Uh oh!

greptile-apps Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Sequence Diagram

Uh oh!

phu0ngng commented Apr 6, 2026

Uh oh!

phu0ngng commented Apr 7, 2026

Uh oh!

Uh oh!

jberchtold-nvidia commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jberchtold-nvidia commented Mar 25, 2026 •

edited

Loading

greptile-apps Bot commented Mar 25, 2026 •

edited

Loading