Skip to content

Commit c872065

Browse files
TimDettmersclaude
andcommitted
Add batched NVFP4 MoE GEMM for SM_100 (B200) with benchmark
Batched CUTLASS GEMM using rank-4 problem shape (M,N,K,L) for MoE inference. Fixed max_M per expert with zero-padded activations, CUDA-graph-friendly design. Achieves 2.0-2.4x speedup over BF16 cuBLAS on B200 across GLM-4.7 MoE shapes (8 experts, 8-128 tokens/expert). Peak 1940 TFLOPS NVFP4 vs 830 TFLOPS BF16. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f8aa403 commit c872065

7 files changed

Lines changed: 1096 additions & 6 deletions

File tree

CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,7 @@ if(BUILD_CUDA)
301301
set(_NVFP4_SM100_SOURCES
302302
csrc/qutlass/gemm_nvfp4_sm100.cu
303303
csrc/qutlass/gemm_nvfp4_grouped_sm100.cu
304+
csrc/qutlass/gemm_nvfp4_moe_sm100.cu
304305
)
305306

306307
add_library(nvfp4_sm100a OBJECT ${_NVFP4_SM100_SOURCES})

0 commit comments

Comments
 (0)