Commit c872065
Add batched NVFP4 MoE GEMM for SM_100 (B200) with benchmark
Batched CUTLASS GEMM using rank-4 problem shape (M,N,K,L) for MoE
inference. Fixed max_M per expert with zero-padded activations,
CUDA-graph-friendly design. Achieves 2.0-2.4x speedup over BF16
cuBLAS on B200 across GLM-4.7 MoE shapes (8 experts, 8-128
tokens/expert). Peak 1940 TFLOPS NVFP4 vs 830 TFLOPS BF16.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent f8aa403 commit c872065
7 files changed
Lines changed: 1096 additions & 6 deletions
File tree
- benchmarks
- bitsandbytes
- backends/cuda
- nn
- csrc/qutlass
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
301 | 301 | | |
302 | 302 | | |
303 | 303 | | |
| 304 | + | |
304 | 305 | | |
305 | 306 | | |
306 | 307 | | |
| |||
0 commit comments