Summary
Add support for tensor core matrix multiply-accumulate (MMA) operations on modern NVIDIA GPUs.
Motivation
Tensor cores provide massive throughput for matrix operations (e.g., 16x16x16 matrix multiply in a single instruction). This is essential for competitive matmul performance on modern GPUs (Volta and later).
Design considerations
- WMMA (Warp Matrix Multiply Accumulate) API operates on matrix fragments
- Fragment types:
a (M×K), b (K×N), c/d (M×N)
- Supported shapes: 16x16x16, 32x8x16, 8x32x16 (varies by GPU architecture)
- Need words for: load fragment, store fragment, MMA compute
Implementation notes
- Maps to
nvvm.wmma or nvvm.mma intrinsics
- Requires floating-point support (depends on f16/f32 types)
- Fragment storage is distributed across warp lanes
- This is a significant feature that may need its own design document
Priority
Nice to have — needed for peak matmul performance but not for correctness.