Tensor core / MMA intrinsics

## Summary

Add support for tensor core matrix multiply-accumulate (MMA) operations on modern NVIDIA GPUs.

## Motivation

Tensor cores provide massive throughput for matrix operations (e.g., 16x16x16 matrix multiply in a single instruction). This is essential for competitive matmul performance on modern GPUs (Volta and later).

## Design considerations

- WMMA (Warp Matrix Multiply Accumulate) API operates on matrix fragments
- Fragment types: `a` (M×K), `b` (K×N), `c`/`d` (M×N)
- Supported shapes: 16x16x16, 32x8x16, 8x32x16 (varies by GPU architecture)
- Need words for: load fragment, store fragment, MMA compute

## Implementation notes

- Maps to `nvvm.wmma` or `nvvm.mma` intrinsics
- Requires floating-point support (depends on f16/f32 types)
- Fragment storage is distributed across warp lanes
- This is a significant feature that may need its own design document

## Priority

Nice to have — needed for peak matmul performance but not for correctness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor core / MMA intrinsics #11

Summary

Motivation

Design considerations

Implementation notes

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Tensor core / MMA intrinsics #11

Description

Summary

Motivation

Design considerations

Implementation notes

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions