Extend regular NAX tuning to gen-17 g devices by lentil32 · Pull Request #3295 · ml-explore/mlx

lentil32 · 2026-03-22T12:17:19Z

Proposed changes

Extend the shared regular NAX GEMM selector in mlx/backend/metal/matmul.cpp while preserving the existing tuned s/c/d path.
Apply the same tuned 64x128x(64|256), wm=2, wn=4, swizzle=2 route to M5-generation g devices for float16 / bfloat16, which is the path relevant to NA/M5 addmm / matmul on MPS ~10–20% slower than PyTorch for 1280×1280 BF16 (hurts Nano Chat training) #3196.
Keep tile and swizzle selection in a single helper.

This may address #3196 on M5 devices that route through the g regular NAX path.

Benchmarks

Measured manually on an Apple M5 Pro with the issue-shaped BF16 1280x1280 addmm / matmul microbenchmark (30 warmup iterations, 1000 timed iterations), using explicit BF16 flags.

This machine reports architecture = applegpu_g17s, so the real-device path is effectively unchanged because it already takes the existing tuned route:

addmm: 0.3073 ms -> 0.3059 ms (-0.5%)
matmul: 0.3108 ms -> 0.3099 ms (-0.3%)

To validate the new g route directly, I reran the same workload with MLX_METAL_GPU_ARCH=applegpu_g17g:

addmm: 0.3620 ms -> 0.3132 ms (-13.5%, 1.16x faster)
matmul: 0.3495 ms -> 0.3208 ms (-8.2%, 1.09x faster)

These results suggest the change improves the targeted g path, but I have not yet validated it on a real device that reports applegpu_g17g.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Extend regular NAX tuning to gen-17 g devices

ebc1ab4

lentil32 force-pushed the fix-3196-m5-regular-nax-routing branch from 521be5c to ebc1ab4 Compare March 22, 2026 12:36

zcbenz requested a review from jagrit06 March 22, 2026 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend regular NAX tuning to gen-17 g devices#3295

Extend regular NAX tuning to gen-17 g devices#3295
lentil32 wants to merge 1 commit intoml-explore:mainfrom
lentil32:fix-3196-m5-regular-nax-routing

lentil32 commented Mar 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lentil32 commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Benchmarks

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lentil32 commented Mar 22, 2026 •

edited

Loading