[CUDA] Fallback QMM by zcbenz · Pull Request #3315 · ml-explore/mlx

zcbenz · 2026-03-25T07:13:36Z

This is an implementation of QMM that is supposed to work for all devices and options, it is slow and mainly serves a fallback for cases that we are not optimizing for.

On A100 it has only about half memory bandwidth of the pipelined version.

FP16xINT4 with K-major weights:

M	N	K	QMM (GiB/s)	CUBLAS (GiB/s)	QMM (TFlop/s)	CUBLAS (TFlop/s)	Speedup (x)
16	4096	4096	157.1	1388.4	8.69	22.04	0.39
16	16384	16384	288.1	1659.6	16.27	26.50	0.61
64	4096	4096	138.1	1295.7	28.28	80.41	0.35
64	16384	16384	227.4	1575.9	50.34	100.07	0.50

FP16xINT4 with N-major weights:

M	N	K	QMM (GiB/s)	CUBLAS (GiB/s)	QMM (TFlop/s)	CUBLAS (TFlop/s)	Speedup (x)
16	4096	4096	172.6	1590.6	9.55	25.25	0.38
16	16384	16384	346.6	1596.7	19.58	25.50	0.77
64	4096	4096	100.7	1389.3	20.63	86.22	0.24
64	16384	16384	187.9	1491.8	41.60	94.74	0.44

This PR does not implement 3/5/6-bit quants and arbitrary K size, which I'll do in followup PRs.

Also it takes a lot of RAM to build the kernels, and I have switched to large arm runner with swap enabled for running tests.

zcbenz force-pushed the qmm-naive branch from e80d145 to e62effb Compare March 25, 2026 07:19

zcbenz added 2 commits March 29, 2026 19:34

[CUDA] Fallback QMM

c424279

Use swap for building arm64 cuda

0cd0bd1

zcbenz force-pushed the qmm-naive branch from e62effb to 0cd0bd1 Compare March 30, 2026 02:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Fallback QMM#3315

[CUDA] Fallback QMM#3315
zcbenz wants to merge 2 commits intoml-explore:mainfrom
zcbenz:qmm-naive

zcbenz commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zcbenz commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant