Skip to content

[CUDA] Fallback QMM#3315

Open
zcbenz wants to merge 2 commits intoml-explore:mainfrom
zcbenz:qmm-naive
Open

[CUDA] Fallback QMM#3315
zcbenz wants to merge 2 commits intoml-explore:mainfrom
zcbenz:qmm-naive

Conversation

@zcbenz
Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz commented Mar 25, 2026

This is an implementation of QMM that is supposed to work for all devices and options, it is slow and mainly serves a fallback for cases that we are not optimizing for.

On A100 it has only about half memory bandwidth of the pipelined version.

FP16xINT4 with K-major weights:

M N K QMM (GiB/s) CUBLAS (GiB/s) QMM (TFlop/s) CUBLAS (TFlop/s) Speedup (x)
16 4096 4096 157.1 1388.4 8.69 22.04 0.39
16 16384 16384 288.1 1659.6 16.27 26.50 0.61
64 4096 4096 138.1 1295.7 28.28 80.41 0.35
64 16384 16384 227.4 1575.9 50.34 100.07 0.50

FP16xINT4 with N-major weights:

M N K QMM (GiB/s) CUBLAS (GiB/s) QMM (TFlop/s) CUBLAS (TFlop/s) Speedup (x)
16 4096 4096 172.6 1590.6 9.55 25.25 0.38
16 16384 16384 346.6 1596.7 19.58 25.50 0.77
64 4096 4096 100.7 1389.3 20.63 86.22 0.24
64 16384 16384 187.9 1491.8 41.60 94.74 0.44

This PR does not implement 3/5/6-bit quants and arbitrary K size, which I'll do in followup PRs.

Also it takes a lot of RAM to build the kernels, and I have switched to large arm runner with swap enabled for running tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant