Skip to content

add bp with patches benchmark#6712

Merged
a10y merged 1 commit intodevelopfrom
aduffy/bp-patches-ench
Mar 2, 2026
Merged

add bp with patches benchmark#6712
a10y merged 1 commit intodevelopfrom
aduffy/bp-patches-ench

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Feb 27, 2026

We have a benchmark for unpatched bitpacking but I wanted a baseline as I'm refining #6708

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y added the changelog/chore A trivial change label Feb 27, 2026
@a10y a10y requested a review from 0ax1 February 27, 2026 16:38
@a10y
Copy link
Contributor Author

a10y commented Feb 27, 2026

Results on A100 A10

Benchmarking bitunpack_cuda_patched_u8/bitunpack_patched/0.1%: Collecting 10 samples in estimated 5.5364 s
bitunpack_cuda_patched_u8/bitunpack_patched/0.1%
                        time:   [419.09 µs 419.24 µs 419.47 µs]
                        thrpt:  [222.03 GiB/s 222.14 GiB/s 222.22 GiB/s]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high severe
Benchmarking bitunpack_cuda_patched_u8/bitunpack_patched/1%: Collecting 10 samples in estimated 5.1481 s (3
bitunpack_cuda_patched_u8/bitunpack_patched/1%
                        time:   [746.30 µs 746.73 µs 747.27 µs]
                        thrpt:  [124.63 GiB/s 124.72 GiB/s 124.79 GiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking bitunpack_cuda_patched_u8/bitunpack_patched/5%: Collecting 10 samples in estimated 5.9989 s (3
bitunpack_cuda_patched_u8/bitunpack_patched/5%
                        time:   [1.5585 ms 1.5594 ms 1.5604 ms]
                        thrpt:  [59.683 GiB/s 59.725 GiB/s 59.758 GiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking bitunpack_cuda_patched_u8/bitunpack_patched/10%: Collecting 10 samples in estimated 5.8198 s (
bitunpack_cuda_patched_u8/bitunpack_patched/10%
                        time:   [2.0051 ms 2.0056 ms 2.0061 ms]
                        thrpt:  [46.424 GiB/s 46.435 GiB/s 46.447 GiB/s]
Found 2 outliers among 10 measurements (20.00%)

@codspeed-hq
Copy link

codspeed-hq bot commented Feb 27, 2026

Merging this PR will improve performance by 17.64%

⚡ 1 improved benchmark
✅ 953 untouched benchmarks
⏩ 1466 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_opt_bool_canonical_into[(10, 1000)] 1.6 ms 1.4 ms +17.64%

Comparing aduffy/bp-patches-ench (047280c) with develop (fc3af37)

Open in CodSpeed

Footnotes

  1. 1466 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y a10y enabled auto-merge (squash) February 27, 2026 17:31
@a10y a10y added changelog/skip Do not list PR in the changelog and removed changelog/chore A trivial change labels Feb 28, 2026
@a10y a10y merged commit d63db76 into develop Mar 2, 2026
57 of 59 checks passed
@a10y a10y deleted the aduffy/bp-patches-ench branch March 2, 2026 15:55
a10y added a commit that referenced this pull request Mar 4, 2026
## Summary

This PR adds G-ALP style patches that allow for data parallel access.
This allows removing the additional `execute_patches` kernel launch, and
instead inserts patch values inside of the unpacking kernels, at the
step when we dispatch from shared -> global memory.

The benchmarks added in #6712 show a significant speedup with this
change, and performance is pretty much invariant to patch count.

<img width="756" height="457" alt="image"
src="https://github.com/user-attachments/assets/0164a453-c69b-48be-abaf-81797a16f7fa"
/>

### Background

The [G-ALP](https://dl.acm.org/doi/10.1145/3736227.3736242) paper's
contribution is modifying the standard layout of "exceptions" (what we
call Patches in Vortex) to allow for fully data parallel access. Their
target is an f32 ALP decoding kernel, but it is equally applicable for
our unpacking kernels.

The core insight is that storing patches in sorted order, which is great
for single-threaded execution on a super-scalar CPU, results in very
poor GPU performance.

Instead, we need to shuffle them so that they can be accessed ordered by
`(chunk, lane)`. The chunk is the normal FastLanes vector chunk size
(1024). The lane depends on the width of the type. Doing this means that
we get O(1) access to the patches within our kernel.

Replicating Figure 2 from the paper:

<img width="447" height="432" alt="image"
src="https://github.com/user-attachments/assets/cb114056-e071-467f-9a17-be6e4032faf1"
/>

This diff looks large, but in reality the big pieces

* Adds a new `patches.h` header file with shared definitions between the
CUDA kernel and the Rust code.
* Adds a `patches.cuh` with some C++ code to make it easier to
seek/iterate over a range of patch values
* Updates the unpacking kernels to accept an optional set of patches

This does not change the memory format for the `Patches` type, rather
this just does a D2H -> transpose on CPU -> H2D transformation.

## Testing

Uses the existing test suite which includes a lot of bit-unpacking with
patches.

---------

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Co-authored-by: Alexander Droste <alexander.droste@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/skip Do not list PR in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants