feat: mod-builder + algebra + ecc CUDA tracegen by stephenh-axiom-xyz · Pull Request #2639 · openvm-org/openvm

stephenh-axiom-xyz · 2026-03-31T21:27:48Z

No description provided.

New Keccak with Xorin and Keccakf chip and opcode - [ ] I have performed a self-review of my own code - [ ] Add negative tests for xorin chip - [ ] Add negative tests for keccakf chip - [x] Add unit test to CI - [x] Add new guest code for E2E test to CI (the keccak example is updated, but I am thinking of adding another one) - [ ] Check with Ayush if I implemented the SizedRecord trait correctly - [ ] Rebase to include Zach's new Plonky3 update and update the keccakf trace gen to not have to transpose any more before giving it into the input - [ ] Maybe add comments to justify the correctness of the constrain_input_read and constraint_output_write function To reviewer: I will still have to complete the above checklist. But you can start reviewing if you would like to. Closes INT-5017, INT-5721, INT-5720, INT-5718, INT-5717, INT-5646, INT-5018

Resolves INT-5719

Resolves INT-5722

Closes INT-5779 --------- Co-authored-by: Ayush Shukla <ayush@axiom.xyz>

Note that I removed the `cells_used` metric since with the introduction of periphery chips (both here and the new SHA-2) it is very hard to make it accurate. There are better ways to get that data, and we should add tooling for it in a separate feature.

closes INT-5935

Implemented a new design for constraining the SHA-2 family of hash functions (specifically SHA-256, SHA-512, SHA-384). The new design adds incremental hasher functionality, which means we can compute the hash of a stream of bytes. More specifically, the new `Sha256`, `Sha512`, and `Sha384` structs provided in the SHA-2 guest library provide the `update(&[u8])` and `finalize() -> [u8; HASH_SIZE]` methods. We can instantiate a hasher object, `let hasher = Sha256::new()` and then call `hasher.update(data)` as many times as we want on it. The `data` parameter can be a slice of any size. When we would like to retrieve the hash, we can call `hasher.finalize()`. The `Sha256` struct in the SHA-2 guest library maintains an array of bytes that serves as the internal state of the SHA-2 hashing algorithm. This array is updated using a new opcode: `SHA256_UPDATE dst src input` which takes in one block of input and pointers to the src/dst hasher states (the guest library sets `src == dst` for updating the state in-place). The `Sha256` struct will buffer up to one block of input, and it will call `SHA2_UPDATE` when necessary to absorb the input into the state. The `Sha512` and `Sha384` structs are implemented similarly. The `Sha256`, `Sha512`, `Sha384` structs implement the `sha2::Digest` trait, allowing them to be used as a drop-in replacement for the popular `sha2` crate's `sha2::{Sha256, Sha512, Sha384}` hasher objects. The OpenVM book, specs, and the crate docs have been updated. Additionally, a brief justification of soundness for the main constraints has been added. All the SHA-2 guest library integration tests and the SHA-2 circuit unit tests pass on CI. Both CPU and GPU trace generation is tested among these tests. closes INT-4972 INT-5023 INT-5024 INT-5025 INT-5026

- change padding size to positive instead of non-negative - change output size of sha384 to 48 bytes instead of 64

Adds execution tests and proving (only for GPU) tests for the Sha2 guest library. Test vectors are from https://csrc.nist.gov/projects/cryptographic-algorithm-validation-program/secure-hashing closes INT-5932

…2381) The field storing the total message length in the `Sha256` and `Sha512` structs were both of type `usize`. To prevent overflow the types were changed to `u64` and `u128`, respectively, following the implementation of RustCrypto. closes INT-6012

Made `Keccak256` struct that has the `tiny-keccak::Hasher` functionalities. The actual `tiny-keccak::Hasher` trait is only implemented when the `tiny-keccak` feature is enabled. The `native_keccak256` function uses a static `Keccak256` instance to reduce memory allocations. To avoid compile errors, the default `tiny-keccak` feature must be kept enabled when the target is not ZKVM. towards INT-5944

Adds execution tests and proving (only for GPU) tests for the Keccak guest library. Test vectors are from https://keccak.team/archives.html (https://keccak.team/obsolete/KeccakKAT-3.zip). closes INT-5933

Made the test guest program process all of the test vectors in a single run. Time to run tests has decreased so all the SHA2 tests are now run on a single machine, like it was before the test suite was added.

The dependency on the RustCrypto sha2 crate is now gated by the "import_sha2" feature . The feature is enabled by default and when it is, the `sha2::Digest` trait is implemented for our Sha2 structs. To avoid compile errors, the "import_sha2" feature must be kept enabled when the target is not ZKVM. closes INT-5999

…alize memory (#2318) Closes INT-5723, INT-5724, INT-5726 --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>

Resolves INT-5728, INT-5727, INT-5729. Summary of changes: - Everywhere the code used `Rv32HeapAdapterAir`, we switch to use `Rv32VecHeapAdapterAir`. - Everywhere the code used `Rv32HeapBranchAdapterAir`, we switch to use a new `Rv32HeapBranchAdapterAir`, which accesses memory in the same way as `Rv32VecHeapAdapterAir`, but is compatible with the branch CoreAirs. - No other code uses `Rv32HeapAdapterAir` and `Rv32HeapBranchAdapterAir`, so the `heap.rs` and `heap_branch.rs` files were deleted. - The interface for `Rv32VecHeapAdapterAir` and `Rv32HeapBranchAdapterAir ` now becomes different to what the CoreAirs expect, so wrappers in `vec_to_heap.rs` are used to convert between them. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Resolves INT-5950. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Remove CUDA code from algebra and ecc extensions, since the production code currenly uses hybrid by default. Cuda tests are switched to using hybrid chips instead of gpu chips.

Resolves INT-5949 INT-5952 INT-5951 INT-5948. --------- Co-authored-by: Paul Chen <chenpaul.pc@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>

Resolves INT-6010. Currently, `AccessAdapterAir` is excluded from the metered execution checks iff `access_adapters_enabled` is true.

Cast usize pointers to u32 before to_le_bytes() to produce 4-byte arrays compatible with BLOCK_SIZE=4 memory configuration.

Resolves INT-5725. The `feat/access-adapter-removal` branch removes memory access adapters from the persistent memory path. Previously, separate `AccessAdapterAir` circuits handled the conversion between `CONST_BLOCK_SIZE=4` (the granularity of memory bus interactions) and `CHUNK=8` (the granularity of Merkle tree hashing/Poseidon2 digests). The updated `PersistentBoundaryAir` eliminates this by operating on 8-byte chunks directly while tracking **per-sub-block timestamps**: ``` Old PersistentBoundaryCols: expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamp ^^^^^^^^^ single timestamp New PersistentBoundaryCols: expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamps[2] ^^^^^^^^^^^^^ one per 4-byte block ``` Each 8-byte chunk contains `BLOCKS_PER_CHUNK = CHUNK / CONST_BLOCK_SIZE = 2` sub-blocks. The boundary AIR emits **two** memory bus interactions per row (one per 4-byte sub-block), each with its own timestamp. Untouched sub-blocks within a touched chunk keep `timestamp=0`, which naturally balances against the initial-state row (also at `t=0`). --- This PR adapts the GPU trace generation pipeline to the new per-sub-block-timestamp design. The core challenge: the CPU-side "touched memory" partition arrives as sorted **4-byte records**, but the boundary chip and Merkle tree need **8-byte records** with per-block timestamps and initial-memory fill for untouched sub-blocks. A new `merge_records` kernel converts `InRec = MemoryInventoryRecord<4, 1>` into `OutRec = MemoryInventoryRecord<8, 2>` in two phases: 1. **`cukernel_build_candidates`** — Each thread inspects one input record. If it starts a new 8-byte output chunk (different `ptr/8` than the previous record), it: - Reads the full 8-byte initial memory from device - Overwrites the touched 4-byte sub-block with final values + timestamp - If the next input record belongs to the same output chunk, also patches the other sub-block - Sets `flag[i] = 1`; otherwise `flag[i] = 0` (duplicate within same chunk) 2. **`cukernel_scatter_compact`** — CUB `ExclusiveSum` on flags produces output positions; flagged records are scattered into a compact output array. The `BoundaryRecord` struct is parameterized on `BLOCKS`: ```c++ template <size_t CHUNK, size_t BLOCKS> struct BoundaryRecord { uint32_t address_space; uint32_t ptr; uint32_t timestamps[BLOCKS]; // was: uint32_t timestamp; uint32_t values[CHUNK]; }; ``` The persistent trace gen kernel writes `timestamps=[0,0]` for initial rows and the actual per-block timestamps for final rows. The Rust side: 1. Converts `TimestampedEquipartition<F, CONST_BLOCK_SIZE>` into GPU-compatible `MemoryInventoryRecord<4,1>` structs 2. Uploads to device and calls the merge kernel 3. Sends merged records to the boundary chip (`finalize_records_persistent_device`) 4. Converts merged records into Merkle records with `timestamp = max(timestamps[0], timestamps[1])` for the tree update --- Suppose the VM touches 3 memory cells at `CONST_BLOCK_SIZE=4` granularity: | addr_space | ptr | timestamp | values | |------------|-----|-----------|-----------------| | 1 | 0 | 5 | [1, 2, 3, 4] | | 1 | 4 | 10 | [5, 6, 7, 8] | | 1 | 16 | 3 | [9, 0, 0, 0] | Initial memory is all zeros. ``` InRec[0]: { as=1, ptr=0, timestamps=[5], values=[1,2,3,4] } InRec[1]: { as=1, ptr=4, timestamps=[10], values=[5,6,7,8] } InRec[2]: { as=1, ptr=16, timestamps=[3], values=[9,0,0,0] } ``` Each thread computes `output_chunk = ptr / 8`: | idx | ptr | output_chunk | same as prev? | flag | |-----|-----|-------------|---------------|------| | 0 | 0 | 0 | N/A (first) | 1 | | 1 | 4 | 0 | yes | 0 | | 2 | 16 | 2 | no | 1 | **Thread 0** (flag=1): Builds an `OutRec` for chunk `ptr=0`: - Reads initial memory `[0,0,0,0,0,0,0,0]` from device - `block_idx = (0 % 8) / 4 = 0` → patches `values[0..4] = [1,2,3,4]`, `timestamps[0] = 5` - Next record (idx=1) is same chunk: `block_idx2 = (4 % 8) / 4 = 1` → patches `values[4..8] = [5,6,7,8]`, `timestamps[1] = 10` - Result: `{ as=1, ptr=0, timestamps=[5,10], values=[1,2,3,4,5,6,7,8] }` **Thread 1** (flag=0): Skipped (same output chunk as thread 0). **Thread 2** (flag=1): Builds an `OutRec` for chunk `ptr=16`: - Reads initial memory `[0,0,0,0,0,0,0,0]` from device - `block_idx = (16 % 8) / 4 = 0` → patches `values[0..4] = [9,0,0,0]`, `timestamps[0] = 3` - No next record → `timestamps[1] = 0`, `values[4..8] = [0,0,0,0]` (from initial memory) - Result: `{ as=1, ptr=16, timestamps=[3,0], values=[9,0,0,0,0,0,0,0] }` ``` flags = [1, 0, 1] positions = [0, 1, 1] (exclusive prefix sum) out[0] = thread 0's record out[1] = thread 2's record out_num_records = 2 ``` | Row | expand_dir | as | leaf_label | values | hash | timestamps | |-----|------------|----|------------|---------------------------|---------------|------------| | 0 | +1 (init) | 1 | 0 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] | | 1 | -1 (final) | 1 | 0 | [1,2,3,4,5,6,7,8] | H([1,..,8]) | [5, 10] | | 2 | +1 (init) | 1 | 2 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] | | 3 | -1 (final) | 1 | 2 | [9,0,0,0,0,0,0,0] | H([9,0,..0]) | [3, 0] | Each final row generates **two** memory bus sends: - Row 1: send `(as=1, ptr=0, values=[1,2,3,4], ts=5)` and `(as=1, ptr=4, values=[5,6,7,8], ts=10)` - Row 3: send `(as=1, ptr=16, values=[9,0,0,0], ts=3)` and `(as=1, ptr=20, values=[0,0,0,0], ts=0)` The `ts=0` sends from initial rows balance against the `ts=0` sub-blocks of the final rows for untouched memory, eliminating the need for access adapters. For the Merkle tree, each record uses a single timestamp = `max(timestamps[0], timestamps[1])`: | as | ptr | merkle_ts | values | |----|-----|-----------|---------------------------| | 1 | 0 | 10 | [1,2,3,4,5,6,7,8] | | 1 | 16 | 3 | [9,0,0,0,0,0,0,0] | These feed into the existing `update_with_touched_blocks` for Merkle root computation. --- - **`merkle_tree/mod.rs`**: Added `MERKLE_TOUCHED_BLOCK_WIDTH = 3 + DIGEST_WIDTH` constant (distinct from `TIMESTAMPED_BLOCK_WIDTH = 3 + CONST_BLOCK_SIZE`) since the Merkle tree now consumes 8-value records directly. Also fixed a potential `ilog2(0)` panic in `calculate_unpadded_height`. - **New test** `test_empty_touched_memory_uses_full_chunk_values` validates that the empty-partition edge case correctly reads initial memory at `DIGEST_WIDTH` granularity and produces a matching Merkle root vs CPU. --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com> Co-authored-by: Golovanov399 <Golovanov12345@gmail.com>

…tentBoundary (#2453) The read_initial_chunk function in inventory.cu was storing raw byte/u16 values from initial memory without converting them to BabyBear Montgomery form. These values end up in PersistentBoundaryAir records, where boundary.cu's final row reads them via from_raw_array (which expects Montgomery form). Meanwhile, the initial row correctly converts bytes via from_u8_array. This mismatch caused a bus balancing failure for untouched memory blocks in CUDA-only test_sha2. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Metered execution now tracks the block hasher chip's height growth, matching the pattern used in keccak256 for its permutation chip. closes INT-5931 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-31T21:28:01Z

Claude finished @stephenh-axiom-xyz's task in 2m 45s —— View job

Code Review: mod-builder + algebra + ecc CUDA tracegen

Gather context and understand PR scope
Review mod-builder CUDA changes
Review algebra CUDA tracegen
Review ECC CUDA tracegen
Review Rust-side chip/extension changes
Review test changes
Post final review

This PR adds specialized CUDA trace generation kernels for mod-builder, algebra (add/sub, mul/div, fp2 variants, is_eq), and ECC (point addition, point doubling) operations. It replaces the previous hybrid CPU/GPU approach with dedicated GPU implementations. Overall the implementation is well-structured and follows consistent patterns, but I found several issues worth addressing.

Critical Issues

1. Array out-of-bounds in is_eq.cu lines 77-78

uint32_t b_lt_diff = modulus[b_diff] - record.b[b_diff];  // Line 77
uint32_t c_lt_diff = modulus[c_diff] - record.c[c_diff];  // Line 78

When b_diff or c_diff equals READ_LIMBS (no differing limb found), these accesses are out of bounds. Lines 70 and 74 use % READ_LIMBS to mask the index, but lines 77-78 use the raw index. This is a buffer overread.

Fix this →

2. Potential integer overflow in workspace calculation (chip.rs:220-221)

let base_bytes = (input_size + var_size + total_carry_count) * size_of::<u32>()
    + num_u32_flags as usize * size_of::<bool>();

This multiplication chain could overflow before being used. Later workspace calculations (lines 585-590) use .checked_mul, but this one doesn't. Consider using checked arithmetic here as well.

High-Priority Issues

3. Uninitialized limbs in BigUintGpu::mod_sub (bigint_ops.cuh:281-295)

The result is initialized with result.num_limbs = prime.num_limbs, but the loop iterates over n = max(prime.num_limbs, max(num_limbs, other.num_limbs)). If n > prime.num_limbs, the result may contain uninitialized limbs beyond result.num_limbs.

4. Carry bits underflow risk in algebra constraint kernels

In addsub.cu:280-282 and similar locations:

uint32_t carry_bits = result.max_overflow_bits - 8;

If max_overflow_bits < 8, this unsigned subtraction wraps around. Add an assertion or guard.

5. Division by zero in EC add_ne (add_ne.cu:208-210)

BigUintGpu lambda = tmp0.mod_div(tmp1, prime, barrett_mu);

Where tmp1 = x2 - x1. The "NE" (Not Equal) name implies x1 != x2, but there's no runtime validation. While the test harness avoids this case, adding a debug assertion would prevent silent corruption if the invariant is violated.

6. Hardcoded array sizes for q_limbs/carry_limbs in ECC cuda.rs

In extensions/ecc/circuit/src/cuda.rs:237-245:

let q_limbs = [builder.q_limbs[0] as u32, builder.q_limbs[1] as u32, builder.q_limbs[2] as u32];

These hardcode size 3, which silently truncates if the vectors are longer. The assertions at lines 223-225 help, but a TryInto<[u32; 3]> conversion would be safer.

Medium-Priority Issues

7. format!("{expr:?}") for expression deduplication (chip.rs:559-578)

Using Debug formatting as a dedup key is fragile (format could change across Rust versions) and slow for large expressions. Consider a structural hash.

8. Inconsistent rem() vs mod_reduce() usage

fp2_addsub.cu:143-147 uses .rem(prime) for addition results, while fp2_muldiv.cu and others use .mod_reduce(prime, barrett_mu). If intentional (rem is sufficient for addition), document why.

9. Missing BN254 test case for is_equal GPU test

modular_chip/tests.rs has GPU test cases for secp256k1_coord, secp256k1_scalar, and BLS12_381 — but no BN254 case, unlike addsub/muldiv which cover BN254.

10. Runtime assert should be static_assert

extensions/rv32-adapters/cuda/include/rv32-adapters/eq_mod.cuh:81:

assert(TOTAL_READ_SIZE == TOTAL_LIMBS);

Both are constexpr, so this should be static_assert for compile-time validation.

Code Quality / Low Priority

11. Massive code duplication across CUDA kernels

addsub.cu, muldiv.cu, fp2_addsub.cu, and fp2_muldiv.cu share near-identical adapter kernel structures, constraint templates, and launch logic. This is understandable for a first pass but should be considered for template-based consolidation to reduce maintenance burden.

12. Magic constant 8 (limb bits) repeated throughout

The limb bit width 8 appears as a magic number in carry calculations, shift operations, and range checks across multiple CUDA files. A constexpr would improve readability.

13. Missing metadata layout documentation

The FieldExprMeta, ExprMeta, and related structs lack documentation about memory layout, pointer lifetime, and alignment requirements. This makes Rust↔CUDA consistency hard to verify by inspection.

14. Pairing cuda.rs — comment-only change looks good

The doc comment update from "hybrid CPU tracegen" to "CUDA prover exports" correctly reflects the new architecture.

Positive Observations

Clean separation of adapter, compute, and constraint kernels following a consistent three-phase pattern
Proper is_real / padding row handling across all kernels
Good use of #[cfg(feature = "cuda")] guards to keep CPU path working
Comprehensive test parametrization across multiple moduli (secp256k1, BN254, BLS12_381)
The refactoring from HybridXXXChip to specialized XXXChipGpu with explicit infrastructure parameters is a good architectural improvement
Tests properly validate GPU vs CPU trace equivalence
New mod-builder unit tests (tests.rs) thoroughly cover field expression evaluation with both direct and recorded trace generation paths

Summary

The implementation is solid and well-structured. The critical array bounds issue in is_eq.cu (item 1) should be fixed before merge. Items 2-6 are high priority and worth addressing. The rest are improvements that can be tracked separately.

github-actions · 2026-03-31T21:38:33Z

group	app.proof_time_ms	app.cycles	leaf.proof_time_ms
fibonacci	3,870	12,000,265	944
keccak	18,896	18,655,329	3,382
regex	1,426	4,137,067	374
ecrecover	662	123,583	274
pairing	851	1,745,757	279
kitchen_sink	2,348	2,579,903	442

Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights.

Commit: a790275

Benchmark Workflow

GunaDD and others added 30 commits March 31, 2026 21:18

feat(new-keccak): add xorin cuda tracegen (#2326)

4e04ac6

Resolves INT-5719

feat(new-keccak): add keccakf cuda tracegen (#2327)

4755ec9

Resolves INT-5722

perf: new keccak run reth (#2328)

c9de2da

Closes INT-5779 --------- Co-authored-by: Ayush Shukla <ayush@axiom.xyz>

fix: use tiny-keccak fork in new keccak example

3285248

chore: revert old keccak example

6d12b88

fix: avoid xorin funct7 clash with sha512

b9ebe33

feat(new-keccak): optimize guest to skip first xorin (#2360)

0ab82a2

closes INT-5935

chore: update openvm tiny-keccak patch (#2362)

e30c951

fix: sha2 guest library implementation (#2374)

357b842

- change padding size to positive instead of non-negative - change output size of sha384 to 48 bytes instead of 64

test: add test suite for sha2 guest (#2368)

7159f4e

Adds execution tests and proving (only for GPU) tests for the Sha2 guest library. Test vectors are from https://csrc.nist.gov/projects/cryptographic-algorithm-validation-program/secure-hashing closes INT-5932

test: add test suite for Keccak guest (#2388)

ceb5045

Adds execution tests and proving (only for GPU) tests for the Keccak guest library. Test vectors are from https://keccak.team/archives.html (https://keccak.team/obsolete/KeccakKAT-3.zip). closes INT-5933

test: run sha2 test vectors together (#2397)

cf88e41

Made the test guest program process all of the test vectors in a single run. Time to run tests has decreased so all the SHA2 tests are now run on a single machine, like it was before the test suite was added.

feat: remove memory access adapters: CPU Boundary AIR, initialize/fin…

9e800a5

…alize memory (#2318) Closes INT-5723, INT-5724, INT-5726 --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>

feat: access adapter removal sha2 (#2375)

c344a4d

Resolves INT-5950. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

feat: remove cuda from algebra and ecc extensions (#2379)

ef0ac97

Remove CUDA code from algebra and ecc extensions, since the production code currenly uses hybrid by default. Cuda tests are switched to using hybrid chips instead of gpu chips.

feat: verify metered execution works properly in tests (#2376)

6ffc720

Resolves INT-6010. Currently, `AccessAdapterAir` is excluded from the metered execution checks iff `access_adapters_enabled` is true.

fix: exclude VolatileBoundaryAir from metering tester (#2385)

95be4c1

feat: access adapter removal keccak256 (#2384)

9c41171

Cast usize pointers to u32 before to_le_bytes() to produce 4-byte arrays compatible with BLOCK_SIZE=4 memory configuration.

fix(cuda): optional access adapter fix (#2454)

177dbda

feat: add on_height_change for SHA2 block hasher periphery chip (#2449)

7ce0e0c

Metered execution now tracks the block hasher chip's height growth, matching the pattern used in keccak256 for its permutation chip. closes INT-5931 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

branch-rebase-bot bot force-pushed the develop-v2.0.0-rc.1 branch 3 times, most recently from 72f3597 to 863127a Compare April 1, 2026 16:58

perf: split ecc double constraint kernel

ff5596f

branch-rebase-bot bot force-pushed the develop-v2.0.0-rc.1 branch 3 times, most recently from 1ad4372 to 87f486c Compare April 1, 2026 20:50

perf: remove full-size int64_t buffer

bd2622d

jpw-axiom force-pushed the develop-v2.0.0-rc.1 branch from 2dc3ff8 to acfcec5 Compare April 2, 2026 00:56

branch-rebase-bot bot force-pushed the develop-v2.0.0-rc.1 branch from acfcec5 to 45182cf Compare April 2, 2026 01:03

shuklaayush force-pushed the develop-v2.0.0-rc.1 branch from 45182cf to 5fc92b8 Compare April 2, 2026 11:23

branch-rebase-bot bot force-pushed the develop-v2.0.0-rc.1 branch from 76d87cc to 33d7325 Compare April 2, 2026 14:18

perf: split large constraint kernels to reduce stack

a52f63d

branch-rebase-bot bot force-pushed the develop-v2.0.0-rc.1 branch 3 times, most recently from 15e3033 to ba57271 Compare April 3, 2026 01:09

perf: store prime ahead of time

379b253

branch-rebase-bot bot force-pushed the develop-v2.0.0-rc.1 branch from ba57271 to 6e3de36 Compare April 3, 2026 17:29

shuklaayush force-pushed the develop-v2.0.0-rc.1 branch from 6e3de36 to 5177690 Compare April 3, 2026 18:25

branch-rebase-bot bot force-pushed the develop-v2.0.0-rc.1 branch 6 times, most recently from dd42a19 to 312fe84 Compare April 8, 2026 19:44

shuklaayush force-pushed the develop-v2.0.0-rc.1 branch from a5750ef to 534de3a Compare April 8, 2026 21:19

branch-rebase-bot bot force-pushed the develop-v2.0.0-rc.1 branch 2 times, most recently from 5cd1c1c to 71a0da5 Compare April 10, 2026 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: mod-builder + algebra + ecc CUDA tracegen#2639

feat: mod-builder + algebra + ecc CUDA tracegen#2639
stephenh-axiom-xyz wants to merge 78 commits intodevelop-v2.0.0-rc.1from
feat/mod-builder-cuda-tracegen

stephenh-axiom-xyz commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

stephenh-axiom-xyz commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: mod-builder + algebra + ecc CUDA tracegen

Critical Issues

High-Priority Issues

Medium-Priority Issues

Code Quality / Low Priority

Positive Observations

Summary

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

github-actions bot commented Mar 31, 2026 •

edited

Loading