GPU Witness Generation
Overview
Accelerate witness generation by offloading computation from CPU to GPU.
The work is organized along two dimensions: chip category and functionality.
┌─────────────────┐
│ StepRecord │ (from emulator trace)
└────────┬────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌──────────────┐ ┌───────────────┐
│ F-1 Witness│ │ F-2 LK Mult. │ │ F-3 Shard Ctx │
│ Matrix │ │ (per-chip) │ │ Records │
└─────┬──────┘ └──────┬───────┘ └────────┬──────┘
│ │ │
│ ┌──────▼───────┐ ┌──────▼──────────┐
│ │ finalize lk │ │ ShardRamCircuit │
│ │multiplicities│ │ (C-5) │
│ └──────┬───────┘ └──────┬──────────┘
│ │ │
│ ┌──────▼───────┐ │
│ │Table Circuits│ │
│ │ (C-3) │ │
│ └──────┬───────┘ │
│ │ │
└────────────────┼───────────────────┘
▼
┌────────────────┐
│ Proof Gen │
└────────────────┘
Chip Categories
| ID |
Category |
Count |
Description |
| C-1 |
RV32IM Base Instructions |
45 |
Core integer/memory/branch/jump instructions |
| C-2 |
ECALL / Precompile |
17 |
Keccak, SHA, elliptic curve, field ops, uint256 |
| C-3 |
Table Circuits |
8 |
Range, Ops (And/Or/Xor/Ltu/Pow), DoubleU8, Program |
| C-4 |
RAM Init / Final Circuits |
7 |
RegInit, StaticMemInit, PubIO, Hints/Stack/HeapInit, LocalFinal |
| C-5 |
ShardRamCircuit |
1 |
Cross-shard RAM consistency (374 witness cols, Poseidon2) |
Functionality Categories
| ID |
Functionality |
Description |
| F-1 |
Fill Witness Matrix |
Populate RowMajorMatrix<BB31> per chip (the main proof matrix) |
| F-2 |
Lookup Multiplicity |
Accumulate per-table lookup counters |
| F-3 |
Shard Context Records |
Cross-shard read/write RAM records consumed by ShardRamCircuit |
(Lookup Multiplicity 8 tables: Dynamic, DoubleU8, And/Or/Xor/Ltu/Pow, Instruction)
Progress Matrix
|
F-1 Witness |
F-2 Lookup |
F-3 Shard |
| C-1 RV32IM |
✅ |
✅ |
✅ |
| C-2 ECALL |
➡️ |
➡️ |
➡️ |
| C-3 Tables |
|
|
|
| C-4 RAM Init/Final |
|
|
|
| C-5 ShardRam |
|
|
➡️ |
Key dependency:
- F-2 results from all chips are merged by
finalize_lk_multiplicities() and consumed by C-3 (Table Circuits).
- F-3 results are consumed by C-5 (ShardRamCircuit).
- C-4 (RAM Init/Final) consumes
MemFinalRecord from the trace directly (not from F-3).
PRs
Current Status
C-1: RV32IM Base Instructions (45)
Grouped by GPU kernel (shared witness column layout).
| Instructions |
Type |
F-1: Witness |
F-2: Lookup |
F-3: Shard |
| ADD, SUB (2) |
integer add/sub (R) |
witgen_add/sub (22 cols) |
|
|
| AND, OR, XOR (3) |
bitwise logic (R) |
witgen_logic_r (28 cols) |
|
|
| SLT, SLTU (2) |
set-less-than (R) |
witgen_slt (26 cols) |
|
|
| SLL, SRL, SRA (3) |
shift (R) |
witgen_shift_r (47 cols) |
|
|
| MUL (1) |
multiply low (R) |
witgen_mul (22 cols) |
|
|
| MULH, MULHU, MULHSU (3) |
multiply high (R) |
witgen_mul (26 cols) |
|
|
| DIV, DIVU, REM, REMU (4) |
div/rem (R) |
witgen_div (39 cols) |
|
|
| ADDI (1) |
immediate add (I) |
witgen_addi (18 cols) |
|
|
| ANDI, ORI, XORI (3) |
immediate logic (I) |
witgen_logic_i (24 cols) |
|
|
| SLTI, SLTIU (2) |
immediate compare (I) |
witgen_slti (22 cols) |
|
|
| SLLI, SRLI, SRAI (3) |
immediate shift (I) |
witgen_shift_i (40 cols) |
|
|
| LUI (1) |
load upper imm (U) |
witgen_lui (16 cols) |
|
|
| AUIPC (1) |
add upper imm to PC (U) |
witgen_auipc (21 cols) |
|
|
| BEQ, BNE (2) |
branch equal/neq (B) |
witgen_branch_eq (19 cols) |
|
|
| BLT, BLTU, BGE, BGEU (4) |
branch compare (B) |
witgen_branch_cmp (22 cols) |
|
|
| JAL (1) |
jump and link (J) |
witgen_jal (13 cols) |
|
|
| JALR (1) |
jump and link reg (I) |
witgen_jalr (22 cols) |
|
|
| LW (1) |
load word (I) |
witgen_lw (23 cols) |
|
|
| LH, LHU, LB, LBU (4) |
load sub-word (I) |
witgen_load_sub (25-29 cols) |
|
|
| SW (1) |
store word (S) |
witgen_sw (23 cols) |
|
|
| SH (1) |
store half (S) |
witgen_sh (24 cols) |
|
|
| SB (1) |
store byte (S) |
witgen_sb (29 cols) |
|
|
| Total: 45 |
|
22 GPU kernels |
|
|
GPU Witness Generation
Overview
Accelerate witness generation by offloading computation from CPU to GPU.
The work is organized along two dimensions: chip category and functionality.
Chip Categories
Functionality Categories
RowMajorMatrix<BB31>per chip (the main proof matrix)(Lookup Multiplicity 8 tables: Dynamic, DoubleU8, And/Or/Xor/Ltu/Pow, Instruction)
Progress Matrix
Key dependency:
finalize_lk_multiplicities()and consumed by C-3 (Table Circuits).MemFinalRecordfrom the trace directly (not from F-3).PRs
Current Status
C-1: RV32IM Base Instructions (45)
Grouped by GPU kernel (shared witness column layout).
witgen_add/sub(22 cols)witgen_logic_r(28 cols)witgen_slt(26 cols)witgen_shift_r(47 cols)witgen_mul(22 cols)witgen_mul(26 cols)witgen_div(39 cols)witgen_addi(18 cols)witgen_logic_i(24 cols)witgen_slti(22 cols)witgen_shift_i(40 cols)witgen_lui(16 cols)witgen_auipc(21 cols)witgen_branch_eq(19 cols)witgen_branch_cmp(22 cols)witgen_jal(13 cols)witgen_jalr(22 cols)witgen_lw(23 cols)witgen_load_sub(25-29 cols)witgen_sw(23 cols)witgen_sh(24 cols)witgen_sb(29 cols)