GPU Witness Generation

# GPU Witness Generation

## Overview

Accelerate witness generation by offloading computation from CPU to GPU. 
The work is organized along two dimensions: **chip category** and **functionality**.


```
                    ┌─────────────────┐
                    │   StepRecord    │  (from emulator trace)
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
     ┌────────────┐  ┌──────────────┐  ┌───────────────┐
     │ F-1 Witness│  │ F-2 LK Mult. │  │ F-3 Shard Ctx │
     │   Matrix   │  │  (per-chip)  │  │   Records     │
     └─────┬──────┘  └──────┬───────┘  └────────┬──────┘
           │                │                   │
           │         ┌──────▼───────┐    ┌──────▼──────────┐
           │         │ finalize lk  │    │ ShardRamCircuit │
           │         │multiplicities│    │      (C-5)      │
           │         └──────┬───────┘    └──────┬──────────┘
           │                │                   │
           │         ┌──────▼───────┐           │
           │         │Table Circuits│           │
           │         │    (C-3)     │           │
           │         └──────┬───────┘           │
           │                │                   │
           └────────────────┼───────────────────┘
                            ▼
                    ┌────────────────┐
                    │   Proof Gen    │
                    └────────────────┘
```

## Chip Categories

| ID | Category | Count | Description |
|----|----------|:-----:|-------------|
| **C-1** | RV32IM Base Instructions | 45 | Core integer/memory/branch/jump instructions |
| **C-2** | ECALL / Precompile | 17 | Keccak, SHA, elliptic curve, field ops, uint256 |
| **C-3** | Table Circuits | 8 | Range, Ops (And/Or/Xor/Ltu/Pow), DoubleU8, Program |
| **C-4** | RAM Init / Final Circuits | 7 | RegInit, StaticMemInit, PubIO, Hints/Stack/HeapInit, LocalFinal |
| **C-5** | ShardRamCircuit | 1 | Cross-shard RAM consistency (374 witness cols, Poseidon2) |

## Functionality Categories

| ID | Functionality | Description |
|----|---------------|-------------|
| **F-1** | Fill Witness Matrix | Populate `RowMajorMatrix<BB31>` per chip (the main proof matrix) |
| **F-2** | Lookup Multiplicity | Accumulate per-table lookup counters  |
| **F-3** | Shard Context Records | Cross-shard read/write RAM records consumed by ShardRamCircuit |

(Lookup Multiplicity 8 tables: Dynamic, DoubleU8, And/Or/Xor/Ltu/Pow, Instruction)

## Progress Matrix

| | F-1 Witness | F-2 Lookup | F-3 Shard |
|---|:---:|:---:|:---:|
| **C-1** RV32IM | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| **C-2** ECALL | ➡️ | ➡️ | ➡️ |
| **C-3** Tables | | | |
| **C-4** RAM Init/Final | | | |
| **C-5** ShardRam | | | ➡️ |

**Key dependency:**
- F-2 results from all chips are merged by `finalize_lk_multiplicities()` and consumed by C-3 (Table Circuits).
- F-3 results are consumed by C-5 (ShardRamCircuit). 
- C-4 (RAM Init/Final) consumes `MemFinalRecord` from the trace directly (not from F-3).     

## PRs
- ceno-gpu: https://github.com/scroll-tech/ceno-gpu/pull/142
- ceno: #1260 
- ceno: #1259

## Current Status
- [x] **C-1 x F-1** — (RV32IM) All 45 instructions have GPU kernels producing witness matrices 
- [x] **C-1 x F-2** — (RV32IM) GPU lookup multiplicity accumulation
- [x] **C-1 x F-3** — (RV32IM) GPU + lightweight CPU shard context record collection
- [ ] **C-5** — ShardRamCircuit
- [ ] **C-2** — Ecall_Keccak
---

## C-1: RV32IM Base Instructions (45)

Grouped by GPU kernel (shared witness column layout).

| Instructions | Type | F-1: Witness | F-2: Lookup | F-3: Shard |
|-------------|------|-------------|-------------|------------|
| ADD, SUB (2) | integer add/sub (R) | `witgen_add/sub` (22 cols) | | |
| AND, OR, XOR (3) | bitwise logic (R) | `witgen_logic_r` (28 cols) | | |
| SLT, SLTU (2) | set-less-than (R) | `witgen_slt` (26 cols) | | |
| SLL, SRL, SRA (3) | shift (R) | `witgen_shift_r` (47 cols) | | |
| MUL (1) | multiply low (R) | `witgen_mul` (22 cols) | | |
| MULH, MULHU, MULHSU (3) | multiply high (R) | `witgen_mul` (26 cols) | | |
| DIV, DIVU, REM, REMU (4) | div/rem (R) | `witgen_div` (39 cols) | | |
| ADDI (1) | immediate add (I) | `witgen_addi` (18 cols) | | |
| ANDI, ORI, XORI (3) | immediate logic (I) | `witgen_logic_i` (24 cols) | | |
| SLTI, SLTIU (2) | immediate compare (I) | `witgen_slti` (22 cols) | | |
| SLLI, SRLI, SRAI (3) | immediate shift (I) | `witgen_shift_i` (40 cols) | | |
| LUI (1) | load upper imm (U) | `witgen_lui` (16 cols) | | |
| AUIPC (1) | add upper imm to PC (U) | `witgen_auipc` (21 cols) | | |
| BEQ, BNE (2) | branch equal/neq (B) | `witgen_branch_eq` (19 cols) | | |
| BLT, BLTU, BGE, BGEU (4) | branch compare (B) | `witgen_branch_cmp` (22 cols) | | |
| JAL (1) | jump and link (J) | `witgen_jal` (13 cols) | | |
| JALR (1) | jump and link reg (I) | `witgen_jalr` (22 cols) | | |
| LW (1) | load word (I) | `witgen_lw` (23 cols) | | |
| LH, LHU, LB, LBU (4) | load sub-word (I) | `witgen_load_sub` (25-29 cols) | | |
| SW (1) | store word (S) | `witgen_sw` (23 cols) | | |
| SH (1) | store half (S) | `witgen_sh` (24 cols) | | |
| SB (1) | store byte (S) | `witgen_sb` (29 cols) | | |
| **Total: 45** | | **22 GPU kernels** | | |



	F-1 Witness	F-2 Lookup	F-3 Shard
C-1 RV32IM	✅	✅	✅
C-2 ECALL	➡️	➡️	➡️
C-3 Tables
C-4 RAM Init/Final
C-5 ShardRam			➡️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Witness Generation #1265

GPU Witness Generation

Overview

Chip Categories

Functionality Categories

Progress Matrix

PRs

Current Status

C-1: RV32IM Base Instructions (45)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ID	Category	Count	Description
C-1	RV32IM Base Instructions	45	Core integer/memory/branch/jump instructions
C-2	ECALL / Precompile	17	Keccak, SHA, elliptic curve, field ops, uint256
C-3	Table Circuits	8	Range, Ops (And/Or/Xor/Ltu/Pow), DoubleU8, Program
C-4	RAM Init / Final Circuits	7	RegInit, StaticMemInit, PubIO, Hints/Stack/HeapInit, LocalFinal
C-5	ShardRamCircuit	1	Cross-shard RAM consistency (374 witness cols, Poseidon2)

ID	Functionality	Description
F-1	Fill Witness Matrix	Populate `RowMajorMatrix<BB31>` per chip (the main proof matrix)
F-2	Lookup Multiplicity	Accumulate per-table lookup counters
F-3	Shard Context Records	Cross-shard read/write RAM records consumed by ShardRamCircuit

Instructions	Type	F-1: Witness
ADD, SUB (2)	integer add/sub (R)	`witgen_add/sub` (22 cols)
AND, OR, XOR (3)	bitwise logic (R)	`witgen_logic_r` (28 cols)
SLT, SLTU (2)	set-less-than (R)	`witgen_slt` (26 cols)
SLL, SRL, SRA (3)	shift (R)	`witgen_shift_r` (47 cols)
MUL (1)	multiply low (R)	`witgen_mul` (22 cols)
MULH, MULHU, MULHSU (3)	multiply high (R)	`witgen_mul` (26 cols)
DIV, DIVU, REM, REMU (4)	div/rem (R)	`witgen_div` (39 cols)
ADDI (1)	immediate add (I)	`witgen_addi` (18 cols)
ANDI, ORI, XORI (3)	immediate logic (I)	`witgen_logic_i` (24 cols)
SLTI, SLTIU (2)	immediate compare (I)	`witgen_slti` (22 cols)
SLLI, SRLI, SRAI (3)	immediate shift (I)	`witgen_shift_i` (40 cols)
LUI (1)	load upper imm (U)	`witgen_lui` (16 cols)
AUIPC (1)	add upper imm to PC (U)	`witgen_auipc` (21 cols)
BEQ, BNE (2)	branch equal/neq (B)	`witgen_branch_eq` (19 cols)
BLT, BLTU, BGE, BGEU (4)	branch compare (B)	`witgen_branch_cmp` (22 cols)
JAL (1)	jump and link (J)	`witgen_jal` (13 cols)
JALR (1)	jump and link reg (I)	`witgen_jalr` (22 cols)
LW (1)	load word (I)	`witgen_lw` (23 cols)
LH, LHU, LB, LBU (4)	load sub-word (I)	`witgen_load_sub` (25-29 cols)
SW (1)	store word (S)	`witgen_sw` (23 cols)
SH (1)	store half (S)	`witgen_sh` (24 cols)
SB (1)	store byte (S)	`witgen_sb` (29 cols)
Total: 45		22 GPU kernels

GPU Witness Generation #1265

Description

GPU Witness Generation

Overview

Chip Categories

Functionality Categories

Progress Matrix

PRs

Current Status

C-1: RV32IM Base Instructions (45)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions