feat(v2.1-rv64): switch to u16 limbs in deferral airs by shuklaayush · Pull Request #2808 · openvm-org/openvm

shuklaayush · 2026-05-22T13:31:29Z

PR 5 of the memory-bus-u16 split, stacked on #2794. Shrinks deferral trace columns whose constraints only operate on packed u16 values, and doubles the sponge absorption rate of the deferral output chip from one digest per row to two.

Why

#2794 still stores deferral commit / output-key columns byte-shaped while packing the memory-bus payloads. Shrinking those columns to u16 cells halves the affected trace width and lets the canonicity sub-AIR use per-u16-cell 16-bit range checks instead of byte-pair BitwiseOperationLookup checks.

Separately, the deferral output chip absorbs only DIGEST_SIZE = 8 bytes per Poseidon row even though the sponge state can hold 2 * DIGEST_SIZE = 16 bytes. Doubling the absorption rate halves the number of output rows for the same byte stream.

What changes

Trace columns to u16 cells

Files: extensions/deferral/circuit/src/{call,output,canonicity}/, extensions/deferral/circuit/src/utils.rs, extensions/deferral/circuit/cuda/.

Call core: input_commit, output_commit, output_len are u16 cells.
Output core: output_commit, output_len are u16 cells.
Canonicity sub-AIR operates over u16 cells; upper-bound checks become per-u16-cell 16-bit range checks (instead of byte-pair bitwise checks).
Deferral rd_val / rs_val register-pointer columns are u16-shaped.
Reuses feat(v2.1-rv64): switch memory bus blocks to u16 cells #2794's split_byte_memory_ops / split_f_memory_ops and byte_memory_op_chunk / f_memory_op_chunk helpers; no new split_memory_ops-style helper is introduced.
CUDA tracegen mirrors the new column shapes.

Sponge absorption rate

Files: extensions/deferral/circuit/src/output/{air,trace,execution,tests}.rs, extensions/deferral/circuit/src/def_fn.rs, extensions/deferral/circuit/src/extension/mod.rs, extensions/deferral/circuit/cuda/src/output.cu.

SPONGE_BYTES_PER_ROW = 2 * DIGEST_SIZE = 16 (was DIGEST_SIZE = 8).
sponge_inputs cells hold u16 values on data rows, each carrying two output bytes; data-row sponge_inputs receive 16-bit range checks. The init row remains [deferral_idx, output_len, 0, ...] and is exempt.
Two memory-bus writes per output row instead of one — the row now consumes 16 bytes of output, not 8.
def_fn::hash_output_raw packs byte pairs into each sponge cell before absorbing.
Bitwise-lookup wiring removed from the output chip; the call chip still uses bitwise lookup for non-canonical commit checks.

Output length contract

output_len remains a byte count.
Raw deferral outputs must be a multiple of 16 bytes; non-16-byte-aligned outputs are invalid with no implicit padding.
Host-side failure-path test asserts the panic when a non-16-aligned raw output is submitted.

Migration notes

Hosts producing deferral output must align raw outputs to a multiple of 16 bytes.
Output-chip configs that previously passed bitwise_lu should drop the argument.

resolves int-7834

PR 5 of the memory-bus-u16 split, stacked on PR 1. Two independent parts. Part 1 — internal u16 reshape, OutputKey byte layout unchanged: - Call core: input_commit, output_commit, output_len trace columns reshape to u16 cells. - Output core: output_commit, output_len reshape to u16 cells. - Canonicity sub-AIR over u16 cells. - Deferral rd_val / rs_val register-pointer columns become u16-shaped. - CPU + CUDA tracegen updates. - Reuses PR1's byte-memory and field-memory chunk helpers; no new generic split_memory_ops-style code is introduced. Part 2 — output commitment-format change (breaks vkey compatibility): - SPONGE_BYTES_PER_ROW = 2 * DIGEST_SIZE = 16; sponge_inputs cells hold u16 values on data rows and receive 16-bit range checks. Init row stays `[deferral_idx, output_len, 0, ...]` and is exempt from u16 checks. - Two memory-bus writes per output row instead of one. - def_fn::hash_output_raw packs byte pairs into each sponge cell. - Bitwise-lookup wiring removed from the output chip only — it no longer needs byte-level range checks. The call chip still uses bitwise. - CUDA + tests updated; expected output commits regenerated. Output length contract: - output_len remains a byte count. - Raw deferral outputs must be a multiple of 16 bytes; non-16-byte aligned raw outputs are invalid (no implicit padding). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ghost · 2026-05-22T21:20:46Z

Claude finished @shuklaayush's task in 2m 29s —— View job

Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

Constraint Soundness

I independently verified the constraint logic across all changed AIRs.

Canonicity sub-AIR (byte → u16): BabyBear ORDER = 0x78000001 decomposes into u16 BE limbs [0x7800, 0x0001]. The diff - 1 range-check value has max 0x77FF (when x_high = 0), well within [0, 0xFFFF]. The debug_assert! at canonicity/air.rs:46 correctly validates limb coverage.

Call core range-check coverage (37 per valid row):

input_commit canonicity diff-1: 8 RCs (cells are read from u16 memory bus, so no per-cell RC needed)
output_commit per-cell u16: 16 RCs + canonicity diff-1: 8 RCs
output_len per-cell u16: 2 RCs + address-width: 1 RC
Adapter rd/rs pointer high-u16: 2 RCs
Total: 37 — tracegen at call/trace.rs:179-210,405-413 matches exactly.

Output AIR range-check coverage:

First row: output_commit per-cell (16) + output_len per-cell (2) + output_commit canonicity diff-1 (8) + rd/rs ptr high-u16 (2) + output_len address-width (1) = 29
Data rows: sponge_inputs per-cell u16 (8 per row)
Init-row sponge cells correctly exempted — they're pinned by assert_array_eq against already-canonical columns (deferral_idx / output_len) at output/air.rs:212-216.

Memory bus address strides: With SPONGE_BYTES_PER_ROW = 16, each data row writes at output_ptr + (section_idx - 1) * 16 + chunk_idx * 8 with SPONGE_ROW_MEMORY_OPS = 2 chunks. Produces contiguous 16-byte writes. Verified.

Sponge absorption rate: Each u16 sponge cell carries 2 bytes; DIGEST_SIZE = 8 cells × 2 = 16 bytes per row. The bijection via le_bytes_to_u16_cells / pack_u8_pairs_le is fixed and invertible.

Timestamp arithmetic:

Call adapter delta: 2 + COMMIT_MEMORY_OPS + OUTPUT_TOTAL_MEMORY_OPS + 4 * DIGEST_F_MEMORY_OPS = 2 + 4 + 5 + 8 = 19. Matches call/trace.rs:418.
Output execution bridge: section_idx * SPONGE_ROW_MEMORY_OPS + OUTPUT_TOTAL_MEMORY_OPS + 2. Consistent between output/air.rs:382-383 and tracegen.

Constant Arithmetic

All constants verified consistent between Rust and CUDA:

Constant	Value	Rust derivation	CUDA derivation
`F_NUM_U16S`	2	`F_NUM_BYTES / U16_CELL_SIZE`	same
`COMMIT_NUM_U16S`	16	`DIGEST_SIZE * F_NUM_U16S`	same
`OUTPUT_LEN_NUM_U16S`	4	`OUTPUT_LEN_NUM_BYTES / U16_CELL_SIZE`	same
`OUTPUT_TOTAL_NUM_U16S`	20	`COMMIT_NUM_U16S + OUTPUT_LEN_NUM_U16S`	same
`SPONGE_BYTES_PER_ROW`	16	`U16_CELL_SIZE * DIGEST_SIZE`	same
`SPONGE_ROW_MEMORY_OPS`	2	`num_byte_memory_ops(16)`	`SPONGE_BYTES_PER_ROW / MEMORY_BLOCK_BYTES`
`COMMIT_MEMORY_OPS`	4	`num_byte_memory_ops(32)`	`COMMIT_NUM_U16S / BLOCK_FE_WIDTH`
`OUTPUT_TOTAL_MEMORY_OPS`	5	`num_f_memory_ops(20)`	`OUTPUT_TOTAL_NUM_U16S / BLOCK_FE_WIDTH`
`DIGEST_F_MEMORY_OPS`	2	`DIGEST_SIZE / BLOCK_FE_WIDTH`	same

Note on COMMIT_MEMORY_OPS: The Rust code derives this as num_byte_memory_ops(COMMIT_NUM_BYTES) = 32/8 = 4, while CUDA derives it as COMMIT_NUM_U16S / BLOCK_FE_WIDTH = 16/4 = 4. Both yield 4 because MEMORY_BLOCK_BYTES / U16_CELL_SIZE == BLOCK_FE_WIDTH (8/2 == 4). This equivalence holds structurally; worth noting in case this invariant ever breaks.

CUDA/Rust Struct Layout Parity

All CUDA structs match their Rust counterparts field-for-field:

DeferralCallReadsBytes, DeferralCallWritesBytes, DeferralCallReads, DeferralCallWrites, DeferralCallCoreCols, DeferralCallAdapterRecord, DeferralCallAdapterCols
DeferralOutputRecordHeader, DeferralOutputCols, DeferralOutputPerCall, DeferralOutputPerRow
CUDA pack_u8_pairs_le / u32_bytes_to_le_u16_cells helpers mirror Rust's le_bytes_to_u16_cells / u32_to_le_u16_cells.
CUDA scale_rv64_ptr_from_u32_bytes extracts bytes [2, 3] as the high u16, matching Rust's scale_rv64_ptr_from_u32_value(ptr >> U16_BITS, ...).

Extension Wiring

BitwiseOperationLookupBus/BitwiseOperationLookupChip discovery-or-create logic in extension/mod.rs is fully removed. Both call and output chips now use inventory.range_checker().bus for the range bus. The bitwise bus is cleanly excised from both CPU and GPU prover paths, all test harnesses, and all chip constructors. No stale references remain.

verify-stark Output Circuit

Both the AIR (output/air.rs:139-148, 179-188) and tracegen (output/trace.rs:71-81) correctly use U16_BITS-wide range checks for per-cell and canonicity interactions. The next_f_to_digest helper (trace.rs:112-120) properly decomposes F elements into u16-packed cells via u16::from_le_bytes, matching the AIR's expectation.

A previous automated review incorrectly flagged these as bugs (wrong max_bits: 8 and byte-shaped tracegen), but in the actual code both max_bits and the tracegen are already correctly updated to u16 in this PR. Confirmed.

The new RangeCheckerAir::<U16_BITS> is properly added to the circuit's AIR list (lib.rs:108-110, 127) and its trace is generated via RangeCheckerCpuTraceGenerator::<U16_BITS> (trace.rs:48, 99), replacing the previous range_inputs: Vec<usize> approach. Clean refactor.

`expand_to_rv64_block`

The new helper at riscv/circuit/src/adapters/mod.rs:378-391 generalizes zero-padding of N u16 limbs to BLOCK_FE_WIDTH cells. Used for rd_val/rs_val (RV64_PTR_U16S = 2 → 4) and output_len (F_NUM_U16S = 2 → 4). The compile-time assertion N <= BLOCK_FE_WIDTH prevents misuse. Clean and reusable.

`def_fn::hash_output_raw`

The sponge absorption now packs byte pairs into u16-valued sponge cells via le_bytes_to_u16_cells before calling perm. This matches the output AIR's data-row sponge cell packing. The alignment assertion (output_ref.len().is_multiple_of(SPONGE_BYTES_PER_ROW)) at def_fn.rs:84 correctly enforces the new 16-byte alignment requirement.

Continuations Changes

assert_user_pvs_shape extracted to root/mod.rs:42-46 and reused across commit/air.rs, commit/trace.rs, memory/air.rs, memory/trace.rs, and RootCircuit::new — replacing scattered debug_assert! calls with assert!, which is a correctness improvement that validates the invariant at construction time.
compute_output_f_commit in e2e.rs now delegates to generate_deferral_results + deferral_poseidon2_chip() instead of reimplementing sponge logic inline — ensures the test stays in sync with circuit changes automatically.
Guest programs and test fixtures updated to 16-byte-aligned outputs.

Test Coverage

New test output_raw_len_must_be_sponge_row_aligned (output/tests.rs:298-302) asserts the panic for non-16-aligned outputs
Guest programs updated to 16-byte outputs with recalculated expected values
Randomized test helpers produce SPONGE_BYTES_PER_ROW-aligned output lengths
CUDA tests mirror CPU test structure

No Issues Found

This is a clean, well-structured PR. The u16 cell transition halves the affected deferral trace columns, and the doubled sponge absorption rate halves the output row count. The AIR constraints, trace generation, CUDA kernels, extension wiring, and verify-stark circuit are all internally consistent. The assert_user_pvs_shape extraction in continuations is a nice quality improvement.

ghost · 2026-05-22T21:25:36Z

group	app.proof_time_ms	app.cycles	leaf.proof_time_ms
fibonacci	1,567	4,000,051	438
keccak	14,065	14,365,133	2,411
sha2_bench	9,414	11,167,961	1,440
regex	1,477	4,090,656	360
ecrecover	474	112,210	267
pairing	598	592,827	257
kitchen_sink	1,818	1,979,971	405

Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights.

Commit: eef7099

Benchmark Workflow

ghost · 2026-05-22T21:30:46Z

Claude finished @shuklaayush's task in 2m 31s —— View job