Skip to content

feat(v2.1-rv64): switch to u16 limbs in deferral airs#2808

Open
shuklaayush wants to merge 1 commit into
feat/memory-bus-u16-pr1from
feat/memory-bus-u16-pr5
Open

feat(v2.1-rv64): switch to u16 limbs in deferral airs#2808
shuklaayush wants to merge 1 commit into
feat/memory-bus-u16-pr1from
feat/memory-bus-u16-pr5

Conversation

@shuklaayush
Copy link
Copy Markdown
Collaborator

@shuklaayush shuklaayush commented May 22, 2026

PR 5 of the memory-bus-u16 split, stacked on #2794. Shrinks deferral trace columns whose constraints only operate on packed u16 values, and doubles the sponge absorption rate of the deferral output chip from one digest per row to two.

Why

#2794 still stores deferral commit / output-key columns byte-shaped while packing the memory-bus payloads. Shrinking those columns to u16 cells halves the affected trace width and lets the canonicity sub-AIR use per-u16-cell 16-bit range checks instead of byte-pair BitwiseOperationLookup checks.

Separately, the deferral output chip absorbs only DIGEST_SIZE = 8 bytes per Poseidon row even though the sponge state can hold 2 * DIGEST_SIZE = 16 bytes. Doubling the absorption rate halves the number of output rows for the same byte stream.

What changes

Trace columns to u16 cells

Files: extensions/deferral/circuit/src/{call,output,canonicity}/, extensions/deferral/circuit/src/utils.rs, extensions/deferral/circuit/cuda/.

  • Call core: input_commit, output_commit, output_len are u16 cells.
  • Output core: output_commit, output_len are u16 cells.
  • Canonicity sub-AIR operates over u16 cells; upper-bound checks become per-u16-cell 16-bit range checks (instead of byte-pair bitwise checks).
  • Deferral rd_val / rs_val register-pointer columns are u16-shaped.
  • Reuses feat(v2.1-rv64): switch memory bus blocks to u16 cells #2794's split_byte_memory_ops / split_f_memory_ops and byte_memory_op_chunk / f_memory_op_chunk helpers; no new split_memory_ops-style helper is introduced.
  • CUDA tracegen mirrors the new column shapes.

Sponge absorption rate

Files: extensions/deferral/circuit/src/output/{air,trace,execution,tests}.rs, extensions/deferral/circuit/src/def_fn.rs, extensions/deferral/circuit/src/extension/mod.rs, extensions/deferral/circuit/cuda/src/output.cu.

  • SPONGE_BYTES_PER_ROW = 2 * DIGEST_SIZE = 16 (was DIGEST_SIZE = 8).
  • sponge_inputs cells hold u16 values on data rows, each carrying two output bytes; data-row sponge_inputs receive 16-bit range checks. The init row remains [deferral_idx, output_len, 0, ...] and is exempt.
  • Two memory-bus writes per output row instead of one — the row now consumes 16 bytes of output, not 8.
  • def_fn::hash_output_raw packs byte pairs into each sponge cell before absorbing.
  • Bitwise-lookup wiring removed from the output chip; the call chip still uses bitwise lookup for non-canonical commit checks.

Output length contract

  • output_len remains a byte count.
  • Raw deferral outputs must be a multiple of 16 bytes; non-16-byte-aligned outputs are invalid with no implicit padding.
  • Host-side failure-path test asserts the panic when a non-16-aligned raw output is submitted.

Migration notes

  • Hosts producing deferral output must align raw outputs to a multiple of 16 bytes.
  • Output-chip configs that previously passed bitwise_lu should drop the argument.

resolves int-7834

@github-actions

This comment was marked as outdated.

@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from 0b3f6a9 to b09573e Compare May 22, 2026 13:44
@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@shuklaayush shuklaayush marked this pull request as ready for review May 22, 2026 13:50
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from c27ce01 to 2164e1e Compare May 22, 2026 14:40
@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from 2164e1e to caee7ff Compare May 22, 2026 14:53
@github-actions

This comment was marked as outdated.

@shuklaayush shuklaayush marked this pull request as draft May 22, 2026 15:01
@github-actions

This comment has been minimized.

@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from caee7ff to a322ef8 Compare May 22, 2026 15:22
@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

PR 5 of the memory-bus-u16 split, stacked on PR 1. Two independent parts.

Part 1 — internal u16 reshape, OutputKey byte layout unchanged:
- Call core: input_commit, output_commit, output_len trace columns
  reshape to u16 cells.
- Output core: output_commit, output_len reshape to u16 cells.
- Canonicity sub-AIR over u16 cells.
- Deferral rd_val / rs_val register-pointer columns become u16-shaped.
- CPU + CUDA tracegen updates.
- Reuses PR1's byte-memory and field-memory chunk helpers; no new
  generic split_memory_ops-style code is introduced.

Part 2 — output commitment-format change (breaks vkey compatibility):
- SPONGE_BYTES_PER_ROW = 2 * DIGEST_SIZE = 16; sponge_inputs cells hold
  u16 values on data rows and receive 16-bit range checks. Init row stays
  `[deferral_idx, output_len, 0, ...]` and is exempt from u16 checks.
- Two memory-bus writes per output row instead of one.
- def_fn::hash_output_raw packs byte pairs into each sponge cell.
- Bitwise-lookup wiring removed from the output chip only — it no longer
  needs byte-level range checks. The call chip still uses bitwise.
- CUDA + tests updated; expected output commits regenerated.

Output length contract:
- output_len remains a byte count.
- Raw deferral outputs must be a multiple of 16 bytes; non-16-byte
  aligned raw outputs are invalid (no implicit padding).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from a322ef8 to eef7099 Compare May 22, 2026 21:20
@github-actions
Copy link
Copy Markdown

ghost commented May 22, 2026

Claude finished @shuklaayush's task in 2m 29s —— View job


Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

  • Gather PR diff and understand scope
  • Review utils.rs changes (helpers, constants)
  • Review canonicity sub-AIR changes
  • Review call chip (air, trace, mod, tests, cuda)
  • Review output chip (air, trace, execution, tests, cuda)
  • Review def_fn.rs and extension wiring changes
  • Review CUDA header and ABI changes
  • Review riscv adapter and continuations changes
  • Review verify-stark changes
  • Compile findings and post review

Constraint Soundness

I independently verified the constraint logic across all changed AIRs.

Canonicity sub-AIR (byte → u16): BabyBear ORDER = 0x78000001 decomposes into u16 BE limbs [0x7800, 0x0001]. The diff - 1 range-check value has max 0x77FF (when x_high = 0), well within [0, 0xFFFF]. The debug_assert! at canonicity/air.rs:46 correctly validates limb coverage.

Call core range-check coverage (37 per valid row):

  • input_commit canonicity diff-1: 8 RCs (cells are read from u16 memory bus, so no per-cell RC needed)
  • output_commit per-cell u16: 16 RCs + canonicity diff-1: 8 RCs
  • output_len per-cell u16: 2 RCs + address-width: 1 RC
  • Adapter rd/rs pointer high-u16: 2 RCs
  • Total: 37 — tracegen at call/trace.rs:179-210,405-413 matches exactly.

Output AIR range-check coverage:

  • First row: output_commit per-cell (16) + output_len per-cell (2) + output_commit canonicity diff-1 (8) + rd/rs ptr high-u16 (2) + output_len address-width (1) = 29
  • Data rows: sponge_inputs per-cell u16 (8 per row)
  • Init-row sponge cells correctly exempted — they're pinned by assert_array_eq against already-canonical columns (deferral_idx / output_len) at output/air.rs:212-216.

Memory bus address strides: With SPONGE_BYTES_PER_ROW = 16, each data row writes at output_ptr + (section_idx - 1) * 16 + chunk_idx * 8 with SPONGE_ROW_MEMORY_OPS = 2 chunks. Produces contiguous 16-byte writes. Verified.

Sponge absorption rate: Each u16 sponge cell carries 2 bytes; DIGEST_SIZE = 8 cells × 2 = 16 bytes per row. The bijection via le_bytes_to_u16_cells / pack_u8_pairs_le is fixed and invertible.

Timestamp arithmetic:

  • Call adapter delta: 2 + COMMIT_MEMORY_OPS + OUTPUT_TOTAL_MEMORY_OPS + 4 * DIGEST_F_MEMORY_OPS = 2 + 4 + 5 + 8 = 19. Matches call/trace.rs:418.
  • Output execution bridge: section_idx * SPONGE_ROW_MEMORY_OPS + OUTPUT_TOTAL_MEMORY_OPS + 2. Consistent between output/air.rs:382-383 and tracegen.

Constant Arithmetic

All constants verified consistent between Rust and CUDA:

Constant Value Rust derivation CUDA derivation
F_NUM_U16S 2 F_NUM_BYTES / U16_CELL_SIZE same
COMMIT_NUM_U16S 16 DIGEST_SIZE * F_NUM_U16S same
OUTPUT_LEN_NUM_U16S 4 OUTPUT_LEN_NUM_BYTES / U16_CELL_SIZE same
OUTPUT_TOTAL_NUM_U16S 20 COMMIT_NUM_U16S + OUTPUT_LEN_NUM_U16S same
SPONGE_BYTES_PER_ROW 16 U16_CELL_SIZE * DIGEST_SIZE same
SPONGE_ROW_MEMORY_OPS 2 num_byte_memory_ops(16) SPONGE_BYTES_PER_ROW / MEMORY_BLOCK_BYTES
COMMIT_MEMORY_OPS 4 num_byte_memory_ops(32) COMMIT_NUM_U16S / BLOCK_FE_WIDTH
OUTPUT_TOTAL_MEMORY_OPS 5 num_f_memory_ops(20) OUTPUT_TOTAL_NUM_U16S / BLOCK_FE_WIDTH
DIGEST_F_MEMORY_OPS 2 DIGEST_SIZE / BLOCK_FE_WIDTH same

Note on COMMIT_MEMORY_OPS: The Rust code derives this as num_byte_memory_ops(COMMIT_NUM_BYTES) = 32/8 = 4, while CUDA derives it as COMMIT_NUM_U16S / BLOCK_FE_WIDTH = 16/4 = 4. Both yield 4 because MEMORY_BLOCK_BYTES / U16_CELL_SIZE == BLOCK_FE_WIDTH (8/2 == 4). This equivalence holds structurally; worth noting in case this invariant ever breaks.

CUDA/Rust Struct Layout Parity

All CUDA structs match their Rust counterparts field-for-field:

  • DeferralCallReadsBytes, DeferralCallWritesBytes, DeferralCallReads, DeferralCallWrites, DeferralCallCoreCols, DeferralCallAdapterRecord, DeferralCallAdapterCols
  • DeferralOutputRecordHeader, DeferralOutputCols, DeferralOutputPerCall, DeferralOutputPerRow
  • CUDA pack_u8_pairs_le / u32_bytes_to_le_u16_cells helpers mirror Rust's le_bytes_to_u16_cells / u32_to_le_u16_cells.
  • CUDA scale_rv64_ptr_from_u32_bytes extracts bytes [2, 3] as the high u16, matching Rust's scale_rv64_ptr_from_u32_value(ptr >> U16_BITS, ...).

Extension Wiring

BitwiseOperationLookupBus/BitwiseOperationLookupChip discovery-or-create logic in extension/mod.rs is fully removed. Both call and output chips now use inventory.range_checker().bus for the range bus. The bitwise bus is cleanly excised from both CPU and GPU prover paths, all test harnesses, and all chip constructors. No stale references remain.

verify-stark Output Circuit

Both the AIR (output/air.rs:139-148, 179-188) and tracegen (output/trace.rs:71-81) correctly use U16_BITS-wide range checks for per-cell and canonicity interactions. The next_f_to_digest helper (trace.rs:112-120) properly decomposes F elements into u16-packed cells via u16::from_le_bytes, matching the AIR's expectation.

A previous automated review incorrectly flagged these as bugs (wrong max_bits: 8 and byte-shaped tracegen), but in the actual code both max_bits and the tracegen are already correctly updated to u16 in this PR. Confirmed.

The new RangeCheckerAir::<U16_BITS> is properly added to the circuit's AIR list (lib.rs:108-110, 127) and its trace is generated via RangeCheckerCpuTraceGenerator::<U16_BITS> (trace.rs:48, 99), replacing the previous range_inputs: Vec<usize> approach. Clean refactor.

expand_to_rv64_block

The new helper at riscv/circuit/src/adapters/mod.rs:378-391 generalizes zero-padding of N u16 limbs to BLOCK_FE_WIDTH cells. Used for rd_val/rs_val (RV64_PTR_U16S = 2 → 4) and output_len (F_NUM_U16S = 2 → 4). The compile-time assertion N <= BLOCK_FE_WIDTH prevents misuse. Clean and reusable.

def_fn::hash_output_raw

The sponge absorption now packs byte pairs into u16-valued sponge cells via le_bytes_to_u16_cells before calling perm. This matches the output AIR's data-row sponge cell packing. The alignment assertion (output_ref.len().is_multiple_of(SPONGE_BYTES_PER_ROW)) at def_fn.rs:84 correctly enforces the new 16-byte alignment requirement.

Continuations Changes

  • assert_user_pvs_shape extracted to root/mod.rs:42-46 and reused across commit/air.rs, commit/trace.rs, memory/air.rs, memory/trace.rs, and RootCircuit::new — replacing scattered debug_assert! calls with assert!, which is a correctness improvement that validates the invariant at construction time.
  • compute_output_f_commit in e2e.rs now delegates to generate_deferral_results + deferral_poseidon2_chip() instead of reimplementing sponge logic inline — ensures the test stays in sync with circuit changes automatically.
  • Guest programs and test fixtures updated to 16-byte-aligned outputs.

Test Coverage

  • New test output_raw_len_must_be_sponge_row_aligned (output/tests.rs:298-302) asserts the panic for non-16-aligned outputs
  • Guest programs updated to 16-byte outputs with recalculated expected values
  • Randomized test helpers produce SPONGE_BYTES_PER_ROW-aligned output lengths
  • CUDA tests mirror CPU test structure

No Issues Found

This is a clean, well-structured PR. The u16 cell transition halves the affected deferral trace columns, and the doubled sponge absorption rate halves the output row count. The AIR constraints, trace generation, CUDA kernels, extension wiring, and verify-stark circuit are all internally consistent. The assert_user_pvs_shape extraction in continuations is a nice quality improvement.


@github-actions
Copy link
Copy Markdown

ghost commented May 22, 2026

group app.proof_time_ms app.cycles leaf.proof_time_ms
fibonacci 1,567 4,000,051 438
keccak 14,065 14,365,133 2,411
sha2_bench 9,414 11,167,961 1,440
regex 1,477 4,090,656 360
ecrecover 474 112,210 267
pairing 598 592,827 257
kitchen_sink 1,818 1,979,971 405

Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights.

Commit: eef7099

Benchmark Workflow

@shuklaayush shuklaayush marked this pull request as ready for review May 22, 2026 21:30
@github-actions
Copy link
Copy Markdown

ghost commented May 22, 2026

Claude finished @shuklaayush's task in 2m 31s —— View job


Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

  • Gather PR diff and understand scope
  • Review utils.rs changes (helpers, constants)
  • Review canonicity sub-AIR changes
  • Review call chip (air, trace, mod, tests, cuda)
  • Review output chip (air, trace, execution, tests, cuda)
  • Review def_fn.rs and extension wiring changes
  • Review CUDA header and ABI changes
  • Review riscv adapter and continuations changes
  • Review verify-stark changes
  • Compile findings and post review

Constraint Soundness

I independently verified the constraint logic across all changed AIRs.

Canonicity sub-AIR (byte → u16): BabyBear ORDER = 0x78000001 decomposes into u16 BE limbs [0x7800, 0x0001]. The diff - 1 range-check value has max 0x77FF (when x_high = 0), well within [0, 0xFFFF]. The debug_assert! at canonicity/air.rs:46 correctly validates limb coverage.

Call core range-check coverage (37 per valid row):

  • input_commit canonicity diff-1: 8 RCs (cells are read from u16 memory bus, so no per-cell RC needed)
  • output_commit per-cell u16: 16 RCs + canonicity diff-1: 8 RCs
  • output_len per-cell u16: 2 RCs + address-width: 1 RC
  • Adapter rd/rs pointer high-u16: 2 RCs
  • Total: 37 — tracegen at call/trace.rs:179-210,405-413 matches exactly.

Output AIR range-check coverage:

  • First row: output_commit per-cell (16) + output_len per-cell (2) + output_commit canonicity diff-1 (8) + rd/rs ptr high-u16 (2) + output_len address-width (1) = 29
  • Data rows: sponge_inputs per-cell u16 (8 per row)
  • Init-row sponge cells correctly exempted — they're pinned by assert_array_eq against already-canonical columns (deferral_idx / output_len) at output/air.rs:212-216.

Memory bus address strides: With SPONGE_BYTES_PER_ROW = 16, each data row writes at output_ptr + (section_idx - 1) * 16 + chunk_idx * 8 with SPONGE_ROW_MEMORY_OPS = 2 chunks. Produces contiguous 16-byte writes. Verified.

Sponge absorption rate: Each u16 sponge cell carries 2 bytes; DIGEST_SIZE = 8 cells × 2 = 16 bytes per row. The bijection via le_bytes_to_u16_cells / pack_u8_pairs_le is fixed and invertible.

Timestamp arithmetic:

  • Call adapter delta: 2 + COMMIT_MEMORY_OPS + OUTPUT_TOTAL_MEMORY_OPS + 4 * DIGEST_F_MEMORY_OPS = 2 + 4 + 5 + 8 = 19. Matches call/trace.rs:418.
  • Output execution bridge: section_idx * SPONGE_ROW_MEMORY_OPS + OUTPUT_TOTAL_MEMORY_OPS + 2. Consistent between output/air.rs:382-383 and tracegen.

Constant Arithmetic

All constants verified consistent between Rust and CUDA:

Constant Value Rust derivation CUDA derivation
F_NUM_U16S 2 F_NUM_BYTES / U16_CELL_SIZE same
COMMIT_NUM_U16S 16 DIGEST_SIZE * F_NUM_U16S same
OUTPUT_LEN_NUM_U16S 4 OUTPUT_LEN_NUM_BYTES / U16_CELL_SIZE same
OUTPUT_TOTAL_NUM_U16S 20 COMMIT_NUM_U16S + OUTPUT_LEN_NUM_U16S same
SPONGE_BYTES_PER_ROW 16 U16_CELL_SIZE * DIGEST_SIZE same
SPONGE_ROW_MEMORY_OPS 2 num_byte_memory_ops(16) SPONGE_BYTES_PER_ROW / MEMORY_BLOCK_BYTES
COMMIT_MEMORY_OPS 4 num_byte_memory_ops(32) COMMIT_NUM_U16S / BLOCK_FE_WIDTH
OUTPUT_TOTAL_MEMORY_OPS 5 num_f_memory_ops(20) OUTPUT_TOTAL_NUM_U16S / BLOCK_FE_WIDTH
DIGEST_F_MEMORY_OPS 2 DIGEST_SIZE / BLOCK_FE_WIDTH same

Note on COMMIT_MEMORY_OPS: The Rust code derives this as num_byte_memory_ops(COMMIT_NUM_BYTES) = 32/8 = 4, while CUDA derives it as COMMIT_NUM_U16S / BLOCK_FE_WIDTH = 16/4 = 4. Both yield 4 because MEMORY_BLOCK_BYTES / U16_CELL_SIZE == BLOCK_FE_WIDTH (8/2 == 4). This equivalence holds structurally; worth noting in case this invariant ever breaks.

CUDA/Rust Struct Layout Parity

All CUDA structs match their Rust counterparts field-for-field:

  • DeferralCallReadsBytes, DeferralCallWritesBytes, DeferralCallReads, DeferralCallWrites, DeferralCallCoreCols, DeferralCallAdapterRecord, DeferralCallAdapterCols
  • DeferralOutputRecordHeader, DeferralOutputCols, DeferralOutputPerCall, DeferralOutputPerRow
  • CUDA pack_u8_pairs_le / u32_bytes_to_le_u16_cells helpers mirror Rust's le_bytes_to_u16_cells / u32_to_le_u16_cells.
  • CUDA scale_rv64_ptr_from_u32_bytes extracts bytes [2, 3] as the high u16, matching Rust's scale_rv64_ptr_from_u32_value(ptr >> U16_BITS, ...).

Extension Wiring

BitwiseOperationLookupBus/BitwiseOperationLookupChip discovery-or-create logic in extension/mod.rs is fully removed. Both call and output chips now use inventory.range_checker().bus for the range bus. The bitwise bus is cleanly excised from both CPU and GPU prover paths, all test harnesses, and all chip constructors. No stale references remain.

verify-stark Output Circuit

Both the AIR (output/air.rs:139-148, 179-188) and tracegen (output/trace.rs:71-81) correctly use U16_BITS-wide range checks for per-cell and canonicity interactions. The next_f_to_digest helper (trace.rs:112-120) properly decomposes F elements into u16-packed cells via u16::from_le_bytes, matching the AIR's expectation.

The new RangeCheckerAir::<U16_BITS> is properly added to the circuit's AIR list (lib.rs:108-110) and its trace is generated via RangeCheckerCpuTraceGenerator::<U16_BITS> (trace.rs:48, 99), replacing the previous range_inputs: Vec<usize> approach. Clean refactor.

expand_to_rv64_block

The new helper at riscv/circuit/src/adapters/mod.rs:378-391 generalizes zero-padding of N u16 limbs to BLOCK_FE_WIDTH cells. Used for rd_val/rs_val (RV64_PTR_U16S = 2 → 4) and output_len (F_NUM_U16S = 2 → 4). The compile-time assertion N <= BLOCK_FE_WIDTH prevents misuse. Clean and reusable.

def_fn::hash_output_raw

The sponge absorption now packs byte pairs into u16-valued sponge cells via le_bytes_to_u16_cells before calling perm. This matches the output AIR's data-row sponge cell packing. The alignment assertion (output_ref.len().is_multiple_of(SPONGE_BYTES_PER_ROW)) at def_fn.rs:84 correctly enforces the new 16-byte alignment requirement.

Continuations Changes

  • assert_user_pvs_shape extracted to root/mod.rs:42-46 and reused across commit/air.rs, commit/trace.rs, memory/air.rs, memory/trace.rs, and RootCircuit::new — replacing scattered debug_assert! calls with assert!, which is a correctness improvement that validates the invariant at construction time.
  • compute_output_f_commit in e2e.rs now delegates to generate_deferral_results + deferral_poseidon2_chip() instead of reimplementing sponge logic inline — ensures the test stays in sync with circuit changes automatically.
  • Guest programs and test fixtures updated to 16-byte-aligned outputs.

Test Coverage

  • New test output_raw_len_must_be_sponge_row_aligned (output/tests.rs:298-302) asserts the panic for non-16-aligned outputs
  • Guest programs updated to 16-byte outputs with recalculated expected values
  • Randomized test helpers produce SPONGE_BYTES_PER_ROW-aligned output lengths
  • CUDA tests mirror CPU test structure

Minor Observations (Non-blocking)

  1. combine_output_cells type inconsistency (utils.rs:255-264): The function signature takes output_len: [T; OUTPUT_LEN_NUM_U16S] (4 cells), but at call sites in call/air.rs:467 and output/air.rs:306, output_len_full has already been expanded to [AB::Expr; BLOCK_FE_WIDTH] (4 cells) via expand_to_rv64_block. This works because OUTPUT_LEN_NUM_U16S == BLOCK_FE_WIDTH == 4, but semantically the function accepts an already-zero-padded array rather than a raw u16 decomposition. This is fine — just noting that OUTPUT_TOTAL_NUM_U16S = COMMIT_NUM_U16S + OUTPUT_LEN_NUM_U16S (not + BLOCK_FE_WIDTH), and the correctness relies on OUTPUT_LEN_NUM_U16S == BLOCK_FE_WIDTH. A const { assert!(...) } in combine_output_cells would make this invariant explicit.

  2. split_cell_memory_ops used for sponge write chunks (output/air.rs:342): split_cell_memory_ops::<AB::Expr, DIGEST_SIZE, SPONGE_ROW_MEMORY_OPS> splits DIGEST_SIZE = 8 sponge cells into 2 chunks of BLOCK_FE_WIDTH = 4. This works cleanly. Just noting that split_cell_memory_ops was named for F-celled memory ops, but here it's used for u16-valued cells being written to byte-addressed memory. The abstraction is correct (both are BLOCK_FE_WIDTH-cell splits), but a brief doc note could help future readers.

No Issues Found

This is a clean, well-structured PR. The u16 cell transition halves the affected deferral trace columns, and the doubled sponge absorption rate halves the output row count. The AIR constraints, trace generation, CUDA kernels, extension wiring, and verify-stark circuit are all internally consistent. The assert_user_pvs_shape extraction in continuations is a nice quality improvement.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant