Batch main sumcheck across chips by hero78119 · Pull Request #1333 · scroll-tech/ceno

hero78119 · 2026-04-29T13:52:33Z

Problem

Main sumcheck was proved and verified per chip, which duplicated transcript work, selector/claim handling, and PCS opening plumbing across chips.

Design Rationale

Use one global batched main sumcheck proof while keeping PCS openings in the existing suffix path. The verifier mirrors the prover transcript order, including ECC bridge sampling before the global combine subset evals challenge, and evaluates frontloaded expressions in the verifier.

Change Highlights

ceno_zkvm: batches main constraints into a single global proof path across chip proofs.
ceno_zkvm: keeps witness/fixed PCS openings per chip after global main verification.
ceno_recursion: mirrors native verifier changes for the batched main proof.
ceno-gpu: supports the batched main proving flow.

Benchmark / Performance Impact

Benchmark session compares current PR branch against the frontload baseline on block 23817600, GPU proving, CENO_GPU_ENABLE_WITGEN=0.

E2E / Layer

Metric	Baseline	This PR	Delta
E2E total	75.600s	103.000s	+27.400s (+36.2%)
emulator	10.100s	10.300s	+0.200s (+2.0%)
app_prove wall time	61.000s	87.400s	+26.400s (+43.3%)
app.verify	3.390s	4.040s	+0.650s (+19.2%)

App Prove Breakdown

Profiler module totals can overlap because chip proving is concurrent; use app_prove wall time above for end-to-end impact. Corrected parser coverage adds the new batched-main span, which is now the main critical-path regression source.

Operation	Baseline	This PR	Delta
prove_batched_main_constraints	0.000s	27.375s	+27.375s (new)
prove_main_constraints	22.622s	0.000s	-22.622s (-100.0%)
extract_witness_mles	24.155s	3.760s	-20.395s (-84.4%)
build_tower_witness_gpu	3.491s	0.323s	-3.168s (-90.7%)
prove_tower_relation_gpu	176.090s	24.008s	-152.082s (-86.4%)
pcs_opening	15.246s	15.207s	-0.039s (-0.3%)
commit_traces	6.827s	6.814s	-0.013s (-0.2%)
parsed rows total	251.118s	78.370s	-172.748s (-68.8%)

Benchmark command:

CENO_GPU_ENABLE_WITGEN=0 CENO_CONCURRENT_CHIP_PROVING=1 CENO_GPU_CACHE_LEVEL=0 \
RUSTFLAGS="-C target-feature=+avx2" \
cargo run --features "jemalloc,gpu" --release --bin ceno-reth-benchmark-bin -- \
  --mode prove-app --block-number 23817600 --rpc-url <redacted> \
  --output-dir output --cache-dir rpc-cache

Environment:

GitHub self-hosted GPU runner, CUDA device cc=8.9, 24GB GPU memory.
Rust nightly-2025-11-20, cargo 1.93.0-nightly.
This PR benchmark: run 25594090744, Ceno dd229c00, summary.
Baseline: run 25419833788 / job 74559223217, Ceno frontload baseline, summary.

Testing

RUST_MIN_STACK=33554432 cargo check --package ceno_recursion --bin e2e_aggregate
RUST_MIN_STACK=33554432 cargo run --release --package ceno_recursion --bin e2e_aggregate -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

Also passed the linked GPU e2e benchmark run above.

Risks and Rollout

Soundness risk is concentrated in transcript ordering and verifier frontload evaluation; native and recursion verifiers now follow the same global proof flow.
Performance is not yet an E2E win in the linked benchmark despite removing per-chip main-constraint cost; further scheduling/host-overlap work is needed before rollout as a performance improvement.

Follow-ups

Investigate reducing the new prove_batched_main_constraints critical-path cost.
Keep benchmark summaries explicit that parsed module totals overlap and are not a wall-time decomposition.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

…_mle_zero_padding

…/ceno into feat/prover_mle_zero_padding

…_mle_zero_padding

…heck

hero78119 added 30 commits April 25, 2026 23:18

refactor GPU compact tower witness flow

ac49ac6

Fix compact tower memory accounting

84a2631

Optimize compact logup ones allocation

12453f6

update dep

7d60f01

Merge branch 'master' into feat/prover_mle_zero_padding

925de92

fix main mem estimation

e9fbe9c

Merge branch 'master' of github.com:scroll-tech/ceno into feat/prover…

46e87bb

…_mle_zero_padding

Merge branch 'feat/prover_mle_zero_padding' of github.com:scroll-tech…

b888fbb

…/ceno into feat/prover_mle_zero_padding

fix mem estimator

5ecce04

snapshot compact tower estimator state

be14006

rollback Cargo.toml, Cargo.lock change

df88dec

fix memory estimation

b57b692

verifier log

c50b793

Pass tower input by value for GPU proving

89b8698

split tower layer by view

f210e1f

Use dense tower build for compact GPU input

99b7a94

Pass logup shape to tower prove estimator

f0d81b6

Deduplicate borrowed tower input booking

917810c

fix logging

4fc8dae

Check scheduler memory estimate in mem tracking

ef9fa30

Refine replay tower proof memory estimate

011a898

clippy fix

f3ca1cf

add missing syncronization, avoid race condition

147f567

Account ShardRam tower prove allocator overhead

94fc7bf

misc: clippy fix

c9401d1

Fix GPU proof memory estimation

d14e66a

Fix GPU proof estimate row basis

ceced51

Tune ShardRam tower proof estimate

d1ab71a

Batch main constraints into single sumcheck

7c6e97c

Restore replay backing before batched main

505e258

hero78119 added 5 commits April 29, 2026 13:35

Replay witness backing incrementally during PCS opening

b2fba0f

wip more log

25d7f42

Improve GPU proof failure diagnostics

2128bf9

Compact ShardRAM main witness extraction

d5513de

Log batched main MLE histograms

2df2590

hero78119 marked this pull request as draft April 29, 2026 13:52

hero78119 mentioned this pull request Apr 29, 2026

Replay PCS traces incrementally #1332

Closed

hero78119 added 9 commits April 30, 2026 09:57

Fix batched main GPU verification

b67c6b7

Use legacy layout for batched main GPU sumcheck

a4d066f

update gkr dependency

29ae6df

Merge branch 'master' of github.com:scroll-tech/ceno into feat/prover…

7d1a9de

…_mle_zero_padding

Merge branch 'feat/prover_mle_zero_padding' into feat/batch_main_sumc…

7ebc0cf

…heck

perf(gpu): trim batched main sumcheck work

fb96061

perf(gpu): use direct layout for batched main

be5a6f5

Experiment staggered batched main sumcheck prover

27ed865

Use dedicated batched main sumcheck prover

268025b

Base automatically changed from feat/prover_mle_zero_padding to master May 4, 2026 07:55

hero78119 added 3 commits May 9, 2026 13:55

chore: checkpoint frontload integration

87f85be

chore: upgrade gkr-backend to alpha.28

dd229c0

feat(recursion): verify batched main sumcheck

23d5a6a

hero78119 changed the title ~~batch main sumcheck~~ Batch main sumcheck across chips May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch main sumcheck across chips#1333

Batch main sumcheck across chips#1333
hero78119 wants to merge 47 commits intomasterfrom
feat/batch_main_sumcheck

hero78119 commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hero78119 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Design Rationale

Change Highlights

Benchmark / Performance Impact

E2E / Layer

App Prove Breakdown

Testing

Risks and Rollout

Follow-ups

Copilot Reviewer Directive (keep this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hero78119 commented Apr 29, 2026 •

edited

Loading