feat: halo2 gpu test (DO NOT MERGE)#2772
Conversation
## Summary - Renames the constant `PUBLIC_VALUES_AIR_ID` to `MEMORY_AIRS_START_IDX` across 8 files - Removes old unused `BOUNDARY_AIR_ID` - Updates the doc comment to reflect the constant's current purpose as the boundary AIR index - The `PublicValuesChip` was removed in #2480, making the old name misleading Follows up on review comment in #2480. Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com> Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com> Co-authored-by: Ayush Shukla <shuklaayush@users.noreply.github.com> Co-authored-by: Ayush Shukla <ayush@axiom.xyz>
Caused a perf regression in CUDA due to the larger memory merkle tree. (Previously the sdk has an optimization to set it to 0, which was removed in the DEFERRAL_AS renaming PR.)
This makes usage more rigid and predictable: every counter must always start at zero.
This resolves: - In Eq3bAir (INT-6599): - INT-6446: added the transition constraint - INT-6447: replaced the old check with a more general one which works within a proof as opposed to within an AIR - INT-6448: added explicit boundary conditions - In EqSharpUniAir (INT-6600): - INT-6444: added the transition equality check - INT-6445: added the boundary condition - In EqUniAir (INT-6601): - INT-6434: added the boundary condition - In EqNsAir (INT-6602): - INT-6376: introduced two new columns `n_logup` and `n_max`, constrained them to be constants within a proof, added constraints to make `n_less_than_n_logup` and `n_less_than_n_max` actual indicators that `n` is less than them - In EqNegAir (INT-6603): - INT-6371: added the boundary and transition checks for `one_half_pow` - INT-6372: added the boundary and transition checks for `prod_1_r_omega` (resolves INT-6599 INT-6446 INT-6447 INT-6448 INT-6600 INT-6444 INT-6445 INT-6601 INT-6434 INT-6602 INT-6376 INT-6603 INT-6371 INT-6372)
This resolves INT-6401 ergo INT-6595: FractionFolderAir now enforces `cur_q_sum` to be a running sum and also doesn't care about the value of `alpha`, instead GkrInputAir receives a denominator not including `alpha`, which saves us the trouble of storing `alpha` elsewhere.
- reject is_present = 1 on dummy cached rows by requiring valid cached flags - move NodeKind into symbolic_expression/air.rs so the canonical discriminants live with the audited AIR logic - remove the unused claim_bus and eq_neg_internal_bus wiring
This resolves: - INT-6578: added the boundary and transition checks - INT-6580: simplified the tracegen, added the zero check - INT-6581: enforced that `eq_3b` is constant within the proof - INT-6582: added the implication check - INT-6583: just zeroed the denominator because it doesn't need to include `alpha` at that point - INT-6584: added the correct check - INT-6585: removed `node_idx` from the columns, because the relation with `node_idx` is enforced on the side of SymbolicExpressionAir (and also removed it from the tracegen and replaced the corresponding `Vec` in the gpu blob with just a counter) This ticket doesn't resolve the remaining two issues from the air because it would be better to do that on top of #2491. (resolves INT-6578 INT-6580 INT-6581 INT-6582 INT-6583 INT-6584 INT-6585)
This resolves: - INT-6572: added boundary checks - INT-6573: added a couple of new buses and propagated `tidx` to cover `lambda_tidx` - INT-6574: added the zero check for the extreme case - INT-6575: added the boundary conditions - INT-6576: made an already existing constraint stricter (resolves INT-6592 INT-6572 INT-6573 INT-6574 INT-6575 INT-6576)
This resolves the rest of INT-6591: - INT-6577: received `AirId` from ProofShapeAir - INT-6579: received `beta_tidx` from GkrInputAir by analogy with `lambda_tidx` in ConstraintsFoldingAir Since this reuses some logic from #2491, this is on top of that commit. (resolves INT-6591 INT-6577 INT-6579)
…2520) - properly escape underscores in readme's latex - add example traces at end - update cuda image family to g6.*
fix: INT-6590 fix: INT-6396, INT-6397, INT-6398, INT-6399, INT-6402, INT-6439
Resolves INT-7422.
Resolves INT-7433.
## Summary
Refactors the SHA-2 block hasher CUDA tracegen kernel from a monolithic
1-thread-per-block design into a three-phase pipeline for better memory
coalescing on column-major traces.
| Phase | Kernel | Parallelism | Purpose |
|-------|--------------------------|-----------------|----------------------------------------------------------------|
| 1 | `sha2_first_pass_phase1` | 1 thread/block | Compute SHA-2 rounds,
snapshot `{a..h, w_buf}` to scratch |
| 2 | `sha2_first_pass_phase2` | 1 thread/row | Read snapshot, write
trace columns (coalesced) |
| 3 | `sha2_first_pass_phase3` | 1 thread/block | Cross-row dependencies
(`intermed_4/8/12`, carry) |
Also drops unused parameters (`ptr_max_bits`,
`range_checker_ptr`/`num_bins`, `timestamp_max_bits`) from the block
hasher kernel and its Rust-side wrappers, simplifying
`Sha2BlockHasherChipGpu`.
## Changes
- `extensions/sha2/circuit/cuda/src/sha2_hasher.cu`: split first-pass
tracegen into three phases; tightened `CHECK_KERNEL` error propagation
in `launch_sha2_second_pass_dependencies`.
- `extensions/sha2/circuit/src/cuda/{mod.rs,cuda_abi.rs}`: wire the new
scratch buffer and drop unused wrapper arguments.
- `extensions/sha2/circuit/src/extension/cuda.rs`: update
`Sha2BlockHasherChipGpu::new` call sites.
- `extensions/sha2/circuit/src/sha2_chips/tests.rs`: update tests to
match the new constructor signature.
## Correctness notes
- Phase 1 snapshots state before each row's rounds; Phase 2 restores
that snapshot and replays the rounds to compute intermediate trace
values. The round computations are identical across phases.
- Scratch is sized as `num_blocks * ROWS_PER_BLOCK * (8 + BLOCK_WORDS)`
words, exactly matching what's written/read (no oversize).
- Phase 2 launches `num_blocks * ROWS_PER_BLOCK` threads, so every
thread maps to a valid block.
- Padding rows are still handled by the existing
`sha2_fill_first_dummy_row` + `sha2_fill_invalid_rows` kernels.
- Digest row: Phase 1 snapshots the final state (after all rounds) for
`row_in_block == ROUND_ROWS`; Phase 2 reads it to compute `final_hash =
prev_hash + work_vars`.
## Test plan
- [ ] `cargo nextest run --cargo-profile=fast -p openvm-sha2-circuit`
- [ ] CUDA tracegen tests for both SHA-256 and SHA-512 variants (feature
`cuda`)
- [ ] Verify trace matrices match CPU reference for representative
inputs
resolves int-6985
Towards INT-7429.
## Summary
- Mark `finalize_ptr`, `finalize`, and `native_keccak256` as `unsafe fn`
in both `zkvm_impl.rs` and `host_impl.rs`. These functions write 32
bytes to an output pointer/slice via `copy_nonoverlapping` without
runtime size checks (only `debug_assert!`), so callers must guarantee
the buffer is at least 32 bytes.
- Add `assert!` in the `tiny_keccak::Hasher` trait impl in `lib.rs`
since the trait method must be safe and cannot propagate `unsafe` to
callers.
- Safe public APIs (`keccak256()`, `set_keccak256()`) are unchanged —
they pass correctly-sized buffers internally.
**TODO:** The `openvm-org/tiny-keccak` fork and
`openvm-org/openvm-primitives` also call `Keccak256::finalize` /
`native_keccak256` and will need matching updates (add `unsafe {}`
blocks and/or `assert!` at their trait boundaries).
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>
Towards INT-7422.
Resolves INT-7439, INT-7422.
…ebra+ecc + update sha2 docs (#2710) Resolves INT-7439.
Towards INT-7429. ## Summary - Fix OOB read in `native_xorin` when `len` is not 4-byte aligned: `AlignedBuf::new` copied `adjusted_len` bytes from source buffers, reading 1-3 bytes past the valid region. Replace with `AlignedBuf::uninit` + `copy_nonoverlapping` of only `len` bytes. - Mark `native_xorin` and `native_keccakf` as `unsafe extern "C" fn` since they take raw pointers and require callers to guarantee pointer validity and buffer sizes. Add `unsafe` blocks in callers `xorin` and `keccakf` with safety comments. - Update keccak extension docs code example to reflect the new unsafe signatures. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Towards INT-7478. - Replace `assert_eq!` with `Err(StaticProgramError::InvalidInstruction)` in multiplication executor `pre_compute_impl` for both bigint (mult.rs) and rv32im (mul/execution.rs). Functions returning `Result` should use error returns instead of panics, consistent with other executors in the codebase. - Add `/// # Safety` doc comments to all 12 `unsafe extern "C" fn` in bigint guest externs (externs.rs), documenting pointer validity, buffer size, and alignment preconditions. Follows precedent set by PRs #2706 (keccak) and #2708 (sha2). - Document unused branch opcode variants. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Issue found and initially corrected by [Daniel Heim](https://github.com/dghelm)
Resolves INT-7541. Issue found and initially corrected by [Daniel Heim](https://github.com/dghelm).
|
Claude finished @gaxiom's task in 1m 51s —— View job Code Review: PR #2772 —
|
| Consumer | Source branch |
|---|---|
| Workspace crates (13 references) | feat/halo2-gpu (0e0e298c) |
halo2-axiom (GPU fork) |
develop-v2 (f36fefe5) |
This means halo2-axiom from the GPU fork pins its own openvm-cuda-builder to the develop-v2 branch, while the rest of the workspace uses the feat/halo2-gpu branch. Two copies of the same crate at different commits can cause subtle type-mismatch or link issues at runtime. The halo2-gpu fork's Cargo.toml likely needs to be updated to also point at feat/halo2-gpu, or you should add a [patch] for openvm-cuda-builder as well to unify the resolution.
2. SSH source for halo2-axiom patch — CI compatibility (Medium)
[patch.crates-io]
halo2-axiom = { git = "ssh://git@github.com/axiom-crypto/halo2-gpu.git", branch = "feat/cuda-common-bn254" }This uses ssh://git@github.com which requires SSH key authentication. This will fail in any CI environment that doesn't have the appropriate deploy key configured (including standard GitHub Actions runners). Since the title says "DO NOT MERGE" this is likely intentional for local testing, but worth noting. If this needs to work in CI eventually, consider using HTTPS with a token or making the repo public.
3. Legacy / deprecated transitive dependencies (Low)
The GPU fork of halo2-axiom pulls in several older crate versions as new transitive deps:
cfg-if 0.1.10(current: 1.x)bitflags 1.3.2(current: 2.x)env_logger 0.7.1(current: 0.11.x)atty 0.2.14(unmaintained, replaced byis-terminal)humantime 1.3.0(current: 2.x)hermit-abi 0.1.19(current: 0.5.x)
These are not blocking for an experimental branch, but the halo2-gpu fork is carrying a fair amount of dependency debt. Worth tracking as tech debt for the upstream fork.
4. No [patch] for openvm-cuda-common (Low)
halo2-axiom depends on openvm-cuda-common (without a disambiguation suffix in the lockfile), meaning Cargo resolved it to the feat/halo2-gpu branch. This seems correct, but it's worth verifying that the openvm-cuda-common API is compatible between what halo2-axiom expects and what feat/halo2-gpu provides — an incompatibility here would be a compile error, not a silent bug.
5. Comment block is overly detailed for Cargo.toml (Nit)
# Point halo2-base / halo2-ecc at the GPU fork of halo2-axiom (halo2-gpu)
# so the CUDA prover path is exercised. Tracks the Phase 1 cuda-common
# migration branch; once that work merges back into halo2-gpu's main
# line, switch this to a stable branch or pinned rev. Tracking INT-6841
# / INT-7555.Per repo conventions (CLAUDE.md: "default to writing no comments"), this 5-line comment is heavy for a Cargo.toml. The ticket references (INT-6841 / INT-7555) are useful, but the rest is better suited for a commit message or PR description. Consider trimming to a single line like:
# Tracking INT-6841 / INT-7555: GPU fork of halo2-axiom for CUDA prover testingSummary
The main concern is issue #1 — the duplicate openvm-cuda-builder from two different branch commits is likely to cause problems when CUDA features are actually compiled and linked. Issues #2–5 are minor and acceptable for a "DO NOT MERGE" experimental branch.
This PR introduces the `ColumnsAir` trait and makes it required by all AIRs. Via this trait, AIRs can (optionally) provide a list of column names. At powdr, we use this to generate (somewhat) human-readable autoprecompiles for OpenVM ([example](https://github.com/powdr-labs/powdr/blob/main/openvm-riscv/tests/apc_snapshots/complex/rotate.txt)). Unfortunately, this touches all AIRs and therefore a lot of files, but the changes are small and now change to `stark-backend` are needed. This change represents most of our [diff to OpenVM](https://github.com/openvm-org/openvm/compare/v2.0.0-beta.2...powdr-labs:openvm:v2-powdr-beta.2-reorganized?expand=1). --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a [patch.crates-io] entry pointing halo2-axiom at /home/grigorii/dev/halo2/halo2-gpu/halo2_proofs so halo2-base / halo2-ecc pick up the local CUDA prover fork. Tracking INT-6841 / INT-7493. This branch is meant to be consumed as an openvm submodule ref inside v2-proof-system; the local path is intentional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local path "/home/grigorii/dev/halo2/halo2-gpu/halo2_proofs" does not exist on CI runners, causing cargo resolution to fail. halo2-gpu is a private repository in the axiom-crypto org, so HTTPS would require token auth; use ssh://git@github.com/... instead so the CI runner's existing SSH agent (loaded from GH_ACTIONS_DEPLOY_PRIVATE_KEY) is reused. Pinned to halo2-gpu rev 22d2b9f3 (tip of chore/refactor-to-cuda-struct as of 2026-04-22) for build reproducibility. Bump this rev when halo2-gpu advances. Local iteration: set CARGO_NET_GIT_FETCH_WITH_CLI=1 so cargo uses the git CLI (and the SSH agent) to fetch. Devs that want to point at a local checkout can add a secondary patch override in their ~/.cargo/config.toml without modifying this manifest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches openvm's stark-backend and halo2-axiom deps onto the
published Phase 1 branches of INT-7555.
- openvm-{stark-backend,codec-derive,stark-sdk,cpu-backend,
cuda-backend,cuda-builder,cuda-common}: branch="develop-v2" →
"feat/halo2-gpu". Tracking a branch (not a pinned rev) lets
stark-backend commits on feat/halo2-gpu flow through via
`cargo update -p openvm-cuda-common` until this work merges
back to develop-v2.
- halo2-axiom: was pinned to a historical rev, then repointed at
a local path during Phase 1 for dev-loop testing; now tracks
halo2-gpu feat/cuda-common-bn254.
- Removes the temporary
[patch."https://github.com/openvm-org/stark-backend.git"] block
— redundant now that the direct deps point at feat/halo2-gpu.
test_sdk_fibonacci green (gas_cost=335626).
Bumps the [patch.crates-io] override for halo2-axiom from the feat/cuda-common-bn254 staging branch to a specific rev on halo2-gpu's main, now that the perf instrumentation (#40) and the GPU evaluate_h port (#41) have both landed. c879c054 is the squash-merge tip of `perf: move evaluate_h to GPU (#41)` — last commit on halo2-gpu main as of this bump. Pinning by `rev = ...` (instead of `branch = "main"`) keeps cargo's git resolver deterministic so CI doesn't see stale snapshots between `cargo update` runs. Comment block above the patch trimmed to drop the now-irrelevant INT-6841 / INT-7555 staging context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the GPU perf optimization landed on axiom-crypto/halo2-gpu main. Replaces the previous pin at c879c054. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After rebasing onto develop-v2.0.0-rc.1, the lock still pinned the seven openvm-* crates pulled in via the stark-backend git dep to 0e0e298c (the pre-rebase tip of stark-backend's feat/halo2-gpu). The stark-backend branch has since been rebased onto its own develop-v2 to absorb #331 / #334 GKR perf wins; the new tip is a71c091d. Refresh the lock so openvm-workspace builds (SDK tests, fibonacci bench, etc.) actually pick up those perf commits. The develop-v2 openvm-cuda-builder build-dep that halo2-gpu pulls in also moves forward 7803eebf -> 260062e9 as a side effect, which is fine. The same submodule pin will be bumped in v2-proof-system in a separate commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @gaxiom's task in 1m 32s —— View job Code Review: PR #2772 —
|
| Source | Rev | Consumer |
|---|---|---|
branch=develop-v2 |
260062e9 |
halo2-axiom (GPU fork, line 4043) |
branch=feat/halo2-gpu |
a71c091d |
All workspace crates (13 references) |
The halo2-axiom fork at rev b691925a hardcodes its openvm-cuda-builder dep to branch = "develop-v2" in its own Cargo.toml, so the workspace [patch] doesn't unify it. This means two different builds of the same crate coexist. When CUDA features are enabled, this can cause type mismatches (e.g., a CudaKernel type from one copy isn't the same as from the other).
Fix options:
- Add
openvm-cuda-builderto the[patch.crates-io]section to force unification. Fix this → - Update the
halo2-gpufork'sCargo.tomlto point atfeat/halo2-gpuinstead ofdevelop-v2.
2. SSH source for halo2-axiom — CI incompatible (Medium)
halo2-axiom = { git = "ssh://git@github.com/axiom-crypto/halo2-gpu.git", rev = "b691925a..." }SSH auth will fail on standard GitHub Actions runners. Acceptable for a "DO NOT MERGE" local-testing branch, but this blocks any CI validation. If CI needs to run, switch to HTTPS with a PAT or make a deploy key available.
3. openvm-cuda-common resolution looks correct (Info)
openvm-cuda-common resolved to feat/halo2-gpu at a71c091d (line 6196), matching the workspace deps. The halo2-axiom fork lists openvm-cuda-common as a plain dependency (line 4044, no disambiguation suffix), and Cargo unified it with the workspace version. This is correct — no action needed.
4. Legacy transitive dependencies from the GPU fork (Low)
The GPU fork of halo2-axiom still pulls in outdated transitive deps:
cfg-if 0.1.10,bitflags 1.3.2,env_logger 0.7.1,atty 0.2.14(unmaintained),humantime 1.3.0,hermit-abi 0.1.19- New GPU deps:
cuda-driver-sys,rustacuda(0.1.3),yastl(viaflume)
Not blocking for an experimental branch, but the halo2-gpu fork carries notable dependency debt. The rustacuda crate (0.1.3) and its cuda-driver-sys dep are quite old — worth tracking for the upstream fork.
5. Comment trimmed but still verbose (Nit)
The comment was trimmed from 5 lines to 2 lines compared to the earlier iteration — good improvement. The current 2-line comment is reasonable for context, though per repo conventions a single line would suffice.
Summary
The main outstanding issue from the previous review — duplicate openvm-cuda-builder (#1) — remains. It's the primary risk when CUDA features are actually compiled. Everything else is acceptable for a "DO NOT MERGE" experimental branch. The pinned rev approach (replacing branch tracking) is a good improvement for reproducibility.
f08f0fe to
0110850
Compare
No description provided.