Skip to content

feat: halo2 gpu test (DO NOT MERGE)#2772

Draft
gaxiom wants to merge 557 commits into
develop-v2.0.0-rc.1from
feat/halo2-gpu
Draft

feat: halo2 gpu test (DO NOT MERGE)#2772
gaxiom wants to merge 557 commits into
develop-v2.0.0-rc.1from
feat/halo2-gpu

Conversation

@gaxiom
Copy link
Copy Markdown
Contributor

@gaxiom gaxiom commented May 4, 2026

No description provided.

stephenh-axiom-xyz and others added 30 commits March 5, 2026 14:22
## Summary

- Renames the constant `PUBLIC_VALUES_AIR_ID` to `MEMORY_AIRS_START_IDX`
across 8 files
- Removes old unused `BOUNDARY_AIR_ID`
- Updates the doc comment to reflect the constant's current purpose as
the boundary AIR index
- The `PublicValuesChip` was removed in #2480, making the old name
misleading

Follows up on review comment in #2480.

Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>
Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>
Co-authored-by: Ayush Shukla <shuklaayush@users.noreply.github.com>
Co-authored-by: Ayush Shukla <ayush@axiom.xyz>
Caused a perf regression in CUDA due to the larger memory merkle tree.
(Previously the sdk has an optimization to set it to 0, which was
removed in the DEFERRAL_AS renaming PR.)
)

closes INT-6114

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This makes usage more rigid and predictable: every counter must always
start at zero.
This resolves:
- In Eq3bAir (INT-6599):
  - INT-6446: added the transition constraint
- INT-6447: replaced the old check with a more general one which works
within a proof as opposed to within an AIR
  - INT-6448: added explicit boundary conditions
- In EqSharpUniAir (INT-6600):
  - INT-6444: added the transition equality check
  - INT-6445: added the boundary condition
- In EqUniAir (INT-6601):
  - INT-6434: added the boundary condition
- In EqNsAir (INT-6602):
- INT-6376: introduced two new columns `n_logup` and `n_max`,
constrained them to be constants within a proof, added constraints to
make `n_less_than_n_logup` and `n_less_than_n_max` actual indicators
that `n` is less than them
- In EqNegAir (INT-6603):
- INT-6371: added the boundary and transition checks for `one_half_pow`
- INT-6372: added the boundary and transition checks for
`prod_1_r_omega`

(resolves INT-6599 INT-6446 INT-6447 INT-6448 INT-6600 INT-6444 INT-6445
INT-6601 INT-6434 INT-6602 INT-6376 INT-6603 INT-6371 INT-6372)
This resolves INT-6401 ergo INT-6595: FractionFolderAir now enforces
`cur_q_sum` to be a running sum and also doesn't care about the value of
`alpha`, instead GkrInputAir receives a denominator not including
`alpha`, which saves us the trouble of storing `alpha` elsewhere.
- reject is_present = 1 on dummy cached rows by requiring valid cached
flags
- move NodeKind into symbolic_expression/air.rs so the canonical
discriminants live with the audited AIR logic
- remove the unused claim_bus and eq_neg_internal_bus wiring
This resolves:
- INT-6578: added the boundary and transition checks
- INT-6580: simplified the tracegen, added the zero check
- INT-6581: enforced that `eq_3b` is constant within the proof
- INT-6582: added the implication check
- INT-6583: just zeroed the denominator because it doesn't need to
include `alpha` at that point
- INT-6584: added the correct check
- INT-6585: removed `node_idx` from the columns, because the relation
with `node_idx` is enforced on the side of SymbolicExpressionAir (and
also removed it from the tracegen and replaced the corresponding `Vec`
in the gpu blob with just a counter)

This ticket doesn't resolve the remaining two issues from the air
because it would be better to do that on top of
#2491.

(resolves INT-6578 INT-6580 INT-6581 INT-6582 INT-6583 INT-6584
INT-6585)
This resolves:
- INT-6572: added boundary checks
- INT-6573: added a couple of new buses and propagated `tidx` to cover
`lambda_tidx`
- INT-6574: added the zero check for the extreme case
- INT-6575: added the boundary conditions
- INT-6576: made an already existing constraint stricter

(resolves INT-6592 INT-6572 INT-6573 INT-6574 INT-6575 INT-6576)
This resolves the rest of INT-6591:
- INT-6577: received `AirId` from ProofShapeAir
- INT-6579: received `beta_tidx` from GkrInputAir by analogy with
`lambda_tidx` in ConstraintsFoldingAir

Since this reuses some logic from
#2491, this is on top of that
commit.

(resolves INT-6591 INT-6577 INT-6579)
…2520)

- properly escape underscores in readme's latex
- add example traces at end
- update cuda image family to g6.*
fix: INT-6590

fix: INT-6396, INT-6397, INT-6398, INT-6399, INT-6402, INT-6439
876pol and others added 14 commits April 24, 2026 19:54
## Summary

Refactors the SHA-2 block hasher CUDA tracegen kernel from a monolithic
1-thread-per-block design into a three-phase pipeline for better memory
coalescing on column-major traces.

| Phase | Kernel | Parallelism | Purpose |

|-------|--------------------------|-----------------|----------------------------------------------------------------|
| 1 | `sha2_first_pass_phase1` | 1 thread/block | Compute SHA-2 rounds,
snapshot `{a..h, w_buf}` to scratch |
| 2 | `sha2_first_pass_phase2` | 1 thread/row | Read snapshot, write
trace columns (coalesced) |
| 3 | `sha2_first_pass_phase3` | 1 thread/block | Cross-row dependencies
(`intermed_4/8/12`, carry) |

Also drops unused parameters (`ptr_max_bits`,
`range_checker_ptr`/`num_bins`, `timestamp_max_bits`) from the block
hasher kernel and its Rust-side wrappers, simplifying
`Sha2BlockHasherChipGpu`.

## Changes

- `extensions/sha2/circuit/cuda/src/sha2_hasher.cu`: split first-pass
tracegen into three phases; tightened `CHECK_KERNEL` error propagation
in `launch_sha2_second_pass_dependencies`.
- `extensions/sha2/circuit/src/cuda/{mod.rs,cuda_abi.rs}`: wire the new
scratch buffer and drop unused wrapper arguments.
- `extensions/sha2/circuit/src/extension/cuda.rs`: update
`Sha2BlockHasherChipGpu::new` call sites.
- `extensions/sha2/circuit/src/sha2_chips/tests.rs`: update tests to
match the new constructor signature.

## Correctness notes

- Phase 1 snapshots state before each row's rounds; Phase 2 restores
that snapshot and replays the rounds to compute intermediate trace
values. The round computations are identical across phases.
- Scratch is sized as `num_blocks * ROWS_PER_BLOCK * (8 + BLOCK_WORDS)`
words, exactly matching what's written/read (no oversize).
- Phase 2 launches `num_blocks * ROWS_PER_BLOCK` threads, so every
thread maps to a valid block.
- Padding rows are still handled by the existing
`sha2_fill_first_dummy_row` + `sha2_fill_invalid_rows` kernels.
- Digest row: Phase 1 snapshots the final state (after all rounds) for
`row_in_block == ROUND_ROWS`; Phase 2 reads it to compute `final_hash =
prev_hash + work_vars`.

## Test plan

- [ ] `cargo nextest run --cargo-profile=fast -p openvm-sha2-circuit`
- [ ] CUDA tracegen tests for both SHA-256 and SHA-512 variants (feature
`cuda`)
- [ ] Verify trace matrices match CPU reference for representative
inputs

resolves int-6985
Towards INT-7429.

 ## Summary

- Mark `finalize_ptr`, `finalize`, and `native_keccak256` as `unsafe fn`
in both `zkvm_impl.rs` and `host_impl.rs`. These functions write 32
bytes to an output pointer/slice via `copy_nonoverlapping` without
runtime size checks (only `debug_assert!`), so callers must guarantee
the buffer is at least 32 bytes.
- Add `assert!` in the `tiny_keccak::Hasher` trait impl in `lib.rs`
since the trait method must be safe and cannot propagate `unsafe` to
callers.
- Safe public APIs (`keccak256()`, `set_keccak256()`) are unchanged —
they pass correctly-sized buffers internally.

**TODO:** The `openvm-org/tiny-keccak` fork and
`openvm-org/openvm-primitives` also call `Keccak256::finalize` /
`native_keccak256` and will need matching updates (add `unsafe {}`
blocks and/or `assert!` at their trait boundaries).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>
…ebra+ecc + update sha2 docs (#2710)

Resolves INT-7439.
Towards INT-7429.

## Summary

- Fix OOB read in `native_xorin` when `len` is not 4-byte aligned:
`AlignedBuf::new` copied `adjusted_len` bytes from source buffers,
reading 1-3 bytes past the valid region. Replace with
`AlignedBuf::uninit` + `copy_nonoverlapping` of only `len` bytes.
- Mark `native_xorin` and `native_keccakf` as `unsafe extern "C" fn`
since they take raw pointers and require callers to guarantee pointer
validity and buffer sizes. Add `unsafe` blocks in callers `xorin` and
`keccakf` with safety comments.
- Update keccak extension docs code example to reflect the new unsafe
signatures.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Towards INT-7478.

- Replace `assert_eq!` with
`Err(StaticProgramError::InvalidInstruction)` in multiplication executor
`pre_compute_impl` for both bigint (mult.rs) and rv32im
(mul/execution.rs). Functions returning `Result` should use error
returns instead of panics, consistent with other executors in the
codebase.
- Add `/// # Safety` doc comments to all 12 `unsafe extern "C" fn` in
bigint guest externs (externs.rs), documenting pointer validity, buffer
size, and alignment preconditions. Follows precedent set by PRs #2706
(keccak) and #2708 (sha2).
- Document unused branch opcode variants.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves INT-7541. Issue found and initially corrected by [Daniel
Heim](https://github.com/dghelm).
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Claude finished @gaxiom's task in 1m 51s —— View job


Code Review: PR #2772feat: halo2 gpu test (DO NOT MERGE)

  • Gather context and read the diff
  • Review Cargo.toml changes
  • Review Cargo.lock changes
  • Post review feedback

This PR makes two dependency-level changes to support testing the halo2 GPU (CUDA) prover path:

  1. Switches all stark-backend workspace deps from branch = "develop-v2" to branch = "feat/halo2-gpu" (7 crates).
  2. Adds a [patch.crates-io] for halo2-axiom, redirecting it from crates.io to the feat/cuda-common-bn254 branch of the private axiom-crypto/halo2-gpu repo via SSH.

Issues Found

1. Duplicate openvm-cuda-builder — branch mismatch (Medium)

The Cargo.lock contains two copies of openvm-cuda-builder 2.0.0-alpha:

Consumer Source branch
Workspace crates (13 references) feat/halo2-gpu (0e0e298c)
halo2-axiom (GPU fork) develop-v2 (f36fefe5)

This means halo2-axiom from the GPU fork pins its own openvm-cuda-builder to the develop-v2 branch, while the rest of the workspace uses the feat/halo2-gpu branch. Two copies of the same crate at different commits can cause subtle type-mismatch or link issues at runtime. The halo2-gpu fork's Cargo.toml likely needs to be updated to also point at feat/halo2-gpu, or you should add a [patch] for openvm-cuda-builder as well to unify the resolution.

2. SSH source for halo2-axiom patch — CI compatibility (Medium)

[patch.crates-io]
halo2-axiom = { git = "ssh://git@github.com/axiom-crypto/halo2-gpu.git", branch = "feat/cuda-common-bn254" }

This uses ssh://git@github.com which requires SSH key authentication. This will fail in any CI environment that doesn't have the appropriate deploy key configured (including standard GitHub Actions runners). Since the title says "DO NOT MERGE" this is likely intentional for local testing, but worth noting. If this needs to work in CI eventually, consider using HTTPS with a token or making the repo public.

3. Legacy / deprecated transitive dependencies (Low)

The GPU fork of halo2-axiom pulls in several older crate versions as new transitive deps:

  • cfg-if 0.1.10 (current: 1.x)
  • bitflags 1.3.2 (current: 2.x)
  • env_logger 0.7.1 (current: 0.11.x)
  • atty 0.2.14 (unmaintained, replaced by is-terminal)
  • humantime 1.3.0 (current: 2.x)
  • hermit-abi 0.1.19 (current: 0.5.x)

These are not blocking for an experimental branch, but the halo2-gpu fork is carrying a fair amount of dependency debt. Worth tracking as tech debt for the upstream fork.

4. No [patch] for openvm-cuda-common (Low)

halo2-axiom depends on openvm-cuda-common (without a disambiguation suffix in the lockfile), meaning Cargo resolved it to the feat/halo2-gpu branch. This seems correct, but it's worth verifying that the openvm-cuda-common API is compatible between what halo2-axiom expects and what feat/halo2-gpu provides — an incompatibility here would be a compile error, not a silent bug.

5. Comment block is overly detailed for Cargo.toml (Nit)

# Point halo2-base / halo2-ecc at the GPU fork of halo2-axiom (halo2-gpu)
# so the CUDA prover path is exercised. Tracks the Phase 1 cuda-common
# migration branch; once that work merges back into halo2-gpu's main
# line, switch this to a stable branch or pinned rev. Tracking INT-6841
# / INT-7555.

Per repo conventions (CLAUDE.md: "default to writing no comments"), this 5-line comment is heavy for a Cargo.toml. The ticket references (INT-6841 / INT-7555) are useful, but the rest is better suited for a commit message or PR description. Consider trimming to a single line like:

# Tracking INT-6841 / INT-7555: GPU fork of halo2-axiom for CUDA prover testing

Summary

The main concern is issue #1 — the duplicate openvm-cuda-builder from two different branch commits is likely to cause problems when CUDA features are actually compiled and linked. Issues #2–5 are minor and acceptable for a "DO NOT MERGE" experimental branch.

stephenh-axiom-xyz and others added 11 commits May 4, 2026 16:16
)

Co-authored-by: Ayush Shukla <ayush@axiom.xyz>
This PR introduces the `ColumnsAir` trait and makes it required by all
AIRs. Via this trait, AIRs can (optionally) provide a list of column
names. At powdr, we use this to generate (somewhat) human-readable
autoprecompiles for OpenVM
([example](https://github.com/powdr-labs/powdr/blob/main/openvm-riscv/tests/apc_snapshots/complex/rotate.txt)).

Unfortunately, this touches all AIRs and therefore a lot of files, but
the changes are small and now change to `stark-backend` are needed.

This change represents most of our [diff to
OpenVM](https://github.com/openvm-org/openvm/compare/v2.0.0-beta.2...powdr-labs:openvm:v2-powdr-beta.2-reorganized?expand=1).

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a [patch.crates-io] entry pointing halo2-axiom at
/home/grigorii/dev/halo2/halo2-gpu/halo2_proofs so halo2-base / halo2-ecc
pick up the local CUDA prover fork. Tracking INT-6841 / INT-7493.

This branch is meant to be consumed as an openvm submodule ref inside
v2-proof-system; the local path is intentional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local path "/home/grigorii/dev/halo2/halo2-gpu/halo2_proofs" does not
exist on CI runners, causing cargo resolution to fail. halo2-gpu is a
private repository in the axiom-crypto org, so HTTPS would require token
auth; use ssh://git@github.com/... instead so the CI runner's existing
SSH agent (loaded from GH_ACTIONS_DEPLOY_PRIVATE_KEY) is reused.

Pinned to halo2-gpu rev 22d2b9f3 (tip of chore/refactor-to-cuda-struct
as of 2026-04-22) for build reproducibility. Bump this rev when
halo2-gpu advances.

Local iteration: set CARGO_NET_GIT_FETCH_WITH_CLI=1 so cargo uses the
git CLI (and the SSH agent) to fetch. Devs that want to point at a
local checkout can add a secondary patch override in their
~/.cargo/config.toml without modifying this manifest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches openvm's stark-backend and halo2-axiom deps onto the
published Phase 1 branches of INT-7555.

- openvm-{stark-backend,codec-derive,stark-sdk,cpu-backend,
  cuda-backend,cuda-builder,cuda-common}: branch="develop-v2" →
  "feat/halo2-gpu". Tracking a branch (not a pinned rev) lets
  stark-backend commits on feat/halo2-gpu flow through via
  `cargo update -p openvm-cuda-common` until this work merges
  back to develop-v2.
- halo2-axiom: was pinned to a historical rev, then repointed at
  a local path during Phase 1 for dev-loop testing; now tracks
  halo2-gpu feat/cuda-common-bn254.
- Removes the temporary
  [patch."https://github.com/openvm-org/stark-backend.git"] block
  — redundant now that the direct deps point at feat/halo2-gpu.

test_sdk_fibonacci green (gas_cost=335626).
Bumps the [patch.crates-io] override for halo2-axiom from the
feat/cuda-common-bn254 staging branch to a specific rev on
halo2-gpu's main, now that the perf instrumentation (#40) and the
GPU evaluate_h port (#41) have both landed.

c879c054 is the squash-merge tip of `perf: move evaluate_h to GPU
(#41)` — last commit on halo2-gpu main as of this bump.

Pinning by `rev = ...` (instead of `branch = "main"`) keeps cargo's
git resolver deterministic so CI doesn't see stale snapshots between
`cargo update` runs.

Comment block above the patch trimmed to drop the now-irrelevant
INT-6841 / INT-7555 staging context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the GPU perf optimization landed on axiom-crypto/halo2-gpu
main. Replaces the previous pin at c879c054.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After rebasing onto develop-v2.0.0-rc.1, the lock still pinned the
seven openvm-* crates pulled in via the stark-backend git dep to
0e0e298c (the pre-rebase tip of stark-backend's feat/halo2-gpu). The
stark-backend branch has since been rebased onto its own develop-v2
to absorb #331 / #334 GKR perf wins; the new tip is a71c091d.

Refresh the lock so openvm-workspace builds (SDK tests, fibonacci
bench, etc.) actually pick up those perf commits. The develop-v2
openvm-cuda-builder build-dep that halo2-gpu pulls in also moves
forward 7803eebf -> 260062e9 as a side effect, which is fine.

The same submodule pin will be bumped in v2-proof-system in a
separate commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

Claude finished @gaxiom's task in 1m 32s —— View job


Code Review: PR #2772feat: halo2 gpu test (DO NOT MERGE)

  • Gather context and read the diff
  • Review Cargo.toml changes
  • Review Cargo.lock changes
  • Post review feedback

This PR (6 commits) makes two dependency-level changes to support testing the halo2 GPU (CUDA) prover path:

  1. Switches all 7 stark-backend workspace deps from branch = "develop-v2" to branch = "feat/halo2-gpu" (now at a71c091d).
  2. Adds a [patch.crates-io] for halo2-axiom, redirecting it to a pinned rev (b691925a) of the private axiom-crypto/halo2-gpu repo via SSH.

Since the last review (May 4), the branch has progressed through 4 more commits — switching from branch tracking to pinned revs and bumping the stark-backend lock.


Issues

1. Duplicate openvm-cuda-builder — still present (Medium)

The lockfile still contains two copies of openvm-cuda-builder 2.0.0-alpha:

Source Rev Consumer
branch=develop-v2 260062e9 halo2-axiom (GPU fork, line 4043)
branch=feat/halo2-gpu a71c091d All workspace crates (13 references)

The halo2-axiom fork at rev b691925a hardcodes its openvm-cuda-builder dep to branch = "develop-v2" in its own Cargo.toml, so the workspace [patch] doesn't unify it. This means two different builds of the same crate coexist. When CUDA features are enabled, this can cause type mismatches (e.g., a CudaKernel type from one copy isn't the same as from the other).

Fix options:

  • Add openvm-cuda-builder to the [patch.crates-io] section to force unification. Fix this →
  • Update the halo2-gpu fork's Cargo.toml to point at feat/halo2-gpu instead of develop-v2.

2. SSH source for halo2-axiom — CI incompatible (Medium)

halo2-axiom = { git = "ssh://git@github.com/axiom-crypto/halo2-gpu.git", rev = "b691925a..." }

SSH auth will fail on standard GitHub Actions runners. Acceptable for a "DO NOT MERGE" local-testing branch, but this blocks any CI validation. If CI needs to run, switch to HTTPS with a PAT or make a deploy key available.

3. openvm-cuda-common resolution looks correct (Info)

openvm-cuda-common resolved to feat/halo2-gpu at a71c091d (line 6196), matching the workspace deps. The halo2-axiom fork lists openvm-cuda-common as a plain dependency (line 4044, no disambiguation suffix), and Cargo unified it with the workspace version. This is correct — no action needed.

4. Legacy transitive dependencies from the GPU fork (Low)

The GPU fork of halo2-axiom still pulls in outdated transitive deps:

  • cfg-if 0.1.10, bitflags 1.3.2, env_logger 0.7.1, atty 0.2.14 (unmaintained), humantime 1.3.0, hermit-abi 0.1.19
  • New GPU deps: cuda-driver-sys, rustacuda (0.1.3), yastl (via flume)

Not blocking for an experimental branch, but the halo2-gpu fork carries notable dependency debt. The rustacuda crate (0.1.3) and its cuda-driver-sys dep are quite old — worth tracking for the upstream fork.

5. Comment trimmed but still verbose (Nit)

The comment was trimmed from 5 lines to 2 lines compared to the earlier iteration — good improvement. The current 2-line comment is reasonable for context, though per repo conventions a single line would suffice.


Summary

The main outstanding issue from the previous review — duplicate openvm-cuda-builder (#1) — remains. It's the primary risk when CUDA features are actually compiled. Everything else is acceptable for a "DO NOT MERGE" experimental branch. The pinned rev approach (replacing branch tracking) is a good improvement for reproducibility.

@jpw-axiom jpw-axiom force-pushed the develop-v2.0.0-rc.1 branch 2 times, most recently from f08f0fe to 0110850 Compare May 18, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.