Skip to content

Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction#177

Merged
scal444 merged 5 commits into
NVIDIA-BioNeMo:mainfrom
mooreneural:nvMolKit
May 26, 2026
Merged

Conversation

@mooreneural
Copy link
Copy Markdown
Contributor

@mooreneural mooreneural commented May 19, 2026

Scientific correctness and GPU performance improvements across six subsystems. No public API changes.

Correctness fixes

  • tfd_kernels.cu Use double-precision sqrt for TFD pair-index decoding (float32 loses bits past ~1415 conformers); guard ring-torsion branch against numQ==0 division by zero.
  • bfgs_minimize.cu Convergence denominator kept as RDKit's signed-energy form to preserve reference behavior. Added TODO to file upstream RDKit bug and gate the fix on RDKit version once merged (matching the pattern already used in scaleGradKernel).
  • triangle_smooth.cu Replace cudaDeviceSynchronize() with cudaStreamSynchronize(data_.stream()) in copyToHost, scoping the sync to the correct stream.

Performance fixes

  • conformer_rmsd.cu Replace 17 sequential cub::BlockReduce + __syncthreads() calls in computePairRmsd with warp-shuffle reductions. Sync count drops 17→3 (Kabsch) or 17→1 (prealigned); shared memory drops ~1120→400 bytes per block.
  • symmetric_eigensolver.cu Move AsyncDeviceVector from a per-call heap allocation to a persistent states_ member of BatchedEigenSolver::Impl; reused across calls, grows only when batch size increases.
  • similarity_kernels.cu Add per-device int8_t cache for tensor-op capability; cudaGetDeviceProperties called at most once per device per process instead of on every similarity launch.
  • clustering.py (fused_butina) Reduce GPU↔CPU syncs from ~6 to 2 per iteration by batching neigh max+argmax into one tolist() call, batching cluster_count+is_free into one tolist() call, and maintaining a CPU mirror of the indices tensor.

Test plan

  • Conformer RMSD tests (prealigned and Kabsch paths)
  • TFD tests with >1415 conformers and ring torsions with 0 quartets
  • Fused Butina produces identical cluster assignments
  • BFGS converges correctly for negative-energy geometries
  • Similarity tests on sm_80+ and pre-Ampere hardware
  • Eigensolver convergence unchanged across multiple calls on same instance

…FGS, and fused Butina

Scientific / correctness:
- conformer_rmsd.cu: no change to algorithm, only implementation (see below)
- tfd_kernels.cu: use double-precision sqrt for pair-index decoding (float32
  loses bits past ~1415 conformers); guard ring-torsion path against numQ==0
  division by zero
- bfgs_minimize.cu: use |energy| in gradient convergence denominator so
  negative-energy geometries mid-minimisation don't tighten the threshold
  asymmetrically
- triangle_smooth.cu: replace cudaDeviceSynchronize() with stream-scoped
  cudaStreamSynchronize in copyToHost, avoiding a device-wide stall

Performance:
- conformer_rmsd.cu: replace 17 sequential cub::BlockReduce + __syncthreads()
  calls in computePairRmsd with warp-shuffle reductions (warpSumDouble) and a
  single 400-byte shared buffer; reduces sync count to 3 for the alignment path
  and 1 for the prealigned path, and cuts shared memory from ~1120 to 400 bytes
- symmetric_eigensolver.cu: move AsyncDeviceVector<curandState> from a
  stack-local allocation inside launchBatchEigensolverKernel (re-allocated on
  every call) to a persistent states_ member of BatchedEigenSolver::Impl,
  resizing only when the batch grows
- similarity_kernels.cu: add a per-device int8_t cache (g_tensorOpsCache[16])
  for isTensorOpsSupportedCached(); cudaGetDeviceProperties is now called at
  most once per device per process instead of on every similarity launch
- clustering.py (fused_butina): reduce GPU↔CPU synchronisation from ~6 D2H
  syncs per clustering iteration to 2 by (a) batching neigh.max() +
  flip.argmax() into one torch.stack().tolist() call, (b) batching
  cluster_count + is_free into one torch.cat().tolist() call, and (c)
  maintaining a CPU mirror of the indices tensor to avoid a per-centroid sync
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

Six targeted performance and correctness improvements across independent subsystems with no public API changes. Each change is narrow in scope and addresses a documented flaw.

  • clustering.py: Reduces GPU↔CPU round-trips from ~6 to 2 per Butina iteration by batching neigh max+argmax into one tolist() and cluster_count+is_free into a second; a CPU mirror of the indices tensor eliminates per-centroid D2H syncs.
  • conformer_rmsd.cu: Replaces 17 sequential cub::BlockReduce + __syncthreads() calls with a two-phase warp-shuffle pattern, reducing sync count to 3 (Kabsch) or 1 (prealigned) and shared-memory footprint from ~1120 to 400 bytes per block.
  • similarity_kernels.cu: Addresses previously flagged data-race and missing error-check issues by using std::atomic<int8_t> and cudaCheckError in the new isTensorOpsSupportedCached() helper.

Confidence Score: 5/5

All six changes are well-scoped correctness or performance fixes with no regressions introduced.

The warp-shuffle refactor in conformer_rmsd.cu uses the correct compile-time block/warp constants and correct sync points. The clustering.py CPU-mirror approach keeps indices_host and the GPU indices tensor in sync through identical boolean masks. The similarity cache uses std::atomic and cudaCheckError, resolving previous concerns. The eigensolver states buffer is always re-initialized by curand_init inside the kernel, so persistence only removes malloc overhead without altering output. The tfd precision fix and numQ guard, the stream-scoped sync in triangle_smooth.cu, and the BFGS comment are all unambiguously correct.

No files require special attention; clustering.py has the most logic-dense change and is the best candidate for an integration test covering the batched tolist() path.

Important Files Changed

Filename Overview
nvmolkit/clustering.py Batches two GPU→CPU syncs per iteration to two, introduces a CPU mirror of the indices tensor; logic appears correct but the combined-tolist approach deserves careful testing on edge cases (n_start=0, single-molecule clusters).
src/conformer_rmsd.cu Replaces 17 BlockReduce+sync calls with a 3-sync (Kabsch) / 1-sync (prealigned) warp-shuffle pattern; shared-memory layout, sync points, and warp-lane indexing all appear correct for the fixed 128-thread block launch.
src/minimizer/bfgs_minimize.cu Comment-only change documenting the intentional signed-energy denominator and a TODO to gate a future fix on an upstream RDKit version; no functional diff.
src/similarity_kernels.cu Introduces an atomic-guarded per-device capability cache that replaces four identical cudaGetDeviceProperties call sites; addresses both the data-race (now std::atomic<int8_t>) and the missing CUDA error checks flagged in previous review rounds.
src/symmetric_eigensolver.cu Promotes the curandState buffer from a per-call heap allocation to a persistent grows-only member; kernel always calls curand_init each launch so correctness is unchanged; allocation cost is eliminated for equal-or-smaller subsequent batches.
src/tfd/tfd_kernels.cu Switches lower-triangle pair-index decode from float32 to double sqrt (fixes precision past ~1415 conformers) and guards the ring-torsion path against numQ==0 division by zero; both fixes are correct.
src/triangle_smooth.cu Replaces cudaDeviceSynchronize() with cudaStreamSynchronize(data_.stream()) in copyToHost, correctly scoping the sync to the stream used by the D2H copy.

Reviews (5): Last reviewed commit: "Fix ruff format: collapse torch.stack ca..." | Re-trigger Greptile

Comment thread src/similarity_kernels.cu Outdated
Comment thread src/similarity_kernels.cu
mooreneural and others added 2 commits May 18, 2026 19:53
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
similarity_kernels.cu: add blank line between constexpr int kMaxDevices and
int8_t g_tensorOpsCache declarations to break AlignConsecutiveDeclarations
grouping — mixed-type consecutive declarations produced formatting that
differed from what clang-format expected.

clustering.py: collapse torch.stack([...]) call to Black-style multiline form
without magic trailing comma; previous multiline-with-trailing-comma form
triggered Black's trailing-comma expansion rule, failing ruff format.
@scal444
Copy link
Copy Markdown
Collaborator

scal444 commented May 22, 2026

Thanks for the contribution!

The TFD, triangle smooth, eigensolver, and similarity fixes are good to go except for the Greptile comment. I measured a modest speedup in our similarity benchmarks.

The Butina change is good, and showed major (nearly 2x) speedups in most of our benchmarks. I identified one regression which when fixed adds another 5-10% speedup, and made a few suggestions.

The Conformer RMSD change also works, with more modest (~10%) speedups. I think it would be worthwhile to experiment with a fused reduction, creating a combined datatype with all the concurrent reductions and doing a block reduce on that, but this is a real improvement so can go in without that experiment.

Comment thread src/minimizer/bfgs_minimize.cu Outdated
Comment thread nvmolkit/clustering.py Outdated
Comment thread nvmolkit/clustering.py
Comment thread src/conformer_rmsd.cu Outdated
Comment thread src/similarity_kernels.cu
- similarity_kernels.cu: wrap cudaGetDevice and cudaGetDeviceProperties
  with cudaCheckError in isTensorOpsSupportedCached so errors in the cached
  path surface immediately rather than silently producing garbage values

- conformer_rmsd.cu: replace hand-rolled __shfl_down_sync warpSumDouble
  with cub::WarpReduce<double> for better abstraction and CUB idiom
  consistency; behaviour and performance are identical for full 32-thread
  warps since CUB uses shuffles internally

- bfgs_minimize.cu: revert |energy| convergence denominator to match
  RDKit's current signed-energy behaviour; add TODO to file upstream
  bug and gate fix on RDKit version when merged

- clustering.py: eliminate is_free_tensor H2D roundtrip in fused_butina;
  is_free is already updated in-place on GPU by extract_cluster_and_singletons
  so we index x/indices/neigh directly instead of round-tripping through CPU
Comment thread src/minimizer/bfgs_minimize.cu
@mooreneural mooreneural requested a review from scal444 May 25, 2026 15:42
@scal444
Copy link
Copy Markdown
Collaborator

scal444 commented May 26, 2026

This looks great, just needs one more ruff format.

@scal444 scal444 merged commit ea09e2d into NVIDIA-BioNeMo:main May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants