Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction by mooreneural · Pull Request #177 · NVIDIA-BioNeMo/nvMolKit

mooreneural · 2026-05-19T00:45:06Z

Scientific correctness and GPU performance improvements across six subsystems. No public API changes.

Correctness fixes

tfd_kernels.cu Use double-precision sqrt for TFD pair-index decoding (float32 loses bits past ~1415 conformers); guard ring-torsion branch against numQ==0 division by zero.
bfgs_minimize.cu Convergence denominator kept as RDKit's signed-energy form to preserve reference behavior. Added TODO to file upstream RDKit bug and gate the fix on RDKit version once merged (matching the pattern already used in scaleGradKernel).
triangle_smooth.cu Replace cudaDeviceSynchronize() with cudaStreamSynchronize(data_.stream()) in copyToHost, scoping the sync to the correct stream.

Performance fixes

conformer_rmsd.cu Replace 17 sequential cub::BlockReduce + __syncthreads() calls in computePairRmsd with warp-shuffle reductions. Sync count drops 17→3 (Kabsch) or 17→1 (prealigned); shared memory drops ~1120→400 bytes per block.
symmetric_eigensolver.cu Move AsyncDeviceVector from a per-call heap allocation to a persistent states_ member of BatchedEigenSolver::Impl; reused across calls, grows only when batch size increases.
similarity_kernels.cu Add per-device int8_t cache for tensor-op capability; cudaGetDeviceProperties called at most once per device per process instead of on every similarity launch.
clustering.py (fused_butina) Reduce GPU↔CPU syncs from ~6 to 2 per iteration by batching neigh max+argmax into one tolist() call, batching cluster_count+is_free into one tolist() call, and maintaining a CPU mirror of the indices tensor.

Test plan

Conformer RMSD tests (prealigned and Kabsch paths)
TFD tests with >1415 conformers and ring torsions with 0 quartets
Fused Butina produces identical cluster assignments
BFGS converges correctly for negative-energy geometries
Similarity tests on sm_80+ and pre-Ampere hardware
Eigensolver convergence unchanged across multiple calls on same instance

…FGS, and fused Butina Scientific / correctness: - conformer_rmsd.cu: no change to algorithm, only implementation (see below) - tfd_kernels.cu: use double-precision sqrt for pair-index decoding (float32 loses bits past ~1415 conformers); guard ring-torsion path against numQ==0 division by zero - bfgs_minimize.cu: use |energy| in gradient convergence denominator so negative-energy geometries mid-minimisation don't tighten the threshold asymmetrically - triangle_smooth.cu: replace cudaDeviceSynchronize() with stream-scoped cudaStreamSynchronize in copyToHost, avoiding a device-wide stall Performance: - conformer_rmsd.cu: replace 17 sequential cub::BlockReduce + __syncthreads() calls in computePairRmsd with warp-shuffle reductions (warpSumDouble) and a single 400-byte shared buffer; reduces sync count to 3 for the alignment path and 1 for the prealigned path, and cuts shared memory from ~1120 to 400 bytes - symmetric_eigensolver.cu: move AsyncDeviceVector<curandState> from a stack-local allocation inside launchBatchEigensolverKernel (re-allocated on every call) to a persistent states_ member of BatchedEigenSolver::Impl, resizing only when the batch grows - similarity_kernels.cu: add a per-device int8_t cache (g_tensorOpsCache[16]) for isTensorOpsSupportedCached(); cudaGetDeviceProperties is now called at most once per device per process instead of on every similarity launch - clustering.py (fused_butina): reduce GPU↔CPU synchronisation from ~6 D2H syncs per clustering iteration to 2 by (a) batching neigh.max() + flip.argmax() into one torch.stack().tolist() call, (b) batching cluster_count + is_free into one torch.cat().tolist() call, and (c) maintaining a CPU mirror of the indices tensor to avoid a per-centroid sync

greptile-apps · 2026-05-19T00:50:43Z

Greptile Summary

Six targeted performance and correctness improvements across independent subsystems with no public API changes. Each change is narrow in scope and addresses a documented flaw.

clustering.py: Reduces GPU↔CPU round-trips from ~6 to 2 per Butina iteration by batching neigh max+argmax into one tolist() and cluster_count+is_free into a second; a CPU mirror of the indices tensor eliminates per-centroid D2H syncs.
conformer_rmsd.cu: Replaces 17 sequential cub::BlockReduce + __syncthreads() calls with a two-phase warp-shuffle pattern, reducing sync count to 3 (Kabsch) or 1 (prealigned) and shared-memory footprint from ~1120 to 400 bytes per block.
similarity_kernels.cu: Addresses previously flagged data-race and missing error-check issues by using std::atomic<int8_t> and cudaCheckError in the new isTensorOpsSupportedCached() helper.

Confidence Score: 5/5

All six changes are well-scoped correctness or performance fixes with no regressions introduced.

The warp-shuffle refactor in conformer_rmsd.cu uses the correct compile-time block/warp constants and correct sync points. The clustering.py CPU-mirror approach keeps indices_host and the GPU indices tensor in sync through identical boolean masks. The similarity cache uses std::atomic and cudaCheckError, resolving previous concerns. The eigensolver states buffer is always re-initialized by curand_init inside the kernel, so persistence only removes malloc overhead without altering output. The tfd precision fix and numQ guard, the stream-scoped sync in triangle_smooth.cu, and the BFGS comment are all unambiguously correct.

No files require special attention; clustering.py has the most logic-dense change and is the best candidate for an integration test covering the batched tolist() path.

Important Files Changed

Filename	Overview
nvmolkit/clustering.py	Batches two GPU→CPU syncs per iteration to two, introduces a CPU mirror of the indices tensor; logic appears correct but the combined-tolist approach deserves careful testing on edge cases (n_start=0, single-molecule clusters).
src/conformer_rmsd.cu	Replaces 17 BlockReduce+sync calls with a 3-sync (Kabsch) / 1-sync (prealigned) warp-shuffle pattern; shared-memory layout, sync points, and warp-lane indexing all appear correct for the fixed 128-thread block launch.
src/minimizer/bfgs_minimize.cu	Comment-only change documenting the intentional signed-energy denominator and a TODO to gate a future fix on an upstream RDKit version; no functional diff.
src/similarity_kernels.cu	Introduces an atomic-guarded per-device capability cache that replaces four identical cudaGetDeviceProperties call sites; addresses both the data-race (now std::atomic<int8_t>) and the missing CUDA error checks flagged in previous review rounds.
src/symmetric_eigensolver.cu	Promotes the curandState buffer from a per-call heap allocation to a persistent grows-only member; kernel always calls curand_init each launch so correctness is unchanged; allocation cost is eliminated for equal-or-smaller subsequent batches.
src/tfd/tfd_kernels.cu	Switches lower-triangle pair-index decode from float32 to double sqrt (fixes precision past ~1415 conformers) and guards the ring-torsion path against numQ==0 division by zero; both fixes are correct.
src/triangle_smooth.cu	Replaces cudaDeviceSynchronize() with cudaStreamSynchronize(data_.stream()) in copyToHost, correctly scoping the sync to the stream used by the D2H copy.

_{Reviews (5): Last reviewed commit: "Fix ruff format: collapse torch.stack ca..." | Re-trigger Greptile}

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

similarity_kernels.cu: add blank line between constexpr int kMaxDevices and int8_t g_tensorOpsCache declarations to break AlignConsecutiveDeclarations grouping — mixed-type consecutive declarations produced formatting that differed from what clang-format expected. clustering.py: collapse torch.stack([...]) call to Black-style multiline form without magic trailing comma; previous multiline-with-trailing-comma form triggered Black's trailing-comma expansion rule, failing ruff format.

scal444 · 2026-05-22T17:03:48Z

Thanks for the contribution!

The TFD, triangle smooth, eigensolver, and similarity fixes are good to go except for the Greptile comment. I measured a modest speedup in our similarity benchmarks.

The Butina change is good, and showed major (nearly 2x) speedups in most of our benchmarks. I identified one regression which when fixed adds another 5-10% speedup, and made a few suggestions.

The Conformer RMSD change also works, with more modest (~10%) speedups. I think it would be worthwhile to experiment with a fused reduction, creating a combined datatype with all the concurrent reductions and doing a block reduce on that, but this is a real improvement so can go in without that experiment.

- similarity_kernels.cu: wrap cudaGetDevice and cudaGetDeviceProperties with cudaCheckError in isTensorOpsSupportedCached so errors in the cached path surface immediately rather than silently producing garbage values - conformer_rmsd.cu: replace hand-rolled __shfl_down_sync warpSumDouble with cub::WarpReduce<double> for better abstraction and CUB idiom consistency; behaviour and performance are identical for full 32-thread warps since CUB uses shuffles internally - bfgs_minimize.cu: revert |energy| convergence denominator to match RDKit's current signed-energy behaviour; add TODO to file upstream bug and gate fix on RDKit version when merged - clustering.py: eliminate is_free_tensor H2D roundtrip in fused_butina; is_free is already updated in-place on GPU by extract_cluster_and_singletons so we index x/indices/neigh directly instead of round-tripping through CPU

scal444 · 2026-05-26T14:24:52Z

This looks great, just needs one more ruff format.

…n 119-char limit)

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/similarity_kernels.cu Outdated

Comment thread src/similarity_kernels.cu

mooreneural and others added 2 commits May 18, 2026 19:53

Update src/similarity_kernels.cu

97dc55b

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

scal444 requested changes May 22, 2026

View reviewed changes

Comment thread src/minimizer/bfgs_minimize.cu Outdated

Comment thread nvmolkit/clustering.py Outdated

Comment thread nvmolkit/clustering.py

Comment thread src/conformer_rmsd.cu Outdated

Comment thread src/similarity_kernels.cu

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/minimizer/bfgs_minimize.cu

mooreneural requested a review from scal444 May 25, 2026 15:42

scal444 approved these changes May 26, 2026

View reviewed changes

Fix ruff format: collapse torch.stack call to single line (fits withi…

c84f5a5

…n 119-char limit)

scal444 merged commit ea09e2d into NVIDIA-BioNeMo:main May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction#177

Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction#177
scal444 merged 5 commits into
NVIDIA-BioNeMo:mainfrom
mooreneural:nvMolKit

mooreneural commented May 19, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 19, 2026 •

edited

Loading

Greptile Summary

Uh oh!

Uh oh!

Uh oh!

scal444 commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scal444 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mooreneural commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scientific correctness and GPU performance improvements across six subsystems. No public API changes.

Correctness fixes

Performance fixes

Test plan

Uh oh!

greptile-apps Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

scal444 commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scal444 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mooreneural commented May 19, 2026 •

edited

Loading

greptile-apps Bot commented May 19, 2026 •

edited

Loading