Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction#177
Conversation
…FGS, and fused Butina Scientific / correctness: - conformer_rmsd.cu: no change to algorithm, only implementation (see below) - tfd_kernels.cu: use double-precision sqrt for pair-index decoding (float32 loses bits past ~1415 conformers); guard ring-torsion path against numQ==0 division by zero - bfgs_minimize.cu: use |energy| in gradient convergence denominator so negative-energy geometries mid-minimisation don't tighten the threshold asymmetrically - triangle_smooth.cu: replace cudaDeviceSynchronize() with stream-scoped cudaStreamSynchronize in copyToHost, avoiding a device-wide stall Performance: - conformer_rmsd.cu: replace 17 sequential cub::BlockReduce + __syncthreads() calls in computePairRmsd with warp-shuffle reductions (warpSumDouble) and a single 400-byte shared buffer; reduces sync count to 3 for the alignment path and 1 for the prealigned path, and cuts shared memory from ~1120 to 400 bytes - symmetric_eigensolver.cu: move AsyncDeviceVector<curandState> from a stack-local allocation inside launchBatchEigensolverKernel (re-allocated on every call) to a persistent states_ member of BatchedEigenSolver::Impl, resizing only when the batch grows - similarity_kernels.cu: add a per-device int8_t cache (g_tensorOpsCache[16]) for isTensorOpsSupportedCached(); cudaGetDeviceProperties is now called at most once per device per process instead of on every similarity launch - clustering.py (fused_butina): reduce GPU↔CPU synchronisation from ~6 D2H syncs per clustering iteration to 2 by (a) batching neigh.max() + flip.argmax() into one torch.stack().tolist() call, (b) batching cluster_count + is_free into one torch.cat().tolist() call, and (c) maintaining a CPU mirror of the indices tensor to avoid a per-centroid sync
|
| Filename | Overview |
|---|---|
| nvmolkit/clustering.py | Batches two GPU→CPU syncs per iteration to two, introduces a CPU mirror of the indices tensor; logic appears correct but the combined-tolist approach deserves careful testing on edge cases (n_start=0, single-molecule clusters). |
| src/conformer_rmsd.cu | Replaces 17 BlockReduce+sync calls with a 3-sync (Kabsch) / 1-sync (prealigned) warp-shuffle pattern; shared-memory layout, sync points, and warp-lane indexing all appear correct for the fixed 128-thread block launch. |
| src/minimizer/bfgs_minimize.cu | Comment-only change documenting the intentional signed-energy denominator and a TODO to gate a future fix on an upstream RDKit version; no functional diff. |
| src/similarity_kernels.cu | Introduces an atomic-guarded per-device capability cache that replaces four identical cudaGetDeviceProperties call sites; addresses both the data-race (now std::atomic<int8_t>) and the missing CUDA error checks flagged in previous review rounds. |
| src/symmetric_eigensolver.cu | Promotes the curandState buffer from a per-call heap allocation to a persistent grows-only member; kernel always calls curand_init each launch so correctness is unchanged; allocation cost is eliminated for equal-or-smaller subsequent batches. |
| src/tfd/tfd_kernels.cu | Switches lower-triangle pair-index decode from float32 to double sqrt (fixes precision past ~1415 conformers) and guards the ring-torsion path against numQ==0 division by zero; both fixes are correct. |
| src/triangle_smooth.cu | Replaces cudaDeviceSynchronize() with cudaStreamSynchronize(data_.stream()) in copyToHost, correctly scoping the sync to the stream used by the D2H copy. |
Reviews (5): Last reviewed commit: "Fix ruff format: collapse torch.stack ca..." | Re-trigger Greptile
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
similarity_kernels.cu: add blank line between constexpr int kMaxDevices and int8_t g_tensorOpsCache declarations to break AlignConsecutiveDeclarations grouping — mixed-type consecutive declarations produced formatting that differed from what clang-format expected. clustering.py: collapse torch.stack([...]) call to Black-style multiline form without magic trailing comma; previous multiline-with-trailing-comma form triggered Black's trailing-comma expansion rule, failing ruff format.
|
Thanks for the contribution! The TFD, triangle smooth, eigensolver, and similarity fixes are good to go except for the Greptile comment. I measured a modest speedup in our similarity benchmarks. The Butina change is good, and showed major (nearly 2x) speedups in most of our benchmarks. I identified one regression which when fixed adds another 5-10% speedup, and made a few suggestions. The Conformer RMSD change also works, with more modest (~10%) speedups. I think it would be worthwhile to experiment with a fused reduction, creating a combined datatype with all the concurrent reductions and doing a block reduce on that, but this is a real improvement so can go in without that experiment. |
- similarity_kernels.cu: wrap cudaGetDevice and cudaGetDeviceProperties with cudaCheckError in isTensorOpsSupportedCached so errors in the cached path surface immediately rather than silently producing garbage values - conformer_rmsd.cu: replace hand-rolled __shfl_down_sync warpSumDouble with cub::WarpReduce<double> for better abstraction and CUB idiom consistency; behaviour and performance are identical for full 32-thread warps since CUB uses shuffles internally - bfgs_minimize.cu: revert |energy| convergence denominator to match RDKit's current signed-energy behaviour; add TODO to file upstream bug and gate fix on RDKit version when merged - clustering.py: eliminate is_free_tensor H2D roundtrip in fused_butina; is_free is already updated in-place on GPU by extract_cluster_and_singletons so we index x/indices/neigh directly instead of round-tripping through CPU
|
This looks great, just needs one more ruff format. |
…n 119-char limit)
Scientific correctness and GPU performance improvements across six subsystems. No public API changes.
Correctness fixes
Performance fixes
Test plan