Skip to content

Reduce conformer bench preprocessing times via single embed + jitter#181

Merged
scal444 merged 3 commits into
NVIDIA-BioNeMo:mainfrom
scal444:split/embed-perturb
May 29, 2026
Merged

Reduce conformer bench preprocessing times via single embed + jitter#181
scal444 merged 3 commits into
NVIDIA-BioNeMo:mainfrom
scal444:split/embed-perturb

Conversation

@scal444
Copy link
Copy Markdown
Collaborator

@scal444 scal444 commented May 27, 2026

Instead of running ETKDG on all conformers, embed one conformer, duplicate, jitter coordinates. Improves bottleneck for TFD, conformer RMSD, and ff minimization on the fly generation

Add bench_utils.perturb_conformer and bench_utils.embed_and_jitter as a
shared helper for the three benches that need 'embed once then perturb'
conformer prep:

* tfd_bench, ff_optimize_bench: parallel embed_and_jitter across many mols.
* conformer_rmsd_bench: per-mol EmbedMolecule + serial jitter loop in the
  existing single-mol benchmark structure (no parallel batch helper needed
  here since the outer loop processes one mol at a time).
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Greptile Summary

This PR replaces per-conformer ETKDG runs with a single-embed-then-jitter strategy across the TFD, conformer RMSD, and FF minimization benchmarks, significantly reducing preprocessing time. The shared embed_and_jitter helper in bench_utils/molprep.py parallelises the embedding step via process_map and then duplicates/perturbs the single base conformer in-process.

  • embed_and_jitter and perturb_conformer are extracted into bench_utils/molprep.py and exported; ff_optimize_bench.py removes its local _embed_conformers in favour of the shared helper.
  • tfd_bench.py switches from a per-molecule serial loop to the batch API, adding --prep-workers for parallelism control; conformer_rmsd_bench.py inlines the same single-embed + jitter pattern directly in its benchmark loop.

Confidence Score: 5/5

The changes are confined to benchmark preprocessing utilities and do not touch library code; the jitter logic is straightforward and the parallelism is well-guarded.

The refactor is clean: seed handling is consistent, the conformer copy-then-perturb order is correct (all copies are taken from the unperturbed base before the base itself is jittered), and no production library paths are affected.

No files require special attention.

Important Files Changed

Filename Overview
benchmarks/bench_utils/molprep.py Adds perturb_conformer (per-atom uniform jitter in place) and embed_and_jitter (single ETKDGv3 embed via process_map, then jitter-duplicate to desired count); logic is correct and seed usage is consistent
benchmarks/ff_optimize_bench.py Removes local _embed_conformers and replaces it with shared embed_and_jitter; the call correctly passes num_workers=args.rdkit_threads and mols already carry explicit Hs from prep_mols, so add_hs=False default is appropriate
benchmarks/tfd_bench.py Replaces per-molecule serial ETKDG with batch embed_and_jitter; add_hs=True and min_atoms=4 are correctly threaded through; seed hardcoded to 42 in prepare_molecules is functionally equivalent to the old 42+i scheme since _embed_one offsets by index internally
benchmarks/conformer_rmsd_bench.py Switches from EmbedMultipleConfs to single-embed + per-conformer jitter, now guarding num_confs < 2 explicitly; minor: embedding still happens before the num_confs < 2 skip check, so a single-conformer run does one unnecessary embed
benchmarks/bench_utils/init.py Exports embed_and_jitter and perturb_conformer from molprep; straightforward addition

Reviews (2): Last reviewed commit: "Update benchmarks/conformer_rmsd_bench.p..." | Re-trigger Greptile

Comment thread benchmarks/bench_utils/molprep.py
Comment thread benchmarks/conformer_rmsd_bench.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@scal444 scal444 requested a review from evasnow1992 May 27, 2026 12:59
Comment thread benchmarks/tfd_bench.py
if num_confs < 2:
raise ValueError(f"num_confs must be >= 2 for TFD, got {num_confs}")
workers = num_workers if num_workers > 0 else max(1, multiprocessing.cpu_count() // 2)
return embed_and_jitter(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m a little concerned about replacing EmbedMultipleConfs with embed_and_jitter. If I understand correctly, the latter may generate conformers with near-identical torsion angles and without meaningful dihedral rotations. In that case, the TFD values could collapse close to 0, where the numerical comparison between RDKit and nvMolKit TFD implementations may become more variable or less informative.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern - I would expect that the values are different enough but I'll confirm with a quick experiment.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an additional spread parameter to avoid RMSD clumps, now both TFD and RMSD have reasonable spreads compared to ETKDG
image

Copy link
Copy Markdown
Collaborator

@evasnow1992 evasnow1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, but no code validity issues. Changes look good to me.

@scal444 scal444 merged commit 5809dad into NVIDIA-BioNeMo:main May 29, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants