Reduce conformer bench preprocessing times via single embed + jitter#181
Conversation
Add bench_utils.perturb_conformer and bench_utils.embed_and_jitter as a shared helper for the three benches that need 'embed once then perturb' conformer prep: * tfd_bench, ff_optimize_bench: parallel embed_and_jitter across many mols. * conformer_rmsd_bench: per-mol EmbedMolecule + serial jitter loop in the existing single-mol benchmark structure (no parallel batch helper needed here since the outer loop processes one mol at a time).
|
| Filename | Overview |
|---|---|
| benchmarks/bench_utils/molprep.py | Adds perturb_conformer (per-atom uniform jitter in place) and embed_and_jitter (single ETKDGv3 embed via process_map, then jitter-duplicate to desired count); logic is correct and seed usage is consistent |
| benchmarks/ff_optimize_bench.py | Removes local _embed_conformers and replaces it with shared embed_and_jitter; the call correctly passes num_workers=args.rdkit_threads and mols already carry explicit Hs from prep_mols, so add_hs=False default is appropriate |
| benchmarks/tfd_bench.py | Replaces per-molecule serial ETKDG with batch embed_and_jitter; add_hs=True and min_atoms=4 are correctly threaded through; seed hardcoded to 42 in prepare_molecules is functionally equivalent to the old 42+i scheme since _embed_one offsets by index internally |
| benchmarks/conformer_rmsd_bench.py | Switches from EmbedMultipleConfs to single-embed + per-conformer jitter, now guarding num_confs < 2 explicitly; minor: embedding still happens before the num_confs < 2 skip check, so a single-conformer run does one unnecessary embed |
| benchmarks/bench_utils/init.py | Exports embed_and_jitter and perturb_conformer from molprep; straightforward addition |
Reviews (2): Last reviewed commit: "Update benchmarks/conformer_rmsd_bench.p..." | Re-trigger Greptile
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
| if num_confs < 2: | ||
| raise ValueError(f"num_confs must be >= 2 for TFD, got {num_confs}") | ||
| workers = num_workers if num_workers > 0 else max(1, multiprocessing.cpu_count() // 2) | ||
| return embed_and_jitter( |
There was a problem hiding this comment.
I’m a little concerned about replacing EmbedMultipleConfs with embed_and_jitter. If I understand correctly, the latter may generate conformers with near-identical torsion angles and without meaningful dihedral rotations. In that case, the TFD values could collapse close to 0, where the numerical comparison between RDKit and nvMolKit TFD implementations may become more variable or less informative.
There was a problem hiding this comment.
Valid concern - I would expect that the values are different enough but I'll confirm with a quick experiment.
evasnow1992
left a comment
There was a problem hiding this comment.
One comment, but no code validity issues. Changes look good to me.

Instead of running ETKDG on all conformers, embed one conformer, duplicate, jitter coordinates. Improves bottleneck for TFD, conformer RMSD, and ff minimization on the fly generation