Consolidate smiles/smarts loaders further in benchmarks, ensure shuffling by scal444 · Pull Request #182 · NVIDIA-BioNeMo/nvMolKit

scal444 · 2026-05-27T12:36:46Z

No description provided.

- load_smiles now shuffles its result deterministically with the seed so benches that consume a head slice get a representative cross-section rather than file-order bias. - substruct_bench.py drops its local load_pickle/load_smiles/load_smarts helpers in favor of the shared bench_utils versions. - butina_clustering_bench.py and cross_similarity_bench.py replace inline open()/pd.read_csv loaders with the shared load_smiles helper.

greptile-apps · 2026-05-27T12:40:37Z

Greptile Summary

This PR completes the loader consolidation started earlier by ensuring every molecule loader always shuffles its output, eliminating file-order bias even when the input has fewer molecules than max_count. It also migrates butina_clustering_bench.py and cross_similarity_bench.py to use the shared load_smiles loader, and removes the duplicate local load_pickle/load_smiles/load_smarts implementations from substruct_bench.py.

load_pickle, load_smiles, and load_sdf in bench_utils/loaders.py all gain an else: rng.shuffle(...) branch so benchmarks that consume a head slice always receive a representative, deterministically-seeded cross-section.
butina_clustering_bench.py and cross_similarity_bench.py replace ad-hoc CSV/SMILES parsing with load_smiles(), gaining reservoir sampling and the new shuffle guarantee.
~130 lines of duplicate loader code are deleted from substruct_bench.py and superseded by the shared bench_utils implementations.

Confidence Score: 5/5

Safe to merge; changes are confined to benchmark utilities and eliminate a reproducible source of input-ordering bias.

All three loaders correctly handle both the sampling and the no-sampling paths, and both previously-flagged gaps (load_sdf and load_pickle missing shuffle-in-else) are addressed in this diff. The benchmark consumer files are straightforward call-site migrations with no logic changes.

No files require special attention.

Important Files Changed

Filename	Overview
benchmarks/bench_utils/loaders.py	All three loaders (load_pickle, load_smiles, load_sdf) now shuffle in the else-branch when sampling is not triggered, ensuring file-order bias is removed regardless of input size vs max_count.
benchmarks/butina_clustering_bench.py	Replaced manual head-slice CSV parsing with load_smiles(), gaining reservoir sampling and shuffle behaviour.
benchmarks/cross_similarity_bench.py	Replaced pandas CSV loading + manual parsing with load_smiles(); removes pandas dependency and gains shuffle; duplicate-padding fallback loop unchanged.
benchmarks/substruct_bench.py	Removed local load_pickle, load_smiles, load_smarts duplicates and replaced with imports from bench_utils; no logic changes to benchmark code itself.

_{Reviews (2): Last reviewed commit: "Shuffle sdf and pickle too" | Re-trigger Greptile}

evasnow1992 · 2026-05-27T19:50:17Z

-        smis = [line.strip() for line in f.readlines()]
-    mols = [MolFromSmiles(smi, sanitize=True) for smi in smis[: max_size + 100]]
-    mols = [mol for mol in mols if mol is not None]
+    mols = load_smiles(args.input_smiles_file, max_count=max_size + 100, sanitize=True)


Should we pass in a seed here to make it deterministic?

evasnow1992

Two minor comments regarding whether to use a random seed in load_smiles. Not blocker. Changes look good to me.

Shuffle sdf and pickle too

868b7be

scal444 requested a review from evasnow1992 May 27, 2026 12:59

evasnow1992 reviewed May 27, 2026

View reviewed changes

Comment thread benchmarks/cross_similarity_bench.py Outdated

evasnow1992 approved these changes May 27, 2026

View reviewed changes

Add seeds to scrambler

fabcda9

scal444 merged commit 95e273c into NVIDIA-BioNeMo:main May 29, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate smiles/smarts loaders further in benchmarks, ensure shuffling#182

Consolidate smiles/smarts loaders further in benchmarks, ensure shuffling#182
scal444 merged 3 commits into
NVIDIA-BioNeMo:mainfrom
scal444:split/loaders

scal444 commented May 27, 2026

Uh oh!

greptile-apps Bot commented May 27, 2026 •

edited

Loading

Greptile Summary

Uh oh!

evasnow1992 May 27, 2026

Uh oh!

scal444 May 29, 2026

Uh oh!

Uh oh!

evasnow1992 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

scal444 commented May 27, 2026

Uh oh!

greptile-apps Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

evasnow1992 May 27, 2026

Choose a reason for hiding this comment

Uh oh!

scal444 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

evasnow1992 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 27, 2026 •

edited

Loading