Consolidate smiles/smarts loaders further in benchmarks, ensure shuffling#182
Conversation
- load_smiles now shuffles its result deterministically with the seed so benches that consume a head slice get a representative cross-section rather than file-order bias. - substruct_bench.py drops its local load_pickle/load_smiles/load_smarts helpers in favor of the shared bench_utils versions. - butina_clustering_bench.py and cross_similarity_bench.py replace inline open()/pd.read_csv loaders with the shared load_smiles helper.
|
| Filename | Overview |
|---|---|
| benchmarks/bench_utils/loaders.py | All three loaders (load_pickle, load_smiles, load_sdf) now shuffle in the else-branch when sampling is not triggered, ensuring file-order bias is removed regardless of input size vs max_count. |
| benchmarks/butina_clustering_bench.py | Replaced manual head-slice CSV parsing with load_smiles(), gaining reservoir sampling and shuffle behaviour. |
| benchmarks/cross_similarity_bench.py | Replaced pandas CSV loading + manual parsing with load_smiles(); removes pandas dependency and gains shuffle; duplicate-padding fallback loop unchanged. |
| benchmarks/substruct_bench.py | Removed local load_pickle, load_smiles, load_smarts duplicates and replaced with imports from bench_utils; no logic changes to benchmark code itself. |
Reviews (2): Last reviewed commit: "Shuffle sdf and pickle too" | Re-trigger Greptile
| smis = [line.strip() for line in f.readlines()] | ||
| mols = [MolFromSmiles(smi, sanitize=True) for smi in smis[: max_size + 100]] | ||
| mols = [mol for mol in mols if mol is not None] | ||
| mols = load_smiles(args.input_smiles_file, max_count=max_size + 100, sanitize=True) |
There was a problem hiding this comment.
Should we pass in a seed here to make it deterministic?
evasnow1992
left a comment
There was a problem hiding this comment.
Two minor comments regarding whether to use a random seed in load_smiles. Not blocker. Changes look good to me.
No description provided.