Improve autotune batch size and CPU count scanning by scal444 · Pull Request #179 · NVIDIA-BioNeMo/nvMolKit

scal444 · 2026-05-26T19:19:12Z

Autotune now steps in 64 element increments by default, cutting down on the search space. CPU space is now physical core limited by default.

…s at 8 - cpu_count() now parses /proc/cpuinfo for unique (physical_id, core_id) pairs so SMT siblings don't double-count, falling back to os.cpu_count(). - Default FF / embed / substruct search spaces use a categorical list of multiples of 64 for batchSize (kernels are tile-tuned for those sizes). - batchesPerGpu / workerThreads are now capped at min(8, cpus // num_gpus); 8 is the empirical point of diminishing returns and the floor prevents CPU oversubscription across GPUs.

greptile-apps · 2026-05-26T19:24:47Z

Greptile Summary

This PR tightens the autotune search spaces in two ways: batch sizes are switched from log-uniform ranges to stepped integer ranges (multiples of 64/128), and CPU-thread counts are now bounded by physical core count (read from /proc/cpuinfo) rather than logical count. A new (low, high, step) spec type is plumbed through suggest_from_space and collect_int_from_space in _core.py, with a positive-step guard in the suggestion path.

Stepped batch size: batchSize for FF, embed, and substructure tunables moves from log-uniform to (low, high, step) tuples (e.g. (64, 1024, 64)), reducing the effective search space while preserving numeric ordering for TPE.
Physical core detection: _physical_cpu_count_from_proc reads distinct (physical id, core id) pairs from /proc/cpuinfo and falls back to os.cpu_count() when the file is absent or the expected fields are missing (ARM64, some older kernels).
batchesPerGpu / workerThreads cap: All three tuners now cap the per-GPU thread upper bound at min(8, cpus // num_gpus), reflecting the empirical point of diminishing returns for batched dispatch.

Confidence Score: 5/5

Safe to merge; the logic changes are well-scoped and the new physical-core fallback degrades gracefully.

The stepped-range plumbing is correct and tested end-to-end. The physical CPU detection reads a well-known file, returns None on any failure, and the caller clamps to a floor of 1, so there is no crash path. The batchesPerGpu cap at 8 and the narrowed batchSize ranges are intentional tuning decisions with no correctness risk.

_ff_common.py is worth a second look on ARM64/Grace Hopper targets, where /proc/cpuinfo lacks physical id fields and the function silently falls back to logical count.

Important Files Changed

Filename	Overview
nvmolkit/autotune/_core.py	Adds stepped integer range `(low, high, step)` support to `suggest_from_space` (with `step <= 0` guard) and `collect_int_from_space` (snapping to nearest multiple of step from low).
nvmolkit/autotune/_ff_common.py	Introduces `_physical_cpu_count_from_proc` to read distinct `(physical id, core id)` pairs from `/proc/cpuinfo`, falling back to `os.cpu_count()` if the file is missing or fields are absent; `default_ff_search_space` switches `batchSize` to stepped multiples of 64 and caps `batchesPerGpu` at 8.
nvmolkit/autotune/tune_embed_molecules.py	Mirrors the FF changes: `batchSize` switched to `(64, 1024, 64)` stepped range and `batchesPerGpu` capped at `min(8, cpus // num_gpus)`.
nvmolkit/autotune/tune_substructure.py	Same pattern: `batchSize` switched to `(128, 1024, 128)` and `workerThreads` per-GPU cap now also bounded at 8.
nvmolkit/tests/test_autotune.py	Updated existing tests for the new per-GPU-8 cap and adds new tests for stepped `batchSize`, the `batchesPerGpu` 8-cap, and `_physical_cpu_count_from_proc` SMT deduplication.

_{Reviews (3): Last reviewed commit: "formatting" | Re-trigger Greptile}

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

evasnow1992

LGTM. Only one minor comment. Thanks!

scal444 added 3 commits May 26, 2026 13:58

Simplify CPU path

b9ba2f5

Further fixes for stepping

90383ed

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

Comment thread nvmolkit/autotune/_core.py

Comment thread nvmolkit/autotune/_core.py

scal444 and others added 2 commits May 26, 2026 15:37

Update nvmolkit/autotune/_core.py

8cd7e8d

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

formatting

4b1c4b1

scal444 requested a review from evasnow1992 May 27, 2026 12:59

evasnow1992 reviewed May 27, 2026

View reviewed changes

Comment thread nvmolkit/autotune/_ff_common.py

evasnow1992 approved these changes May 27, 2026

View reviewed changes

scal444 merged commit d0bce61 into NVIDIA-BioNeMo:main May 29, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve autotune batch size and CPU count scanning#179

Improve autotune batch size and CPU count scanning#179
scal444 merged 5 commits into
NVIDIA-BioNeMo:mainfrom
scal444:split/autotune

scal444 commented May 26, 2026

Uh oh!

greptile-apps Bot commented May 26, 2026 •

edited

Loading

Greptile Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evasnow1992 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

scal444 commented May 26, 2026

Uh oh!

greptile-apps Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evasnow1992 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 26, 2026 •

edited

Loading