Skip to content

Improve autotune batch size and CPU count scanning#179

Merged
scal444 merged 5 commits into
NVIDIA-BioNeMo:mainfrom
scal444:split/autotune
May 29, 2026
Merged

Improve autotune batch size and CPU count scanning#179
scal444 merged 5 commits into
NVIDIA-BioNeMo:mainfrom
scal444:split/autotune

Conversation

@scal444
Copy link
Copy Markdown
Collaborator

@scal444 scal444 commented May 26, 2026

Autotune now steps in 64 element increments by default, cutting down on the search space. CPU space is now physical core limited by default.

scal444 added 3 commits May 26, 2026 13:58
…s at 8

- cpu_count() now parses /proc/cpuinfo for unique (physical_id, core_id)
  pairs so SMT siblings don't double-count, falling back to os.cpu_count().
- Default FF / embed / substruct search spaces use a categorical list of
  multiples of 64 for batchSize (kernels are tile-tuned for those sizes).
- batchesPerGpu / workerThreads are now capped at min(8, cpus // num_gpus);
  8 is the empirical point of diminishing returns and the floor prevents
  CPU oversubscription across GPUs.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Greptile Summary

This PR tightens the autotune search spaces in two ways: batch sizes are switched from log-uniform ranges to stepped integer ranges (multiples of 64/128), and CPU-thread counts are now bounded by physical core count (read from /proc/cpuinfo) rather than logical count. A new (low, high, step) spec type is plumbed through suggest_from_space and collect_int_from_space in _core.py, with a positive-step guard in the suggestion path.

  • Stepped batch size: batchSize for FF, embed, and substructure tunables moves from log-uniform to (low, high, step) tuples (e.g. (64, 1024, 64)), reducing the effective search space while preserving numeric ordering for TPE.
  • Physical core detection: _physical_cpu_count_from_proc reads distinct (physical id, core id) pairs from /proc/cpuinfo and falls back to os.cpu_count() when the file is absent or the expected fields are missing (ARM64, some older kernels).
  • batchesPerGpu / workerThreads cap: All three tuners now cap the per-GPU thread upper bound at min(8, cpus // num_gpus), reflecting the empirical point of diminishing returns for batched dispatch.

Confidence Score: 5/5

Safe to merge; the logic changes are well-scoped and the new physical-core fallback degrades gracefully.

The stepped-range plumbing is correct and tested end-to-end. The physical CPU detection reads a well-known file, returns None on any failure, and the caller clamps to a floor of 1, so there is no crash path. The batchesPerGpu cap at 8 and the narrowed batchSize ranges are intentional tuning decisions with no correctness risk.

_ff_common.py is worth a second look on ARM64/Grace Hopper targets, where /proc/cpuinfo lacks physical id fields and the function silently falls back to logical count.

Important Files Changed

Filename Overview
nvmolkit/autotune/_core.py Adds stepped integer range (low, high, step) support to suggest_from_space (with step <= 0 guard) and collect_int_from_space (snapping to nearest multiple of step from low).
nvmolkit/autotune/_ff_common.py Introduces _physical_cpu_count_from_proc to read distinct (physical id, core id) pairs from /proc/cpuinfo, falling back to os.cpu_count() if the file is missing or fields are absent; default_ff_search_space switches batchSize to stepped multiples of 64 and caps batchesPerGpu at 8.
nvmolkit/autotune/tune_embed_molecules.py Mirrors the FF changes: batchSize switched to (64, 1024, 64) stepped range and batchesPerGpu capped at min(8, cpus // num_gpus).
nvmolkit/autotune/tune_substructure.py Same pattern: batchSize switched to (128, 1024, 128) and workerThreads per-GPU cap now also bounded at 8.
nvmolkit/tests/test_autotune.py Updated existing tests for the new per-GPU-8 cap and adds new tests for stepped batchSize, the batchesPerGpu 8-cap, and _physical_cpu_count_from_proc SMT deduplication.

Reviews (3): Last reviewed commit: "formatting" | Re-trigger Greptile

Comment thread nvmolkit/autotune/_core.py
Comment thread nvmolkit/autotune/_core.py
scal444 and others added 2 commits May 26, 2026 15:37
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@scal444 scal444 requested a review from evasnow1992 May 27, 2026 12:59
Comment thread nvmolkit/autotune/_ff_common.py
Copy link
Copy Markdown
Collaborator

@evasnow1992 evasnow1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only one minor comment. Thanks!

@scal444 scal444 merged commit d0bce61 into NVIDIA-BioNeMo:main May 29, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants