Add chunked-dispatch mode to eval_runner by cvolkcvolk · Pull Request #751 · isaac-sim/IsaacLab-Arena

cvolkcvolk · 2026-06-02T08:55:51Z

Summary

Adds --chunk_size N to eval_runner. When set and the jobs config has more than N jobs, the runner splits the config into chunks and launches one fresh subprocess per chunk. Unset → unchanged single-process behavior.

Why

Long evaluation sweeps OOM. Each env build/teardown cycle leaves ~72 MB of host memory unreclaimable from Python. Restarting the process is the only way to reclaim that memory

The in-process leak will be partly fixed upstream (isaac-sim/IsaacLab#5896 cutting it from ~540 MB/cycle to ~72 MB/cycle).
Chunking is the workaround until what's left is identified and fixed.

greptile-apps · 2026-06-02T08:59:00Z

Greptile Summary

This PR adds a --chunk_size N flag to eval_runner that partitions a large jobs config into chunks and re-invokes the runner as a fresh subprocess per chunk, allowing the OS to reclaim host memory that cannot be freed in-process between Isaac Sim teardown cycles.

_run_chunk serializes each chunk to a temp JSON file, appends --eval_jobs_config <tmp> to the current sys.argv so argparse picks up the override on re-parse, waits for the child, and cleans up the temp file in a finally block — correctly avoiding recursion because each child's job count is always ≤ chunk_size, making the child's dispatch condition false.
_run_in_chunks drives the loop, exits the parent with the child's non-zero return code on failure, and guards against non-positive chunk_size before dividing. The metrics-aggregation gap across chunks is acknowledged in a TODO.

Confidence Score: 5/5

Safe to merge — the chunked dispatch logic is structurally correct, temp file cleanup is guarded by finally, and child processes cannot recurse because their job count is always less than or equal to chunk_size.

The core dispatch mechanism works correctly: child processes inherit the full environment, the --eval_jobs_config override reliably wins on re-parse, and the non-recursion invariant holds for all chunk sizes. The two findings are both presentational — an off-by-one in the failure log label and a missing completion message — neither affects correctness or data integrity.

No files require special attention; both changed files are straightforward additions.

Important Files Changed

Filename	Overview
isaaclab_arena/evaluation/eval_runner.py	Adds _run_chunk and _run_in_chunks helpers plus a dispatch branch in main(). Core logic is sound — child processes inherit sys.argv with an appended --eval_jobs_config override, temp file is safely cleaned up in finally, and children never recurse because their job count is always <= chunk_size. Minor inconsistency between 0-indexed failure message and 1-indexed chunk labels; no completion message after all chunks succeed.
isaaclab_arena/evaluation/eval_runner_cli.py	Adds --chunk_size argument as type=int with default=None. Straightforward CLI addition; no issues.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant P as Parent eval_runner
    participant T as Temp JSON file
    participant C as Child subprocess

    U->>P: python eval_runner.py --eval_jobs_config master.json --chunk_size N
    P->>P: parse args, load master config
    alt "len(jobs) > chunk_size"
        P->>P: _run_in_chunks()
        loop for each chunk
            P->>T: json.dump(jobs chunk)
            P->>C: subprocess.run with --eval_jobs_config tmp override
            C->>T: load chunk config
            C->>C: enable_cameras_if_required()
            C->>C: SimulationAppContext - run jobs
            C-->>P: exit code
            P->>T: unlink temp file (finally)
            alt "exit code != 0"
                P->>U: sys.exit(returncode)
            end
        end
        P->>U: return exit 0
    else "len(jobs) <= chunk_size"
        P->>P: enable_cameras_if_required()
        P->>P: SimulationAppContext - run all jobs
        P->>U: return
    end

_{Reviews (8): Last reviewed commit: "Add chunk_size usage guidance to help te..." | Re-trigger Greptile}

isaaclab-review-bot

Code Review: Add chunked-dispatch mode to eval_runner

The approach is well-motivated — process restart is the correct strategy for reclaiming C-level memory (USD/Hydra state) that survives Python teardown. The implementation is clean and minimal: dispatcher logic stays in the runner, children reuse the existing in-process flow without code duplication. Resume support via --start_chunk is a nice touch for long sweeps.

A few issues worth addressing before merge:

Summary of Findings

#	Severity	File	Issue
1	🟡 Medium	eval_runner.py	`sys.argv[0]` not reliable for module-style invocation
2	🟡 Medium	eval_runner.py	`--start_chunk` ≥ actual chunk count silently succeeds
3	🟡 Medium	eval_runner.py	`--chunk_size 0` causes `ZeroDivisionError`
4	🟢 Low	eval_runner.py	`raise SystemExit(0)` vs implicit `None` return
5	🟢 Low	eval_runner.py	No unit tests for `_argv_without_dispatcher_flags`

Overall: solid implementation with minor hardening needed. The core dispatch logic is correct and the fail-fast + resume pattern is well thought out.

isaaclab-review-bot

🤖 Isaac Lab Review Bot

Summary

This PR adds a chunked-dispatch mode to eval_runner that splits large job sweeps across subprocess boundaries, allowing the OS to reclaim memory leaks between chunks. The implementation is clean and follows standard subprocess patterns.

Design Assessment

Design is sound. Process-restart is the only reliable way to reclaim C-level memory leaks from USD/Hydra. The implementation correctly:

Preserves single-process behavior when --chunk_size is unset or job count is small
Cleans up temp files in finally blocks
Propagates exit codes for CI integration

Findings

🟡 Warning: chunk_size ≤ 0 validation missing — isaaclab_arena/evaluation/eval_runner.py:135

The validation if chunk_size <= 0 raises ValueError, which is good. However, since chunk_size comes from argparse with type=int, a non-integer value would fail at CLI parsing time. The current validation is sufficient.

🟡 Warning: subprocess signal handling — isaaclab_arena/evaluation/eval_runner.py:154-155

When a subprocess is killed by a signal (e.g., SIGKILL from OOM killer), result.returncode will be negative (-signal). The current message "exit code -9" is technically correct but could be more informative for users debugging OOM issues.

Consider:

if result.returncode < 0:
    sig = -result.returncode
    print(f"[eval_runner] chunk {chunk_idx} killed by signal {sig} (possibly OOM).", flush=True)
else:
    print(f"[eval_runner] chunk {chunk_idx} failed (exit {result.returncode}).", flush=True)

🔵 Suggestion: non-jobs keys dropped — isaaclab_arena/evaluation/eval_runner.py:143

The chunked config only preserves the "jobs" key: json.dump({"jobs": jobs[start:end]}, tmp). If the master config has other top-level keys (e.g., metadata, version), they won't be forwarded to child processes. This is fine if the schema only has jobs, but worth noting if the schema might expand.

Test Coverage

Feature PR: No tests included
Coverage: Partial — existing test_eval_runner.py tests single-process flow but not chunked dispatch
Gaps: The chunked dispatch logic is untested. Consider adding an integration test verifying chunked dispatch produces the same results as single-process execution (a 3-job config with --chunk_size 1 should produce identical metrics to running without chunking).

CI Status

Pre-commit check queued.

Verdict

Minor fixes needed

The implementation is correct and well-structured. The signal-handling improvement is nice-to-have for UX but not blocking.

Update (eb022d2): Reviewed incremental commit that rewrites the chunked-dispatch implementation from scratch, simplifying it significantly:

Changes:

Removed --start_chunk — The previous version had resume support via --start_chunk; this is now removed, simplifying the implementation. Resume can be achieved externally if needed.
Removed _argv_without_dispatcher_flags helper — The new approach forwards sys.argv[1:] directly and appends --eval_jobs_config <chunk_path> at the end, relying on argparse's last-value-wins semantics. This is cleaner and eliminates potential bugs in flag-stripping logic.
Simplified loop — Straightforward for chunk_idx in range(n_chunks) without start_chunk offset.

Previous concerns resolved:

✅ --start_chunk validation concern is now moot (parameter removed)

Remaining observations:

The signal-aware error messaging suggestion still applies
The non-jobs keys observation still applies (low priority)

Update (68b20a7): New commits add random_yaw_init, NotNextTo relation, --env_graph_spec_yaml CLI arg, and reproducible per-env RNG for the pooled placer. These are unrelated to the eval_runner chunked-dispatch feature reviewed above — the new commits do not touch eval_runner.py.

New code assessment: Clean implementation. Highlights:

✅ Conservative bbox enclosure correctly composes marker yaw + sampled yaw
✅ NotNextTo loss has no flat plateau (gradient always points to nearest escape)
✅ get_rngs gives per-env independent, reproducible streams
✅ Comprehensive test coverage for all new features

No new issues found in the incremental diff.

Update (646b5d9): Incremental commit adds a TODO comment in _run_in_chunks acknowledging that per-chunk metrics are not aggregated/persisted and outlining a follow-up plan (--metrics_file forwarding + merge step). Good housekeeping — no functional changes, no new issues.

Update (067759b): Reviewed incremental changes. This commit is a clean refactor:

✅ Extracted _run_chunk helper — good separation of concerns, makes the dispatch loop easier to read
✅ math.ceil(len(jobs) / chunk_size) replaces manual ceiling division — cleaner
✅ Improved comments explaining why process-restart is needed (host memory leaks)
✅ CLI help text now advises when to use --chunk_size

Previous suggestions still applicable (non-blocking):

Signal-aware error message for negative return codes (OOM kills)
Preserving non-jobs keys in chunk configs

No new issues. Code quality is improving with each iteration. 👍

For long evaluation sweeps that build and tear down hundreds of envs in one process, a residual host-RSS leak below the Python layer (e.g. USD asset cache, Hydra render delegate state) eventually OOM-kills the process even after every Python-side cleanup has been done. The only reliable way to reclaim that memory is to let the OS release it on process exit. Adds --chunk_size N. When set and the jobs config has more than N jobs, eval_runner splits the config into chunks of N jobs and dispatches one subprocess per chunk; each child boots a fresh SimulationApp and runs only its chunk in-process. If --chunk_size is unset, or len(jobs) <= chunk_size, the dispatcher is inert: behavior matches the original single-process flow exactly. A child subprocess receives at most chunk_size jobs so it also runs in-process, even with --chunk_size still on the command line. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

xyao-nv · 2026-06-02T17:07:28Z

Do you know where this 72 MB comes from?

cvolkcvolk · 2026-06-03T08:42:14Z

@xyao-nv Mostly from the observation fields which were not released in IsaacLab here

alexmillane

Thanks for part-fixing part-mitigating this. Onwards!

I have a couple of mega nits.

Feel free to merge when you want!

Single-process eval runs already aggregate all jobs into one MetricsLogger summary, but the chunked dispatch path returns before building a logger, so per-chunk metrics are never merged or persisted. Document the gap and the intended follow-up at the early return in main(). Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Extract the per-chunk subprocess body into _run_chunk(), use math.ceil for the chunk count, name the child-command parts for readability, and reword the host-memory comment. No behavior change. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Clarify that --chunk_size should stay unset for normal runs and be set only when a long sweep grows in host memory or gets OOM-killed. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

cvolkcvolk requested review from alexmillane, peterd-NV, qianl-nv, viiik-inside, xyao-nv and zhx06 as code owners June 2, 2026 08:55

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread isaaclab_arena/evaluation/eval_runner.py Outdated

Comment thread isaaclab_arena/evaluation/eval_runner.py Outdated

Comment thread isaaclab_arena/evaluation/eval_runner.py Outdated

isaaclab-review-bot Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread isaaclab_arena/evaluation/eval_runner.py Outdated

Comment thread isaaclab_arena/evaluation/eval_runner.py Outdated

cvolkcvolk force-pushed the cvolk/feature/eval-runner-chunked-dispatch branch 11 times, most recently from e686814 to 4793daf Compare June 2, 2026 09:52

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread isaaclab_arena/evaluation/eval_runner.py Outdated

cvolkcvolk force-pushed the cvolk/feature/eval-runner-chunked-dispatch branch from 4793daf to eb022d2 Compare June 2, 2026 09:58

xyao-nv reviewed Jun 2, 2026

View reviewed changes

Comment thread isaaclab_arena/evaluation/eval_runner_cli.py

xyao-nv reviewed Jun 2, 2026

View reviewed changes

Comment thread isaaclab_arena/evaluation/eval_runner.py

xyao-nv approved these changes Jun 2, 2026

View reviewed changes

Merge branch 'main' into cvolk/feature/eval-runner-chunked-dispatch

68b20a7

alexmillane approved these changes Jun 3, 2026

View reviewed changes

cvolkcvolk added 3 commits June 3, 2026 15:35

Add chunk_size usage guidance to help text

067759b

Clarify that --chunk_size should stay unset for normal runs and be set only when a long sweep grows in host memory or gets OOM-killed. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

cvolkcvolk enabled auto-merge (squash) June 3, 2026 14:06

cvolkcvolk merged commit 3c408b3 into main Jun 3, 2026
6 checks passed

Conversation

cvolkcvolk commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Uh oh!

greptile-apps Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

Code Review: Add chunked-dispatch mode to eval_runner

Summary of Findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot

Summary

Design Assessment

Findings

Test Coverage

CI Status

Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xyao-nv commented Jun 2, 2026

Uh oh!

cvolkcvolk commented Jun 3, 2026

Uh oh!

alexmillane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cvolkcvolk commented Jun 2, 2026 •

edited

Loading

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading

isaaclab-review-bot Bot left a comment •

edited

Loading