Emit v1.0 schema bundles from benchmark scripts#5842
Conversation
Promote the JSON bundle schema produced by the standalone benchmark scripts under scripts/benchmarks/ into a real public-API module, isaaclab.benchmark.schema. Until now there was no single place in lab that defined the shape of training.json / startup.json, even though three lab scripts emit it and downstream tooling (e.g. the in-tree Odin evaluation harness) is starting to consume it. The module ships frozen dataclasses for TrainingBundle, StartupBundle, and all their building blocks, plus a small write_bundle_file helper that serialises any dataclass tree as schema-v1 JSON. The package __init__ re-exports the public surface so callers can write `from isaaclab.benchmark import TrainingBundle`. This commit also extends GPUInfoRecorder and MemoryInfoRecorder to report per-device peak alongside the existing mean/std rows. The peak rows are always emitted (initialised to 0.0) so dashboards see a consistent key set regardless of whether any sample was recorded. Existing rows are unchanged. The benchmark scripts themselves continue to use the legacy output format on develop today; a follow-up PR rewrites them to emit schema-v1 bundles directly via this module.
Wire the three standalone benchmark scripts under scripts/benchmarks/ to emit self-contained JSON bundles conforming to the v1.0 schema added in the previous commit (isaaclab.benchmark.schema): - benchmark_startup.py now optionally writes a StartupBundle to the path given by --schema_v1_output, with per-phase cProfile top-N data and total durations. - benchmark_rsl_rl.py now optionally writes a TrainingBundle with the run identity, captured versions/hardware, aggregated runtime and resource metrics, and EMA-smoothed reward / episode-length curves. The EMA factor is configurable via --ema_alpha; --no_series drops the full per-iteration curves and keeps only the scalars. - benchmark_skrl.py is new: a SKRL-framework counterpart that emits the same TrainingBundle with framework set to "skrl". Pairs with a small skrl_benchmark_trainer subclass that exposes per-iteration reward / episode-length values to the script without touching upstream skrl. The legacy per-backend output format remains the default when --schema_v1_output is omitted, so existing CI and ad-hoc invocations keep working unchanged. Shared helpers (_action_sampling.sample_random_actions to keep single-agent + multi-agent benchmark startup working, _schema_helpers to build Versions/Hardware from the recorder metadata and synthesise a fallback run_id) live alongside the scripts. utils.parse_cprofile_stats now returns ncalls as a fourth tuple element so the schema's CProfileFunction.calls field can be populated. Updated startup_whitelist.yaml to track the IsaacLab v3 configclass / cloner / scene-init call paths and explicitly fall through to top_n for python_imports and first_step (per file comments). Added scripts/benchmarks/tests/ covering the new helpers and CLI surfaces, plus source/isaaclab/test/benchmark/test_parse_cprofile_stats.py for the ncalls extension. Added docs/source/features/benchmarking.rst documenting the scripts and the schema.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR introduces a comprehensive v1.0 JSON schema (isaaclab.benchmark.schema) for benchmark bundles and wires the three standalone benchmark scripts (benchmark_startup.py, benchmark_rsl_rl.py, and the new benchmark_skrl.py) to emit self-contained JSON bundles via an opt-in --schema_v1_output flag. The implementation is well-structured with frozen dataclasses, proper helper modules, and extensive test coverage.
Findings
🔵 Suggestion — scripts/benchmarks/_action_sampling.py:47-52
The multi-agent action sampling creates numpy arrays via list comprehension and stacks, then converts to torch. For large num_envs × number of agents, this could be optimized by sampling directly into a pre-allocated tensor using torch.rand with the appropriate bounds, avoiding the numpy intermediate allocation.
🔵 Suggestion — scripts/benchmarks/benchmark_rsl_rl.py:269-276
The _compute_ema() function is duplicated verbatim in benchmark_skrl.py. Consider extracting this into _schema_helpers.py to keep these two training bundle emitters DRY.
🔵 Suggestion — scripts/benchmarks/skrl_benchmark_trainer.py:90-95
The episode length tracking falls back to 0.0 when no episodes have terminated. While documented, consider whether None or float("nan") would be more semantically correct for "no data available" vs. "actual episode length of zero".
🔵 Suggestion — source/isaaclab/isaaclab/benchmark/schema.py:249
The write_bundle_file creates the parent directory but uses os.path.dirname(os.path.abspath(path)) or "." which could return "" for relative paths like "output.json". The or "." handles this, but consider documenting this edge case.
🟡 Warning — scripts/benchmarks/benchmark_rsl_rl.py:595-598
The first_step_s proxy uses the first iteration's collection + learning time from rl_training_times. If rl_training_times has fewer than expected entries (e.g., early termination), the contextlib.suppress(IndexError, KeyError, ValueError) silently falls back to 0.0. This is safe but may mask real issues; consider logging when this fallback is hit.
Test Coverage
Excellent test coverage. The PR includes:
- Unit tests for
sample_random_actionscovering single-agent, multi-agent, heterogeneous action dims, and device placement - CLI surface tests for both RSL-RL and SKRL scripts (parse-only, no Isaac Sim)
BenchmarkTrainerunit tests with fake env/agent covering timing, reward tracking, multi-env vs single-env reset behaviorparse_cprofile_statstests validating the new ncalls 4-tuple return- Recorder tests for peak memory/utilization tracking
Verdict
Minor suggestions only — This is a well-designed, non-breaking feature addition with proper opt-in behavior. The schema design is clean (frozen dataclasses, clear separation of concerns), the backward compatibility is maintained (legacy output when --schema_v1_output is omitted), and the test coverage is thorough. Ready to merge once CI passes.
Description
Wire the three standalone benchmark scripts under
scripts/benchmarks/to emit self-contained JSON bundles conforming to the v1.0 schema added in #5840 (isaaclab.benchmark.schema).What's in here
Scripts (opt-in via new
--schema_v1_output <path>flag):benchmark_startup.pywrites aStartupBundlewith per-phase cProfile top-N data and total durations.benchmark_rsl_rl.pywrites aTrainingBundlewith run identity, capturedVersions/Hardware, aggregatedRuntime+Resources, and EMA-smoothedLearningcurves. EMA factor is configurable via--ema_alpha(default0.05);--no_seriesdrops per-iteration curves and keeps only thefinal_raw+final_emascalars.benchmark_skrl.pyis new — the SKRL-framework counterpart that emits the sameTrainingBundlewithframework: \"skrl\". Pairs with a smallskrl_benchmark_trainer.PerIterRewardTrainersubclass that exposes per-iteration reward and episode-length values to the script without patching upstream skrl.The legacy per-backend output format remains the default when
--schema_v1_outputis omitted, so existing invocations and CI keep working unchanged.Shared helpers:
scripts/benchmarks/_action_sampling.py— single-agent + multi-agent action sampling for the benchmark's first-step phase. Multi-agent envs exposeaction_spaces(dict); single-agent envs exposesingle_action_space. The helper picks the right shape.scripts/benchmarks/_schema_helpers.py— buildsVersions/Hardwarefrom the recorder metadata and synthesises a fallbackrun_idof the form<framework>_<backend>_<task>_<YYYYMMDD-HHMMSS>_seed<seed>.Other changes:
scripts/benchmarks/utils.parse_cprofile_statsnow returns a 4-tuple(function_label, tottime_ms, cumtime_ms, ncalls)instead of a 3-tuple, exposing the primitive call count frompstatsso the schema'sCProfileFunction.callsfield can be populated. Whitelist placeholder rows carryncalls=0.scripts/benchmarks/startup_whitelist.yamlreworked to track the IsaacLab v3 configclass / cloner / scene-init call paths. Adds an explicittask_configphase entry;python_importsandfirst_stepintentionally fall through to top_n (documented in file comments).Tests:
scripts/benchmarks/tests/covering: action-sampling shape for single-agent and multi-agent envs; CLI surface tests forbenchmark_rsl_rl.pyandbenchmark_skrl.py(parse-only, no Isaac Sim launch);skrl_benchmark_trainerreward/ep-length collection.source/isaaclab/test/benchmark/test_parse_cprofile_stats.pyfor the ncalls extension toutils.parse_cprofile_stats.Docs:
docs/source/features/benchmarking.rst— invocation examples per script, v1.0 schema summary, and CLI-flag reference. Wired into the Features TOC indocs/index.rst.Compatibility
--schema_v1_outputflag. Pre-existing CI and ad-hoc invocations are byte-identical without it.parse_cprofile_statsis a private helper (scripts/benchmarks/utils.py) and only used by the benchmark scripts themselves, so the 3→4-tuple change has no external callers.Fixes # (no issue)
Type of change
Screenshots
N/A — JSON-emitter additions.
Checklist