Standardize MuJoCo timestep and increase camera resolution across robot configs#648
Standardize MuJoCo timestep and increase camera resolution across robot configs#648davetcoleman wants to merge 3 commits into
Conversation
|
Consider whether the change should land upstream in Overlapping files
|
|
Before talking to @JWhitleyWork Claude suggested it implement the "lower camera resolution" change for CI, so that is in here for now. LMK when our servers support GPUs instead! |
53a2cbd to
2e1e693
Compare
2e1e693 to
ce98d97
Compare
Sets `<option timestep="0.003">` in every MuJoCo-using scene file in the workspace, matching the value already in use in qualcomm_sim. Previous values were inconsistent: - 0.025 s (hangar_sim, 40 Hz) - 12.5x larger than default - 0.01 s (factory_sim, 100 Hz) - 0.005 s (dual_arm_sim, kitchen_sim, 200 Hz) - 0.002 s (everything else, via MuJoCo default, 500 Hz) Standardizing on 0.003 s (333 Hz) trades a small wall-time slowdown on the previously-default-0.002 configs for a large stability improvement on hangar_sim and factory_sim, where stiff arm contacts were running at a step well outside the typical safe range. See PickNikRobotics/moveit_pro#19223 for the question filed to Griz/Breelyn about hangar_sim's prior 0.025 s value - that issue may surface a reason to revisit hangar_sim specifically. phoebe_sim (submodule) is intentionally not included in this PR; it needs a separate PR in PickNikRobotics/phoebe_ws. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…obot configs
Sets `resolution="1280 720"` on every <camera> in the workspace's MJCF scene
files, and ensures every top-level scene's <global offwidth/offheight> is at
least 1280x720 so the offscreen render buffer fits the largest camera.
Previously inconsistent:
- 1280x720 (lunar_sim, hangar_sim, picknik_accessories/ur5e wrist_camera)
- 640x480 (lab_sim, kitchen_sim, grinding_sim, april_tag_sim, dual_arm_sim,
factory_sim, space_satellite_sim, kinova_sim and variants)
This was discovered when `lab_sim` failed to start with:
[ros2_control_node] [ERROR] The published robot description file (urdf)
seems not to be genuine. ... Camera resolution mismatch, set:
<global offwidth="1280" offheight="720"/> in the MJCF model.
The wrist_camera in picknik_accessories/mujoco_assets/ur5e/ur5e.xml was
already at 1280x720, but lab_sim's scene declared the global offscreen
buffer at only 640x480, so the hardware plugin rejected the URDF.
Standardizing to 1280x720 everywhere keeps the included ur5e wrist_camera
working in every UR-based config and avoids the same trap silently waiting
in the others.
space_satellite_sim_camera_cal kept at 1920x1080 (intentional higher
resolution for camera-intrinsics calibration data).
phoebe_sim (submodule) is not included; it needs a separate PR in
PickNikRobotics/phoebe_ws (and was already at 1280x720 anyway).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce98d97 to
11d1536
Compare
|
Moving to "ready" and we'll see how CI handles it. |
cf36a51 to
79d0b94
Compare
…ive test progress Two CI-infra changes folded together: 1. Bump the reusable workflow ref to PickNikRobotics/moveit_pro_ci v0.3.1, set enable_gpu: true, and switch the runner label from picknik-16-amd64 to picknik-16-amd64-gpu. v0.3.1 appends the CUDA image suffix when enable_gpu is true (moveit_pro_ci#26) -- without it, v0.3.0 set the label but kept the non-CUDA image, so MuJoCo's EGL rendering still ran through llvmpipe on CPU. 2. Add src/lab_sim/test/conftest.py with two pytest hooks (logstart, logreport) that write directly to fd 2, bypassing pytest's --capture=fd. Without this, a CTest timeout kills pytest before any per-test output is flushed, leaving the CI log silent past 'collected N items'. The hooks emit START / PASSED|FAILED|SKIPPED + duration for every test, so the next timeout (if any) names the offending objective.
79d0b94 to
d8aebf3
Compare
|
Update — the new conftest paid off: CI on Every parametrized invocation of
So this is a sim or |
|
Bisection result — correcting the prior diagnosis. Opened #655 (just commit 3 from this PR, no scene changes) to discriminate scene-vs-image. #655's CI passes the gripper-wait fine: 34 PASSED, 82 SKIPPED, only 2 failed ( This inverts my earlier claim of an image regression. The gripper failure on this PR isn't from Pinged Josh to ask whether any of those specifically would block |
ad934cd to
d8aebf3
Compare
|
Blocked by #655 |
This PR standardizes two MuJoCo settings that were inconsistent across the workspace's robot configs, plus a CI-infra commit that the second commit's resolution bump made necessary (GPU runner + live test progress). Three commits, kept separate intentionally (please do not squash).
Commit 1 —
timestep="0.003"Problem
MuJoCo
<option timestep>was inconsistent across robot configs in this workspace:hangar_simfactory_simdual_arm_sim,kitchen_simlab_sim,april_tag_sim,grinding_sim,lunar_sim,kinova_sim,space_satellite_sim,space_satellite_sim_camera_calhangar_sim's 25 ms step in particular is well outside the range stiff arm contacts can normally tolerate, and means the 600 Hz ros2_control loop is reading stale physics state for ~15 ticks at a time.Alternatives considered
hangar_simis a ~12.5× slowdown — likely intolerable for the mobile-base + Nav2 + lidar load.qualcomm_sim. Trades a ~1.5× speedup on the 8 default-0.002 configs for a ~1.67× / ~3.3× / ~8.3× slowdown on the 4 outliers, in exchange for a uniform value that's stable for all of them.Chosen approach
<option timestep="0.003" />set in every MuJoCo-using scene file in the workspace (11 files).Three edit patterns:
timestep="..."(4 files):hangar_sim,factory_sim,dual_arm_sim,kitchen_sim.timestep="0.003"attribute to an existing<option>(4 files):lab_sim,april_tag_sim,space_satellite_sim,space_satellite_sim_camera_cal.<option timestep="0.003" />(3 files):grinding_sim,lunar_sim,kinova_sim.Related
moveit_pro#19223 — asked Griz / Breelyn about the reasoning behind
hangar_sim's prior 0.025 s value. If they come back with a stability/performance reason specific to mecanum + Nav2 that 0.003 won't satisfy, we may need to revisithangar_simalone.Commit 2 —
resolution="1280 720"Problem
MuJoCo camera resolutions and offscreen buffer sizes were inconsistent across robot configs:
lunar_sim,hangar_sim,picknik_accessories/ur5e wrist_cameraspace_satellite_sim_camera_calpicknik_mujoco_rosrequires<visual><global offwidth offheight>to be ≥ the largest camera in the scene, so any config that includes a 1280×720 camera (e.g. viapicknik_accessories/ur5e) but declares a smaller global currently fails to load with aCamera resolution mismatchURDF rejection.Chosen approach
<camera resolution="640 480">in the workspace's MJCF files bumped to1280 720(13 files, 16 camera tags).<visual><global offwidth offheight></visual>ensured to be ≥ 1280×720 — bumpedlab_sim's existing 640×480 global, and addedoffwidth/offheightattributes to the existing<global>element in 7 other scenes that previously had a global with onlyazimuth/elevation.<visual><global offwidth="1280" offheight="720"/></visual>block toapril_tag_sim, which had no<visual>block at all.space_satellite_sim_camera_calleft at 1920×1080.Skipped
phoebe_sim(submoduleexternal_dependencies/phoebe_ws) is intentionally not included in either commit — it needs a separate PR inPickNikRobotics/phoebe_ws. (It was already at 1280×720 for cameras anyway.)Commit 3 —
enable_gpu: trueon apicknik-16-amd64-gpurunner, plus live per-objective test progressProblem
After commit 2,
lab_sim'sobjectives_integration_teststarted timing out on CI at 600 s (Exit Code: Timeout,Execution Time: 600.058 s), both on humble and jazzy. Baseline runtime onmainis ~388 s. The ~3× per-camera pixel increase (and the addition of the offscreen buffer growth) pushes the perception-heavy objective sweep past the existing timeout budget onpicknik-16-amd64runners.Alternatives considered
mujoco_ci_camera_resolutioninput. Original plan — attempted in PickNikRobotics/moveit_pro_ci#23, closed without merging. Would have kept CPU runners but introduced a CI-vs-product scene divergence.Chosen approach
PickNikRobotics/moveit_pro_ci@v0.3.1, which appends the CUDA image suffix whenenable_gpuis true (moveit_pro_ci#26). Without that fix,v0.3.0set the runner label correctly but the container kept the non-CUDA image, so MuJoCo's EGL rendering still went through llvmpipe on CPU — i.e. the GPU was on the bare metal but not visible to the test process. That was the suspected cause of the first GPU-runner run still timing out at 600 s.enable_gpu: trueand switch the runner label frompicknik-16-amd64topicknik-16-amd64-gpu.Test-diagnostics addition:
src/lab_sim/test/conftest.pyThe first GPU-runner CI run timed out and the log showed pytest's
collected 118 itemsheader and then nothing — no per-test progress, no indication of which objective was running when CTest killed the process. Root cause: pytest's default--capture=fdredirects fds 1 and 2 and only flushes at end-of-run, which doesn't happen on timeout.Folded in here because it's CI-test infra. Two pytest hooks (
pytest_runtest_logstart,pytest_runtest_logreport) write directly to fd 2 viaos.write(2, ...), bypassing the capture. Every test emitsSTART <nodeid>on entry andPASSED|FAILED|SKIPPED <nodeid> (<elapsed>s)on completion, plus the short failure reason on failure. Diagnostics-only — no production code touched.Caveat — fallback if v0.3.1 isn't enough
The bet now is that v0.3.1's CUDA image actually plumbs the GPU through to the container and MuJoCo's EGL uses the GPU's hardware path.
mj_stepitself is still CPU/solver work — the GPU only helps the render path. Iflab_sim's test still times out, the new conftest will name the slow objective (and the timing distribution of those that completed), letting us decide between bumping the test timeout, reviving themujoco_ci_camera_resolutiondirection (re-open #23, merge, cut a tag, repin), or fixing a specific slow objective.Manual verification
hangar_sim— the biggest timestep delta (0.025 → 0.003, ~8× more sim work per second). Verify mobile-base locomotion + Nav2 still keeps up.lab_simandkinova_sim/rafti_fingers— contact-heavy scenes most likely to surface stability regressions from the 0.002 → 0.003 step increase.ros2_control_nodeno longer logsCamera resolution mismatch.lab_sim'sobjectives_integration_testcompletes under 600 s onpicknik-16-amd64-gpufor both humble and jazzy. If it still times out, the GPU-runner bet didn't pan out and we need to revisit Commit 3's strategy.Claude agent checks
Autofilled by Claude when it opens or updates this PR. Each agent that was actually run gets ticked; agents that don't apply are ticked with
SKIPPEDright after the checkbox plus a brief reason. Humans rarely need to touch these. See.claude/rules/agent-delegation.mdfor when each agent is required.code-reviewer(required) — 3 findings (1 P0, 2 P1) addressed in fixup commitstest-runner(required) — no compiled code in diff (XML configs + CI yaml only)platform-architect-bot(required for backend / C++ / ROS / REST / build-CI changes) — 7 findings; applied 3 (kitchen_sim offwidth, 2 kinova_sim timestep); applied 2 P2 comment fixes (lab_sim, ci.yaml); deferred 2 (hangar_sim wall-clock already in manual-verify list, franka3.xml pre-existing pattern)frontend-noah-bot(required whensrc/web/frontend/**changes) — no src/web/frontend/** files changedsecurity-auditor(required for security-sensitive paths) — no security-sensitive paths changedtest-writer(optional)unit-test-reviewer(optional)