Skip to content

Fix: Coarsen MuJoCo timestep on CI to stop slower-than-realtime flakes#615

Open
JWhitleyWork wants to merge 3 commits into
mainfrom
fix/ci-mujoco-timestep
Open

Fix: Coarsen MuJoCo timestep on CI to stop slower-than-realtime flakes#615
JWhitleyWork wants to merge 3 commits into
mainfrom
fix/ci-mujoco-timestep

Conversation

@JWhitleyWork
Copy link
Copy Markdown
Member

Summary

  • Pin the integration-test reusable workflow to the moveit_pro_ci branch that adds the new mujoco_ci_timestep input (companion PR: PickNikRobotics/moveit_pro_ci#18).
  • Pass mujoco_ci_timestep: "0.004" so CI runs the lab_sim scene at 250 Hz instead of MuJoCo's 500 Hz default, doubling the wall-clock budget per step.
  • Local dev runs the scene unmodified — the patch only happens inside the CI job.

Why

The MuJoCo 3.2.7 → 3.6.0 upgrade in moveit_pro (6eedef88a5, Apr 14) made the constraint solver heavier per step. Within 24h, main CI went from 100% green to flaky and to ~92% red within three days:

Period Pass Fail
Apr 8–14 (pre-upgrade) 24 0
Apr 15 4 1
Apr 17 1 7
Apr 20–30 ~3 ~37

Failure logs always include the warning Mujoco model timestep not running in realtime. Increase the model timestep. and the timing-sensitive failures fall out of that — MoveGripperAction 15s timeout in Push Button With a Trajectory (~9/10 runs), GetImage 5s wrist-camera timeout in ML Segment Point Cloud (~4/10), and various MPC pose-tracking variants. Several mitigations have already been merged (memory="64M" arena fix, MPC retunes, tolerance loosening, publisher timeout fixes); none addressed the underlying realtime gap.

This PR fixes the root cause for CI specifically — by coarsening the MuJoCo timestep to give the heavier 3.6.0 solver enough wall-clock budget — without changing the experience on dev machines (where the simulator generally runs faster than realtime and the warning is diagnostic).

Why CI-only

Bumping the timestep in the scene file would affect local dev too. With integrator="implicitfast" and impratio="10" the scene is well within MuJoCo's stability envelope at 0.004s, but contact-stability for tight grasps on small objects is a real concern that warrants a separate validation pass. Doing this CI-only is the cheapest, lowest-risk route to a green main; we can revisit a global bump (or, longer-term, the test-harness rethink Shaur called out in #610) as a follow-up.

Test plan

  • Trigger CI on this branch and confirm integration-test-in-studio-container passes.
  • Re-run several times (at least 5) to confirm the historical flake rate drops materially.
  • Verify the Override MuJoCo timestep for CI step's log shows the expected scene files were patched (lab_sim/description/scene.xml, etc.).
  • After moveit_pro_ci#18 merges and a new tag is cut, swap the SHA pin for that tagged release.

🤖 Generated with Claude Code

@JWhitleyWork JWhitleyWork added this to the 9.3.0 milestone May 7, 2026
@JWhitleyWork JWhitleyWork force-pushed the fix/ci-mujoco-timestep branch from ee5d05a to 7611d04 Compare May 8, 2026 19:58
@JWhitleyWork JWhitleyWork requested review from Copilot and shaur-k May 8, 2026 19:58
@JWhitleyWork JWhitleyWork self-assigned this May 8, 2026
@JWhitleyWork JWhitleyWork marked this pull request as ready for review May 8, 2026 19:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the repository CI workflow to reduce MuJoCo integration-test flakiness by overriding the simulator timestep only in CI, giving the heavier MuJoCo 3.6.0 solver more wall-clock budget per step while keeping local development behavior unchanged.

Changes:

  • Pin the reusable workspace_integration_test.yaml workflow to a newer moveit_pro_ci commit that supports the new mujoco_ci_timestep input.
  • Pass mujoco_ci_timestep: "0.004" to run the CI lab simulation at 250 Hz instead of the default 500 Hz.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…me flakes

The MuJoCo 3.2.7 -> 3.6.0 upgrade in moveit_pro (6eedef88a5) made the
constraint solver heavier per step, so the lab_sim scene runs slower
than realtime on CI runners. That surfaces as `Mujoco model timestep
not running in realtime` warnings and timing-related test failures
(MoveGripperAction 15s timeouts, GetImage 5s wrist-camera timeouts).
CI on main has been ~92% red since Apr 17 as a result, and the in-tree
mitigations applied so far (constraint-arena memory, MPC retunes,
push-button tolerance, publisher timeout fixes) did not address the
underlying realtime gap.

Pin to the moveit_pro_ci branch that adds the new `mujoco_ci_timestep`
input (PR PickNikRobotics/moveit_pro_ci#18) and pass "0.004" -- 250 Hz,
~2x the wall-clock budget per step versus the MuJoCo default of 500 Hz.
This only takes effect on CI; local dev runs the scene unmodified.
After moveit_pro_ci tags a release containing this input, swap the
SHA pin for that tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JWhitleyWork JWhitleyWork force-pushed the fix/ci-mujoco-timestep branch from 7611d04 to c70604d Compare May 8, 2026 20:07
@JWhitleyWork JWhitleyWork enabled auto-merge May 8, 2026 20:07
shaur-k
shaur-k previously approved these changes May 8, 2026
JWhitleyWork and others added 2 commits May 8, 2026 15:05
The objective integration test runs ~117 parametrized objectives
against a single shared backend and MuJoCo simulation. Pick/place,
push-button, and similar objectives leave residual world state that
caused order-dependent failures after the MuJoCo 3.6.0 upgrade.

Re-export reset_simulation_before_test from moveit_pro_test_utils so
pytest activates the autouse reset fixture for this test module.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants