Skip to content

[codex] Add structured SWE rollout failure observability#1283

Draft
rasdani wants to merge 2 commits intomainfrom
codex/swe-rebench-rollout-observability
Draft

[codex] Add structured SWE rollout failure observability#1283
rasdani wants to merge 2 commits intomainfrom
codex/swe-rebench-rollout-observability

Conversation

@rasdani
Copy link
Copy Markdown
Contributor

@rasdani rasdani commented May 4, 2026

Summary

  • Add a serialized failure rollout payload with stable reason, origin, error type metadata, message, and bounded diagnostic logs.
  • Serialize failure from state_to_output() while preserving the existing error, error_chain, and long_error_chain fields.
  • Redefine CLI-agent monitor metrics so agent_error means failure.origin == "agent", while agent_nonzero_exit preserves the old narrow non-zero-exit meaning and agent_poll_failed tracks background polling/read failures.
  • Classify sandbox, tunnel, model, and rollout-timeout failures away from agent_error; keep sandbox_oom, sandbox_timeout, and agent_timeout metrics.
  • Collect best-effort failure diagnostics before sandbox deletion: agent stdout/stderr tails, harness/agent log file tails, /tmp/vf_observed_command.log, and /tmp/install_progress.log.
  • Wrap composable agent install/post-install shell commands with grep-friendly observed-command traces and failure lines without masking the original setup failure.
  • Keep rollout logs grep-friendly with event=... key=value messages for aborts, finishes, empty trajectories, setup failures, and agent polling failures.

Companion scheduler metrics PR: PrimeIntellect-ai/prime-rl#2411.

Compatibility Note

  • agent_error is intentionally broadened to all agent-origin failures.
  • agent_nonzero_exit preserves the previous narrow non-zero process exit signal.

Validation

  • uv run ruff check verifiers/utils/failure_utils.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/composable/composable_env.py verifiers/envs/experimental/composable/harnesses/opencode.py verifiers/envs/experimental/opencode_env.py verifiers/utils/save_utils.py verifiers/types.py verifiers/envs/environment.py tests/test_cli_agent_env.py tests/test_swe_rollout_observability.py
  • python -m py_compile verifiers/utils/failure_utils.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/composable/composable_env.py verifiers/envs/experimental/opencode_env.py verifiers/utils/save_utils.py verifiers/types.py verifiers/envs/environment.py
  • uv run pytest tests/test_cli_agent_env.py tests/test_swe_rollout_observability.py tests/test_environment_extra.py::test_state_to_output_uses_state_usage_not_trajectory tests/test_save_utils.py

rasdani added a commit that referenced this pull request May 5, 2026
…ut/stderr tails

When a rollout completes cleanly (no error, not timed_out) but produced
zero LLM turns, the orchestrator discards it as an empty trajectory and
reschedules — but the orch log only records "Empty trajectory in group X",
without instance_id or any agent output. Bad SWE rows that hang the agent
early are invisible at the orch level.

Emit a WARNING-level rollout_empty_trajectory event from the env-side
log_rollout_finished cleanup hook with stop, exit_code, durations, and
truncated stdout/stderr from the agent process. Combined with PR #1283's
existing instance_id field, this makes the offending dataset row
identifiable directly from the env-server log.
@rasdani rasdani changed the title [codex] Add SWE rollout lifecycle observability [codex] Add structured SWE rollout failure observability May 8, 2026
@rasdani rasdani marked this pull request as draft May 8, 2026 22:47
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ed668b8. Configure here.

if failure is None:
return None
failure["logs"].update({str(k): tail_text(v) for k, v in logs.items()})
state["failure"] = failure
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant tail_text re-application corrupts truncation prefix

Low Severity

add_failure_logs passes logs to ensure_rollout_failure(state, logs=logs), which already applies tail_text to every log value internally (via make_rollout_failure or the existing-failure merge path). Then on line 234, it redundantly calls failure["logs"].update({str(k): tail_text(v) for k, v in logs.items()}), applying tail_text a second time to the same original values. Since callers like collect_failure_diagnostics already tail_text the values before passing them in, this is actually a triple application. For logs originally exceeding DEFAULT_FAILURE_LOG_CHARS (12000), the first tail_text produces a ~12025-char string with a "...<truncated N chars>\n" prefix; subsequent applications re-truncate this, replacing the accurate prefix with a misleading one (e.g., "...<truncated 25 chars>" instead of "...<truncated 1000 chars>").

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ed668b8. Configure here.

@rasdani rasdani force-pushed the codex/swe-rebench-rollout-observability branch 3 times, most recently from d93e863 to 804e913 Compare May 10, 2026 00:48
@rasdani rasdani force-pushed the codex/swe-rebench-rollout-observability branch from 804e913 to 8a5e718 Compare May 10, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant