[codex] Add structured SWE rollout failure observability by rasdani · Pull Request #1283 · PrimeIntellect-ai/verifiers

rasdani · 2026-05-04T15:34:19Z

Summary

Add a serialized failure rollout payload with stable reason, origin, error type metadata, message, and bounded diagnostic logs.
Serialize failure from state_to_output() while preserving the existing error, error_chain, and long_error_chain fields.
Redefine CLI-agent monitor metrics so agent_error means failure.origin == "agent", while agent_nonzero_exit preserves the old narrow non-zero-exit meaning and agent_poll_failed tracks background polling/read failures.
Classify sandbox, tunnel, model, and rollout-timeout failures away from agent_error; keep sandbox_oom, sandbox_timeout, and agent_timeout metrics.
Collect best-effort failure diagnostics before sandbox deletion: agent stdout/stderr tails, harness/agent log file tails, /tmp/vf_observed_command.log, and /tmp/install_progress.log.
Wrap composable agent install/post-install shell commands with grep-friendly observed-command traces and failure lines without masking the original setup failure.
Keep rollout logs grep-friendly with event=... key=value messages for aborts, finishes, empty trajectories, setup failures, and agent polling failures.

Companion scheduler metrics PR: PrimeIntellect-ai/prime-rl#2411.

Compatibility Note

agent_error is intentionally broadened to all agent-origin failures.
agent_nonzero_exit preserves the previous narrow non-zero process exit signal.

Validation

uv run ruff check verifiers/utils/failure_utils.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/composable/composable_env.py verifiers/envs/experimental/composable/harnesses/opencode.py verifiers/envs/experimental/opencode_env.py verifiers/utils/save_utils.py verifiers/types.py verifiers/envs/environment.py tests/test_cli_agent_env.py tests/test_swe_rollout_observability.py
python -m py_compile verifiers/utils/failure_utils.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/composable/composable_env.py verifiers/envs/experimental/opencode_env.py verifiers/utils/save_utils.py verifiers/types.py verifiers/envs/environment.py
uv run pytest tests/test_cli_agent_env.py tests/test_swe_rollout_observability.py tests/test_environment_extra.py::test_state_to_output_uses_state_usage_not_trajectory tests/test_save_utils.py

…ut/stderr tails When a rollout completes cleanly (no error, not timed_out) but produced zero LLM turns, the orchestrator discards it as an empty trajectory and reschedules — but the orch log only records "Empty trajectory in group X", without instance_id or any agent output. Bad SWE rows that hang the agent early are invisible at the orch level. Emit a WARNING-level rollout_empty_trajectory event from the env-side log_rollout_finished cleanup hook with stop, exit_code, durations, and truncated stdout/stderr from the agent process. Combined with PR #1283's existing instance_id field, this makes the offending dataset row identifiable directly from the env-server log.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit ed668b8. Configure here.}

cursor · 2026-05-08T22:57:11Z

+    if failure is None:
+        return None
+    failure["logs"].update({str(k): tail_text(v) for k, v in logs.items()})
+    state["failure"] = failure


Redundant tail_text re-application corrupts truncation prefix

Low Severity

add_failure_logs passes logs to ensure_rollout_failure(state, logs=logs), which already applies tail_text to every log value internally (via make_rollout_failure or the existing-failure merge path). Then on line 234, it redundantly calls failure["logs"].update({str(k): tail_text(v) for k, v in logs.items()}), applying tail_text a second time to the same original values. Since callers like collect_failure_diagnostics already tail_text the values before passing them in, this is actually a triple application. For logs originally exceeding DEFAULT_FAILURE_LOG_CHARS (12000), the first tail_text produces a ~12025-char string with a "...<truncated N chars>\n" prefix; subsequent applications re-truncate this, replacing the accurate prefix with a misleading one (e.g., "...<truncated 25 chars>" instead of "...<truncated 1000 chars>").

^{Reviewed by Cursor Bugbot for commit ed668b8. Configure here.}

rasdani changed the title ~~[codex] Add SWE rollout lifecycle observability~~ [codex] Add structured SWE rollout failure observability May 8, 2026

rasdani mentioned this pull request May 8, 2026

[codex] Track rejected rollout failure metrics PrimeIntellect-ai/prime-rl#2411

Draft

rasdani marked this pull request as draft May 8, 2026 22:47

cursor Bot reviewed May 8, 2026

View reviewed changes

rasdani force-pushed the codex/swe-rebench-rollout-observability branch 3 times, most recently from d93e863 to 804e913 Compare May 10, 2026 00:48

rasdani added 2 commits May 10, 2026 02:55

Add SWE rollout observability logs

29cf601

Add structured CLI rollout failure observability

8a5e718

rasdani force-pushed the codex/swe-rebench-rollout-observability branch from 804e913 to 8a5e718 Compare May 10, 2026 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add structured SWE rollout failure observability#1283

[codex] Add structured SWE rollout failure observability#1283
rasdani wants to merge 2 commits intomainfrom
codex/swe-rebench-rollout-observability

rasdani commented May 4, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rasdani commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Compatibility Note

Validation

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 8, 2026

Choose a reason for hiding this comment

Redundant tail_text re-application corrupts truncation prefix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rasdani commented May 4, 2026 •

edited

Loading

Redundant `tail_text` re-application corrupts truncation prefix