[codex] Add structured SWE rollout failure observability#1283
[codex] Add structured SWE rollout failure observability#1283
Conversation
…ut/stderr tails When a rollout completes cleanly (no error, not timed_out) but produced zero LLM turns, the orchestrator discards it as an empty trajectory and reschedules — but the orch log only records "Empty trajectory in group X", without instance_id or any agent output. Bad SWE rows that hang the agent early are invisible at the orch level. Emit a WARNING-level rollout_empty_trajectory event from the env-side log_rollout_finished cleanup hook with stop, exit_code, durations, and truncated stdout/stderr from the agent process. Combined with PR #1283's existing instance_id field, this makes the offending dataset row identifiable directly from the env-server log.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ed668b8. Configure here.
| if failure is None: | ||
| return None | ||
| failure["logs"].update({str(k): tail_text(v) for k, v in logs.items()}) | ||
| state["failure"] = failure |
There was a problem hiding this comment.
Redundant tail_text re-application corrupts truncation prefix
Low Severity
add_failure_logs passes logs to ensure_rollout_failure(state, logs=logs), which already applies tail_text to every log value internally (via make_rollout_failure or the existing-failure merge path). Then on line 234, it redundantly calls failure["logs"].update({str(k): tail_text(v) for k, v in logs.items()}), applying tail_text a second time to the same original values. Since callers like collect_failure_diagnostics already tail_text the values before passing them in, this is actually a triple application. For logs originally exceeding DEFAULT_FAILURE_LOG_CHARS (12000), the first tail_text produces a ~12025-char string with a "...<truncated N chars>\n" prefix; subsequent applications re-truncate this, replacing the accurate prefix with a misleading one (e.g., "...<truncated 25 chars>" instead of "...<truncated 1000 chars>").
Reviewed by Cursor Bugbot for commit ed668b8. Configure here.
d93e863 to
804e913
Compare
804e913 to
8a5e718
Compare


Summary
failurerollout payload with stablereason,origin, error type metadata, message, and bounded diagnostic logs.failurefromstate_to_output()while preserving the existingerror,error_chain, andlong_error_chainfields.agent_errormeansfailure.origin == "agent", whileagent_nonzero_exitpreserves the old narrow non-zero-exit meaning andagent_poll_failedtracks background polling/read failures.agent_error; keepsandbox_oom,sandbox_timeout, andagent_timeoutmetrics./tmp/vf_observed_command.log, and/tmp/install_progress.log.event=... key=valuemessages for aborts, finishes, empty trajectories, setup failures, and agent polling failures.Companion scheduler metrics PR: PrimeIntellect-ai/prime-rl#2411.
Compatibility Note
agent_erroris intentionally broadened to all agent-origin failures.agent_nonzero_exitpreserves the previous narrow non-zero process exit signal.Validation
uv run ruff check verifiers/utils/failure_utils.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/composable/composable_env.py verifiers/envs/experimental/composable/harnesses/opencode.py verifiers/envs/experimental/opencode_env.py verifiers/utils/save_utils.py verifiers/types.py verifiers/envs/environment.py tests/test_cli_agent_env.py tests/test_swe_rollout_observability.pypython -m py_compile verifiers/utils/failure_utils.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/composable/composable_env.py verifiers/envs/experimental/opencode_env.py verifiers/utils/save_utils.py verifiers/types.py verifiers/envs/environment.pyuv run pytest tests/test_cli_agent_env.py tests/test_swe_rollout_observability.py tests/test_environment_extra.py::test_state_to_output_uses_state_usage_not_trajectory tests/test_save_utils.py