A living catalog of known error patterns encountered in CodeScaleBench benchmark runs. Each entry documents the pattern signature, root cause, affected benchmarks, and recommended fix. Use this to quickly diagnose and resolve failures.
Fingerprints are defined in scripts/status_fingerprints.py and matched in order (first match wins).
| ID | Label | Severity | Auto-retry |
|---|---|---|---|
| token_refresh_403 | OAuth token refresh failure | infra | Yes |
| verifier_parse_error | Verifier output parse error | verifier | No |
| api_500 | API 500 server error | api | Yes |
| api_rate_limit | API rate limit / overloaded | api | Yes |
| context_window_exceeded | Context window exceeded | infra | No |
| timeout | Task timeout | task | No |
| mcp_connection | MCP server connection failure | mcp | Yes |
| import_error | Python import error | setup | No |
| docker_compose_fail | Docker/container failure | setup | No |
| permission_denied | Permission denied | infra | No |
| git_error | Git operation failure | setup | No |
| deep_search_polling_only | Deep Search returned polling-only | warning | Yes |
| deep_search_polling_timeout | Deep Search polling timeout (aggregate) | warning | Yes |
Pattern: Matches 403, Forbidden, token.*refresh, refresh.*token, or credentials.*expired in exception info.
Root Cause: OAuth token expired or credentials file is stale. The Claude API rejects requests with a 403 when the token needs refreshing.
Fix: Re-authenticate with claude auth or check ~/.claude/.credentials.json. The ensure_fresh_token function in _common.sh handles automatic refresh with a 30-minute margin.
Auto-retry: true
Pattern: Matches verifier.*parse, verifier.*json, verifier.*decode, JSONDecodeError.*verifier, or reward.*parse in exception info.
Root Cause: The verifier script produced output that couldn't be parsed as valid JSON. The reward.txt or reward.json file is malformed or missing.
Fix: Check verifier script output format; ensure reward.txt/reward.json contains valid JSON with the expected schema.
Auto-retry: false
Pattern: Matches 500 Internal Server Error, api.*500, or server.*error.*5xx in exception info.
Root Cause: Transient server-side error from the Claude API or Sourcegraph API.
Fix: Retry the task. If persistent, check the API status page for outages.
Auto-retry: true
Pattern: Matches rate limit, 429, too many requests, throttl, or overloaded in exception info.
Root Cause: Too many concurrent API requests exceeded account rate limits or the API is overloaded.
Fix: Reduce parallelism (lower PARALLEL_JOBS) or wait before retrying. Check account quotas.
Auto-retry: true
Pattern: Matches conversation is too long, context_window, context window, max_tokens exceeded, maximum context length, or prompt is too long in exception info.
Root Cause: The task required more context than the model's maximum window. Common in large-codebase tasks like TAC find-in-codebase (5-12M tokens) and CrossRepo API upgrades.
Fix: Not a task failure — classify as infrastructure limitation. Consider tasks that require less context, or use MCP tools to reduce context usage.
Auto-retry: false
Pattern: Matches timeout, timed out, deadline exceeded, SIGTERM, or killed.*signal in exception info.
Root Cause: The task exceeded its configured time limit (time_limit_sec in task.toml or timeout_hours in the run config).
Fix: Consider increasing timeout_hours or simplifying the task. Check if the agent is stuck in a loop.
Auto-retry: false
Pattern: Matches mcp.*connect, mcp.*refused, mcp.*unavailable, mcp.*error, or sourcegraph.*connect/error/fail in exception info.
Root Cause: The MCP server (Sourcegraph) is unreachable or returned a connection error. Can be caused by network issues, server downtime, or misconfigured MCP settings.
Fix: Check that the MCP server is running and accessible. Verify MCP config in the task setup.
Auto-retry: true
Pattern: Matches ImportError, ModuleNotFoundError, No module named, or cannot import in exception info.
Root Cause: A Python dependency is missing from the Docker image. The Dockerfile or requirements.txt is incomplete.
Fix: Update the Dockerfile or requirements.txt to include the missing dependency, then rebuild the image.
Auto-retry: false
Pattern: Matches docker.*compose/build/pull.*fail, container.*exit/crash/fail, or OOMKill in exception info.
Root Cause: Docker image build failure, container crash, or out-of-memory kill. Can be caused by missing base images, insufficient disk space, or memory limits.
Fix: Check Docker image availability, disk space, and memory limits. For OOMKill, increase container memory allocation.
Auto-retry: false
Pattern: Matches permission denied, EACCES, or Operation not permitted in exception info.
Root Cause: File or directory permission issues in the task workspace. Common when files are owned by a different user (e.g., ubuntu:ubuntu in Docker).
Fix: Check file/directory permissions in the task workspace. May need chmod/chown in the Dockerfile.
Auto-retry: false
Pattern: Matches fatal:.*git, git.*clone/checkout/pull.*fail, or repository not found in exception info.
Root Cause: A git operation failed — usually a clone, checkout, or pull. Can be caused by network issues, invalid repo URLs, or missing credentials for private repos.
Fix: Check repository URL and network access. Verify git credentials if the repo is private.
Auto-retry: false
Pattern: Matches Poll for results using sg_deepsearch_read in session/trajectory content (not exception_info).
Root Cause: The agent called sg_deepsearch and received an async polling link, but sg_deepsearch_read only returned the polling message instead of actual results. The agent didn't retry enough times for Deep Search to complete.
Fix: Deep Search did not return results within the polling window. The run may have degraded quality. Rerun after the preamble retry fix is applied.
Auto-retry: true
Note: This fingerprint matches trajectory/session content, not exception_info. The current
fingerprint_error()function won't detect it. It is included instatus_fingerprints.pyfor use by future trajectory-scanning tools.
Pattern: sg_deepsearch_read returns a polling link ({"link":"...","note":"Poll for results using sg_deepsearch_read..."}) instead of actual semantic analysis results. The agent polls 1-2 times then gives up.
Root Cause: Deep Search is asynchronous, typically taking 50-300+ seconds to complete on the Sourcegraph backend. The agent polls at ~53-second intervals but only attempts 1-2 reads before moving on. 70.1% of all Deep Search calls (96/137) returned polling-only responses. At the task level, 38% of tasks (23/60) that invoked Deep Search never received useful results during the entire execution.
Affected Benchmarks:
| Benchmark | Tasks with DS | Got Results | Never Got Results | Success Rate |
|---|---|---|---|---|
| K8s Docs | 5 | 2 | 3 | 40% |
| PyTorch | 12 | 6 | 6 | 50% |
| SWE-bench Pro | 43 | 29 | 14 | 67% |
| Total | 60 | 37 | 23 | 62% |
Fix: SG_full preamble updated in claude_baseline_agent.py to instruct the agent: "After calling sg_deepsearch, call sg_deepsearch_read at least 3-5 times with 10-15 second waits between attempts. Deep Search is asynchronous and typically takes 50-300 seconds." Reruns with the updated preamble should resolve the issue.
Auto-retry: true
Set DEBUG_MODE=true before running configs to capture full verifier diagnostics:
DEBUG_MODE=true ./configs/test_2config.shDebug output is written to /logs/verifier/debug/ inside the container and does not affect scoring. The following files are captured:
| File | Contents |
|---|---|
environment.txt |
All environment variables (secrets filtered out) |
workspace_git_status.txt |
git status output from /workspace |
workspace_git_diff.txt |
git diff output from /workspace (capped at 500 lines) |
workspace_file_tree.txt |
File listing of /workspace (capped at 200 lines) |
Debug capture adds less than 2 seconds of overhead (env, git, find commands only). The DEBUG_MODE environment variable is exported by configs/_common.sh and inherited by Docker containers via Harbor.
Secret filtering excludes any environment variable whose name contains KEY, TOKEN, SECRET, or PASSWORD.