fix: recover from BadRequestError caused by oversized payloads#460
fix: recover from BadRequestError caused by oversized payloads#460seanturner83 wants to merge 5 commits into
Conversation
…yloads When tool results (e.g. large config files, CIS benchmark YAML) accumulate in conversation history, the serialized payload can exceed the provider's request size limit, causing a persistent HTTP 400. Previously this was not retried, failing the scan immediately. Now handles BadRequestError in two stages: 1. Bare retry after 2s (transient 400s from provider hiccups) 2. If STRIX_TRUNCATE_ON_OVERSIZE=true, truncates the largest tool_result XML blocks to 1000 chars with a "requires manual review" notice, then retries. This is opt-in to avoid lossy recovery for users who don't want it. Observed in practice on repos with 50KB+ CIS defaults YAML and 94KB tfvars.json files hitting Bedrock payload limits at ~6M tokens. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR adds a two-stage HTTP 400 recovery path to the LLM retry loop: a bare retry after 2 s (for transient provider hiccups) followed by optional Confidence Score: 4/5Safe to merge with minor caveats; core logic is sound and previous blocking issues are fixed. No P0/P1 findings in the new code. Two P2 items remain: the bare retry fires on all 400s (not just oversize payloads), adding latency for permanent config errors; and structured strix/llm/llm.py — specifically the Important Files Changed
Prompt To Fix All With AIThis is a comment left during a code review.
Path: strix/llm/llm.py
Line: 177-183
Comment:
**Bare retry fires on all HTTP 400s, not just payload oversize**
The `_is_bad_request` check intercepts every HTTP 400 — including persistent provider errors like invalid model name, unsupported parameter combinations, or malformed request bodies that will never recover. For these, the bare retry adds a guaranteed 2-second stall and an extra wasted attempt before failing. Consider at least logging a warning, or inspecting the error body for "payload too large" / "context window" signals before incurring the retry.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: strix/llm/llm.py
Line: 380-389
Comment:
**Structured `tool_result` content blocks are not truncated**
The code searches for `<tool_result>` XML only inside `type == "text"` content blocks. Some providers / litellm versions represent tool results as structured list entries with `type == "tool_result"` (and a nested `content` key holding the large body). Those are silently skipped, so truncation may not trigger even when the payload contains large structured tool results, leaving the caller in the `_raise_error` path without any reduction in payload size.
How can I resolve this? If you propose a fix, please make it concise.Reviews (5): Last reviewed commit: "Fix truncate list formatting and tool re..." | Re-trigger Greptile |
|
@greptile |
…loads Remove one-shot bad_request_truncated guard so truncation retries on each 400 until nothing remains to truncate. Also scan all messages per pass instead of stopping at the first hit. Addresses review feedback from Greptile on usestrix#460. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@greptile |
…runcation retry - Rename max_chars → threshold_chars and add separate truncate_to_chars parameter so the threshold and truncation target can differ by design, replacing the hardcoded 1000. Callers can now shrink blocks aggressively without re-processing blocks that are already acceptable. - Add 2s sleep before the truncation-path `continue` to match the bare-retry pacing. If Bedrock is throttling after the original 400, immediately hitting it again risks a second rejection before the truncated payload is evaluated.
|
Addressed both P2 points in 6e25294:
|
|
@greptile |
Pulls usestrix#467 (feat: add resume session feature) ahead of upstream merge. Enables unattended scans to survive crashes or forced termination (e.g. AWS STS session expiry mid-scan) by appending every LLM message to strix_runs/<run>/conversation.jsonl and replaying on --resume. New CLI: strix --list-sessions # tabulate past scans strix --continue (or -c) # resume most recent strix --resume <run_name> # resume by name strix --resume # open interactive picker Motivates: https://github.com/seedcx/strix-scan-workflow/actions/runs/25116247499 — trade-api flag-aware scan exited at 71min with APIConnectionError ("security token included in the request is expired"). Session had produced 10 findings; all lost once the report upload also hit ExpiredToken. With this commit merged, the CI composite action can wrap strix in a re-assume + --continue loop so each sub-session fits inside the 60min STS budget and total scan duration is bounded only by the overall workflow timeout. Upstream integration: - Once usestrix#467 merges, drop from the seedcx-build recipe - Until then, include alongside usestrix#454 (memory compression), usestrix#460 (truncate retry), usestrix#468 (thinking retry), this one, and the adaptive-thinking commit on this branch Clean 3-way merge, no conflicts. Smoke tests: - uv sync succeeds - strix --help surfaces --resume / --continue / --list-sessions - imports (strix.sessions.resume, strix.telemetry.conversation_log) OK Follow-up (separate): strix-scan-workflow README recipe + composite action consumer changes to invoke resume on re-dispatch.
Title:
fix: recover from BadRequestError caused by oversized payloadsSummary
<tool_result>XML blocks whenSTRIX_TRUNCATE_ON_OVERSIZE=trueContext
During large-scale scanning (700+ repos), certain repos with very large files caused persistent HTTP 400s that killed scans immediately. The two-stage approach handles both transient 400s (bare retry) and genuine payload oversize (truncation + retry).
How it works
STRIX_TRUNCATE_ON_OVERSIZE=true, scan messages in reverse for<tool_result>blocks exceeding 2000 chars, truncate to 1000 chars with a "requires manual review" notice, retryTest plan