fix: recover from BadRequestError caused by oversized payloads by seanturner83 · Pull Request #460 · usestrix/strix

seanturner83 · 2026-04-20T15:38:24Z

Title: fix: recover from BadRequestError caused by oversized payloads

Summary

Adds BadRequestError (HTTP 400) recovery to the LLM retry loop — first retries bare, then optionally truncates large <tool_result> XML blocks when STRIX_TRUNCATE_ON_OVERSIZE=true
Truncation is opt-in (default off) to avoid lossy recovery for users who don't want it
Observed on repos with 50KB+ config files (CIS benchmark YAML, large tfvars.json) that push Bedrock payload past its limit at ~6M tokens

Context

During large-scale scanning (700+ repos), certain repos with very large files caused persistent HTTP 400s that killed scans immediately. The two-stage approach handles both transient 400s (bare retry) and genuine payload oversize (truncation + retry).

How it works

On HTTP 400, retry once after 2s (handles transient provider hiccups)
If still failing and STRIX_TRUNCATE_ON_OVERSIZE=true, scan messages in reverse for <tool_result> blocks exceeding 2000 chars, truncate to 1000 chars with a "requires manual review" notice, retry
If neither works, fall through to existing retry/error logic

Test plan

Confirmed bare retry resolves transient 400s in production scanning
Confirmed truncation triggers on repos with large tool results hitting Bedrock payload limits
Verify truncation notice appears in scan output when triggered
Verify flag defaults to off — no behavior change without explicit opt-in

…yloads When tool results (e.g. large config files, CIS benchmark YAML) accumulate in conversation history, the serialized payload can exceed the provider's request size limit, causing a persistent HTTP 400. Previously this was not retried, failing the scan immediately. Now handles BadRequestError in two stages: 1. Bare retry after 2s (transient 400s from provider hiccups) 2. If STRIX_TRUNCATE_ON_OVERSIZE=true, truncates the largest tool_result XML blocks to 1000 chars with a "requires manual review" notice, then retries. This is opt-in to avoid lossy recovery for users who don't want it. Observed in practice on repos with 50KB+ CIS defaults YAML and 94KB tfvars.json files hitting Bedrock payload limits at ~6M tokens. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-04-20T15:42:28Z

Greptile Summary

This PR adds a two-stage HTTP 400 recovery path to the LLM retry loop: a bare retry after 2 s (for transient provider hiccups) followed by optional <tool_result> XML truncation when STRIX_TRUNCATE_ON_OVERSIZE=true (for genuine payload-oversize errors against Bedrock). The previous silent-exit-on-last-attempt bug and the missing import re at module level are both correctly addressed.

Confidence Score: 4/5

Safe to merge with minor caveats; core logic is sound and previous blocking issues are fixed.

No P0/P1 findings in the new code. Two P2 items remain: the bare retry fires on all 400s (not just oversize payloads), adding latency for permanent config errors; and structured type=tool_result content blocks are not covered by the truncation path. Neither blocks the stated use-case of recovering from Bedrock payload limits on large YAML/tfvars files.

strix/llm/llm.py — specifically the _is_bad_request catch scope and the _truncate_large_tool_results content-type coverage.

Important Files Changed

Filename	Overview
strix/llm/llm.py	Adds HTTP 400 recovery with bare retry and optional truncation; `import re` correctly moved to module-level; previous silent-exit and last-attempt guard bugs fixed; a few P2 edge cases remain around 400 misclassification and truncation coverage.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: strix/llm/llm.py
Line: 177-183

Comment:
**Bare retry fires on all HTTP 400s, not just payload oversize**

The `_is_bad_request` check intercepts every HTTP 400 — including persistent provider errors like invalid model name, unsupported parameter combinations, or malformed request bodies that will never recover. For these, the bare retry adds a guaranteed 2-second stall and an extra wasted attempt before failing. Consider at least logging a warning, or inspecting the error body for "payload too large" / "context window" signals before incurring the retry.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: strix/llm/llm.py
Line: 380-389

Comment:
**Structured `tool_result` content blocks are not truncated**

The code searches for `<tool_result>` XML only inside `type == "text"` content blocks. Some providers / litellm versions represent tool results as structured list entries with `type == "tool_result"` (and a nested `content` key holding the large body). Those are silently skipped, so truncation may not trigger even when the payload contains large structured tool results, leaving the caller in the `_raise_error` path without any reduction in payload size.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (5): Last reviewed commit: "Fix truncate list formatting and tool re..." | Re-trigger Greptile}

…t, closure scope

bearsyankees · 2026-04-22T16:58:06Z

@greptile

…loads Remove one-shot bad_request_truncated guard so truncation retries on each 400 until nothing remains to truncate. Also scan all messages per pass instead of stopping at the first hit. Addresses review feedback from Greptile on usestrix#460. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bearsyankees · 2026-04-26T16:32:55Z

@greptile

…runcation retry - Rename max_chars → threshold_chars and add separate truncate_to_chars parameter so the threshold and truncation target can differ by design, replacing the hardcoded 1000. Callers can now shrink blocks aggressively without re-processing blocks that are already acceptable. - Add 2s sleep before the truncation-path `continue` to match the bare-retry pacing. If Bedrock is throttling after the original 400, immediately hitting it again risks a second rejection before the truncated payload is evaluated.

seanturner83 · 2026-04-27T21:15:19Z

Addressed both P2 points in 6e25294:

Split max_chars into threshold_chars (qualifier for truncation) and truncate_to_chars (retained prefix size), with a docstring explaining why they're independent. The hardcoded 1000 is gone.
Added await asyncio.sleep(2) before the truncation path continue to match the bare-retry pacing, with a comment explaining the intent.
Ready for another Greptile pass whenever you want to re-trigger.

bearsyankees · 2026-04-27T21:20:11Z

@greptileai

bearsyankees · 2026-04-27T21:33:16Z

@greptile

Pulls usestrix#467 (feat: add resume session feature) ahead of upstream merge. Enables unattended scans to survive crashes or forced termination (e.g. AWS STS session expiry mid-scan) by appending every LLM message to strix_runs/<run>/conversation.jsonl and replaying on --resume. New CLI: strix --list-sessions # tabulate past scans strix --continue (or -c) # resume most recent strix --resume <run_name> # resume by name strix --resume # open interactive picker Motivates: https://github.com/seedcx/strix-scan-workflow/actions/runs/25116247499 — trade-api flag-aware scan exited at 71min with APIConnectionError ("security token included in the request is expired"). Session had produced 10 findings; all lost once the report upload also hit ExpiredToken. With this commit merged, the CI composite action can wrap strix in a re-assume + --continue loop so each sub-session fits inside the 60min STS budget and total scan duration is bounded only by the overall workflow timeout. Upstream integration: - Once usestrix#467 merges, drop from the seedcx-build recipe - Until then, include alongside usestrix#454 (memory compression), usestrix#460 (truncate retry), usestrix#468 (thinking retry), this one, and the adaptive-thinking commit on this branch Clean 3-way merge, no conflicts. Smoke tests: - uv sync succeeds - strix --help surfaces --resume / --continue / --list-sessions - imports (strix.sessions.resume, strix.telemetry.conversation_log) OK Follow-up (separate): strix-scan-workflow README recipe + composite action consumer changes to invoke resume on re-dispatch.

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread strix/llm/llm.py

Comment thread strix/llm/llm.py Outdated

Comment thread strix/llm/llm.py Outdated

fix: address review feedback — last-attempt guard, module-level impor…

4a213b1

…t, closure scope

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread strix/llm/llm.py

Fix truncate list formatting and tool result pattern caching

39c84d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover from BadRequestError caused by oversized payloads#460

fix: recover from BadRequestError caused by oversized payloads#460
seanturner83 wants to merge 5 commits into
usestrix:mainfrom
seanturner83:fix/bad-request-truncate-retry

seanturner83 commented Apr 20, 2026

Uh oh!

greptile-apps Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bearsyankees commented Apr 22, 2026

Uh oh!

Uh oh!

bearsyankees commented Apr 26, 2026

Uh oh!

seanturner83 commented Apr 27, 2026

Uh oh!

bearsyankees commented Apr 27, 2026

Uh oh!

bearsyankees commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

seanturner83 commented Apr 20, 2026

Summary

Context

How it works

Test plan

Uh oh!

greptile-apps Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bearsyankees commented Apr 22, 2026

Uh oh!

Uh oh!

bearsyankees commented Apr 26, 2026

Uh oh!

seanturner83 commented Apr 27, 2026

Uh oh!

bearsyankees commented Apr 27, 2026

Uh oh!

bearsyankees commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Apr 20, 2026 •

edited

Loading