feat: prevent overfitting via prompt changes and post-processing#127
Conversation
| optimize_context, iteration | ||
| ) | ||
| if all_valid: | ||
| return self._handle_success(last_ctx, iteration) |
There was a problem hiding this comment.
Success returns validation context with inflated iteration number
Medium Severity
When validation passes, _handle_success receives last_ctx — the last validation sample's context — instead of optimize_context (the original passing turn). The validation context's .iteration is set to iteration + i + 1 (a synthetic validation-internal number), so the returned result and the "success" status update carry an inflated iteration number. For example, if the main loop passes on attempt 1 with 2 validation samples, the result reports iteration=3 instead of 1. This propagates into on_passing_result, on_status_update, and the API persistence layer via _persist_and_forward, causing misleading iteration counts in the UI and stored records.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 3042984. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 288336e. Configure here.
| f"— replaced {total_count} occurrence(s) with placeholder {placeholder}" | ||
| ) | ||
|
|
||
| return text, warnings |
There was a problem hiding this comment.
Sync callbacks crash in await_if_needed after return type change
High Severity
await_if_needed checks isinstance(result, str) to detect synchronous returns, but the callback signatures now return OptimizationResponse instead of str. When a synchronous (non-async) callback returns an OptimizationResponse, the isinstance check is False, so the code falls through to await result, which raises a TypeError because OptimizationResponse is not awaitable. All tests use AsyncMock so this path is untested.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 288336e. Configure here.


Requirements
Describe the solution you've provided
Implements a few things to prevent "overfitting" of responses (the LLM tailoring its response to only one set of inputs and values):
Describe alternatives you've considered
This is an attempted fix at an overfitting problem that was reported.
Additional context
Output example after these changes:
The appropriate placeholders are now present rather than it hardcoding values directly into the prompt.
Note
Medium Risk
Medium risk because it changes the public callback contract (
handle_agent_call/handle_judge_callnow returnOptimizationResponse) and alters optimization control flow by adding post-pass validation loops and retry behavior, which can affect integrations and run-time characteristics.Overview
Adds a post-pass validation phase (“chaos mode”) that, after an initial passing iteration, reruns the agent on additional distinct sampled inputs/variables (2–5, scaled by pool size) before confirming success; failed validation rejects the candidate and continues with variation generation without consuming the attempt budget.
Refactors agent/judge callback plumbing to return a new
OptimizationResponse(output + optionalTokenUsage), records per-call durations, and persistsgeneration_latency/token usage plus per-judge evaluation latencies/tokens inagent_optimization_resultpayloads.Hardens variation generation and overfitting prevention by improving prompts (explicit placeholder key-vs-value guidance + overfitting warning section), broadening placeholder interpolation to support hyphenated keys, adding deterministic post-processing (
restore_variable_placeholders) to revert leaked concrete values back to{{key}}, and retrying variation generation up to 3 times on empty/unparseable JSON while removing structured-output tool injection/handler routing.Reviewed by Cursor Bugbot for commit 288336e. Bugbot is set up for automated code reviews on this repo. Configure here.