Add pandas-debugger RL environment for data pipeline debugging#1344
Add pandas-debugger RL environment for data pipeline debugging#1344l69d wants to merge 1 commit into
Conversation
New environment covering 10 bug categories in pandas/numpy code: off_by_one, dtype_cast, merge_key, agg_axis, fillna_method, groupby_reset, str_strip, sort_ascending, inplace_return, copy_alias. Graded reward signal (0/0.25/0.5/1.0) for richer RL training signal. No sandbox required — subprocess with timeout isolation. 39 tests across all bug categories and edge cases.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 899e4b8. Configure here.
| # our check_expr itself is broken; give benefit of the doubt | ||
| return 0.5 | ||
|
|
||
| return score |
There was a problem hiding this comment.
Reward ladder 0.5 level never triggers for intended case
High Severity
The documented reward ladder says 0.5 = "runs but wrong," but the implementation never returns 0.5 for that case. When the model's code runs without errors but fails the check, it returns 0.25 (same as "syntactically valid but crashes"). The 0.5 path only triggers when the ground-truth check_expr is itself broken, which is a degenerate sanity-check scenario. The advertised 4-level graded signal (0/0.25/0.5/1.0) is effectively 3-level (0/0.25/1.0), reducing the RL training signal richness that the PR specifically highlights as a design goal.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 899e4b8. Configure here.
| @@ -0,0 +1,645 @@ | |||
| """ | |||
There was a problem hiding this comment.
Missing environments/README.md update for new environment
Low Severity
This PR adds a new pandas_debugger environment to the environments/ folder but does not update environments/README.md to list it. The project rules require that any PR adding or removing an environment must update that README with the new environment listed under the appropriate category/pattern section.
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit 899e4b8. Configure here.
| } | ||
| patterns = keywords.get(bug_type, []) | ||
| text_lower = text.lower() | ||
| return any(re.search(p, text_lower) for p in patterns) |
There was a problem hiding this comment.
Regex patterns with uppercase never match lowered text
Low Severity
The patterns "None" and "SettingWithCopy" contain uppercase characters but are matched against text.lower() via re.search, which is case-sensitive by default. These patterns can never match, making them dead code. The function still works in practice because sibling patterns like "inplace" / "return" and "copy" / "view" compensate, but the intended matching for those specific terms is silently broken.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 899e4b8. Configure here.
| return 1.0 | ||
|
|
||
| # Partial: also check if the GROUND TRUTH passes (sanity; should always be True) | ||
| gt_passed, _ = _run_code_safe(fixed_code_gt, check_expr) |
There was a problem hiding this comment.
Blocking subprocess.run in async reward function serializes rollouts
Medium Severity
The async correctness_reward function calls _run_code_safe, which uses synchronous subprocess.run with a 10-second timeout — potentially twice per rollout (model code + ground-truth sanity check). The project docs explicitly state that sync operations in reward functions block all concurrent rollouts and "should be avoided at all costs." Using asyncio.create_subprocess_exec or wrapping with asyncio.to_thread would prevent event-loop starvation.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 899e4b8. Configure here.


Summary
Adds a new
pandas-debuggerRL environment for debugging broken pandas/NumPy data pipelines.What it does
The environment presents the model with a broken pandas/NumPy code snippet containing one of 10 bug categories and asks it to identify and fix the bug.
Bug categories (10)
off_by_one— index/slice off-by-one errorsdtype_cast— implicit type coercions producing NaN/wrong outputmerge_key— wrong join key or join typeagg_axis— axis confusion in aggregationsfillna_method— wrong fill strategygroupby_reset— missingreset_index()after groupbystr_strip— string whitespace issues corrupting matchessort_ascending— wrong sort directioninplace_return—inplace=True+ assignment = None bugcopy_alias— view vs copy confusion causing silent no-opsWhy this fills a gap
None of the 37 existing environments cover data wrangling/debugging. Data engineering is the #1 ML practitioner pain point — an RL model that can reliably debug pandas pipelines has high practical value.
Reward signal
Graded: 0 / 0.25 / 0.5 / 1.0 — richer than binary for RL training signal.
Tests
39 unit tests across all categories and edge cases. No external services required — uses subprocess with timeout isolation.
Interested in the Application-Only tier if this qualifies. Happy to add more bug categories or increase task complexity based on feedback.
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com
Note
Medium Risk
Introduces a new environment that executes model-produced Python in a subprocess for scoring; while isolated with timeouts, any code-execution harness changes carry moderate reliability/sandboxing risk.
Overview
Adds a new
pandas-debuggerenvironment that trains/evaluates models on fixing single-bug pandas/NumPy data-wrangling snippets, scoring via XML response parsing plus execution-based assertions.Implements an embedded task bank (buggy vs fixed code +
check_expr), a subprocess-based safe runner with timeouts, and a rubric combiningcorrectness_reward,format_reward, and a lightweight reasoning keyword bonus.Includes packaging metadata (
pyproject.toml), documentation (README.md), and a comprehensive pytest suite validating task integrity, reward behavior, and end-to-end environment loading.Reviewed by Cursor Bugbot for commit 899e4b8. Bugbot is set up for automated code reviews on this repo. Configure here.