Skip to content

Add pandas-debugger RL environment for data pipeline debugging#1344

Open
l69d wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
l69d:add-pandas-debugger-env
Open

Add pandas-debugger RL environment for data pipeline debugging#1344
l69d wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
l69d:add-pandas-debugger-env

Conversation

@l69d
Copy link
Copy Markdown

@l69d l69d commented May 11, 2026

Summary

Adds a new pandas-debugger RL environment for debugging broken pandas/NumPy data pipelines.

What it does

The environment presents the model with a broken pandas/NumPy code snippet containing one of 10 bug categories and asks it to identify and fix the bug.

Bug categories (10)

  • off_by_one — index/slice off-by-one errors
  • dtype_cast — implicit type coercions producing NaN/wrong output
  • merge_key — wrong join key or join type
  • agg_axis — axis confusion in aggregations
  • fillna_method — wrong fill strategy
  • groupby_reset — missing reset_index() after groupby
  • str_strip — string whitespace issues corrupting matches
  • sort_ascending — wrong sort direction
  • inplace_returninplace=True + assignment = None bug
  • copy_alias — view vs copy confusion causing silent no-ops

Why this fills a gap

None of the 37 existing environments cover data wrangling/debugging. Data engineering is the #1 ML practitioner pain point — an RL model that can reliably debug pandas pipelines has high practical value.

Reward signal

Graded: 0 / 0.25 / 0.5 / 1.0 — richer than binary for RL training signal.

Tests

39 unit tests across all categories and edge cases. No external services required — uses subprocess with timeout isolation.

Interested in the Application-Only tier if this qualifies. Happy to add more bug categories or increase task complexity based on feedback.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com


Note

Medium Risk
Introduces a new environment that executes model-produced Python in a subprocess for scoring; while isolated with timeouts, any code-execution harness changes carry moderate reliability/sandboxing risk.

Overview
Adds a new pandas-debugger environment that trains/evaluates models on fixing single-bug pandas/NumPy data-wrangling snippets, scoring via XML response parsing plus execution-based assertions.

Implements an embedded task bank (buggy vs fixed code + check_expr), a subprocess-based safe runner with timeouts, and a rubric combining correctness_reward, format_reward, and a lightweight reasoning keyword bonus.

Includes packaging metadata (pyproject.toml), documentation (README.md), and a comprehensive pytest suite validating task integrity, reward behavior, and end-to-end environment loading.

Reviewed by Cursor Bugbot for commit 899e4b8. Bugbot is set up for automated code reviews on this repo. Configure here.

New environment covering 10 bug categories in pandas/numpy code:
off_by_one, dtype_cast, merge_key, agg_axis, fillna_method,
groupby_reset, str_strip, sort_ascending, inplace_return, copy_alias.

Graded reward signal (0/0.25/0.5/1.0) for richer RL training signal.
No sandbox required — subprocess with timeout isolation.
39 tests across all bug categories and edge cases.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 899e4b8. Configure here.

# our check_expr itself is broken; give benefit of the doubt
return 0.5

return score
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reward ladder 0.5 level never triggers for intended case

High Severity

The documented reward ladder says 0.5 = "runs but wrong," but the implementation never returns 0.5 for that case. When the model's code runs without errors but fails the check, it returns 0.25 (same as "syntactically valid but crashes"). The 0.5 path only triggers when the ground-truth check_expr is itself broken, which is a degenerate sanity-check scenario. The advertised 4-level graded signal (0/0.25/0.5/1.0) is effectively 3-level (0/0.25/1.0), reducing the RL training signal richness that the PR specifically highlights as a design goal.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 899e4b8. Configure here.

@@ -0,0 +1,645 @@
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing environments/README.md update for new environment

Low Severity

This PR adds a new pandas_debugger environment to the environments/ folder but does not update environments/README.md to list it. The project rules require that any PR adding or removing an environment must update that README with the new environment listed under the appropriate category/pattern section.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit 899e4b8. Configure here.

}
patterns = keywords.get(bug_type, [])
text_lower = text.lower()
return any(re.search(p, text_lower) for p in patterns)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex patterns with uppercase never match lowered text

Low Severity

The patterns "None" and "SettingWithCopy" contain uppercase characters but are matched against text.lower() via re.search, which is case-sensitive by default. These patterns can never match, making them dead code. The function still works in practice because sibling patterns like "inplace" / "return" and "copy" / "view" compensate, but the intended matching for those specific terms is silently broken.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 899e4b8. Configure here.

return 1.0

# Partial: also check if the GROUND TRUTH passes (sanity; should always be True)
gt_passed, _ = _run_code_safe(fixed_code_gt, check_expr)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking subprocess.run in async reward function serializes rollouts

Medium Severity

The async correctness_reward function calls _run_code_safe, which uses synchronous subprocess.run with a 10-second timeout — potentially twice per rollout (model code + ground-truth sanity check). The project docs explicitly state that sync operations in reward functions block all concurrent rollouts and "should be avoided at all costs." Using asyncio.create_subprocess_exec or wrapping with asyncio.to_thread would prevent event-loop starvation.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 899e4b8. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant