Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| continue | ||
| step["completion_advantages"] = [step_advantage] * len( | ||
| tokens["completion_ids"] | ||
| ) |
There was a problem hiding this comment.
RubricGroup produces incorrect completion_advantages after aggregation
Medium Severity
When RubricGroup.score_group is used, each sub-rubric's score_group calls _populate_step_completion_advantages with that sub-rubric's partial advantage. Since step advantages are only set when None, the first sub-rubric's partial advantage "sticks" on the steps and completion_advantages, while RubricGroup never recomputes them from the final aggregated reward. The result is completion_advantages reflecting only one sub-rubric's contribution rather than the total.
There was a problem hiding this comment.
this was already the implementation before, the rubric group only does aggregated trajectory level rewards, if someone wants to do step level advantages they will probably have a single reward function in a rubric that does this. probably a cleaner way to get around this but not sure if this is worth fixing right now


Description
Added opt-in support for verifier-provided step-level advantages:
use_verifiers_advantageson the baseEnvironmentand auto-populate it intostateso environments can explicitly signal this mode to prime-rlcompletion_advantageson each trajectory step, computed as per-completion-token values by expanding the step scalar advantage across that step’scompletion_ids(one value per token).Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Medium Risk
Changes rollout/state/output schema and modifies grouped scoring post-processing, which can affect downstream training/serialization expectations despite being additive and covered by tests.
Overview
Adds opt-in plumbing to propagate
use_verifiers_advantagesfromEnvironmentinto each rolloutState(and makes it available viastates_to_outputs) so downstream consumers can detect/verbalize this mode.Enhances grouped scoring (
Rubric.score_group) to populate each trajectory step with a newcompletion_advantagesfield, expanding the step’s scalar advantage into a per-completion-token list (and filling missing stepreward/advantagefrom the state as a fallback). Includes tests covering default fallback behavior vs preserving an existing step advantage.Also hardens
gepa.pywhen readingresult.best_candidateby handling non-dict candidates.Written by Cursor Bugbot for commit 0b96ae2. This will update automatically on new commits. Configure here.