send step level advantages to prime rl by eligotts · Pull Request #941 · PrimeIntellect-ai/verifiers

eligotts · 2026-02-20T04:08:19Z

Description

Added opt-in support for verifier-provided step-level advantages:

Introduced use_verifiers_advantages on the base Environment and auto-populate it into state so environments can explicitly signal this mode to prime-rl
Added completion_advantages on each trajectory step, computed as per-completion-token values by expanding the step scalar advantage across that step’s completion_ids (one value per token).

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Medium Risk
Changes rollout/state/output schema and modifies grouped scoring post-processing, which can affect downstream training/serialization expectations despite being additive and covered by tests.

Overview
Adds opt-in plumbing to propagate use_verifiers_advantages from Environment into each rollout State (and makes it available via states_to_outputs) so downstream consumers can detect/verbalize this mode.

Enhances grouped scoring (Rubric.score_group) to populate each trajectory step with a new completion_advantages field, expanding the step’s scalar advantage into a per-completion-token list (and filling missing step reward/advantage from the state as a fallback). Includes tests covering default fallback behavior vs preserving an existing step advantage.

Also hardens gepa.py when reading result.best_candidate by handling non-dict candidates.

^{Written by Cursor Bugbot for commit 0b96ae2. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-20T04:13:32Z

verifiers/rubrics/rubric.py

+                continue
+            step["completion_advantages"] = [step_advantage] * len(
+                tokens["completion_ids"]
+            )


RubricGroup produces incorrect completion_advantages after aggregation

Medium Severity

When RubricGroup.score_group is used, each sub-rubric's score_group calls _populate_step_completion_advantages with that sub-rubric's partial advantage. Since step advantages are only set when None, the first sub-rubric's partial advantage "sticks" on the steps and completion_advantages, while RubricGroup never recomputes them from the final aggregated reward. The result is completion_advantages reflecting only one sub-rubric's contribution rather than the total.

this was already the implementation before, the rubric group only does aggregated trajectory level rewards, if someone wants to do step level advantages they will probably have a single reward function in a rubric that does this. probably a cleaner way to get around this but not sure if this is worth fixing right now

eligotts added 2 commits February 18, 2026 21:17

codex first

4840a6d

add flag to use advantages

1c4e674

eligotts changed the title ~~Eli/step level adv~~ send step level advantages tp prime rl Feb 20, 2026

cursor bot reviewed Feb 20, 2026

View reviewed changes

eligotts mentioned this pull request Feb 20, 2026

Support for using step level advantages from verifiers PrimeIntellect-ai/prime-rl#1838

Closed

unrelated ty check to pass

0b96ae2

eligotts changed the title ~~send step level advantages tp prime rl~~ send step level advantages to prime rl Feb 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

send step level advantages to prime rl#941

send step level advantages to prime rl#941
eligotts wants to merge 3 commits intomainfrom
eli/step-level-adv

eligotts commented Feb 20, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 20, 2026

Uh oh!

eligotts Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eligotts commented Feb 20, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 20, 2026

Choose a reason for hiding this comment

RubricGroup produces incorrect completion_advantages after aggregation

Uh oh!

eligotts Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eligotts commented Feb 20, 2026 •

edited by cursor bot

Loading