fix: align GRPO prompt format with SFT training format by abrichr · Pull Request #61 · OpenAdaptAI/openadapt-ml

abrichr · 2026-03-22T22:17:24Z

Fixes zero-reward GRPO runs caused by prompt format mismatch between SFT and GRPO. Adds Thought/Action format, JSON parsing, and debug logging.

The GRPO rollout prompt was missing the "Thought:" line and action history that the SFT training uses. Models fine-tuned via SFT output "Thought: ...\nAction: CLICK(...)" but the GRPO prompt didn't prompt for this format, causing verbose free-form output that couldn't be parsed → reward 0.0 → zero gradients. Changes: - Add "Thought:" and "Action:" prompt lines matching SFT format - Add action_history parameter for step context - Parser extracts action from "Action: ..." line before regex matching - Parser handles JSON format {"action_type": "click", "coordinate": [x,y]} - Debug logging of raw VLM output for zero-reward diagnosis Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abrichr merged commit 04e6e9f into main Mar 22, 2026
0 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: align GRPO prompt format with SFT training format#61

fix: align GRPO prompt format with SFT training format#61
abrichr merged 1 commit intomainfrom
fix/sft-grpo-prompt-alignment

abrichr commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant