Skip to content

fix: align GRPO prompt format with SFT training format#61

Merged
abrichr merged 1 commit intomainfrom
fix/sft-grpo-prompt-alignment
Mar 22, 2026
Merged

fix: align GRPO prompt format with SFT training format#61
abrichr merged 1 commit intomainfrom
fix/sft-grpo-prompt-alignment

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 22, 2026

Fixes zero-reward GRPO runs caused by prompt format mismatch between SFT and GRPO. Adds Thought/Action format, JSON parsing, and debug logging.

The GRPO rollout prompt was missing the "Thought:" line and action
history that the SFT training uses. Models fine-tuned via SFT output
"Thought: ...\nAction: CLICK(...)" but the GRPO prompt didn't
prompt for this format, causing verbose free-form output that
couldn't be parsed → reward 0.0 → zero gradients.

Changes:
- Add "Thought:" and "Action:" prompt lines matching SFT format
- Add action_history parameter for step context
- Parser extracts action from "Action: ..." line before regex matching
- Parser handles JSON format {"action_type": "click", "coordinate": [x,y]}
- Debug logging of raw VLM output for zero-reward diagnosis

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit 04e6e9f into main Mar 22, 2026
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant