Skip to content

fix: increase max_new_tokens to 2048 and make configurable via GRPOConfig#62

Merged
abrichr merged 2 commits intomainfrom
fix/max-new-tokens
Mar 23, 2026
Merged

fix: increase max_new_tokens to 2048 and make configurable via GRPOConfig#62
abrichr merged 2 commits intomainfrom
fix/max-new-tokens

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 23, 2026

Fixes zero-reward GRPO runs caused by 100-token limit truncating reasoning models.

abrichr and others added 2 commits March 22, 2026 18:17
The GRPO rollout prompt was missing the "Thought:" line and action
history that the SFT training uses. Models fine-tuned via SFT output
"Thought: ...\nAction: CLICK(...)" but the GRPO prompt didn't
prompt for this format, causing verbose free-form output that
couldn't be parsed → reward 0.0 → zero gradients.

Changes:
- Add "Thought:" and "Action:" prompt lines matching SFT format
- Add action_history parameter for step context
- Parser extracts action from "Action: ..." line before regex matching
- Parser handles JSON format {"action_type": "click", "coordinate": [x,y]}
- Debug logging of raw VLM output for zero-reward diagnosis

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default of 100 tokens truncated reasoning models mid-thought,
producing unparseable output → DONE → reward 0.0 → zero gradients.
Caused 4 failed training runs (~20 GPU-hours wasted).

- Add max_new_tokens to GRPOConfig (default 2048)
- Use config value instead of hardcoded 100
- Add truncation warning when generation hits the limit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit fecf461 into main Mar 23, 2026
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant