Change online training verifier #1371

jacklanchantin · 2025-10-14T13:44:40Z

What does this PR do? Please describe:

Adds GrpoLossConfig adv_std_normarlization (for DrGRPO)
Adds GrpoLossConfig loss_token_mean for normalizing over all tokens
Adds new if statement to skip ref logprob computation for kl if beta == 0
Adds tis_imp_ratio_cap to use truncated importance sampling correction

Fixes #{issue number}

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

…airseq2 into jacklanchantin/drgrpo

…2 into jacklanchantin/drgrpo

…dd a comment for calling newer vllm generate function (#1396) * set VLLM_ALLOW_INSECURE_SERIALIZATION=1 for newer vllm versions * update generate function to align with newer vllm version

…q2 into jacklanchantin/drgrpo

* do not throttle client-server port * maybe_sync_fix * move tokenizer * ppl/logp reward * remove prepare_preference_batch func (intended for online dpo * keep an noop prepare_preference_batch method to instantiate * make logging clearer. * 1. add additional logic to add whitespace if needed. 2. not reusing prefix token from fs2. * pass prefix text rather than tokens in rm (it would also support rm tokenizer different from policy model) * clear up string_input flag. --------- Co-authored-by: uralik <kulikov@cs.nyu.edu> Co-authored-by: jacklanchantin <jacklanchantin@gmail.com>

…airseq2 into jacklanchantin/drgrpo

drgrpo

fd7267f

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 14, 2025

Jack Lanchantin and others added 11 commits October 14, 2025 16:13

get vllm logps

cb6f7a9

Update _wandb.py

d6acc63

remove beta check

7b72df9

Merge branch 'jacklanchantin/drgrpo' of github.com:facebookresearch/f…

7fc3b2f

…airseq2 into jacklanchantin/drgrpo

format

502fa69

revert

79382d3

add importance sampling correction

97e8dca

dont run ref model forward if beta==0

54c9d98

add tis ratio clamp = 2

acb0840

clean up

50d21dd

configs

ccfa63b

jacklanchantin changed the title ~~drgrpo~~ Importance Sampling Correction, and DrGRPO args Oct 22, 2025

Jack Lanchantin added 5 commits October 22, 2025 20:59

clean up

bb49312

default

bd4b073

var name

6919a4c

var name

d910891

only use tis_imp_ratio_cap

b762625

jacklanchantin changed the title ~~Importance Sampling Correction, and DrGRPO args~~ Change online training verifier Oct 22, 2025

Jack Lanchantin added 10 commits October 23, 2025 03:57

batched inputs

5dff68a

use tis_drgrpo files

cce97ce

size

178fb69

match tis_grpo

536ce2b

fix batching/microbatching bugs

a036e92

black/isort

ca043a5

Merge branch 'online_training' of github.com:facebookresearch/fairseq…

55dc39a

…2 into jacklanchantin/drgrpo

revert qwen card

bdf6e4b

bypass reference_model if None

cdbec3c

add SelfAugmentingExtractor for llm judge

2645498

jacklanchantin and others added 3 commits November 13, 2025 17:25

tokenizer

5cc0610

set VLLM_ALLOW_INSECURE_SERIALIZATION=1 for newer vllm versions and a…

7f7b50d

…dd a comment for calling newer vllm generate function (#1396) * set VLLM_ALLOW_INSECURE_SERIALIZATION=1 for newer vllm versions * update generate function to align with newer vllm version

comment out unused

14d2571

jacklanchantin force-pushed the jacklanchantin/drgrpo branch from 9090bd5 to 14d2571 Compare November 14, 2025 16:02

jacklanchantin and others added 23 commits November 14, 2025 18:45

logging

fcb24b2

logging

8d40e31

logging

424338a

fix src_key_text

45cd0aa

logging

214be19

maybe sync fix

fe061c3

do not throttle client-server port

7bbd503

Merge branch 'kulikov/init_fix' of github.com:facebookresearch/fairse…

58b1df7

…q2 into jacklanchantin/drgrpo

maybe_sync_fix

1641e5c

move tokenizer

9068b24

move tokenizer

399464d

Merge branch 'jacklanchantin/drgrpo' of github.com:facebookresearch/f…

ba27796

…airseq2 into jacklanchantin/drgrpo

comment

a03edf6

divide by 0 protection in ppl loss

c57b70d

stability

5adcb1d

clip if no think

ba349a6

print prompt text

c5f6cf6

fixes

d14b610

entropy stabilization

4d61714

clipping

fdef776

clean

6c6048c

start nll loss

a6013ce

jacklanchantin force-pushed the jacklanchantin/drgrpo branch 2 times, most recently from f967068 to a6013ce Compare December 2, 2025 05:49

wandb

ad3a5c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change online training verifier #1371

Change online training verifier #1371

Uh oh!

jacklanchantin commented Oct 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Change online training verifier #1371

Are you sure you want to change the base?

Change online training verifier #1371

Uh oh!

Conversation

jacklanchantin commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jacklanchantin commented Oct 14, 2025 •

edited

Loading