[Pipeline RL] Add support for PipelineRL#428
Merged
jlamypoirier merged 134 commits intomainfrom Mar 20, 2026
Merged
Conversation
…M into denis/new_datasets
…enis/new_datasets
… forward Move num_labels_in_seq computation from _compute_num_labels_in_seq (called inside forward_backward on the already-packed sequence) to _get_model_input, where document boundaries are available via cropped_lengths. Per-document response token counts are trivially computed and broadcast to token positions, eliminating the need for span-finding on the packed sequence. Also fixes new_logprobs metric scaling with cross_entropy_splits > 1, and updates test_lm_head to properly handle list-indexed advantages/old_log_probs and verify the new_logprobs extra metric. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use local_world_size=1 (not world_size) since each process sees exactly one GPU via CUDA_VISIBLE_DEVICES in PipelineRL's setup - Switch from torch.distributed.broadcast_object_list/broadcast to fast_llm.core.distributed.broadcast_object/broadcast, which work directly on ProcessGroupNCCL backend objects (ProcessGroupPool returns unregistered backends that torch.distributed ops cannot accept) - Use process_group.shutdown() instead of torch.distributed.destroy_process_group for the same reason Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs when micro-batches are padded to pad_to_size in LanguageModelBatch.from_documents: 1. advantages and old_log_probabilities (TokenDataBatch) were not padded to match the token batch size. get_cropped_data(label_begin, label_end) then returned fewer elements than logits, causing a shape mismatch in fused_grpo_loss_forward_backward. 2. num_labels_in_seq used cropped_lengths from (begin, label_end) which spans end-begin+prediction_distance tokens, one more than the model input length. Now uses (label_begin, label_end) so segment lengths sum to end-begin, matching new_log_probs shape. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rafapi
previously requested changes
Mar 20, 2026
start_time was set once at the start of iterate() and never reset, causing TimeoutError after 600s of total training time regardless of whether documents were actively flowing. Reset start_time on each successful XREADGROUP response so the timeout only fires when no new documents have arrived for the configured duration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
searchsorted requires a sorted haystack, but labels_per_document is an unsorted array of per-document label counts. Using it directly caused incorrect doc-index lookups, resulting in wrong (often zero) label counts and nan in the grpo_new_logprobs metric. Fix: use length_cumsum[1:] (sorted) to map each token to its document index, then index labels_per_document with that result.
Padded tokens and fully masked documents have num_labels_in_seq=0 and loss_mask=0. Without clamping, 0/0=nan poisons the sum even though those positions contribute nothing to the loss. Clamp to min=1 so masked positions produce 0/1=0 instead.
bigximik
approved these changes
Mar 20, 2026
Collaborator
bigximik
left a comment
There was a problem hiding this comment.
I’ve made several additional changes and addressed @rafapi feedback. @jlamypoirier , could you review and confirm? Otherwise, we can merge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR provides the initial integration with PipelineRL with GRPO loss.
It introduces:
training_started,step_finished, andtraining_finished.This enables seamless coordination between Fast-LLM training and PipelineRL-based inference or orchestration components.