Added entropy normalization and different threshold for clipping.#38
Closed
PolarisDane wants to merge 8 commits into
Closed
Added entropy normalization and different threshold for clipping.#38PolarisDane wants to merge 8 commits into
PolarisDane wants to merge 8 commits into
Conversation
…ted GRPO Implement RWML from arxiv:2602.05842 as a separate GRPO training pass for world model learning, decoupled from policy training. Core: EmbeddingSimilarityReward class computes per-turn binary rewards based on text embedding cosine similarity between model predictions (argmax at observation positions) and actual environment observations. Training flow per step: 1. compute_log_prob extracts predicted_ids (argmax on logits) 2. RWML GRPO: compute embedding similarity rewards → GRPO advantages → update_actor 3. Policy GRPO: compute task rewards → GRPO advantages → update_actor Key files: - world_model_rl.py: EmbeddingSimilarityReward, decode_per_turn_texts, compute_rwml_turn_rewards - dp_actor.py: return_predicted_ids in forward pass - http_training_server.py: separate RWML GRPO update step - generic_agent_loop.py: store per-turn observation texts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The rwml config block was defined in the YAML but never extracted and sent to the training server by the client, unlike wmc_erc which has explicit extraction logic. Without this, OmegaConf.select(self.config, "rwml") always returned None on the server side. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Alibaba-NLP/gte-large-en-v1.5 uses custom code that requires explicit trust_remote_code=True to load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set HF_HUB_OFFLINE=1 during SentenceTransformer init so it uses the local cache without trying to connect to huggingface.co. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of passing the model ID to SentenceTransformer (which tries to reach huggingface.co even when cached), resolve the local snapshot path from the HF hub cache first. Falls back to the original path if not found in cache or if it's already a local directory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added entropy normalization and different threshold for clipping.