Added entropy normalization and different threshold for clipping. by PolarisDane · Pull Request #38 · open-tinker/OpenTinker

PolarisDane · 2026-03-17T15:37:58Z

Added entropy normalization and different threshold for clipping.

…ted GRPO Implement RWML from arxiv:2602.05842 as a separate GRPO training pass for world model learning, decoupled from policy training. Core: EmbeddingSimilarityReward class computes per-turn binary rewards based on text embedding cosine similarity between model predictions (argmax at observation positions) and actual environment observations. Training flow per step: 1. compute_log_prob extracts predicted_ids (argmax on logits) 2. RWML GRPO: compute embedding similarity rewards → GRPO advantages → update_actor 3. Policy GRPO: compute task rewards → GRPO advantages → update_actor Key files: - world_model_rl.py: EmbeddingSimilarityReward, decode_per_turn_texts, compute_rwml_turn_rewards - dp_actor.py: return_predicted_ids in forward pass - http_training_server.py: separate RWML GRPO update step - generic_agent_loop.py: store per-turn observation texts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The rwml config block was defined in the YAML but never extracted and sent to the training server by the client, unlike wmc_erc which has explicit extraction logic. Without this, OmegaConf.select(self.config, "rwml") always returned None on the server side. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Alibaba-NLP/gte-large-en-v1.5 uses custom code that requires explicit trust_remote_code=True to load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set HF_HUB_OFFLINE=1 during SentenceTransformer init so it uses the local cache without trying to connect to huggingface.co. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of passing the model ID to SentenceTransformer (which tries to reach huggingface.co even when cached), resolve the local snapshot path from the HF hub cache first. Falls back to the original path if not found in cache or if it's already a local directory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PolarisDane and others added 8 commits March 17, 2026 15:36

Added entropy normalization and different threshold for clipping.

586dd31

update

5562471

Clip by scaling instead of masking.

86a6da5

fix: pass trust_remote_code=True to SentenceTransformer

28de904

Alibaba-NLP/gte-large-en-v1.5 uses custom code that requires explicit trust_remote_code=True to load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: load embedding model in offline mode for air-gapped environments

fb8f555

Set HF_HUB_OFFLINE=1 during SentenceTransformer init so it uses the local cache without trying to connect to huggingface.co. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PolarisDane closed this Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added entropy normalization and different threshold for clipping.#38

Added entropy normalization and different threshold for clipping.#38
PolarisDane wants to merge 8 commits into
open-tinker:feature/support-co-evolve-v2from
PolarisDane:wm-clip-dev

PolarisDane commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PolarisDane commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant