Skip to content

Added entropy normalization and different threshold for clipping.#38

Closed
PolarisDane wants to merge 8 commits into
open-tinker:feature/support-co-evolve-v2from
PolarisDane:wm-clip-dev
Closed

Added entropy normalization and different threshold for clipping.#38
PolarisDane wants to merge 8 commits into
open-tinker:feature/support-co-evolve-v2from
PolarisDane:wm-clip-dev

Conversation

@PolarisDane
Copy link
Copy Markdown
Contributor

Added entropy normalization and different threshold for clipping.

PolarisDane and others added 8 commits March 17, 2026 15:36
…ted GRPO

Implement RWML from arxiv:2602.05842 as a separate GRPO training pass for
world model learning, decoupled from policy training.

Core: EmbeddingSimilarityReward class computes per-turn binary rewards based
on text embedding cosine similarity between model predictions (argmax at
observation positions) and actual environment observations.

Training flow per step:
1. compute_log_prob extracts predicted_ids (argmax on logits)
2. RWML GRPO: compute embedding similarity rewards → GRPO advantages → update_actor
3. Policy GRPO: compute task rewards → GRPO advantages → update_actor

Key files:
- world_model_rl.py: EmbeddingSimilarityReward, decode_per_turn_texts, compute_rwml_turn_rewards
- dp_actor.py: return_predicted_ids in forward pass
- http_training_server.py: separate RWML GRPO update step
- generic_agent_loop.py: store per-turn observation texts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The rwml config block was defined in the YAML but never extracted and
sent to the training server by the client, unlike wmc_erc which has
explicit extraction logic. Without this, OmegaConf.select(self.config,
"rwml") always returned None on the server side.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Alibaba-NLP/gte-large-en-v1.5 uses custom code that requires explicit
trust_remote_code=True to load.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set HF_HUB_OFFLINE=1 during SentenceTransformer init so it uses the
local cache without trying to connect to huggingface.co.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of passing the model ID to SentenceTransformer (which tries to
reach huggingface.co even when cached), resolve the local snapshot path
from the HF hub cache first. Falls back to the original path if not
found in cache or if it's already a local directory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@PolarisDane PolarisDane closed this Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant