open-tinker · lwaekfjlk · Mar 5, 2026 · Mar 7, 2026 · Mar 7, 2026 · Mar 7, 2026
diff --git a/README.md b/README.md
@@ -18,17 +18,17 @@
 
 Choose an example below to get started. Each example includes step-by-step instructions for setup, training, and inference.
 
-| Task                                             | Description                                                                          | Performance                                                                       |
-| ------------------------------------------------ | ------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------- |
-| **[LLM Single-Turn Math](docs/math_singleturn.md)**                       | Mathematical problem solving                                     | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/bwkq1wl8?nw=nwuserzhusq20)                                                                               |
-| **[LLM Multi-Turn Math](docs/math_multiturn.md)** | Multi-turn mathematical problem solving with tool calling                          | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/f5pt6gcw?nw=nwuserzhusq20)                       |
-| **[LLM Single-LoRA Single-Turn Math](docs/math_lora_singleturn.md)**                  | Math single-turn Trained With LoRA                                                         | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/cl1w5l07?nw=nwuserzhusq20)                        |
-| **[VLM Single-Turn Math](docs/vlm_geo3k_singleturn.md)**                    | geometry 3k math problem solving                                                          | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/aidfc2y1?nw=nwuserzhusq20)                                                                               |
-| **[VLM Multi-Turn Math](docs/vlm_geo3k_multiturn.md)**             | geometry 3k math problem solving with tool calling                                           | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/r39htm2o?nw=nwuserzhusq20)                |
-| **[LLM Gomoku Agent](docs/gomoku_multiturn.md)**       | A multi-turn gomoku agent | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/7a7ggkw3?nw=nwuserzhusq20)                        |
-| **[LLM AlfWorld Agent](docs/alfworld_multiturn.md)**       | A multi-turn alfworld agent | [wandb](https://wandb.ai/1125027232/opentinker-public/runs/3jrlolk7?nw=nwuser1125027232)                        |
-| **[LLM Android World Agent](docs/android_world_multiturn.md)**       | A multi-turn android world agent |                         |
-
+| Task                                                                 | Description                                               | Performance                                                                              |
+| -------------------------------------------------------------------- | --------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
+| **[LLM Single-Turn Math](docs/math_singleturn.md)**                  | Mathematical problem solving                              | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/bwkq1wl8?nw=nwuserzhusq20)               |
+| **[LLM Multi-Turn Math](docs/math_multiturn.md)**                    | Multi-turn mathematical problem solving with tool calling | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/f5pt6gcw?nw=nwuserzhusq20)               |
+| **[LLM Single-LoRA Single-Turn Math](docs/math_lora_singleturn.md)** | Math single-turn Trained With LoRA                        | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/cl1w5l07?nw=nwuserzhusq20)               |
+| **[VLM Single-Turn Math](docs/vlm_geo3k_singleturn.md)**             | geometry 3k math problem solving                          | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/aidfc2y1?nw=nwuserzhusq20)               |
+| **[VLM Multi-Turn Math](docs/vlm_geo3k_multiturn.md)**               | geometry 3k math problem solving with tool calling        | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/r39htm2o?nw=nwuserzhusq20)               |
+| **[LLM Gomoku Agent](docs/gomoku_multiturn.md)**                     | A multi-turn gomoku agent                                 | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/7a7ggkw3?nw=nwuserzhusq20)               |
+| **[LLM AlfWorld Agent](docs/alfworld_multiturn.md)**                 | A multi-turn alfworld agent                               | [wandb](https://wandb.ai/1125027232/opentinker-public/runs/3jrlolk7?nw=nwuser1125027232) |
+| **[LLM SciWorld Agent](docs/sciworld_multiturn.md)**                 | A multi-turn ScienceWorld agent                           |                                                                                          |
+| **[LLM Android World Agent](docs/android_world_multiturn.md)**       | A multi-turn android world agent                          |                                                                                          |
 
 ## 📦 Installation
 
@@ -149,12 +149,12 @@ This 2×2 design space enables four distinct paradigms, each suited to different
 
 ```
 @misc{zhu2026opentinkerseparatingconcernsagentic,
-      title={OpenTinker: Separating Concerns in Agentic Reinforcement Learning}, 
+      title={OpenTinker: Separating Concerns in Agentic Reinforcement Learning},
       author={Siqi Zhu and Jiaxuan You},
       year={2026},
       eprint={2601.07376},
       archivePrefix={arXiv},
       primaryClass={cs.AI},
-      url={https://arxiv.org/abs/2601.07376}, 
+      url={https://arxiv.org/abs/2601.07376},
 }
 ```
diff --git a/docs/sciworld_multiturn.md b/docs/sciworld_multiturn.md
@@ -0,0 +1,117 @@
+# LLM Game Agent (ScienceWorld Multi-Turn)
+
+**Author:** Haofeiy
+
+This example demonstrates training a language model to complete science tasks in the ScienceWorld text-based environment.
+
+## Overview
+
+ScienceWorld is a text-based benchmark of grounded science tasks. Tasks include:
+
+- Boiling / freezing / melting substances
+- Identifying and classifying living things
+- Using instruments (thermometer, microscope, etc.)
+- Combining materials to produce reactions
+- Navigating rooms to find and manipulate objects
+
+OpenTinker support follows the same pattern as ALFWorld:
+
+- `SciWorldGame` wraps the benchmark as an `AbstractGame`
+- `sciworld_server.py` exposes the environment over the generic FastAPI server
+- `sciworld_rl.py` trains against that server through `GameEnvironment`
+
+## Prerequisites
+
+1. Complete the [Installation](../README.md#-installation) steps
+2. Install ScienceWorld: `pip install scienceworld`
+3. Ensure **Java** is available (`java -version`), since ScienceWorld launches a JVM-backed server
+4. Get your IP address if client and scheduler run on different machines: `hostname -I`
+
+## Step 1: Start the Scheduler
+
+```bash
+bash opentinker/scripts/launch_scheduler.sh --scheduler-port <scheduler_port>
+```
+
+## Step 2: Start the ScienceWorld Environment
+
+```bash
+python -m opentinker.environment.sciworld.sciworld_server \
+    --port <env_port> \
+    --max_steps 30 \
+    --split train \
+    --shards 8 \
+    --threads-per-shard 256
+```
+
+Optional task restriction:
+
+```bash
+python -m opentinker.environment.sciworld.sciworld_server \
+    --port <env_port> \
+    --split train \
+    --task-name boil \
+    --task-name find-animal
+```
+
+Useful server options:
+
+- `--split`: ScienceWorld split to sample variations from (`train`, `dev`, `test`)
+- `--task-name`: Repeat to restrict the task pool
+- `--task-id`: Alternative to task names if you prefer numeric task ids
+- `--variation`: Repeat to restrict to explicit variation ids
+- `--simplification-str`: Pass-through simplification string for `env.load()`
+- `--thread-base`: Base ScienceWorld thread number for this server group
+- `--threads-per-shard`: Reserved thread-number block per shard
+
+## Step 3: Run Training
+
+```bash
+python opentinker/client/sciworld_rl.py \
+    tokenizer_path=Qwen/Qwen2.5-3B-Instruct \
+    batch_size=4 \
+    num_steps=1000 \
+    test_freq=10 \
+    scheduler_url=http://<server_endpoint>:<scheduler_port> \
+    interaction.config.env_port=<env_port> \
+    interaction.config.env_host=<client_endpoint> \
+    interaction.config.split=train \
+    interaction.config.local_thread_base=20000
+```
+
+## Notes
+
+- Keep `num_workers=0` for the local prompt-generation dataloaders unless you
+  explicitly manage non-overlapping ScienceWorld thread bases per worker.
+- Use the same `split`, `task_names`, `task_ids`, and `variation_indices`
+  settings on both the environment server and the client config so prompt
+  generation matches the remote environment.
+- If you already run ScienceWorld-backed processes on the same machine, move
+  `--thread-base` and `interaction.config.local_thread_base` to disjoint ranges.
+
+## Reward Structure
+
+| Event            | Reward |
+| ---------------- | ------ |
+| Task Success     | +10.0  |
+| Task Failure     | -1.0   |
+| Per Step Penalty | -0.01  |
+| Invalid Action   | -0.1   |
+
+## Example Actions
+
+The agent interacts with the environment using text commands:
+
+- `look around` - Observe the current room
+- `open door to kitchen` - Navigate between rooms
+- `pick up thermometer` - Pick up an object
+- `use thermometer on water` - Use an instrument
+- `pour water into beaker` - Combine or transfer materials
+- `focus on substance in microscope` - Examine with instruments
+- `inventory` - Check held items
+- `wait` - Wait one step (e.g. for a reaction)
+
+## Configuration Reference
+
+See [`opentinker/client/client_config/sciworld_param.yaml`](../opentinker/client/client_config/sciworld_param.yaml)
+for the full configuration.
diff --git a/opentinker/client/client_config/sciworld_param.yaml b/opentinker/client/client_config/sciworld_param.yaml
@@ -0,0 +1,81 @@
+# ScienceWorld Training Configuration
+# Use with: python sciworld_rl.py
+
+# Project settings
+project_name: opentinker
+experiment_name: sciworld_training
+
+# Logging
+logger_backends: ["console", "wandb"]
+
+# Tracing (optional)
+enable_tracing: true
+weave_project: null
+
+# WandB (optional)
+wandb_key: null
+
+# Model and tokenizer
+tokenizer_path: null
+
+# Training parameters
+batch_size: 4
+num_workers: 0 # Keep single-process; ScienceWorld uses per-instance JVM ports
+# Training duration - set ONE of these (num_steps takes precedence if both set)
+num_epochs: null # Number of epochs (null = use num_steps)
+num_steps: 1000 # Total training steps (null = use num_epochs)
+save_freq: 200
+test_freq: 50 # Validation frequency (every N steps)
+
+# Validation parameters
+val_batch_size: 32 # Total validation samples (null = 50)
+
+# Generation parameters
+temperature: 1 # Lower temperature for more focused responses
+top_p: 1
+max_new_tokens: 4096 # TOTAL response budget for entire multi-turn trajectory (NOT per-turn!)
+max_prompt_tokens: 2048
+
+# Algorithm (must be agent_loop for multi-turn)
+algorithm: "agent_loop"
+
+# RL Algorithm settings (passed to server via scheduler)
+# adv_estimator options:
+#   - "grpo"          : Standard GRPO (outcome-only advantage)
+#   - "grpo_per_step" : Per-step GRPO with return-based advantages (for multi-turn tasks)
+#   - "gae"           : Generalized Advantage Estimation (for PPO, requires critic)
+adv_estimator: "grpo"
+# rollout_n: number of samples per prompt for GRPO/grpo_per_step
+# For PPO (gae), rollout_n is typically 1
+rollout_n: 8
+
+interaction:
+  name: sciworld
+  class_path: opentinker.environment.gym_environment_interaction.GymEnvironmentInteraction
+  config:
+    env_host: 0.0.0.0
+    env_port: 8092
+    env_endpoint: http://${interaction.config.env_host}:${interaction.config.env_port}
+    env_shards: 8
+    max_steps: 30
+    max_total_steps: 30
+    observation_template: "{observation}"
+    split: train # train, dev, test
+    task_names: null # e.g. ["boil", "find-animal"]
+    task_ids: null
+    variation_indices: null
+    simplification_str: ""
+    jar_path: null
+    local_thread_base: 20000
+
+multi_turn:
+  max_user_turns: ${interaction.config.max_total_steps}
+  max_assistant_turns: ${interaction.config.max_total_steps}
+  max_tokens_per_turn: 512
+  weave_project: "opentinker/sciworld"
+  experiment_name: "sciworld_interaction"
+
+scheduler_url: "http://0.0.0.0:8780"
+scheduler_api_key: null
+
+num_gpus: 8