Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,17 @@

Choose an example below to get started. Each example includes step-by-step instructions for setup, training, and inference.

| Task | Description | Performance |
| ------------------------------------------------ | ------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------- |
| **[LLM Single-Turn Math](docs/math_singleturn.md)** | Mathematical problem solving | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/bwkq1wl8?nw=nwuserzhusq20) |
| **[LLM Multi-Turn Math](docs/math_multiturn.md)** | Multi-turn mathematical problem solving with tool calling | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/f5pt6gcw?nw=nwuserzhusq20) |
| **[LLM Single-LoRA Single-Turn Math](docs/math_lora_singleturn.md)** | Math single-turn Trained With LoRA | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/cl1w5l07?nw=nwuserzhusq20) |
| **[VLM Single-Turn Math](docs/vlm_geo3k_singleturn.md)** | geometry 3k math problem solving | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/aidfc2y1?nw=nwuserzhusq20) |
| **[VLM Multi-Turn Math](docs/vlm_geo3k_multiturn.md)** | geometry 3k math problem solving with tool calling | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/r39htm2o?nw=nwuserzhusq20) |
| **[LLM Gomoku Agent](docs/gomoku_multiturn.md)** | A multi-turn gomoku agent | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/7a7ggkw3?nw=nwuserzhusq20) |
| **[LLM AlfWorld Agent](docs/alfworld_multiturn.md)** | A multi-turn alfworld agent | [wandb](https://wandb.ai/1125027232/opentinker-public/runs/3jrlolk7?nw=nwuser1125027232) |
| **[LLM Android World Agent](docs/android_world_multiturn.md)** | A multi-turn android world agent | |

| Task | Description | Performance |
| -------------------------------------------------------------------- | --------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| **[LLM Single-Turn Math](docs/math_singleturn.md)** | Mathematical problem solving | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/bwkq1wl8?nw=nwuserzhusq20) |
| **[LLM Multi-Turn Math](docs/math_multiturn.md)** | Multi-turn mathematical problem solving with tool calling | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/f5pt6gcw?nw=nwuserzhusq20) |
| **[LLM Single-LoRA Single-Turn Math](docs/math_lora_singleturn.md)** | Math single-turn Trained With LoRA | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/cl1w5l07?nw=nwuserzhusq20) |
| **[VLM Single-Turn Math](docs/vlm_geo3k_singleturn.md)** | geometry 3k math problem solving | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/aidfc2y1?nw=nwuserzhusq20) |
| **[VLM Multi-Turn Math](docs/vlm_geo3k_multiturn.md)** | geometry 3k math problem solving with tool calling | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/r39htm2o?nw=nwuserzhusq20) |
| **[LLM Gomoku Agent](docs/gomoku_multiturn.md)** | A multi-turn gomoku agent | [wandb](https://wandb.ai/zsqzz/Open-Tinker/runs/7a7ggkw3?nw=nwuserzhusq20) |
| **[LLM AlfWorld Agent](docs/alfworld_multiturn.md)** | A multi-turn alfworld agent | [wandb](https://wandb.ai/1125027232/opentinker-public/runs/3jrlolk7?nw=nwuser1125027232) |
| **[LLM SciWorld Agent](docs/sciworld_multiturn.md)** | A multi-turn ScienceWorld agent | |
| **[LLM Android World Agent](docs/android_world_multiturn.md)** | A multi-turn android world agent | |

## πŸ“¦ Installation

Expand Down Expand Up @@ -149,12 +149,12 @@ This 2Γ—2 design space enables four distinct paradigms, each suited to different

```
@misc{zhu2026opentinkerseparatingconcernsagentic,
title={OpenTinker: Separating Concerns in Agentic Reinforcement Learning},
title={OpenTinker: Separating Concerns in Agentic Reinforcement Learning},
author={Siqi Zhu and Jiaxuan You},
year={2026},
eprint={2601.07376},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.07376},
url={https://arxiv.org/abs/2601.07376},
}
```
117 changes: 117 additions & 0 deletions docs/sciworld_multiturn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# LLM Game Agent (ScienceWorld Multi-Turn)

**Author:** Haofeiy

This example demonstrates training a language model to complete science tasks in the ScienceWorld text-based environment.

## Overview

ScienceWorld is a text-based benchmark of grounded science tasks. Tasks include:

- Boiling / freezing / melting substances
- Identifying and classifying living things
- Using instruments (thermometer, microscope, etc.)
- Combining materials to produce reactions
- Navigating rooms to find and manipulate objects

OpenTinker support follows the same pattern as ALFWorld:

- `SciWorldGame` wraps the benchmark as an `AbstractGame`
- `sciworld_server.py` exposes the environment over the generic FastAPI server
- `sciworld_rl.py` trains against that server through `GameEnvironment`

## Prerequisites

1. Complete the [Installation](../README.md#-installation) steps
2. Install ScienceWorld: `pip install scienceworld`
3. Ensure **Java** is available (`java -version`), since ScienceWorld launches a JVM-backed server
4. Get your IP address if client and scheduler run on different machines: `hostname -I`

## Step 1: Start the Scheduler

```bash
bash opentinker/scripts/launch_scheduler.sh --scheduler-port <scheduler_port>
```

## Step 2: Start the ScienceWorld Environment

```bash
python -m opentinker.environment.sciworld.sciworld_server \
--port <env_port> \
--max_steps 30 \
--split train \
--shards 8 \
--threads-per-shard 256
```

Optional task restriction:

```bash
python -m opentinker.environment.sciworld.sciworld_server \
--port <env_port> \
--split train \
--task-name boil \
--task-name find-animal
```

Useful server options:

- `--split`: ScienceWorld split to sample variations from (`train`, `dev`, `test`)
- `--task-name`: Repeat to restrict the task pool
- `--task-id`: Alternative to task names if you prefer numeric task ids
- `--variation`: Repeat to restrict to explicit variation ids
- `--simplification-str`: Pass-through simplification string for `env.load()`
- `--thread-base`: Base ScienceWorld thread number for this server group
- `--threads-per-shard`: Reserved thread-number block per shard

## Step 3: Run Training

```bash
python opentinker/client/sciworld_rl.py \
tokenizer_path=Qwen/Qwen2.5-3B-Instruct \
batch_size=4 \
num_steps=1000 \
test_freq=10 \
scheduler_url=http://<server_endpoint>:<scheduler_port> \
interaction.config.env_port=<env_port> \
interaction.config.env_host=<client_endpoint> \
interaction.config.split=train \
interaction.config.local_thread_base=20000
```

## Notes

- Keep `num_workers=0` for the local prompt-generation dataloaders unless you
explicitly manage non-overlapping ScienceWorld thread bases per worker.
- Use the same `split`, `task_names`, `task_ids`, and `variation_indices`
settings on both the environment server and the client config so prompt
generation matches the remote environment.
- If you already run ScienceWorld-backed processes on the same machine, move
`--thread-base` and `interaction.config.local_thread_base` to disjoint ranges.

## Reward Structure

| Event | Reward |
| ---------------- | ------ |
| Task Success | +10.0 |
| Task Failure | -1.0 |
| Per Step Penalty | -0.01 |
| Invalid Action | -0.1 |

## Example Actions

The agent interacts with the environment using text commands:

- `look around` - Observe the current room
- `open door to kitchen` - Navigate between rooms
- `pick up thermometer` - Pick up an object
- `use thermometer on water` - Use an instrument
- `pour water into beaker` - Combine or transfer materials
- `focus on substance in microscope` - Examine with instruments
- `inventory` - Check held items
- `wait` - Wait one step (e.g. for a reaction)

## Configuration Reference

See [`opentinker/client/client_config/sciworld_param.yaml`](../opentinker/client/client_config/sciworld_param.yaml)
for the full configuration.
81 changes: 81 additions & 0 deletions opentinker/client/client_config/sciworld_param.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# ScienceWorld Training Configuration
# Use with: python sciworld_rl.py

# Project settings
project_name: opentinker
experiment_name: sciworld_training

# Logging
logger_backends: ["console", "wandb"]

# Tracing (optional)
enable_tracing: true
weave_project: null

# WandB (optional)
wandb_key: null

# Model and tokenizer
tokenizer_path: null

# Training parameters
batch_size: 4
num_workers: 0 # Keep single-process; ScienceWorld uses per-instance JVM ports
# Training duration - set ONE of these (num_steps takes precedence if both set)
num_epochs: null # Number of epochs (null = use num_steps)
num_steps: 1000 # Total training steps (null = use num_epochs)
save_freq: 200
test_freq: 50 # Validation frequency (every N steps)

# Validation parameters
val_batch_size: 32 # Total validation samples (null = 50)

# Generation parameters
temperature: 1 # Lower temperature for more focused responses
top_p: 1
max_new_tokens: 4096 # TOTAL response budget for entire multi-turn trajectory (NOT per-turn!)
max_prompt_tokens: 2048

# Algorithm (must be agent_loop for multi-turn)
algorithm: "agent_loop"

# RL Algorithm settings (passed to server via scheduler)
# adv_estimator options:
# - "grpo" : Standard GRPO (outcome-only advantage)
# - "grpo_per_step" : Per-step GRPO with return-based advantages (for multi-turn tasks)
# - "gae" : Generalized Advantage Estimation (for PPO, requires critic)
adv_estimator: "grpo"
# rollout_n: number of samples per prompt for GRPO/grpo_per_step
# For PPO (gae), rollout_n is typically 1
rollout_n: 8

interaction:
name: sciworld
class_path: opentinker.environment.gym_environment_interaction.GymEnvironmentInteraction
config:
env_host: 0.0.0.0
env_port: 8092
env_endpoint: http://${interaction.config.env_host}:${interaction.config.env_port}
env_shards: 8
max_steps: 30
max_total_steps: 30
observation_template: "{observation}"
split: train # train, dev, test
task_names: null # e.g. ["boil", "find-animal"]
task_ids: null
variation_indices: null
simplification_str: ""
jar_path: null
local_thread_base: 20000

multi_turn:
max_user_turns: ${interaction.config.max_total_steps}
max_assistant_turns: ${interaction.config.max_total_steps}
max_tokens_per_turn: 512
weave_project: "opentinker/sciworld"
experiment_name: "sciworld_interaction"

scheduler_url: "http://0.0.0.0:8780"
scheduler_api_key: null

num_gpus: 8
Loading
Loading