[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo by jingshenghang · Pull Request #1923 · THUDM/slime

jingshenghang · 2026-05-19T08:28:35Z

End-to-end loop for "coding agent + sandbox execution + test reward":

Boot an E2B sandbox per sample (dataset metadata.image).
Install Node 22 + Claude Code CLI inside; create unprivileged agent user; chown repo; write problem_statement.md.
Launch claude --output-format stream-json pointed at a head-node bridge via ANTHROPIC_BASE_URL; ANTHROPIC_AUTH_TOKEN doubles as the session id for concurrent request demux.
bridge.py translates each /v1/messages call into a SGLang /generate call and streams an Anthropic SSE reply back; per-session it keeps (prompt_ids, response_ids, loss_mask) so no post-hoc retokenize.
After the agent finishes, a fresh eval sandbox applies the model git diff and runs the dataset's tests -> 0/1 reward.
generate.py drops the bridge-collected tokens straight into Sample.

Layout under examples/coding_agent_rl/:
generate.py slime custom-generate entrypoint
sandbox.py all sandbox-side ops (boot/exec/upload, install
Node/Claude Code, run agent, git diff, eval)
bridge.py Anthropic Messages API <-> SGLang /generate
shim; model-agnostic
run_glm47_355b.sh 8-node / 64-GPU / colocate / E2B launch
script for GLM-4.7-355B-A32B
README.md walkthrough + dataset schema + swap-model howto

Model-agnostic: chat template via tokenizer.apply_chat_template; tool call parsing via sglang FunctionCallParser; reasoning parsing via sglang ReasoningParser. Swapping model = changing SWE_TOOL_PARSER / SWE_REASONING_PARSER envs (default glm47/glm45).

No new swe_rollout.py: reuses slime's default sglang_rollout outer loop via --custom-generate-function-path.

A minimal, readable example of coding agent + sandbox execution + test reward in slime (~1500 LoC across 4 files). One training sample: spin up a sandbox -> run Claude Code inside it -> capture the model-produced git diff -> spin up a SECOND clean sandbox, apply the diff, run the dataset's tests -> 0/1 reward -> feed the actual generated tokens (with loss-mask) back to slime, no re-tokenization. Wire-up is one CLI flag: --custom-generate-function-path examples.coding_agent_rl.generate.generate slime's default sglang_rollout.generate_rollout outer loop is reused; only the per-sample generate() is swapped. Files: * generate.py - per-sample entrypoint slime calls. Provision sandbox -> drop PROBLEM_STATEMENT.md -> run agent -> git diff -> eval in a fresh sandbox -> fill Sample. * sandbox.py - E2B sandbox backend. Boot/kill, exec/upload, install_node22 + install_claude_code, long-running agent spawn with done-marker poll, git_diff, fresh-sandbox eval runner (swepro / f2p_script / eval_cmd). * bridge.py - head-node aiohttp shim. Translates the agent's Anthropic Messages API into slime's SGLang /generate (token-native + logprobs) and keeps (prompt_ids, response_ids, loss_mask) per session so the trainer skips re-tokenization. Model-agnostic. * run_glm47_355b.sh - reference launch script (GLM-4.7-355B-A32B, 8 nodes / 64 GPUs, colocate, E2B). All required env vars guarded by \${VAR:?...}; no operator-specific paths. * README.md - file table, sample flow diagram, dataset schema (flat and remote_env_info layouts), required vs optional env knobs, "Swap things out" recipes (model / agent / sandbox backend), and design notes (no re-tokenization, reasoning round-trip, done-marker poll, boot semaphore + retry).

zhuzilin · 2026-05-19T09:25:36Z

+    # Canonical chat log. Each assistant turn we append after /generate carries
+    # reasoning_content so the next round's apply_chat_template re-render matches
+    # the tokens the model actually emitted (preserving prefix match).
+    glm_messages: list[dict] = dataclasses.field(default_factory=list)


应该叫 messages 就可以了

zhuzilin · 2026-05-19T09:32:44Z

@@ -0,0 +1,424 @@
+"""Anthropic Messages API <-> SGLang /generate bridge.


我建议改成叫 middleware 之类的东西，主要是现在 slime 里面已经有 mbridge 和 megatron bridge 了，容易混淆。

zhuzilin · 2026-05-19T09:33:20Z

+    lock: asyncio.Lock = dataclasses.field(default_factory=asyncio.Lock)
+
+
+class _Store:


python 的 asyncio 是不是单线程的？需要这个带锁的 session 吗？

zhuzilin · 2026-05-19T09:38:25Z

+            "message": "Authorization Bearer <session_id> required",
+        }}, status=400)
+
+    s = await store.get(session_id)


有可能需要加个提示，session_id 不能重复

zhuzilin · 2026-05-19T09:39:38Z

+        ideal_text = tok.apply_chat_template(
+            s.glm_messages, tools=s.tools_schema, tokenize=False, add_generation_prompt=True,
+        )
+        ideal_ids = tok.encode(ideal_text, add_special_tokens=False)


这里是不是 tokenize=True 就行？token=True 还能设置 add_special_tokens=False 吗？

zhuzilin · 2026-05-19T09:57:49Z

+            else:
+                logger.warning("[bridge] %s template-rerender mismatch; rebaselining", session_id)
+                s.response_ids = ideal_ids[len(s.prompt_ids):]
+                s.loss_mask = [0] * len(s.response_ids)


这里是按如果不匹配会 mask 掉整条 response 来做的是吗？

zhuzilin · 2026-05-19T10:07:47Z

+                    "content": [], "stop_reason": None, "stop_sequence": None,
+                    "usage": {"input_tokens": in_tokens, "output_tokens": 0}},
+    }))
+    for idx, b in enumerate(blocks):


想确定一下，这里是会先把整个 streaming 的请求变成同步的吗？

…metadata configurable * bridge.py renamed to middleware.py (file + class BridgeHandle -> MiddlewareHandle + log prefix + thread name). The chat-log dataclass field glm_messages was also renamed chat_messages -- the example is model-agnostic and the GLM-specific naming was misleading. * sandbox.py no longer hardcodes ``glm-platform/*`` metadata keys. Set SWE_SANDBOX_METADATA_JSON='{"my-platform/size":"lg",...}' to pass arbitrary routing tags into AsyncSandbox.create(metadata=...). Default is empty, which works for stock E2B accounts. * generate.py defaults SWE_TOOL_PARSER / SWE_REASONING_PARSER to None (no hardcoded GLM-specific parser fallback). The reference launch script run_glm47_355b.sh still sets these to glm47/glm45. * sandbox.run_claude_code's ``bridge_url=`` kwarg renamed to ``middleware_url=`` (caller in generate.py updated). * README + run_glm47_355b.sh updated for the rename and the new metadata env var.

…ddleware - middleware: track each /messages call as a `_Turn` (request snapshot, response, finish/stop reason, parent prefix), expose via pop_session return value and `record_tree` opt-in. - middleware: detect non-linear message updates by hashing prior messages and rebuild the prompt when the client diverges, instead of silently appending. Also translate Anthropic `thinking` blocks into `reasoning_content` so prior reasoning is preserved across turns. - generate: add SWE_SAVE_TRAJECTORY_TREE env knob; when set, stash the exported tree under sample.metadata["trajectory_tree"]. Also allow overriding the Claude Code prompt via SWE_CC_PROMPT. - sandbox: pass --include-partial-messages / --include-hook-events to claude CLI and allow extra args via SWE_CLAUDE_EXTRA_ARGS; quote the trajectory path with shlex.

jingshenghang force-pushed the coding-agent-rl-example branch from 585ef36 to f2fd320 Compare May 19, 2026 09:40

zhuzilin reviewed May 19, 2026

View reviewed changes

jingshenghang and others added 4 commits May 19, 2026 11:35

Merge branch 'main' into coding-agent-rl-example

3bab216

[examples] fix coding agent sandbox node install

04dfa8b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo#1923

[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo#1923
jingshenghang wants to merge 5 commits into
THUDM:mainfrom
jingshenghang:coding-agent-rl-example

jingshenghang commented May 19, 2026

Uh oh!

zhuzilin May 19, 2026

Uh oh!

zhuzilin May 19, 2026

Uh oh!

zhuzilin May 19, 2026

Uh oh!

zhuzilin May 19, 2026

Uh oh!

zhuzilin May 19, 2026

Uh oh!

zhuzilin May 19, 2026

Uh oh!

zhuzilin May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,424 @@
		"""Anthropic Messages API <-> SGLang /generate bridge.

		lock: asyncio.Lock = dataclasses.field(default_factory=asyncio.Lock)


		class _Store:

Conversation

jingshenghang commented May 19, 2026

Uh oh!

zhuzilin May 19, 2026

Choose a reason for hiding this comment

Uh oh!

zhuzilin May 19, 2026

Choose a reason for hiding this comment

Uh oh!

zhuzilin May 19, 2026

Choose a reason for hiding this comment

Uh oh!

zhuzilin May 19, 2026

Choose a reason for hiding this comment

Uh oh!

zhuzilin May 19, 2026

Choose a reason for hiding this comment

Uh oh!

zhuzilin May 19, 2026

Choose a reason for hiding this comment

Uh oh!

zhuzilin May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants