[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo#1923
Open
jingshenghang wants to merge 5 commits into
Open
[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo#1923jingshenghang wants to merge 5 commits into
jingshenghang wants to merge 5 commits into
Conversation
A minimal, readable example of coding agent + sandbox execution + test
reward in slime (~1500 LoC across 4 files). One training sample:
spin up a sandbox -> run Claude Code inside it -> capture the
model-produced git diff -> spin up a SECOND clean sandbox, apply the
diff, run the dataset's tests -> 0/1 reward -> feed the actual
generated tokens (with loss-mask) back to slime, no re-tokenization.
Wire-up is one CLI flag:
--custom-generate-function-path examples.coding_agent_rl.generate.generate
slime's default sglang_rollout.generate_rollout outer loop is reused;
only the per-sample generate() is swapped.
Files:
* generate.py - per-sample entrypoint slime calls. Provision sandbox ->
drop PROBLEM_STATEMENT.md -> run agent -> git diff -> eval in a fresh
sandbox -> fill Sample.
* sandbox.py - E2B sandbox backend. Boot/kill, exec/upload,
install_node22 + install_claude_code, long-running agent spawn with
done-marker poll, git_diff, fresh-sandbox eval runner (swepro /
f2p_script / eval_cmd).
* bridge.py - head-node aiohttp shim. Translates the agent's
Anthropic Messages API into slime's SGLang /generate (token-native +
logprobs) and keeps (prompt_ids, response_ids, loss_mask) per session
so the trainer skips re-tokenization. Model-agnostic.
* run_glm47_355b.sh - reference launch script (GLM-4.7-355B-A32B,
8 nodes / 64 GPUs, colocate, E2B). All required env vars guarded by
\${VAR:?...}; no operator-specific paths.
* README.md - file table, sample flow diagram, dataset schema (flat
and remote_env_info layouts), required vs optional env knobs,
"Swap things out" recipes (model / agent / sandbox backend), and
design notes (no re-tokenization, reasoning round-trip, done-marker
poll, boot semaphore + retry).
585ef36 to
f2fd320
Compare
zhuzilin
reviewed
May 19, 2026
| # Canonical chat log. Each assistant turn we append after /generate carries | ||
| # reasoning_content so the next round's apply_chat_template re-render matches | ||
| # the tokens the model actually emitted (preserving prefix match). | ||
| glm_messages: list[dict] = dataclasses.field(default_factory=list) |
| @@ -0,0 +1,424 @@ | |||
| """Anthropic Messages API <-> SGLang /generate bridge. | |||
Contributor
There was a problem hiding this comment.
我建议改成叫 middleware 之类的东西,主要是现在 slime 里面已经有 mbridge 和 megatron bridge 了,容易混淆。
| lock: asyncio.Lock = dataclasses.field(default_factory=asyncio.Lock) | ||
|
|
||
|
|
||
| class _Store: |
Contributor
There was a problem hiding this comment.
python 的 asyncio 是不是单线程的?需要这个带锁的 session 吗?
| "message": "Authorization Bearer <session_id> required", | ||
| }}, status=400) | ||
|
|
||
| s = await store.get(session_id) |
Contributor
There was a problem hiding this comment.
有可能需要加个提示,session_id 不能重复
| ideal_text = tok.apply_chat_template( | ||
| s.glm_messages, tools=s.tools_schema, tokenize=False, add_generation_prompt=True, | ||
| ) | ||
| ideal_ids = tok.encode(ideal_text, add_special_tokens=False) |
Contributor
There was a problem hiding this comment.
这里是不是 tokenize=True 就行?token=True 还能设置 add_special_tokens=False 吗?
| else: | ||
| logger.warning("[bridge] %s template-rerender mismatch; rebaselining", session_id) | ||
| s.response_ids = ideal_ids[len(s.prompt_ids):] | ||
| s.loss_mask = [0] * len(s.response_ids) |
Contributor
There was a problem hiding this comment.
这里是按如果不匹配会 mask 掉整条 response 来做的是吗?
| "content": [], "stop_reason": None, "stop_sequence": None, | ||
| "usage": {"input_tokens": in_tokens, "output_tokens": 0}}, | ||
| })) | ||
| for idx, b in enumerate(blocks): |
Contributor
There was a problem hiding this comment.
想确定一下,这里是会先把整个 streaming 的请求变成同步的吗?
…metadata configurable
* bridge.py renamed to middleware.py (file + class BridgeHandle ->
MiddlewareHandle + log prefix + thread name). The chat-log dataclass
field glm_messages was also renamed chat_messages -- the example is
model-agnostic and the GLM-specific naming was misleading.
* sandbox.py no longer hardcodes ``glm-platform/*`` metadata keys.
Set SWE_SANDBOX_METADATA_JSON='{"my-platform/size":"lg",...}' to pass
arbitrary routing tags into AsyncSandbox.create(metadata=...). Default
is empty, which works for stock E2B accounts.
* generate.py defaults SWE_TOOL_PARSER / SWE_REASONING_PARSER to None
(no hardcoded GLM-specific parser fallback). The reference launch
script run_glm47_355b.sh still sets these to glm47/glm45.
* sandbox.run_claude_code's ``bridge_url=`` kwarg renamed to
``middleware_url=`` (caller in generate.py updated).
* README + run_glm47_355b.sh updated for the rename and the new
metadata env var.
…ddleware - middleware: track each /messages call as a `_Turn` (request snapshot, response, finish/stop reason, parent prefix), expose via pop_session return value and `record_tree` opt-in. - middleware: detect non-linear message updates by hashing prior messages and rebuild the prompt when the client diverges, instead of silently appending. Also translate Anthropic `thinking` blocks into `reasoning_content` so prior reasoning is preserved across turns. - generate: add SWE_SAVE_TRAJECTORY_TREE env knob; when set, stash the exported tree under sample.metadata["trajectory_tree"]. Also allow overriding the Claude Code prompt via SWE_CC_PROMPT. - sandbox: pass --include-partial-messages / --include-hook-events to claude CLI and allow extra args via SWE_CLAUDE_EXTRA_ARGS; quote the trajectory path with shlex.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
End-to-end loop for "coding agent + sandbox execution + test reward":
agentuser; chown repo; write problem_statement.md.claude --output-format stream-jsonpointed at a head-node bridge via ANTHROPIC_BASE_URL; ANTHROPIC_AUTH_TOKEN doubles as the session id for concurrent request demux.Layout under examples/coding_agent_rl/:
generate.py slime custom-generate entrypoint
sandbox.py all sandbox-side ops (boot/exec/upload, install
Node/Claude Code, run agent, git diff, eval)
bridge.py Anthropic Messages API <-> SGLang /generate
shim; model-agnostic
run_glm47_355b.sh 8-node / 64-GPU / colocate / E2B launch
script for GLM-4.7-355B-A32B
README.md walkthrough + dataset schema + swap-model howto
Model-agnostic: chat template via tokenizer.apply_chat_template; tool call parsing via sglang FunctionCallParser; reasoning parsing via sglang ReasoningParser. Swapping model = changing SWE_TOOL_PARSER / SWE_REASONING_PARSER envs (default glm47/glm45).
No new swe_rollout.py: reuses slime's default sglang_rollout outer loop via --custom-generate-function-path.