Skip to content

[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo#1923

Open
jingshenghang wants to merge 5 commits into
THUDM:mainfrom
jingshenghang:coding-agent-rl-example
Open

[examples] add coding_agent_rl: agent-in-sandbox RL minimal demo#1923
jingshenghang wants to merge 5 commits into
THUDM:mainfrom
jingshenghang:coding-agent-rl-example

Conversation

@jingshenghang
Copy link
Copy Markdown
Collaborator

End-to-end loop for "coding agent + sandbox execution + test reward":

  1. Boot an E2B sandbox per sample (dataset metadata.image).
  2. Install Node 22 + Claude Code CLI inside; create unprivileged agent user; chown repo; write problem_statement.md.
  3. Launch claude --output-format stream-json pointed at a head-node bridge via ANTHROPIC_BASE_URL; ANTHROPIC_AUTH_TOKEN doubles as the session id for concurrent request demux.
  4. bridge.py translates each /v1/messages call into a SGLang /generate call and streams an Anthropic SSE reply back; per-session it keeps (prompt_ids, response_ids, loss_mask) so no post-hoc retokenize.
  5. After the agent finishes, a fresh eval sandbox applies the model git diff and runs the dataset's tests -> 0/1 reward.
  6. generate.py drops the bridge-collected tokens straight into Sample.

Layout under examples/coding_agent_rl/:
generate.py slime custom-generate entrypoint
sandbox.py all sandbox-side ops (boot/exec/upload, install
Node/Claude Code, run agent, git diff, eval)
bridge.py Anthropic Messages API <-> SGLang /generate
shim; model-agnostic
run_glm47_355b.sh 8-node / 64-GPU / colocate / E2B launch
script for GLM-4.7-355B-A32B
README.md walkthrough + dataset schema + swap-model howto

Model-agnostic: chat template via tokenizer.apply_chat_template; tool call parsing via sglang FunctionCallParser; reasoning parsing via sglang ReasoningParser. Swapping model = changing SWE_TOOL_PARSER / SWE_REASONING_PARSER envs (default glm47/glm45).

No new swe_rollout.py: reuses slime's default sglang_rollout outer loop via --custom-generate-function-path.

A minimal, readable example of coding agent + sandbox execution + test
reward in slime (~1500 LoC across 4 files). One training sample:

  spin up a sandbox -> run Claude Code inside it -> capture the
  model-produced git diff -> spin up a SECOND clean sandbox, apply the
  diff, run the dataset's tests -> 0/1 reward -> feed the actual
  generated tokens (with loss-mask) back to slime, no re-tokenization.

Wire-up is one CLI flag:

  --custom-generate-function-path examples.coding_agent_rl.generate.generate

slime's default sglang_rollout.generate_rollout outer loop is reused;
only the per-sample generate() is swapped.

Files:

* generate.py - per-sample entrypoint slime calls. Provision sandbox ->
  drop PROBLEM_STATEMENT.md -> run agent -> git diff -> eval in a fresh
  sandbox -> fill Sample.
* sandbox.py  - E2B sandbox backend. Boot/kill, exec/upload,
  install_node22 + install_claude_code, long-running agent spawn with
  done-marker poll, git_diff, fresh-sandbox eval runner (swepro /
  f2p_script / eval_cmd).
* bridge.py   - head-node aiohttp shim. Translates the agent's
  Anthropic Messages API into slime's SGLang /generate (token-native +
  logprobs) and keeps (prompt_ids, response_ids, loss_mask) per session
  so the trainer skips re-tokenization. Model-agnostic.
* run_glm47_355b.sh - reference launch script (GLM-4.7-355B-A32B,
  8 nodes / 64 GPUs, colocate, E2B). All required env vars guarded by
  \${VAR:?...}; no operator-specific paths.
* README.md   - file table, sample flow diagram, dataset schema (flat
  and remote_env_info layouts), required vs optional env knobs,
  "Swap things out" recipes (model / agent / sandbox backend), and
  design notes (no re-tokenization, reasoning round-trip, done-marker
  poll, boot semaphore + retry).
@jingshenghang jingshenghang force-pushed the coding-agent-rl-example branch from 585ef36 to f2fd320 Compare May 19, 2026 09:40
Comment thread examples/coding_agent_rl/bridge.py Outdated
# Canonical chat log. Each assistant turn we append after /generate carries
# reasoning_content so the next round's apply_chat_template re-render matches
# the tokens the model actually emitted (preserving prefix match).
glm_messages: list[dict] = dataclasses.field(default_factory=list)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该叫 messages 就可以了

Comment thread examples/coding_agent_rl/bridge.py Outdated
@@ -0,0 +1,424 @@
"""Anthropic Messages API <-> SGLang /generate bridge.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我建议改成叫 middleware 之类的东西,主要是现在 slime 里面已经有 mbridge 和 megatron bridge 了,容易混淆。

lock: asyncio.Lock = dataclasses.field(default_factory=asyncio.Lock)


class _Store:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python 的 asyncio 是不是单线程的?需要这个带锁的 session 吗?

"message": "Authorization Bearer <session_id> required",
}}, status=400)

s = await store.get(session_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有可能需要加个提示,session_id 不能重复

ideal_text = tok.apply_chat_template(
s.glm_messages, tools=s.tools_schema, tokenize=False, add_generation_prompt=True,
)
ideal_ids = tok.encode(ideal_text, add_special_tokens=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是 tokenize=True 就行?token=True 还能设置 add_special_tokens=False 吗?

else:
logger.warning("[bridge] %s template-rerender mismatch; rebaselining", session_id)
s.response_ids = ideal_ids[len(s.prompt_ids):]
s.loss_mask = [0] * len(s.response_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是按如果不匹配会 mask 掉整条 response 来做的是吗?

"content": [], "stop_reason": None, "stop_sequence": None,
"usage": {"input_tokens": in_tokens, "output_tokens": 0}},
}))
for idx, b in enumerate(blocks):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

想确定一下,这里是会先把整个 streaming 的请求变成同步的吗?

jingshenghang and others added 4 commits May 19, 2026 11:35
…metadata configurable

* bridge.py renamed to middleware.py (file + class BridgeHandle ->
  MiddlewareHandle + log prefix + thread name). The chat-log dataclass
  field glm_messages was also renamed chat_messages -- the example is
  model-agnostic and the GLM-specific naming was misleading.

* sandbox.py no longer hardcodes ``glm-platform/*`` metadata keys.
  Set SWE_SANDBOX_METADATA_JSON='{"my-platform/size":"lg",...}' to pass
  arbitrary routing tags into AsyncSandbox.create(metadata=...). Default
  is empty, which works for stock E2B accounts.

* generate.py defaults SWE_TOOL_PARSER / SWE_REASONING_PARSER to None
  (no hardcoded GLM-specific parser fallback). The reference launch
  script run_glm47_355b.sh still sets these to glm47/glm45.

* sandbox.run_claude_code's ``bridge_url=`` kwarg renamed to
  ``middleware_url=`` (caller in generate.py updated).

* README + run_glm47_355b.sh updated for the rename and the new
  metadata env var.
…ddleware

- middleware: track each /messages call as a `_Turn` (request snapshot,
  response, finish/stop reason, parent prefix), expose via pop_session
  return value and `record_tree` opt-in.
- middleware: detect non-linear message updates by hashing prior messages
  and rebuild the prompt when the client diverges, instead of silently
  appending. Also translate Anthropic `thinking` blocks into
  `reasoning_content` so prior reasoning is preserved across turns.
- generate: add SWE_SAVE_TRAJECTORY_TREE env knob; when set, stash the
  exported tree under sample.metadata["trajectory_tree"]. Also allow
  overriding the Claude Code prompt via SWE_CC_PROMPT.
- sandbox: pass --include-partial-messages / --include-hook-events to
  claude CLI and allow extra args via SWE_CLAUDE_EXTRA_ARGS; quote the
  trajectory path with shlex.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants