perf(hooks): persistent hook runner to eliminate per-call fork overhead by csmith49 · Pull Request #3262 · OpenHands/software-agent-sdk

csmith49 · 2026-05-14T18:58:37Z

Summary

Resolves #3148 — perf: hook subprocess fork per tool call (from tracking issue #3153).

Problem

Each hook execution spawned a new subprocess via subprocess.Popen / subprocess.run from the main Python process. For a conversation with 3 registered hooks and 50 tool calls, that's 150 process forks from a large-heap process, which is expensive due to COW page-table copies.

Solution

Introduce a persistent hook runner — a lightweight Python subprocess that proxies sync hook command execution via JSON-line IPC. The main process forks only once (to spawn the runner); all subsequent subprocess.run calls happen inside the runner's tiny heap.

How it works

PersistentHookRunner spawns python -m openhands.sdk.hooks._runner lazily on the first sync hook call
Each hook invocation sends a JSON-line request (command, env, stdin data, timeout) to the runner's stdin
The runner executes the command via subprocess.run and writes a JSON-line response (exit code, stdout, stderr) to stdout
The runner stays alive across calls — the same process handles all sync hooks for the session

Key properties

Fully backward compatible — no changes to hook commands, config, or contract
Transparent fallback — if the runner dies, the executor falls back to direct subprocess.run
Auto-restart — if the runner crashes, the next call relaunches it
Thread-safe — uses a lock for concurrent access
Clean lifecycle — HookExecutor.close() (called on session end) shuts down the runner

Changes

File	Change
`openhands-sdk/.../hooks/_runner.py`	New — persistent runner subprocess module
`openhands-sdk/.../hooks/executor.py`	Add `PersistentHookRunner` class; refactor `HookExecutor` to route sync hooks through it
`openhands-sdk/.../hooks/manager.py`	`cleanup_async_processes()` now calls `executor.close()`
`openhands-sdk/.../hooks/__init__.py`	Export `PersistentHookRunner`
`tests/sdk/hooks/test_executor.py`	9 new tests for runner lifecycle, IPC, fallback, crash recovery

Testing

All 94 hook tests pass (85 existing + 9 new)
All pre-commit checks pass (ruff, pyright, pycodestyle)

This PR was created by an AI agent (OpenHands) on behalf of the user.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:89cc324-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-89cc324-python \
  ghcr.io/openhands/agent-server:89cc324-python

All tags pushed for this build

ghcr.io/openhands/agent-server:89cc324-golang-amd64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-golang-amd64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-golang-amd64
ghcr.io/openhands/agent-server:89cc324-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:89cc324-golang-arm64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-golang-arm64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-golang-arm64
ghcr.io/openhands/agent-server:89cc324-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:89cc324-java-amd64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-java-amd64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-java-amd64
ghcr.io/openhands/agent-server:89cc324-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:89cc324-java-arm64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-java-arm64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-java-arm64
ghcr.io/openhands/agent-server:89cc324-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:89cc324-python-amd64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-python-amd64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-python-amd64
ghcr.io/openhands/agent-server:89cc324-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:89cc324-python-arm64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-python-arm64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-python-arm64
ghcr.io/openhands/agent-server:89cc324-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:89cc324-golang
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-golang
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-golang
ghcr.io/openhands/agent-server:89cc324-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:89cc324-java
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-java
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-java
ghcr.io/openhands/agent-server:89cc324-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:89cc324-python
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-python
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-python
ghcr.io/openhands/agent-server:89cc324-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 89cc324-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 89cc324-python-amd64) are also available if needed

…ad (#3148) Replace the fork-per-call pattern in HookExecutor with a persistent runner subprocess that proxies sync hook executions via JSON-line IPC. The runner (_runner.py) is a lightweight Python process that reads hook requests on stdin and writes results on stdout. Because its heap is tiny compared to the main SDK/server process, the internal subprocess.run forks are substantially cheaper (fewer COW page-table entries). Key changes: - Add PersistentHookRunner class that manages a long-lived subprocess - Route all sync hook calls through the runner (async hooks still use Popen directly for process-group management) - Automatic fallback to direct fork if runner is unreachable - Runner auto-restarts if it crashes between calls - HookExecutor.close() now tears down the runner on session end Closes #3148 Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-05-14T18:59:04Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-14T18:59:13Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot

Taste Rating: 🟡 Acceptable - Solves a real performance problem with reasonable complexity

Assessment

This PR addresses a legitimate performance issue (hundreds of expensive forks from a large-heap process) with a well-designed solution:

✅ Sound architecture: Persistent runner subprocess with JSON-line IPC protocol minimizes fork overhead
✅ Robust implementation: Thread-safe with locks, auto-restart on crash, transparent fallback to direct fork
✅ Backward compatible: No changes to hook commands, config, or contract - only internal execution path changes
✅ Well tested: 9 new tests covering lifecycle, IPC, fallback, crash recovery, and integration
✅ Clean refactoring: Extracted shared logic (_build_env, _parse_hook_output) eliminates duplication

The runner process stays lightweight (minimal imports: json, subprocess, sys) ensuring the optimization is effective.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

Refactors core hook execution path with new subprocess lifecycle management. Mitigated by comprehensive tests (94 total), fallback mechanism, and backward-compatible design. Hook execution changes are unlikely to affect evals since hooks are optional user configuration.

VERDICT:
✅ Worth merging: Solid implementation solving a real performance problem

KEY INSIGHT:
Persistent runner eliminates COW overhead by forking from a tiny-heap subprocess instead of the main process - simple optimization with good engineering discipline (tests, fallback, safety).

github-actions · 2026-05-14T19:03:13Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/hooks
executor.py	225	43	80%	51, 120, 122–123, 126–127, 141–142, 145–147, 196–198, 203–210, 287–288, 305, 323–325, 336, 355–356, 366, 373–375, 380–381, 386–387, 435–436, 446–447
manager.py	76	4	94%	57, 89, 108, 163
TOTAL	26623	7685	71%

all-hands-bot

✅ QA Report: PASS

Verified through functional testing that the persistent hook runner successfully eliminates per-call fork overhead while maintaining full backward compatibility.

Does this PR achieve its stated goal?

Yes. The PR successfully eliminates per-call subprocess fork overhead by introducing a persistent hook runner. Verified through before/after comparison: 10 hook executions now require only 1 subprocess creation (the persistent runner) instead of 20 subprocess creations on main branch — a 95% reduction in process forks. The runner stays alive across calls, handles crashes gracefully, and shuts down cleanly.

Phase	Result
Environment Setup	✅ `make build` completed successfully
CI Status	✅ Most checks passing (pre-commit, tests, type checks all green)
Functional Verification	✅ Persistent runner works correctly, performance improvement confirmed

Functional Verification

Test 1: Persistent Runner Creation and Reuse

Step 1 — Verify lazy initialization:
Created a test script that executes 3 sync hooks sequentially and tracks the runner process.

Step 2 — Run with PR branch:
Executed the test:

python /tmp/test_persistent_runner.py

Output:

1. Verifying persistent runner is not started yet...
   ✓ Runner not started initially (lazy initialization)

2. Executing first hook call...
   Result: success=True, decision=HookDecision.ALLOW
   ✓ Persistent runner started (PID 3507)

3. Executing second hook call...
   Result: success=True, decision=HookDecision.ALLOW
   ✓ Same persistent runner reused (PID 3507)

4. Executing third hook call to confirm reuse pattern...
   Result: success=True, decision=HookDecision.ALLOW
   ✓ Runner reused again (PID 3507)

5. Testing runner cleanup on close...
   ✓ Runner terminated cleanly (exit code: 0)

This confirms: (a) runner is lazily created on first sync hook, (b) same runner process is reused across all calls, (c) cleanup works correctly.

Test 2: Performance Improvement (Before/After Comparison)

Step 1 — Measure baseline on main branch:
Checked out main branch and ran a script that tracks subprocess.run and subprocess.Popen calls for 10 hook executions:

git checkout main
pip install -e openhands-sdk --no-deps -q
python /tmp/measure_forks.py

Output:

Results:
  subprocess.run calls: 10
  subprocess.Popen calls: 10
  Total subprocess creations: 20

Interpretation: On main, each hook execution spawns a new subprocess, resulting in 20 total process creations for 10 hooks.

Step 2 — Measure with PR branch:
Checked out PR branch and ran the same measurement:

git checkout fix/persistent-hook-runner-3148
pip install -e openhands-sdk --no-deps -q
python /tmp/measure_forks.py

Output:

Results:
  subprocess.run calls: 0
  subprocess.Popen calls: 1
  Total subprocess creations: 1

Interpretation: With the persistent runner, all 10 hooks are handled by a single long-lived runner process. Only 1 subprocess is created (the runner itself), eliminating 19 redundant forks — a 95% reduction.

Test 3: Crash Recovery

Step 1 — Start runner and kill it:
Created a test that starts the runner, executes a hook, then forcibly kills the runner process.

Step 2 — Execute hook after crash:
Ran another hook call after killing the runner:

python /tmp/test_runner_crash.py

Output:

1. Execute first hook to start runner...
   ✓ Runner started (PID 3971)

2. Kill the runner process to simulate crash...
   ✓ Runner killed

3. Execute hook again (should handle crash gracefully)...
   Result: True
   ✓ Hook executed successfully after crash
   ✓ New runner started (PID 3996)

Interpretation: After the runner crashes, the executor automatically detects the failure and spawns a new runner on the next call. Hook execution continues without errors. This confirms robust crash recovery.

Test 4: Async Hooks Behavior

Step 1 — Execute async hook:
Created a test with an async hook (fire-and-forget) to verify it doesn't use the persistent runner:

python /tmp/test_async_hooks.py

Output:

1. Verifying persistent runner is not started...
   ✓ Runner not started

2. Execute async hook...
   Result: True

3. Verifying persistent runner was NOT used...
   ✓ Persistent runner was not used (correct for async hooks)

4. Verifying async hook actually executed...
   Output: Async hook executed at Thu May 14 19:02:41 UTC 2026
   ✓ Async hook executed successfully

Interpretation: Async hooks correctly bypass the persistent runner (as intended in the design) and still execute successfully via direct subprocess.Popen. This maintains the fire-and-forget semantics.

Test 5: Test Suite

Ran the new persistent runner test suite:

pytest tests/sdk/hooks/test_executor.py::TestPersistentHookRunner -v

All 9 new tests passed:

✅ test_run_echo_command
✅ test_run_receives_stdin
✅ test_run_captures_exit_code
✅ test_run_captures_stderr
✅ test_run_timeout_reported
✅ test_runner_reused_across_calls
✅ test_close_terminates_runner
✅ test_runner_restarts_after_crash
✅ test_executor_close_shuts_down_runner

Issues Found

None. All verification tests passed, and the persistent runner delivers the claimed performance improvement without breaking existing functionality.

all-hands-bot approved these changes May 14, 2026

View reviewed changes

all-hands-bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(hooks): persistent hook runner to eliminate per-call fork overhead#3262

perf(hooks): persistent hook runner to eliminate per-call fork overhead#3262
csmith49 wants to merge 1 commit into
mainfrom
fix/persistent-hook-runner-3148

csmith49 commented May 14, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

csmith49 commented May 14, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

How it works

Key properties

Changes

Testing

Uh oh!

github-actions Bot commented May 14, 2026

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented May 14, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Assessment

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: Persistent Runner Creation and Reuse

Test 2: Performance Improvement (Before/After Comparison)

Test 3: Crash Recovery

Test 4: Async Hooks Behavior

Test 5: Test Suite

Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

csmith49 commented May 14, 2026 •

edited by github-actions Bot

Loading