Skip to content

perf(hooks): persistent hook runner to eliminate per-call fork overhead#3262

Open
csmith49 wants to merge 1 commit into
mainfrom
fix/persistent-hook-runner-3148
Open

perf(hooks): persistent hook runner to eliminate per-call fork overhead#3262
csmith49 wants to merge 1 commit into
mainfrom
fix/persistent-hook-runner-3148

Conversation

@csmith49
Copy link
Copy Markdown
Collaborator

@csmith49 csmith49 commented May 14, 2026

Summary

Resolves #3148perf: hook subprocess fork per tool call (from tracking issue #3153).

Problem

Each hook execution spawned a new subprocess via subprocess.Popen / subprocess.run from the main Python process. For a conversation with 3 registered hooks and 50 tool calls, that's 150 process forks from a large-heap process, which is expensive due to COW page-table copies.

Solution

Introduce a persistent hook runner — a lightweight Python subprocess that proxies sync hook command execution via JSON-line IPC. The main process forks only once (to spawn the runner); all subsequent subprocess.run calls happen inside the runner's tiny heap.

How it works

  1. PersistentHookRunner spawns python -m openhands.sdk.hooks._runner lazily on the first sync hook call
  2. Each hook invocation sends a JSON-line request (command, env, stdin data, timeout) to the runner's stdin
  3. The runner executes the command via subprocess.run and writes a JSON-line response (exit code, stdout, stderr) to stdout
  4. The runner stays alive across calls — the same process handles all sync hooks for the session

Key properties

  • Fully backward compatible — no changes to hook commands, config, or contract
  • Transparent fallback — if the runner dies, the executor falls back to direct subprocess.run
  • Auto-restart — if the runner crashes, the next call relaunches it
  • Thread-safe — uses a lock for concurrent access
  • Clean lifecycleHookExecutor.close() (called on session end) shuts down the runner

Changes

File Change
openhands-sdk/.../hooks/_runner.py New — persistent runner subprocess module
openhands-sdk/.../hooks/executor.py Add PersistentHookRunner class; refactor HookExecutor to route sync hooks through it
openhands-sdk/.../hooks/manager.py cleanup_async_processes() now calls executor.close()
openhands-sdk/.../hooks/__init__.py Export PersistentHookRunner
tests/sdk/hooks/test_executor.py 9 new tests for runner lifecycle, IPC, fallback, crash recovery

Testing

  • All 94 hook tests pass (85 existing + 9 new)
  • All pre-commit checks pass (ruff, pyright, pycodestyle)

This PR was created by an AI agent (OpenHands) on behalf of the user.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:89cc324-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-89cc324-python \
  ghcr.io/openhands/agent-server:89cc324-python

All tags pushed for this build

ghcr.io/openhands/agent-server:89cc324-golang-amd64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-golang-amd64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-golang-amd64
ghcr.io/openhands/agent-server:89cc324-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:89cc324-golang-arm64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-golang-arm64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-golang-arm64
ghcr.io/openhands/agent-server:89cc324-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:89cc324-java-amd64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-java-amd64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-java-amd64
ghcr.io/openhands/agent-server:89cc324-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:89cc324-java-arm64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-java-arm64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-java-arm64
ghcr.io/openhands/agent-server:89cc324-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:89cc324-python-amd64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-python-amd64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-python-amd64
ghcr.io/openhands/agent-server:89cc324-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:89cc324-python-arm64
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-python-arm64
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-python-arm64
ghcr.io/openhands/agent-server:89cc324-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:89cc324-golang
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-golang
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-golang
ghcr.io/openhands/agent-server:89cc324-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:89cc324-java
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-java
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-java
ghcr.io/openhands/agent-server:89cc324-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:89cc324-python
ghcr.io/openhands/agent-server:89cc324d87645253d5fcd9b340cffec41cf842b3-python
ghcr.io/openhands/agent-server:fix-persistent-hook-runner-3148-python
ghcr.io/openhands/agent-server:89cc324-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 89cc324-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 89cc324-python-amd64) are also available if needed

…ad (#3148)

Replace the fork-per-call pattern in HookExecutor with a persistent
runner subprocess that proxies sync hook executions via JSON-line IPC.

The runner (_runner.py) is a lightweight Python process that reads hook
requests on stdin and writes results on stdout.  Because its heap is
tiny compared to the main SDK/server process, the internal subprocess.run
forks are substantially cheaper (fewer COW page-table entries).

Key changes:
- Add PersistentHookRunner class that manages a long-lived subprocess
- Route all sync hook calls through the runner (async hooks still use
  Popen directly for process-group management)
- Automatic fallback to direct fork if runner is unreachable
- Runner auto-restarts if it crashes between calls
- HookExecutor.close() now tears down the runner on session end

Closes #3148

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solves a real performance problem with reasonable complexity

Assessment

This PR addresses a legitimate performance issue (hundreds of expensive forks from a large-heap process) with a well-designed solution:

Sound architecture: Persistent runner subprocess with JSON-line IPC protocol minimizes fork overhead
Robust implementation: Thread-safe with locks, auto-restart on crash, transparent fallback to direct fork
Backward compatible: No changes to hook commands, config, or contract - only internal execution path changes
Well tested: 9 new tests covering lifecycle, IPC, fallback, crash recovery, and integration
Clean refactoring: Extracted shared logic (_build_env, _parse_hook_output) eliminates duplication

The runner process stays lightweight (minimal imports: json, subprocess, sys) ensuring the optimization is effective.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

Refactors core hook execution path with new subprocess lifecycle management. Mitigated by comprehensive tests (94 total), fallback mechanism, and backward-compatible design. Hook execution changes are unlikely to affect evals since hooks are optional user configuration.

VERDICT:
Worth merging: Solid implementation solving a real performance problem

KEY INSIGHT:
Persistent runner eliminates COW overhead by forking from a tiny-heap subprocess instead of the main process - simple optimization with good engineering discipline (tests, fallback, safety).

@github-actions
Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/hooks
   executor.py2254380%51, 120, 122–123, 126–127, 141–142, 145–147, 196–198, 203–210, 287–288, 305, 323–325, 336, 355–356, 366, 373–375, 380–381, 386–387, 435–436, 446–447
   manager.py76494%57, 89, 108, 163
TOTAL26623768571% 

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified through functional testing that the persistent hook runner successfully eliminates per-call fork overhead while maintaining full backward compatibility.

Does this PR achieve its stated goal?

Yes. The PR successfully eliminates per-call subprocess fork overhead by introducing a persistent hook runner. Verified through before/after comparison: 10 hook executions now require only 1 subprocess creation (the persistent runner) instead of 20 subprocess creations on main branch — a 95% reduction in process forks. The runner stays alive across calls, handles crashes gracefully, and shuts down cleanly.

Phase Result
Environment Setup make build completed successfully
CI Status ✅ Most checks passing (pre-commit, tests, type checks all green)
Functional Verification ✅ Persistent runner works correctly, performance improvement confirmed
Functional Verification

Test 1: Persistent Runner Creation and Reuse

Step 1 — Verify lazy initialization:
Created a test script that executes 3 sync hooks sequentially and tracks the runner process.

Step 2 — Run with PR branch:
Executed the test:

python /tmp/test_persistent_runner.py

Output:

1. Verifying persistent runner is not started yet...
   ✓ Runner not started initially (lazy initialization)

2. Executing first hook call...
   Result: success=True, decision=HookDecision.ALLOW
   ✓ Persistent runner started (PID 3507)

3. Executing second hook call...
   Result: success=True, decision=HookDecision.ALLOW
   ✓ Same persistent runner reused (PID 3507)

4. Executing third hook call to confirm reuse pattern...
   Result: success=True, decision=HookDecision.ALLOW
   ✓ Runner reused again (PID 3507)

5. Testing runner cleanup on close...
   ✓ Runner terminated cleanly (exit code: 0)

This confirms: (a) runner is lazily created on first sync hook, (b) same runner process is reused across all calls, (c) cleanup works correctly.


Test 2: Performance Improvement (Before/After Comparison)

Step 1 — Measure baseline on main branch:
Checked out main branch and ran a script that tracks subprocess.run and subprocess.Popen calls for 10 hook executions:

git checkout main
pip install -e openhands-sdk --no-deps -q
python /tmp/measure_forks.py

Output:

Results:
  subprocess.run calls: 10
  subprocess.Popen calls: 10
  Total subprocess creations: 20

Interpretation: On main, each hook execution spawns a new subprocess, resulting in 20 total process creations for 10 hooks.

Step 2 — Measure with PR branch:
Checked out PR branch and ran the same measurement:

git checkout fix/persistent-hook-runner-3148
pip install -e openhands-sdk --no-deps -q
python /tmp/measure_forks.py

Output:

Results:
  subprocess.run calls: 0
  subprocess.Popen calls: 1
  Total subprocess creations: 1

Interpretation: With the persistent runner, all 10 hooks are handled by a single long-lived runner process. Only 1 subprocess is created (the runner itself), eliminating 19 redundant forks — a 95% reduction.


Test 3: Crash Recovery

Step 1 — Start runner and kill it:
Created a test that starts the runner, executes a hook, then forcibly kills the runner process.

Step 2 — Execute hook after crash:
Ran another hook call after killing the runner:

python /tmp/test_runner_crash.py

Output:

1. Execute first hook to start runner...
   ✓ Runner started (PID 3971)

2. Kill the runner process to simulate crash...
   ✓ Runner killed

3. Execute hook again (should handle crash gracefully)...
   Result: True
   ✓ Hook executed successfully after crash
   ✓ New runner started (PID 3996)

Interpretation: After the runner crashes, the executor automatically detects the failure and spawns a new runner on the next call. Hook execution continues without errors. This confirms robust crash recovery.


Test 4: Async Hooks Behavior

Step 1 — Execute async hook:
Created a test with an async hook (fire-and-forget) to verify it doesn't use the persistent runner:

python /tmp/test_async_hooks.py

Output:

1. Verifying persistent runner is not started...
   ✓ Runner not started

2. Execute async hook...
   Result: True

3. Verifying persistent runner was NOT used...
   ✓ Persistent runner was not used (correct for async hooks)

4. Verifying async hook actually executed...
   Output: Async hook executed at Thu May 14 19:02:41 UTC 2026
   ✓ Async hook executed successfully

Interpretation: Async hooks correctly bypass the persistent runner (as intended in the design) and still execute successfully via direct subprocess.Popen. This maintains the fire-and-forget semantics.


Test 5: Test Suite

Ran the new persistent runner test suite:

pytest tests/sdk/hooks/test_executor.py::TestPersistentHookRunner -v

All 9 new tests passed:

  • ✅ test_run_echo_command
  • ✅ test_run_receives_stdin
  • ✅ test_run_captures_exit_code
  • ✅ test_run_captures_stderr
  • ✅ test_run_timeout_reported
  • ✅ test_runner_reused_across_calls
  • ✅ test_close_terminates_runner
  • ✅ test_runner_restarts_after_crash
  • ✅ test_executor_close_shuts_down_runner

Issues Found

None. All verification tests passed, and the persistent runner delivers the claimed performance improvement without breaking existing functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: hook subprocess fork per tool call

2 participants