fix(swe): resolve tool server startup failures with port allocation, retries, and shell fallback by echobt · Pull Request #18 · CortexLM/swe-forge

echobt · 2026-02-18T18:53:05Z

Summary

Fix the Docker sandbox tool server failing to start in all containers during concurrent pipeline execution, eliminating the pervasive "Tool server health check failed after 3s, tools may not work" warnings.

Changes

Port Allocation (`docker_sandbox.rs`)

Replace timestamp-based port derivation with a global AtomicU16 counter (NEXT_PORT) that guarantees unique ports across all concurrent containers
Eliminate as u16 truncation bug and TOCTOU race conditions when multiple sandboxes start simultaneously under --network=host
Port range: 10,000–60,000 with atomic wrap-around

Tool Server Robustness (`docker_sandbox.rs`, `tool_server.rs`)

Add HTTPServer.allow_reuse_address = True in the embedded Python server to reduce Address already in use errors
Wrap HTTPServer() creation in try/except to catch and log OSError on port bind failure
Increase health check from 6×500ms (3s) to 12×500ms (6s) to handle slow container startup under load
Add full retry loop (up to 2 attempts) that kills stale processes and re-writes the server script
Verify script file size after writing to detect truncated stdin pipe transfers
Change start_tool_server() to return bool indicating success, tracked via tool_server_ok field

Shell-Based Tool Fallback (`test_generator.rs`)

Add shell_fallback() function implementing all 5 tools (read_file, list_dir, grep_files, search_files, apply_patch) via direct shell commands
When tool server is unavailable or returns connection errors, transparently fall back to shell execution
Extract parse_tool_response() into a standalone function for cleaner code organization

Error Handling Improvements (`docker_sandbox.rs`)

Upgrade git install failed from WARN-and-continue to anyhow::bail! — a sandbox without git is unusable
Upgrade checkout failed from WARN-and-continue to anyhow::bail! — running tests against the wrong commit produces invalid results
Both failures now properly destroy the container before returning the error

Tests

Add test_allocate_port_returns_valid_range and test_allocate_port_sequential_unique unit tests for the new port allocator

…l fallback The tool server inside Docker sandbox containers was failing to start reliably, producing "Tool server health check failed after 3s, tools may not work" warnings for every container. This was caused by port collisions (timestamp-based port derivation with --network=host), insufficient health check timeout, no retry mechanism, and no fallback when the server was unavailable. Port allocation: Replaced timestamp-based port derivation (which suffered from u16 truncation and TOCTOU races) with a global AtomicU16 counter that guarantees unique ports across all concurrent containers. Each sandbox atomically increments the counter, wrapping from 60000 back to 10000. Tool server Python code (tool_server.rs): Added HTTPServer.allow_reuse_address = True to reduce EADDRINUSE errors, and wrapped server creation in try/except to log port binding failures before exiting cleanly. Health check robustness (docker_sandbox.rs): Increased health check from 6x500ms (3s) to 12x500ms (6s). Added full retry loop (up to 2 attempts) that kills stale processes, re-writes the script, and restarts on failure. Verifies script was written correctly by checking file size. Changed start_tool_server() to return bool indicating success, exposed via has_tool_server() for callers. Error handling improvements (docker_sandbox.rs): git install failure now aborts sandbox creation (was a silent warning). git checkout failure now aborts sandbox creation instead of silently continuing on HEAD, which would produce invalid test results. Shell-based tool fallback (test_generator.rs): When the tool server is unavailable or returns connection errors, tools (read_file, list_dir, grep_files, search_files, apply_patch) now fall back to equivalent shell commands. Extracted parse_tool_response() helper for cleaner code. This ensures the agent loop continues even if the tool server never starts.

…ction Add validate_file_path() checks for file/directory path arguments and sanitize_shell_arg() escaping for pattern/glob arguments in the shell_fallback() function. Previously, LLM-provided values containing single quotes could break out of shell quoting and execute arbitrary commands inside the Docker container.

…cation

echobt added 4 commits February 18, 2026 18:52

ci: trigger CI run

376a4eb

docs(tool_server): fix stale port doc comment to reflect dynamic allo…

fcb723d

…cation

echobt merged commit abb69c6 into main Feb 18, 2026
10 checks passed

echobt deleted the fix/tool-server-port-allocation-health-check branch February 18, 2026 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(swe): resolve tool server startup failures with port allocation, retries, and shell fallback#18

fix(swe): resolve tool server startup failures with port allocation, retries, and shell fallback#18
echobt merged 4 commits intomainfrom
fix/tool-server-port-allocation-health-check

echobt commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

echobt commented Feb 18, 2026

Summary

Changes

Port Allocation (docker_sandbox.rs)

Tool Server Robustness (docker_sandbox.rs, tool_server.rs)

Shell-Based Tool Fallback (test_generator.rs)

Error Handling Improvements (docker_sandbox.rs)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Port Allocation (`docker_sandbox.rs`)

Tool Server Robustness (`docker_sandbox.rs`, `tool_server.rs`)

Shell-Based Tool Fallback (`test_generator.rs`)

Error Handling Improvements (`docker_sandbox.rs`)