-
Notifications
You must be signed in to change notification settings - Fork 205
Description
asyncio.Lock in messaging.py not released on client disconnect → cascading timeouts
Describe the bug
When a client disconnects from a code execution request (e.g., due to SDK timeout), the per-context asyncio.Lock() in template/server/messaging.py remains held until the Jupyter kernel finishes the execution. All subsequent code executions on the same context block behind this orphaned lock, causing a cascade of timeouts.
The only recovery is calling POST /contexts/{id}/restart, which creates a new ContextWebSocket with a fresh lock. But this clears all kernel state (variables, imports), which is a heavy penalty for what should be a recoverable situation.
Root Cause
In template/server/messaging.py, the ContextWebSocket.execute() method holds an asyncio.Lock() for the entire duration of code execution, including streaming results back:
class ContextWebSocket:
def __init__(self, context_id, session_id, language, cwd):
self._lock = asyncio.Lock()
async def execute(self, code, env_vars, access_token):
async with self._lock: # Lock acquired here
await self._ws.send(request) # Send to Jupyter kernel
async for item in self._wait_for_result(message_id):
yield item # Stream results while holding lockThe HTTP endpoint in template/server/main.py wraps this generator in a streaming response:
@app.post("/execute")
async def post_execute(request, exec_request):
return StreamingListJsonResponse(
ws.execute(code, env_vars, access_token)
)When the SDK client times out (default 300s) and closes the HTTP connection, FastAPI/Starlette abandons the streaming generator. However, the asyncio.Lock is still held inside the generator's frame — it only releases when:
- The kernel finishes execution and
_wait_for_result()hitsEndOfExecution, OR - The generator is garbage collected (non-deterministic), OR
- The context is restarted via
POST /contexts/{id}/restart
The Cascade
T=0:00 Client sends long-running code → lock acquired → kernel executing
T=5:00 SDK timeout (300s) → client HTTP disconnect → lock STILL held
T=5:01 Client retries with new code → blocked on lock
T=10:01 Retry also times out → next retry blocked behind both
... each retry adds another timeout duration of queue time ...
envd Log Evidence
We observed this directly via journalctl -u envd logs inside a sandbox. Normal executions show sequential lock acquire/release:
06:14:05 Execution abc... finished [LOCK RELEASED]
06:14:05 Input accepted for def... [LOCK ACQUIRED]
06:14:15 Execution def... finished [LOCK RELEASED]
During a deadlock incident, we captured:
05:43:45 d8a56844 → Sending code (SessionView.from_id)
05:43:45 d8a56844 → Input accepted [LOCK ACQUIRED]
... kernel running for ~5 min (large HTTP request) ...
≈05:48:45 SDK timeout fires (300s) → client disconnects
*** 4.6 minutes of TOTAL SILENCE — all executions blocked ***
05:53:23 Next execution finally gets lock [after kernel finished internally]
Execution d8a56844 has "Sending code" + "Input accepted" but no "finished execution" event — a confirmed orphaned lock holder. Total lock hold: 578s (9.6 min), of which 278s (4.6 min) was orphaned after the client disconnected.
Suggested Fix
Instead of holding the lock for the entire generator lifetime, consider one of:
-
Release the lock after sending to the kernel — The lock's purpose is to serialize sends to the Jupyter kernel WebSocket. Once the message is sent and accepted, the lock could be released. Streaming results doesn't require the lock since
_wait_for_resultreads from a per-execution queue. -
Add a
finallyclause to release on generator close — Wrap the lock acquisition so that when the generator is closed (by FastAPI on client disconnect), the lock is explicitly released:async def execute(self, code, env_vars, access_token): await self._lock.acquire() try: await self._ws.send(request) async for item in self._wait_for_result(message_id): yield item finally: self._lock.release()
Note:
async with self._lockinside an async generator may not trigger__aexit__on generator close in all Python versions. -
Add a lock timeout — Use
asyncio.wait_for(self._lock.acquire(), timeout=N)so blocked executions fail fast instead of cascading.
Impact
This affects any SDK user whose code execution exceeds the SDK timeout. In our production environment:
- One session had 8 consecutive timeouts — even
print('hello')was blocked - Another had 16 consecutive timeouts before full sandbox destruction recovered it
- Our workaround: call
restartCodeContext()on every timeout, but this clears kernel state
Related Issues
- Unstable service E2B#1017 — "Unstable service" / sandbox hangs (closed with a fix in Dec 2025, but this specific lock issue persists)
- Executing AsyncCommandHandle.kill() will cause blocking and cannot kill the process. E2B#1034 —
kill()blocks, process keeps running after disconnect - [Bug]: commands.run hangs indefinitely when sandbox becomes unreachable — request_timeout does not set read timeout on streaming calls E2B#1128 —
commands.runhangs indefinitely, no read timeout on streams
Environment
- E2B SDK:
@e2b/code-interpreterv1.x (JS/TS) - Python version inside sandbox: 3.12
- Affected file:
template/server/messaging.py(deployed to sandbox as/root/.server/messaging.py)