[Repo Assist] fix(node): fire-and-forget capability dispatch to unblock WS receive loop#426
Conversation
…loop (#363) Slow capabilities (screen.record, stt.transcribe, system.run) were blocking the single-threaded WebSocket receive loop, starving health pings and pair events for their entire duration. The gateway could incorrectly declare the node dead during a long capture. Changes: - Add SemaphoreSlim(8) to WindowsNodeClient to bound concurrent invocations - Both dispatch paths (HandleNodeInvokeAsync / HandleNodeInvokeEventAsync) now fire-and-forget via Task.Run, returning the receive loop immediately - Args JsonElement cloned before fire-and-forget to outlive parent JsonDocument - When all 8 slots are occupied, gateway receives immediate 'node busy, retry' response rather than queueing unbounded work - Health/pair/challenge events continue to be handled inline on the receive loop (cheap, latency-sensitive) - Update tests to await cap.ExecutedTask for the async dispatch Closes #363 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Serialize WebSocket sends before enabling concurrent capability invoke responses, and avoid losing semaphore slots if shutdown cancellation fires before a scheduled invoke task starts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Added maintainer-side hardening in commit 83872b8 after review found one concurrency blocker: capability invoke completions can now run in parallel, so their response sends must not call ClientWebSocket.SendAsync concurrently. The follow-up serializes WebSocket sends in WebSocketClientBase and avoids passing the shutdown token to Task.Run so a canceled schedule cannot strand an invoke semaphore slot.\n\nLocal validation in D:\github\moltbot-windows-hub.pr426 passed:\n- .\build.ps1\n- dotnet test .\tests\OpenClaw.Shared.Tests\OpenClaw.Shared.Tests.csproj (1620 passed, 28 skipped)\n- dotnet test .\tests\OpenClaw.Tray.Tests\OpenClaw.Tray.Tests.csproj (953 passed) |
|
Post-refactor triage note: this is now the only remaining automation PR. The maintainer hardening commit is in place and local validation passed earlier, but the PR is still draft, so I am leaving it open and unmerged until it is marked ready and mergeability/checks settle. |
🤖 This PR was created by Repo Assist, an automated AI assistant.
Summary
Fixes head-of-line blocking in
WindowsNodeClient— a slow capability (5-minutescreen.record, longstt.transcribe, blockingsystem.run) blocked all subsequent WebSocket messages including health pings, which could cause the gateway to declare the node dead and tear down the session.Closes #363
Root Cause
Both capability dispatch paths (
HandleNodeInvokeAsyncfortype:reqandHandleNodeInvokeEventAsyncfortype:event) were directlyawaitingcapability.ExecuteAsync(...)on the WS receive loop. The loop cannot advance to the next frame until the capability completes.Fix
SemaphoreSlim(8)added toWindowsNodeClientto bound concurrent invocations.Task.Run, returning the receive loop immediately after launching the work.health,connect.challenge,node.pair.*,device.pair.*) continue to be handled inline on the receive loop.JsonElement argscloned before fire-and-forget to outlive the parentJsonDocumentthat is disposed whenProcessMessageAsyncreturns."node busy, retry"error rather than unbounded queuing.OperationCanceledExceptionduring shutdown is silently swallowed per the standard pattern.Trade-offs
Test Status
Three existing
CommandDispatch_*tests updated to awaitcap.ExecutedTask(a newTaskCompletionSourceonMockCapability) since capability execution is no longer synchronous with theProcessMessageAsynccall.All 41
WindowsNodeClientTestspass.