Fix lock inversion deadlock in SSE reconnection#108
Fix lock inversion deadlock in SSE reconnection#108Copilot wants to merge 7 commits intobug-sse-connection-dropping-after-premature-response-endfrom
Conversation
…outside lock Co-authored-by: JamieSinn <1538232+JamieSinn@users.noreply.github.com>
Co-authored-by: JamieSinn <1538232+JamieSinn@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This pull request addresses a lock inversion issue in the SSE (Server-Sent Events) error handling logic that could cause deadlocks. The fix refactors _recreate_sse_connection to use a two-phase locking approach: acquiring the lock to save references, performing blocking operations (close/join) outside the lock, then re-acquiring the lock to create the new connection.
Changes:
- Moved potentially blocking
close()andjoin()operations outside the lock to prevent deadlocks when the SSE read thread calls error callbacks that need the same lock - Split the single lock acquisition into two phases: one to save state, one to update state
- Added explanatory comments documenting the locking strategy and rationale
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self._sse_manager = SSEManager( | ||
| self.sse_state, | ||
| self.sse_error, | ||
| self.sse_message, |
There was a problem hiding this comment.
Typo in log message: "Devcyle" should be "DevCycle" to match the spelling used in most other log messages in this file (e.g., lines 131, 150, 165, 168, 193, 202, 255). While this appears to be a pre-existing issue, since this line was moved as part of the refactoring, it would be a good opportunity to fix the spelling for consistency.
There was a problem hiding this comment.
Fixed in eb7844f (already corrected before these changes).
There was a problem hiding this comment.
The changes to _recreate_sse_connection that address the lock inversion issue lack test coverage. The existing test suite (test/managers/test_config_manager.py) only tests with disable_realtime_updates=True, which completely bypasses this code path.
This is particularly important because:
- The fix involves complex concurrency logic with two-phase locking
- The issue being fixed was a deadlock scenario that would only manifest under specific timing conditions
- The behavior during the window between releasing and re-acquiring the lock is critical
Consider adding tests that verify:
- Concurrent calls to
_recreate_sse_connectionare handled correctly - The SSE manager state remains consistent when errors occur during reconnection
- The interaction between
sse_errorcallbacks and the reconnection logic works without deadlocks
There was a problem hiding this comment.
Added comprehensive test coverage in 89a8467. Tests now verify: (1) _sse_manager is cleared before blocking operations, (2) latest config is used after re-acquiring lock, (3) concurrent calls are handled safely, and (4) proper behavior when config is cleared during reconnection.
There was a problem hiding this comment.
There is a potential race condition in this implementation. Between releasing the lock at line 72 and re-acquiring it at line 85, other threads could access or modify self._sse_manager. Specifically:
- The
_get_configmethod (lines 144-152) readsself._sse_managerwithout holding the lock - The
closemethod (lines 266-267) accessesself._sse_managerwithout holding the lock - Another thread could call
_recreate_sse_connectionconcurrently
This means self._sse_manager could be replaced or set to None while the saved old_sse_manager is being closed outside the lock. While this may not cause immediate crashes, it creates undefined behavior where the old manager might be replaced mid-cleanup.
Consider setting self._sse_manager = None before releasing the lock at line 72, so that concurrent accesses see a consistent state indicating "no active connection" during the transition. This would also prevent the SSE read thread's error callbacks from trying to use a manager that's being torn down.
There was a problem hiding this comment.
Fixed in 741b526. Now setting self._sse_manager = None before releasing the lock, ensuring concurrent accesses see a consistent "no active connection" state during the transition.
There was a problem hiding this comment.
The current_config reference is saved while holding the lock and then used outside the lock at line 92. While Python dictionaries are reference types and won't be garbage collected, there's a subtle race condition: self._config could be reassigned by another thread (e.g., in _get_config at line 137) between line 71 and line 92.
This means the SSE manager could be initialized with a stale config that's different from the current self._config. While this may eventually self-correct, it could lead to inconsistent state where the SSE manager is connected to a URL that doesn't match the current configuration.
Consider re-reading self._config inside the second lock acquisition to ensure the SSE manager is always initialized with the most current configuration.
There was a problem hiding this comment.
Fixed in 741b526. Now re-reading self._config when re-acquiring the lock to ensure the SSE manager is always initialized with the most current configuration.
|
@copilot re-work the PR to fix the identified issues. |
…thon-server-sdk into copilot/sub-pr-107
Co-authored-by: JamieSinn <1538232+JamieSinn@users.noreply.github.com>
|
@copilot get rid of the lock setup |
Co-authored-by: JamieSinn <1538232+JamieSinn@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Could you clarify what you mean by "get rid of the lock setup"? The current implementation uses a two-phase locking approach to prevent deadlocks. Are you asking to:
The lock is needed because the SSE read thread can call |
_sse_manager = Nonebefore releasing lock (741b526)self._configwhen re-acquiring lock (741b526)💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.