Python: fix(python): prevent MCP message_handler deadlock on notification reload#4866
Python: fix(python): prevent MCP message_handler deadlock on notification reload#4866giles17 wants to merge 2 commits intomicrosoft:mainfrom
Conversation
When an MCP server sends a notifications/tools/list_changed or notifications/prompts/list_changed notification, the message_handler previously awaited load_tools()/load_prompts() directly. Since the handler runs on the MCP SDK's single-threaded receive loop, this caused a deadlock: load_tools() sends a list_tools request and waits for its response, but the receive loop cannot deliver that response while blocked in the handler. This manifested as a timeout in call_tool(), which then surfaced as "Error: Function failed." to the model instead of the real tool output. The MATLAB MCP server reliably triggers this because it sends a tools/list_changed notification during tool execution. Fix: schedule reloads as background asyncio.Tasks via a new _schedule_reload() helper, freeing the receive loop immediately. Fixes microsoft#4828 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Python Test Coverage Report •
Python Unit Test Overview
|
||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Pull request overview
Prevents a deadlock in the Python MCP tool integration by ensuring MCP notification handling doesn’t synchronously await reload operations on the SDK’s single-threaded receive loop.
Changes:
- Add background-task scheduling for tool/prompt reloads triggered by MCP
list_changednotifications. - Track pending reload tasks to keep them alive until completion.
- Update and extend MCP tests to cover the non-blocking/deadlock regression scenario and reload-failure behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
python/packages/core/agent_framework/_mcp.py |
Schedules reloads via background tasks instead of awaiting them in message_handler; tracks pending tasks. |
python/packages/core/tests/core/test_mcp.py |
Updates notification tests for async scheduling; adds regression tests for deadlock scenario and reload failure handling. |
giles17
left a comment
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 90%
✓ Correctness
The fix correctly addresses a real deadlock where
message_handler(running on the MCP SDK's single-threaded receive loop) previously awaitedload_tools()/load_prompts()which themselves need the receive loop to process responses. The solution usesasyncio.create_task()to schedule reloads as background tasks, with proper strong-reference tracking via_pending_reload_tasksand a done-callback for cleanup. Exception handling in the background task prevents errors from propagating to the receive loop. Tests cover the non-blocking behavior, task lifecycle, and error logging. The implementation follows standard asyncio fire-and-forget patterns correctly.
✓ Security Reliability
The fix correctly addresses a real deadlock by scheduling reload coroutines as background tasks instead of awaiting them inside the MCP SDK's single-threaded receive loop. The implementation is sound: exceptions are caught and logged, tasks are prevented from GC via a strong-reference set, and the done-callback cleanup pattern is correct. The one reliability concern is that
close()/_close_on_owner()does not cancel or await pending reload tasks before tearing down the session, which could cause reload tasks to run against aNonesession and produce confusing warnings or mask real errors during shutdown.
✓ Test Coverage
The test coverage for this fix is good. The PR adds three well-targeted tests: the existing notification test is updated to account for the async scheduling (with
asyncio.sleep(0)), a new deadlock regression test uses anasyncio.Eventto verify the handler returns immediately even when the reload blocks, and a failure-logging test confirms exceptions don't propagate. However, two minor gaps exist: the reload-failure test doesn't verify the task is cleaned up from_pending_reload_tasksafter the error, and theload_promptsnotification path lacks a dedicated non-blocking test (onlyload_toolsis tested in the deadlock scenario).
✓ Design Approach
The fix correctly addresses a real deadlock:
message_handlerran on the MCP SDK's single-threaded receive loop and previously awaitedload_tools()/load_prompts(), which in turn calledsession.list_tools()— a request whose response could never arrive because the receive loop was blocked. Scheduling reloads as background tasks viaasyncio.create_task()with a strong-reference set is the right approach. The tests are well-structured, especially the event-based non-blocking regression test. One design concern: if the server sends a burst ofnotifications/tools/list_changednotifications, multiple concurrent reload tasks are each created, all capturingexisting_namesas a local snapshot ofself._functionsat task-start time. Two concurrent tasks can race to append the same newly-advertised tool (theexisting_namesdedup check is local, not atomic with the append), producing duplicate entries in_functions. Additionally,coro: Anyis looser than necessary;Coroutine[Any, Any, None]would let type checkers catch misuse. Neither blocks the merge — the duplicate-tool race is an edge case unlikely to cause user-visible harm, and the type annotation is minor — but both are worth a follow-up.
Suggestions
- In
test_mcp_tool_message_handler_reload_failure_is_logged, assert that_pending_reload_tasksis empty after the failed task completes (e.g.assert len(tool._pending_reload_tasks) == 0) to verify the done-callback cleanup works on the error path, consistent with the pattern in the deadlock test. - Tighten the
coro: Anyparameter type in_schedule_reloadtoCoroutine[Any, Any, None](fromcollections.abc) so type checkers can flag accidental misuse, even though usage is internal-only. - Consider adding the
prompts/list_changednotification totest_mcp_tool_message_handler_does_not_block_receive_loop(or a sibling test) to ensure both notification paths are covered for non-blocking behavior, not justtools/list_changed.
Automated review by giles17's agents
…cleanup, tests - Fix exc_info=exc -> exc_info=True in _schedule_reload and message_handler - Tighten _schedule_reload param type from Any to Coroutine[Any, Any, None] - Coalesce reloads: cancel-and-replace per reload kind to prevent unbounded growth - Cancel pending reload tasks in _close_on_owner before tearing down session - Re-raise CancelledError in _safe_reload to respect task cancellation - Replace flaky asyncio.sleep(0) with asyncio.wait_for/gather in tests - Add caplog assertions to verify reload failure is actually logged - Assert _pending_reload_tasks cleanup on error path Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
giles17
left a comment
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 91%
✗ Correctness
The core deadlock fix is correct and well-motivated: scheduling reloads as background tasks via
asyncio.create_taskinstead of awaiting them inline inmessage_handlerprevents the classic deadlock on the MCP SDK's single-threaded receive loop. The cancel-and-replace logic, strong-reference set, and done-callback cleanup are all sound. Tests are comprehensive. However, theexc_info=message→exc_info=Truechange on the error-logging line is a regression: becausemessage_handleris NOT called from within anexceptblock (the MCP SDK passes the exception as a parameter at lines 358 and 516 ofshared/session.py),sys.exc_info()returns(None, None, None)and the exception type/traceback info is lost. The originalexc_info=messagecorrectly lets Python's logging extract(type(message), message, message.__traceback__)from the exception instance directly.
✗ Security Reliability
The core fix — scheduling MCP reloads as background tasks via
asyncio.create_taskinstead of awaiting them inline inmessage_handler— is correct and addresses a real deadlock. However, the change fromexc_info=messagetoexc_info=Trueon the error-logging line is a regression:message_handleris not inside anexceptblock, sosys.exc_info()(used whenexc_info=True) returns(None, None, None), silently discarding the exception's traceback. The originalexc_info=messagecorrectly passes the exception instance, which Python's logging (3.5+) converts to a full traceback tuple. Additionally,_close_on_ownercancels pending reload tasks but does not await them, which can produce 'Task was destroyed but it is pending' warnings and allows cancelled tasks to race against session teardown.
✓ Test Coverage
The PR adds three new/updated tests covering the core deadlock fix: non-blocking behavior of
message_handler, background task scheduling, and reload-failure logging. These tests are well-structured and meaningfully validate the primary fix. However, the new cancel-and-replace logic in_schedule_reload(which cancels a previous pending reload of the same kind when a new notification arrives) and the pending-task cleanup in_close_on_ownerboth lack dedicated test coverage. These are behavioral additions that could harbor bugs (e.g., incorrect task name comparison) without any test exercising them.
✗ Design Approach
The core fix — scheduling reloads as background tasks via
asyncio.create_taskto avoid deadlocking the MCP receive loop — is correct and well-motivated. The cancel-and-replace deduplication, the_pending_reload_tasksstrong-reference set, the explicitCancelledErrorre-raise, and the teardown cleanup in_close_on_ownerare all sound improvements. However, the diff introduces one clear regression: changingexc_info=messagetoexc_info=Truein the exception-logging branch ofmessage_handler. Since that code is not inside anexceptblock,exc_info=Trueresolves tosys.exc_info()which returns(None, None, None)— silently dropping the exception traceback. The originalexc_info=messagecorrectly passes the exception object directly to the logger, which Python 3.x logging accepts and uses to attach the traceback. This should be reverted.
Flagged Issues
- Changing
exc_info=messagetoexc_info=Trueon the error-logging line (line 803) loses exception information.message_handleris called by the MCP SDK with the exception as a parameter—NOT from within anexceptblock (seemcp/shared/session.pylines 358, 516)—sosys.exc_info()returns(None, None, None)and no traceback is ever logged. The originalexc_info=messagewas correct: Python 3.x logging accepts an exception instance and extracts(type, value, __traceback__)from it. Revert this line.
Suggestions
- In
_close_on_owner, await the cancelled reload tasks (e.g.await asyncio.gather(*tasks, return_exceptions=True)) before proceeding to session teardown. Without this, cancelled tasks may still be in-flight whenself.sessionis set toNone, producing 'Task was destroyed but it is pending' warnings and a subtle race condition. - Add a test for the cancel-and-replace behavior: send two rapid
notifications/tools/list_changednotifications with a blockingload_tools(viaasyncio.Event), verify the first task is cancelled and only one task remains pending. This exercises theexisting.cancel()path and the task-naming scheme in_schedule_reload. - Add a test for
_close_on_ownercleanup: create an MCPTool with a pending reload task, call_close_on_owner, and verify the task is cancelled and_pending_reload_tasksis empty. The existing test mocks the entire method, so the new cancellation code has no coverage. - Ensure the
caplog-based version oftest_mcp_tool_message_handler_reload_failure_is_loggedis what ships. The on-disk version withoutcaplogonly checks the mock was called and doesn't assert the warning was actually logged, so a regression that silently swallows errors would go undetected.
Automated review by giles17's agents
| """ | ||
| if isinstance(message, Exception): | ||
| logger.error("Error from MCP server: %s", message, exc_info=message) | ||
| logger.error("Error from MCP server: %s", message, exc_info=True) |
There was a problem hiding this comment.
Bug: message_handler is called by the MCP SDK with the exception as a parameter—NOT from within an except block (see mcp/shared/session.py:359 and :516). With exc_info=True, sys.exc_info() returns (None, None, None), so the exception type and traceback are silently lost. The original exc_info=message correctly passes the exception instance, letting Python's logging extract the full traceback directly.
| logger.error("Error from MCP server: %s", message, exc_info=True) | |
| logger.error("Error from MCP server: %s", message, exc_info=message) |
| async def _close_on_owner(self) -> None: | ||
| # Cancel any pending reload tasks before tearing down the session. | ||
| for task in list(self._pending_reload_tasks): | ||
| task.cancel() | ||
| self._pending_reload_tasks.clear() |
There was a problem hiding this comment.
Reliability concern: task.cancel() only requests cancellation. Without awaiting the tasks, they may still be in-flight when _safe_close_exit_stack() nullifies self.session, and Python may emit 'Task was destroyed but it is pending' warnings. Consider gathering the cancelled tasks before proceeding.
| async def _close_on_owner(self) -> None: | |
| # Cancel any pending reload tasks before tearing down the session. | |
| for task in list(self._pending_reload_tasks): | |
| task.cancel() | |
| self._pending_reload_tasks.clear() | |
| tasks = list(self._pending_reload_tasks) | |
| for task in tasks: | |
| task.cancel() | |
| self._pending_reload_tasks.clear() | |
| if tasks: | |
| await asyncio.gather(*tasks, return_exceptions=True) |
| assert reload_warnings[0].levelname == "WARNING" | ||
| assert reload_warnings[0].exc_info is not None | ||
|
|
||
|
|
There was a problem hiding this comment.
Consider adding a companion test for the cancel-and-replace logic: send two notifications/tools/list_changed in quick succession with a blocking load_tools (via asyncio.Event), and assert that the first task gets cancelled and only one task remains in _pending_reload_tasks. This is the only way to verify the existing.cancel() path and the task-naming scheme work correctly.
| assert len(tool._pending_reload_tasks) == 0 | ||
|
|
||
|
|
||
| async def test_mcp_tool_message_handler_reload_failure_is_logged(caplog: pytest.LogCaptureFixture): |
There was a problem hiding this comment.
This test verifies the exception doesn't propagate, but (in its current on-disk form without caplog) doesn't assert the warning was actually logged. The diff version with caplog and log-record assertions is a meaningful improvement—make sure that version is what ships. Without the log assertion, a regression that silently swallows errors without logging would go undetected.
Problem
When an MCP server (e.g. MATLAB's
matlab-mcp-core-server) sends anotifications/tools/list_changednotification during or after tool execution,MCPTool.message_handler()previously awaitedload_tools()/load_prompts()directly. Since the handler runs on the MCP SDK's single-threaded_receive_loop, this caused a deadlock:message_handlerawaitsload_tools()load_tools()callssession.list_tools(), which sends a request and waits for a responseThis manifested as a timeout in
call_tool(), which then surfaced as"Error: Function failed."to the model instead of the real tool output.Fix
Schedule reloads as background
asyncio.Tasks via a new_schedule_reload()helper method, freeing the receive loop to continue processing responses immediately. Strong references are kept in_pending_reload_tasksto prevent garbage collection before completion, and errors are logged rather than propagated into the receive loop.Changes
python/packages/core/agent_framework/_mcp.py: Add_pending_reload_tasksset,_schedule_reload()helper; changemessage_handlerto use it instead of directawaitpython/packages/core/tests/core/test_mcp.py: Update existing notification test; add regression test for deadlock scenario; add test for reload failure loggingTesting
All 89 existing MCP tests pass (2 skipped). New tests verify:
message_handlerreturns immediately even whenload_toolsblocks (deadlock regression)Fixes #4828