Python: fix(python): prevent MCP message_handler deadlock on notification reload by giles17 · Pull Request #4866 · microsoft/agent-framework

giles17 · 2026-03-23T19:24:05Z

Problem

When an MCP server (e.g. MATLAB's matlab-mcp-core-server) sends a notifications/tools/list_changed notification during or after tool execution, MCPTool.message_handler() previously awaited load_tools()/load_prompts() directly. Since the handler runs on the MCP SDK's single-threaded _receive_loop, this caused a deadlock:

message_handler awaits load_tools()
load_tools() calls session.list_tools(), which sends a request and waits for a response
The response can never arrive because the receive loop is blocked inside the handler

This manifested as a timeout in call_tool(), which then surfaced as "Error: Function failed." to the model instead of the real tool output.

Fix

Schedule reloads as background asyncio.Tasks via a new _schedule_reload() helper method, freeing the receive loop to continue processing responses immediately. Strong references are kept in _pending_reload_tasks to prevent garbage collection before completion, and errors are logged rather than propagated into the receive loop.

Changes

python/packages/core/agent_framework/_mcp.py: Add _pending_reload_tasks set, _schedule_reload() helper; change message_handler to use it instead of direct await
python/packages/core/tests/core/test_mcp.py: Update existing notification test; add regression test for deadlock scenario; add test for reload failure logging

Testing

All 89 existing MCP tests pass (2 skipped). New tests verify:

message_handler returns immediately even when load_tools blocks (deadlock regression)
Background reload errors are logged, not raised

Fixes #4828

When an MCP server sends a notifications/tools/list_changed or notifications/prompts/list_changed notification, the message_handler previously awaited load_tools()/load_prompts() directly. Since the handler runs on the MCP SDK's single-threaded receive loop, this caused a deadlock: load_tools() sends a list_tools request and waits for its response, but the receive loop cannot deliver that response while blocked in the handler. This manifested as a timeout in call_tool(), which then surfaced as "Error: Function failed." to the model instead of the real tool output. The MATLAB MCP server reliably triggers this because it sends a tools/list_changed notification during tool execution. Fix: schedule reloads as background asyncio.Tasks via a new _schedule_reload() helper, freeing the receive loop immediately. Fixes microsoft#4828 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

markwallace-microsoft · 2026-03-23T19:26:31Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/core/agent_framework
_mcp.py	546	76	86%	96–97, 107–112, 123, 128, 171, 180, 192–193, 244, 253, 316, 324, 383, 497, 532, 545, 547–550, 569–570, 583–586, 588–589, 593, 650, 685, 687, 691–692, 694–695, 749, 764, 782, 834, 840, 861, 971, 998, 1011–1016, 1040, 1099–1100, 1106–1108, 1127, 1152–1153, 1157–1161, 1178–1182, 1329
TOTAL	27318	3224	88%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
5356	20 💤	0 ❌	0 🔥	1m 24s ⏱️

Copilot

Pull request overview

Prevents a deadlock in the Python MCP tool integration by ensuring MCP notification handling doesn’t synchronously await reload operations on the SDK’s single-threaded receive loop.

Changes:

Add background-task scheduling for tool/prompt reloads triggered by MCP list_changed notifications.
Track pending reload tasks to keep them alive until completion.
Update and extend MCP tests to cover the non-blocking/deadlock regression scenario and reload-failure behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`python/packages/core/agent_framework/_mcp.py`	Schedules reloads via background tasks instead of awaiting them in `message_handler`; tracks pending tasks.
`python/packages/core/tests/core/test_mcp.py`	Updates notification tests for async scheduling; adds regression tests for deadlock scenario and reload failure handling.

python/packages/core/agent_framework/_mcp.py

python/packages/core/tests/core/test_mcp.py

giles17

Automated Code Review

Reviewers: 4 | Confidence: 90%

✓ Correctness

The fix correctly addresses a real deadlock where message_handler (running on the MCP SDK's single-threaded receive loop) previously awaited load_tools()/load_prompts() which themselves need the receive loop to process responses. The solution uses asyncio.create_task() to schedule reloads as background tasks, with proper strong-reference tracking via _pending_reload_tasks and a done-callback for cleanup. Exception handling in the background task prevents errors from propagating to the receive loop. Tests cover the non-blocking behavior, task lifecycle, and error logging. The implementation follows standard asyncio fire-and-forget patterns correctly.

✓ Security Reliability

The fix correctly addresses a real deadlock by scheduling reload coroutines as background tasks instead of awaiting them inside the MCP SDK's single-threaded receive loop. The implementation is sound: exceptions are caught and logged, tasks are prevented from GC via a strong-reference set, and the done-callback cleanup pattern is correct. The one reliability concern is that close() / _close_on_owner() does not cancel or await pending reload tasks before tearing down the session, which could cause reload tasks to run against a None session and produce confusing warnings or mask real errors during shutdown.

✓ Test Coverage

The test coverage for this fix is good. The PR adds three well-targeted tests: the existing notification test is updated to account for the async scheduling (with asyncio.sleep(0)), a new deadlock regression test uses an asyncio.Event to verify the handler returns immediately even when the reload blocks, and a failure-logging test confirms exceptions don't propagate. However, two minor gaps exist: the reload-failure test doesn't verify the task is cleaned up from _pending_reload_tasks after the error, and the load_prompts notification path lacks a dedicated non-blocking test (only load_tools is tested in the deadlock scenario).

✓ Design Approach

The fix correctly addresses a real deadlock: message_handler ran on the MCP SDK's single-threaded receive loop and previously awaited load_tools()/load_prompts(), which in turn called session.list_tools() — a request whose response could never arrive because the receive loop was blocked. Scheduling reloads as background tasks via asyncio.create_task() with a strong-reference set is the right approach. The tests are well-structured, especially the event-based non-blocking regression test. One design concern: if the server sends a burst of notifications/tools/list_changed notifications, multiple concurrent reload tasks are each created, all capturing existing_names as a local snapshot of self._functions at task-start time. Two concurrent tasks can race to append the same newly-advertised tool (the existing_names dedup check is local, not atomic with the append), producing duplicate entries in _functions. Additionally, coro: Any is looser than necessary; Coroutine[Any, Any, None] would let type checkers catch misuse. Neither blocks the merge — the duplicate-tool race is an edge case unlikely to cause user-visible harm, and the type annotation is minor — but both are worth a follow-up.

Suggestions

In test_mcp_tool_message_handler_reload_failure_is_logged, assert that _pending_reload_tasks is empty after the failed task completes (e.g. assert len(tool._pending_reload_tasks) == 0) to verify the done-callback cleanup works on the error path, consistent with the pattern in the deadlock test.
Tighten the coro: Any parameter type in _schedule_reload to Coroutine[Any, Any, None] (from collections.abc) so type checkers can flag accidental misuse, even though usage is internal-only.
Consider adding the prompts/list_changed notification to test_mcp_tool_message_handler_does_not_block_receive_loop (or a sibling test) to ensure both notification paths are covered for non-blocking behavior, not just tools/list_changed.

Automated review by giles17's agents

python/packages/core/tests/core/test_mcp.py

python/packages/core/agent_framework/_mcp.py

…cleanup, tests - Fix exc_info=exc -> exc_info=True in _schedule_reload and message_handler - Tighten _schedule_reload param type from Any to Coroutine[Any, Any, None] - Coalesce reloads: cancel-and-replace per reload kind to prevent unbounded growth - Cancel pending reload tasks in _close_on_owner before tearing down session - Re-raise CancelledError in _safe_reload to respect task cancellation - Replace flaky asyncio.sleep(0) with asyncio.wait_for/gather in tests - Add caplog assertions to verify reload failure is actually logged - Assert _pending_reload_tasks cleanup on error path Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

giles17

Automated Code Review

Reviewers: 4 | Confidence: 91%

✗ Correctness

The core deadlock fix is correct and well-motivated: scheduling reloads as background tasks via asyncio.create_task instead of awaiting them inline in message_handler prevents the classic deadlock on the MCP SDK's single-threaded receive loop. The cancel-and-replace logic, strong-reference set, and done-callback cleanup are all sound. Tests are comprehensive. However, the exc_info=message → exc_info=True change on the error-logging line is a regression: because message_handler is NOT called from within an except block (the MCP SDK passes the exception as a parameter at lines 358 and 516 of shared/session.py), sys.exc_info() returns (None, None, None) and the exception type/traceback info is lost. The original exc_info=message correctly lets Python's logging extract (type(message), message, message.__traceback__) from the exception instance directly.

✗ Security Reliability

The core fix — scheduling MCP reloads as background tasks via asyncio.create_task instead of awaiting them inline in message_handler — is correct and addresses a real deadlock. However, the change from exc_info=message to exc_info=True on the error-logging line is a regression: message_handler is not inside an except block, so sys.exc_info() (used when exc_info=True) returns (None, None, None), silently discarding the exception's traceback. The original exc_info=message correctly passes the exception instance, which Python's logging (3.5+) converts to a full traceback tuple. Additionally, _close_on_owner cancels pending reload tasks but does not await them, which can produce 'Task was destroyed but it is pending' warnings and allows cancelled tasks to race against session teardown.

✓ Test Coverage

The PR adds three new/updated tests covering the core deadlock fix: non-blocking behavior of message_handler, background task scheduling, and reload-failure logging. These tests are well-structured and meaningfully validate the primary fix. However, the new cancel-and-replace logic in _schedule_reload (which cancels a previous pending reload of the same kind when a new notification arrives) and the pending-task cleanup in _close_on_owner both lack dedicated test coverage. These are behavioral additions that could harbor bugs (e.g., incorrect task name comparison) without any test exercising them.

✗ Design Approach

The core fix — scheduling reloads as background tasks via asyncio.create_task to avoid deadlocking the MCP receive loop — is correct and well-motivated. The cancel-and-replace deduplication, the _pending_reload_tasks strong-reference set, the explicit CancelledError re-raise, and the teardown cleanup in _close_on_owner are all sound improvements. However, the diff introduces one clear regression: changing exc_info=message to exc_info=True in the exception-logging branch of message_handler. Since that code is not inside an except block, exc_info=True resolves to sys.exc_info() which returns (None, None, None) — silently dropping the exception traceback. The original exc_info=message correctly passes the exception object directly to the logger, which Python 3.x logging accepts and uses to attach the traceback. This should be reverted.

Flagged Issues

Changing exc_info=message to exc_info=True on the error-logging line (line 803) loses exception information. message_handler is called by the MCP SDK with the exception as a parameter—NOT from within an except block (see mcp/shared/session.py lines 358, 516)—so sys.exc_info() returns (None, None, None) and no traceback is ever logged. The original exc_info=message was correct: Python 3.x logging accepts an exception instance and extracts (type, value, __traceback__) from it. Revert this line.

Suggestions

In _close_on_owner, await the cancelled reload tasks (e.g. await asyncio.gather(*tasks, return_exceptions=True)) before proceeding to session teardown. Without this, cancelled tasks may still be in-flight when self.session is set to None, producing 'Task was destroyed but it is pending' warnings and a subtle race condition.
Add a test for the cancel-and-replace behavior: send two rapid notifications/tools/list_changed notifications with a blocking load_tools (via asyncio.Event), verify the first task is cancelled and only one task remains pending. This exercises the existing.cancel() path and the task-naming scheme in _schedule_reload.
Add a test for _close_on_owner cleanup: create an MCPTool with a pending reload task, call _close_on_owner, and verify the task is cancelled and _pending_reload_tasks is empty. The existing test mocks the entire method, so the new cancellation code has no coverage.
Ensure the caplog-based version of test_mcp_tool_message_handler_reload_failure_is_logged is what ships. The on-disk version without caplog only checks the mock was called and doesn't assert the warning was actually logged, so a regression that silently swallows errors would go undetected.

Automated review by giles17's agents

giles17 · 2026-03-23T23:40:33Z

python/packages/core/agent_framework/_mcp.py

        """
        if isinstance(message, Exception):
-            logger.error("Error from MCP server: %s", message, exc_info=message)
+            logger.error("Error from MCP server: %s", message, exc_info=True)


Bug: message_handler is called by the MCP SDK with the exception as a parameter—NOT from within an except block (see mcp/shared/session.py:359 and :516). With exc_info=True, sys.exc_info() returns (None, None, None), so the exception type and traceback are silently lost. The original exc_info=message correctly passes the exception instance, letting Python's logging extract the full traceback directly.

Suggested change

logger.error("Error from MCP server: %s", message, exc_info=True)

logger.error("Error from MCP server: %s", message, exc_info=message)

giles17 · 2026-03-23T23:40:33Z

python/packages/core/agent_framework/_mcp.py

    async def _close_on_owner(self) -> None:
+        # Cancel any pending reload tasks before tearing down the session.
+        for task in list(self._pending_reload_tasks):
+            task.cancel()
+        self._pending_reload_tasks.clear()


Reliability concern: task.cancel() only requests cancellation. Without awaiting the tasks, they may still be in-flight when _safe_close_exit_stack() nullifies self.session, and Python may emit 'Task was destroyed but it is pending' warnings. Consider gathering the cancelled tasks before proceeding.

Suggested change

async def _close_on_owner(self) -> None:

# Cancel any pending reload tasks before tearing down the session.

for task in list(self._pending_reload_tasks):

task.cancel()

self._pending_reload_tasks.clear()

tasks = list(self._pending_reload_tasks)

for task in tasks:

task.cancel()

self._pending_reload_tasks.clear()

if tasks:

await asyncio.gather(*tasks, return_exceptions=True)

giles17 · 2026-03-23T23:40:33Z

python/packages/core/tests/core/test_mcp.py

+    assert reload_warnings[0].levelname == "WARNING"
+    assert reload_warnings[0].exc_info is not None
+
+


Consider adding a companion test for the cancel-and-replace logic: send two notifications/tools/list_changed in quick succession with a blocking load_tools (via asyncio.Event), and assert that the first task gets cancelled and only one task remains in _pending_reload_tasks. This is the only way to verify the existing.cancel() path and the task-naming scheme work correctly.

giles17 · 2026-03-23T23:40:33Z

python/packages/core/tests/core/test_mcp.py

+    assert len(tool._pending_reload_tasks) == 0
+
+
+async def test_mcp_tool_message_handler_reload_failure_is_logged(caplog: pytest.LogCaptureFixture):


This test verifies the exception doesn't propagate, but (in its current on-disk form without caplog) doesn't assert the warning was actually logged. The diff version with caplog and log-record assertions is a meaningful improvement—make sure that version is what ships. Without the log assertion, a regression that silently swallows errors without logging would go undetected.

Copilot AI review requested due to automatic review settings March 23, 2026 19:24

markwallace-microsoft added the python label Mar 23, 2026

github-actions bot changed the title ~~fix(python): prevent MCP message_handler deadlock on notification reload~~ Python: fix(python): prevent MCP message_handler deadlock on notification reload Mar 23, 2026

Copilot started reviewing on behalf of giles17 March 23, 2026 19:24 View session

giles17 marked this pull request as draft March 23, 2026 19:27

Copilot AI reviewed Mar 23, 2026

View reviewed changes

giles17 commented Mar 23, 2026

View reviewed changes

python/packages/core/tests/core/test_mcp.py Show resolved Hide resolved

python/packages/core/agent_framework/_mcp.py Outdated Show resolved Hide resolved

giles17 commented Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: fix(python): prevent MCP message_handler deadlock on notification reload#4866

Python: fix(python): prevent MCP message_handler deadlock on notification reload#4866
giles17 wants to merge 2 commits intomicrosoft:mainfrom
giles17:fix/mcp-message-handler-deadlock-4828

giles17 commented Mar 23, 2026

Uh oh!

markwallace-microsoft commented Mar 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

giles17 left a comment

Uh oh!

Uh oh!

Uh oh!

giles17 left a comment

Uh oh!

giles17 Mar 23, 2026

Uh oh!

giles17 Mar 23, 2026

Uh oh!

giles17 Mar 23, 2026

Uh oh!

giles17 Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	logger.error("Error from MCP server: %s", message, exc_info=True)
	logger.error("Error from MCP server: %s", message, exc_info=message)

		assert reload_warnings[0].levelname == "WARNING"
		assert reload_warnings[0].exc_info is not None

		assert len(tool._pending_reload_tasks) == 0


		async def test_mcp_tool_message_handler_reload_failure_is_logged(caplog: pytest.LogCaptureFixture):

Conversation

giles17 commented Mar 23, 2026

Problem

Fix

Changes

Testing

Uh oh!

markwallace-microsoft commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Unit Test Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

giles17 left a comment

Choose a reason for hiding this comment

Automated Code Review

✓ Correctness

✓ Security Reliability

✓ Test Coverage

✓ Design Approach

Suggestions

Uh oh!

Uh oh!

Uh oh!

giles17 left a comment

Choose a reason for hiding this comment

Automated Code Review

✗ Correctness

✗ Security Reliability

✓ Test Coverage

✗ Design Approach

Flagged Issues

Suggestions

Uh oh!

giles17 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

giles17 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

giles17 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

giles17 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

markwallace-microsoft commented Mar 23, 2026 •

edited

Loading