|
| 1 | +# Interaction-model test suite |
| 2 | + |
| 3 | +This suite enumerates the MCP interaction model as end-to-end tests: one test per piece of |
| 4 | +functionality, asserting the full client↔server round trip through the public API. It exists to |
| 5 | +pin the SDK's observable behaviour — every request type, every notification direction, every |
| 6 | +error plane — so that internal rewrites of the send/receive path can be proven equivalent by |
| 7 | +running the suite before and after. |
| 8 | + |
| 9 | +```bash |
| 10 | +uv run --frozen pytest tests/interaction/ |
| 11 | +``` |
| 12 | + |
| 13 | +The whole suite is in-memory and event-driven; it runs in about a second. |
| 14 | + |
| 15 | +## Ground rules |
| 16 | + |
| 17 | +- **Public API only.** Tests drive a `Client` connected to a `Server` or `MCPServer`. Nothing |
| 18 | + reaches into session internals, so the suite keeps working when those internals change. |
| 19 | + `ClientSession` is used directly only for behaviours `Client` cannot express (skipping |
| 20 | + initialization, requesting a non-default protocol version). |
| 21 | +- **Pin current behaviour.** Every test passes against the current `main`, including behaviours |
| 22 | + that diverge from the specification. A failing or xfailed test proves nothing about whether a |
| 23 | + rewrite preserved behaviour; a passing test that pins the wrong output exactly does. Known |
| 24 | + divergences are recorded as data on the requirement (see below), not worked around in the test. |
| 25 | +- **Spec-mandated assertions, not implementation quirks.** Error *codes* are asserted against |
| 26 | + the constants in `mcp.types`; error *message strings* are pinned only where they are the |
| 27 | + SDK's own deliberate output. |
| 28 | +- **No sleeps, no real I/O.** Concurrency is coordinated with `anyio.Event`; every wait that |
| 29 | + could hang is bounded by `anyio.fail_after(5)`. The streamable HTTP tests drive the Starlette |
| 30 | + app in-process through `httpx.ASGITransport` — no sockets, threads, or subprocesses anywhere. |
| 31 | + |
| 32 | +## Layout |
| 33 | + |
| 34 | +```text |
| 35 | +tests/interaction/ |
| 36 | + _requirements.py the requirements manifest (see below) |
| 37 | + _helpers.py shared type aliases + the wire-recording transport |
| 38 | + test_coverage.py enforces the manifest ↔ test contract |
| 39 | + lowlevel/ one file per feature area, against the low-level Server |
| 40 | + mcpserver/ the same feature areas in MCPServer's natural idiom |
| 41 | + transports/ a smoke subset over the streamable HTTP framing |
| 42 | +``` |
| 43 | + |
| 44 | +The two server APIs produce genuinely different wire output for the same conceptual feature |
| 45 | +(`MCPServer` generates schemas, converts exceptions to `isError` results, attaches structured |
| 46 | +content), so they get parallel directories with mirrored file names rather than one parametrized |
| 47 | +test body — each directory pins its flavour's true output exactly. |
| 48 | + |
| 49 | +## The requirements manifest |
| 50 | + |
| 51 | +`_requirements.py` maps every behaviour the suite covers to the reason it must hold: |
| 52 | + |
| 53 | +```python |
| 54 | +"tools:call:content:text": Requirement( |
| 55 | + source=f"{SPEC_BASE_URL}/server/tools#text-content", |
| 56 | + behavior="tools/call delivers arguments to the tool handler and returns its text content.", |
| 57 | +), |
| 58 | +``` |
| 59 | + |
| 60 | +- **`source`** is a deep link into the MCP specification for externally mandated behaviour, |
| 61 | + the literal string `"sdk"` for behaviour the SDK chose where the spec is silent, or |
| 62 | + `"issue:#n"` for a regression lock. |
| 63 | +- **`behavior`** describes what the suite *asserts* — which is always the SDK's current |
| 64 | + behaviour, never an aspiration. |
| 65 | +- **`divergence`** records the gap when current behaviour differs from what `source` mandates, |
| 66 | + with an issue link once one exists. The test still pins current behaviour. |
| 67 | +- **`deferred`** marks a behaviour that is deliberately not covered, with the reason. |
| 68 | + |
| 69 | +Tests link themselves to the manifest with a decorator: |
| 70 | + |
| 71 | +```python |
| 72 | +@requirement("tools:call:content:text") |
| 73 | +async def test_call_tool_returns_text_content() -> None: ... |
| 74 | +``` |
| 75 | + |
| 76 | +`test_coverage.py` enforces the contract in both directions: every non-deferred requirement must |
| 77 | +be exercised by at least one test, every deferred requirement by none, and an unknown ID fails at |
| 78 | +import time. A behaviour without a manifest entry cannot be silently half-tested, and a manifest |
| 79 | +entry without a test cannot be silently aspirational. |
| 80 | + |
| 81 | +### The divergence lifecycle |
| 82 | + |
| 83 | +1. A test reveals that the SDK does not do what the spec says. The test pins what the SDK |
| 84 | + *actually does* and a `Divergence(note=..., issue=...)` goes on the requirement. |
| 85 | +2. When the behaviour is eventually fixed, the pinned test fails. Whoever makes the change finds |
| 86 | + the divergence note explaining that the old behaviour was a known gap, re-pins the test to the |
| 87 | + spec-correct output, and deletes the `Divergence`. |
| 88 | +3. An empty divergence list means the SDK is spec-conformant on every behaviour the suite covers. |
| 89 | + |
| 90 | +This is also the triage key for any rewrite: a test that fails on the new code path either has a |
| 91 | +divergence note (the rewrite accidentally fixed a known gap — decide whether to keep the fix) or |
| 92 | +it does not (the rewrite broke something that was correct — fix the rewrite). |
| 93 | + |
| 94 | +### When a new spec revision is released |
| 95 | + |
| 96 | +1. Update `SPEC_REVISION` and walk the new revision's changelog. |
| 97 | +2. For each changed interaction, find its requirements (the IDs use the wire method strings the |
| 98 | + changelog speaks in), re-audit the tests against the new text, and update `source` links and |
| 99 | + assertions where behaviour legitimately changed. |
| 100 | +3. New interactions get new requirements and new tests; removed interactions get their |
| 101 | + requirements deleted along with their tests. |
| 102 | +4. A behaviour that is correct under both revisions needs no change beyond the `source` link. |
| 103 | + |
| 104 | +## Writing a test |
| 105 | + |
| 106 | +The shortest complete example of the conventions: |
| 107 | + |
| 108 | +```python |
| 109 | +@requirement("tools:call:content:text") |
| 110 | +async def test_call_tool_returns_text_content() -> None: |
| 111 | + """Arguments reach the tool handler; its content comes back as the call result.""" |
| 112 | + |
| 113 | + async def call_tool(ctx: ServerRequestContext, params: types.CallToolRequestParams) -> CallToolResult: |
| 114 | + assert params.name == "add" |
| 115 | + assert params.arguments is not None |
| 116 | + return CallToolResult(content=[TextContent(text=str(params.arguments["a"] + params.arguments["b"]))]) |
| 117 | + |
| 118 | + server = Server("adder", on_call_tool=call_tool) |
| 119 | + |
| 120 | + async with Client(server) as client: |
| 121 | + result = await client.call_tool("add", {"a": 2, "b": 3}) |
| 122 | + |
| 123 | + assert result == snapshot(CallToolResult(content=[TextContent(text="5")])) |
| 124 | +``` |
| 125 | + |
| 126 | +- **The server is defined inside the test** (or in a small fixture at the top of the file when |
| 127 | + several tests genuinely share it). The whole observable behaviour fits on one screen. |
| 128 | +- **Test names are behaviour sentences** — they state the observable outcome, not the feature |
| 129 | + being poked. Docstrings add the one or two sentences of context a reviewer needs, including |
| 130 | + whether the assertion is spec-mandated, SDK-defined, or a known divergence. |
| 131 | +- **Handlers assert their dispatch identity first** (`assert params.name == "add"`), proving the |
| 132 | + request that arrived is the request the test sent. |
| 133 | +- **The result proves the round trip.** Server-side observations travel back to the test through |
| 134 | + the protocol itself (a tool returns what it saw) or through a closure-captured list; the test |
| 135 | + asserts after the call returns. |
| 136 | +- **Order within a test**: server handlers → server construction → client callbacks → connect → |
| 137 | + act → assert. The test reads in the order the conversation happens. |
| 138 | +- A registered handler or tool that a test never invokes gets a `raise NotImplementedError` body |
| 139 | + so it cannot silently become load-bearing. |
| 140 | + |
| 141 | +### Choosing an assertion |
| 142 | + |
| 143 | +| The property under test is… | Assert with | |
| 144 | +|---|---| |
| 145 | +| the result of a transformation (arguments → output, exception → error result) | `result == snapshot(...)` of the full object, so any field the implementation adds or drops fails the test | |
| 146 | +| pass-through of an opaque value (`_meta`, cursors) | identity against the same variable that was sent — a snapshot of a pass-through value only matches the input because a human checked two literals correspond | |
| 147 | +| an error | `pytest.raises(MCPError)` and a snapshot of `exc.value.error` when the message is the SDK's own; a plain `==` on `.code` against the `mcp.types` constant when it is not | |
| 148 | +| third-party output embedded in a result (validation messages) | the stable prefix only — never pin text that changes with a dependency upgrade | |
| 149 | + |
| 150 | +### Notifications and concurrency |
| 151 | + |
| 152 | +The client's receive loop dispatches each incoming message to completion before reading the next, |
| 153 | +and the in-memory transport delivers everything on one ordered stream. Together these guarantee |
| 154 | +that every notification a server handler emits before its response reaches the client callback |
| 155 | +before the originating request returns — so tests collect notifications into a plain list and |
| 156 | +assert after the call, with no synchronisation. The exceptions: |
| 157 | + |
| 158 | +- a notification not triggered by a request the test is awaiting needs an `anyio.Event` set in |
| 159 | + the receiving handler and awaited under `anyio.fail_after(5)`; |
| 160 | +- the ordering guarantee does not survive transports that split messages across streams (the |
| 161 | + streamable HTTP standalone GET stream) — see `transports/test_streamable_http.py`. |
| 162 | + |
| 163 | +### Coverage |
| 164 | + |
| 165 | +CI requires 100% line and branch coverage, including `tests/`, and `strict-no-cover` fails the |
| 166 | +build if a line marked `# pragma: no cover` is ever executed. When a new test starts covering a |
| 167 | +pragma'd line in `src/`, delete the pragma in the same change. Do not add new `# pragma`, |
| 168 | +`# type: ignore`, or `# noqa` comments; restructure instead. |
0 commit comments