Skip to content

Commit 2f0da6e

Browse files
committed
test: document the interaction suite's conventions and manifest workflow
1 parent d739975 commit 2f0da6e

1 file changed

Lines changed: 168 additions & 0 deletions

File tree

tests/interaction/README.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Interaction-model test suite
2+
3+
This suite enumerates the MCP interaction model as end-to-end tests: one test per piece of
4+
functionality, asserting the full client↔server round trip through the public API. It exists to
5+
pin the SDK's observable behaviour — every request type, every notification direction, every
6+
error plane — so that internal rewrites of the send/receive path can be proven equivalent by
7+
running the suite before and after.
8+
9+
```bash
10+
uv run --frozen pytest tests/interaction/
11+
```
12+
13+
The whole suite is in-memory and event-driven; it runs in about a second.
14+
15+
## Ground rules
16+
17+
- **Public API only.** Tests drive a `Client` connected to a `Server` or `MCPServer`. Nothing
18+
reaches into session internals, so the suite keeps working when those internals change.
19+
`ClientSession` is used directly only for behaviours `Client` cannot express (skipping
20+
initialization, requesting a non-default protocol version).
21+
- **Pin current behaviour.** Every test passes against the current `main`, including behaviours
22+
that diverge from the specification. A failing or xfailed test proves nothing about whether a
23+
rewrite preserved behaviour; a passing test that pins the wrong output exactly does. Known
24+
divergences are recorded as data on the requirement (see below), not worked around in the test.
25+
- **Spec-mandated assertions, not implementation quirks.** Error *codes* are asserted against
26+
the constants in `mcp.types`; error *message strings* are pinned only where they are the
27+
SDK's own deliberate output.
28+
- **No sleeps, no real I/O.** Concurrency is coordinated with `anyio.Event`; every wait that
29+
could hang is bounded by `anyio.fail_after(5)`. The streamable HTTP tests drive the Starlette
30+
app in-process through `httpx.ASGITransport` — no sockets, threads, or subprocesses anywhere.
31+
32+
## Layout
33+
34+
```text
35+
tests/interaction/
36+
_requirements.py the requirements manifest (see below)
37+
_helpers.py shared type aliases + the wire-recording transport
38+
test_coverage.py enforces the manifest ↔ test contract
39+
lowlevel/ one file per feature area, against the low-level Server
40+
mcpserver/ the same feature areas in MCPServer's natural idiom
41+
transports/ a smoke subset over the streamable HTTP framing
42+
```
43+
44+
The two server APIs produce genuinely different wire output for the same conceptual feature
45+
(`MCPServer` generates schemas, converts exceptions to `isError` results, attaches structured
46+
content), so they get parallel directories with mirrored file names rather than one parametrized
47+
test body — each directory pins its flavour's true output exactly.
48+
49+
## The requirements manifest
50+
51+
`_requirements.py` maps every behaviour the suite covers to the reason it must hold:
52+
53+
```python
54+
"tools:call:content:text": Requirement(
55+
source=f"{SPEC_BASE_URL}/server/tools#text-content",
56+
behavior="tools/call delivers arguments to the tool handler and returns its text content.",
57+
),
58+
```
59+
60+
- **`source`** is a deep link into the MCP specification for externally mandated behaviour,
61+
the literal string `"sdk"` for behaviour the SDK chose where the spec is silent, or
62+
`"issue:#n"` for a regression lock.
63+
- **`behavior`** describes what the suite *asserts* — which is always the SDK's current
64+
behaviour, never an aspiration.
65+
- **`divergence`** records the gap when current behaviour differs from what `source` mandates,
66+
with an issue link once one exists. The test still pins current behaviour.
67+
- **`deferred`** marks a behaviour that is deliberately not covered, with the reason.
68+
69+
Tests link themselves to the manifest with a decorator:
70+
71+
```python
72+
@requirement("tools:call:content:text")
73+
async def test_call_tool_returns_text_content() -> None: ...
74+
```
75+
76+
`test_coverage.py` enforces the contract in both directions: every non-deferred requirement must
77+
be exercised by at least one test, every deferred requirement by none, and an unknown ID fails at
78+
import time. A behaviour without a manifest entry cannot be silently half-tested, and a manifest
79+
entry without a test cannot be silently aspirational.
80+
81+
### The divergence lifecycle
82+
83+
1. A test reveals that the SDK does not do what the spec says. The test pins what the SDK
84+
*actually does* and a `Divergence(note=..., issue=...)` goes on the requirement.
85+
2. When the behaviour is eventually fixed, the pinned test fails. Whoever makes the change finds
86+
the divergence note explaining that the old behaviour was a known gap, re-pins the test to the
87+
spec-correct output, and deletes the `Divergence`.
88+
3. An empty divergence list means the SDK is spec-conformant on every behaviour the suite covers.
89+
90+
This is also the triage key for any rewrite: a test that fails on the new code path either has a
91+
divergence note (the rewrite accidentally fixed a known gap — decide whether to keep the fix) or
92+
it does not (the rewrite broke something that was correct — fix the rewrite).
93+
94+
### When a new spec revision is released
95+
96+
1. Update `SPEC_REVISION` and walk the new revision's changelog.
97+
2. For each changed interaction, find its requirements (the IDs use the wire method strings the
98+
changelog speaks in), re-audit the tests against the new text, and update `source` links and
99+
assertions where behaviour legitimately changed.
100+
3. New interactions get new requirements and new tests; removed interactions get their
101+
requirements deleted along with their tests.
102+
4. A behaviour that is correct under both revisions needs no change beyond the `source` link.
103+
104+
## Writing a test
105+
106+
The shortest complete example of the conventions:
107+
108+
```python
109+
@requirement("tools:call:content:text")
110+
async def test_call_tool_returns_text_content() -> None:
111+
"""Arguments reach the tool handler; its content comes back as the call result."""
112+
113+
async def call_tool(ctx: ServerRequestContext, params: types.CallToolRequestParams) -> CallToolResult:
114+
assert params.name == "add"
115+
assert params.arguments is not None
116+
return CallToolResult(content=[TextContent(text=str(params.arguments["a"] + params.arguments["b"]))])
117+
118+
server = Server("adder", on_call_tool=call_tool)
119+
120+
async with Client(server) as client:
121+
result = await client.call_tool("add", {"a": 2, "b": 3})
122+
123+
assert result == snapshot(CallToolResult(content=[TextContent(text="5")]))
124+
```
125+
126+
- **The server is defined inside the test** (or in a small fixture at the top of the file when
127+
several tests genuinely share it). The whole observable behaviour fits on one screen.
128+
- **Test names are behaviour sentences** — they state the observable outcome, not the feature
129+
being poked. Docstrings add the one or two sentences of context a reviewer needs, including
130+
whether the assertion is spec-mandated, SDK-defined, or a known divergence.
131+
- **Handlers assert their dispatch identity first** (`assert params.name == "add"`), proving the
132+
request that arrived is the request the test sent.
133+
- **The result proves the round trip.** Server-side observations travel back to the test through
134+
the protocol itself (a tool returns what it saw) or through a closure-captured list; the test
135+
asserts after the call returns.
136+
- **Order within a test**: server handlers → server construction → client callbacks → connect →
137+
act → assert. The test reads in the order the conversation happens.
138+
- A registered handler or tool that a test never invokes gets a `raise NotImplementedError` body
139+
so it cannot silently become load-bearing.
140+
141+
### Choosing an assertion
142+
143+
| The property under test is… | Assert with |
144+
|---|---|
145+
| the result of a transformation (arguments → output, exception → error result) | `result == snapshot(...)` of the full object, so any field the implementation adds or drops fails the test |
146+
| pass-through of an opaque value (`_meta`, cursors) | identity against the same variable that was sent — a snapshot of a pass-through value only matches the input because a human checked two literals correspond |
147+
| an error | `pytest.raises(MCPError)` and a snapshot of `exc.value.error` when the message is the SDK's own; a plain `==` on `.code` against the `mcp.types` constant when it is not |
148+
| third-party output embedded in a result (validation messages) | the stable prefix only — never pin text that changes with a dependency upgrade |
149+
150+
### Notifications and concurrency
151+
152+
The client's receive loop dispatches each incoming message to completion before reading the next,
153+
and the in-memory transport delivers everything on one ordered stream. Together these guarantee
154+
that every notification a server handler emits before its response reaches the client callback
155+
before the originating request returns — so tests collect notifications into a plain list and
156+
assert after the call, with no synchronisation. The exceptions:
157+
158+
- a notification not triggered by a request the test is awaiting needs an `anyio.Event` set in
159+
the receiving handler and awaited under `anyio.fail_after(5)`;
160+
- the ordering guarantee does not survive transports that split messages across streams (the
161+
streamable HTTP standalone GET stream) — see `transports/test_streamable_http.py`.
162+
163+
### Coverage
164+
165+
CI requires 100% line and branch coverage, including `tests/`, and `strict-no-cover` fails the
166+
build if a line marked `# pragma: no cover` is ever executed. When a new test starts covering a
167+
pragma'd line in `src/`, delete the pragma in the same change. Do not add new `# pragma`,
168+
`# type: ignore`, or `# noqa` comments; restructure instead.

0 commit comments

Comments
 (0)