Skip to content

Tolerate spec-violating list methods on backend init#5232

Open
tgrunnagle wants to merge 1 commit intomainfrom
issue_5231
Open

Tolerate spec-violating list methods on backend init#5232
tgrunnagle wants to merge 1 commit intomainfrom
issue_5231

Conversation

@tgrunnagle
Copy link
Copy Markdown
Contributor

Summary

A vMCP backend that advertises capabilities.resources (or capabilities.prompts) in its initialize response but does not actually implement resources/list (or prompts/list) currently fails per-session backend init outright. Every tool from that backend silently disappears from tools/list, leaving users with no signal that anything went wrong — a real problem against third-party servers like Atlassian's Rovo MCP that we don't control. This PR makes the per-session bootstrap tolerate that specific spec violation while keeping all other failure modes fatal.

In initAndQueryCapabilities, a JSON-RPC -32601 Method not found from resources/list or prompts/list is now treated as "the backend has no resources/prompts" rather than a fatal init error, with a WARN log recording the spec violation. tools/list errors and any non--32601 error from the resources/prompts list calls remain fatal.

Closes #5231

Changes Made

pkg/vmcp/session/internal/backend/mcp_session.go

  • Detect mcp.ErrMethodNotFound from resources/list and prompts/list via errors.Is and recover with an empty result set instead of returning an error.
  • Emit a WARN log on recovery, including backendID and the offending list method, so operators can flag the upstream bug without ERROR-level noise.
  • Leave tools/list strict (a backend with no tool surface is not useful to expose) and leave non--32601 list errors fatal (we are not silencing arbitrary failures).

pkg/vmcp/session/internal/backend/mcp_session_capabilities_test.go (new)

  • Table-driven tests covering each acceptance-criteria scenario via a fake streamable-http MCP backend.

Implementation Details

  • The -32601 detection relies on mcp-go's JSONRPCErrorDetails.AsError() wrapping the sentinel mcp.ErrMethodNotFound, so errors.Is(listErr, mcp.ErrMethodNotFound) is the correct discriminator across upstream message variations (e.g., the doubled "method not found: Method not found" string seen in production).
  • The recovery is scoped narrowly to the spec-violation case. Transport errors, timeouts, and other JSON-RPC error codes still abort init, preserving visibility into genuine problems.

Testing

  • New unit tests in pkg/vmcp/session/internal/backend/mcp_session_capabilities_test.go cover the acceptance criteria:
    • resources/list returns -32601 after the server advertised resources -> init succeeds with no resources, tools remain reachable.
    • prompts/list returns -32601 after the server advertised prompts -> init succeeds with no prompts, tools remain reachable.
    • Both list methods return -32601 simultaneously -> init still succeeds, tools remain reachable.
    • tools/list returns -32601 -> init still fails (regression guard).
    • resources/list returns -32601 INTERNAL_ERROR (non--32601) -> init still fails (regression guard).
    • prompts/list returns INVALID_PARAMS (non--32601) -> init still fails (regression guard).
  • task test (unit tests) and task lint-fix (linting) run clean on the touched package.

Additional Notes

  • Manual verification against the Atlassian Rovo MCP server (https://mcp.atlassian.com/v1/mcp) called out in the issue's acceptance criteria has not been performed in this PR; the unit tests exercise the same code path with a fake backend that produces the equivalent -32601 response. Reviewers who have the Rovo wiring available should confirm tools/list returns the workload-prefixed Atlassian tools after OAuth completes.
  • The same fragility class is worth a follow-up audit elsewhere in the session bootstrap (subscriptions, completions) — anywhere we derive "the server supports method X" from the initialize response and then unconditionally call X. Out of scope for this PR.

When a backend advertises resources or prompts capability in its
initialize response but returns JSON-RPC -32601 to resources/list or
prompts/list, treat that as "the backend has no resources/prompts"
rather than a fatal init error. This unblocks third-party MCP servers
(e.g. Atlassian Rovo) whose initialize response contradicts their
implemented method set, so users still get the backend's tools instead
of silently losing them.

Implements changes for issue #5231:
- Recover from errors.Is(err, mcp.ErrMethodNotFound) on resources/list
  and prompts/list with a WARN log naming the backend and method.
- Keep tools/list failures and non-(-32601) errors from list methods
  fatal so we are not silencing arbitrary failures.
- Add table-driven unit tests with a fake JSON-RPC backend covering
  the recoverable, fatal, and regression-guard cases.
@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label May 8, 2026
@tgrunnagle tgrunnagle marked this pull request as ready for review May 8, 2026 20:36
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 86.48649% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.93%. Comparing base (9211a36) to head (b8c1d63).

Files with missing lines Patch % Lines
pkg/vmcp/session/internal/backend/mcp_session.go 86.48% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5232      +/-   ##
==========================================
+ Coverage   67.86%   67.93%   +0.06%     
==========================================
  Files         610      610              
  Lines       62522    62536      +14     
==========================================
+ Hits        42431    42482      +51     
+ Misses      16910    16873      -37     
  Partials     3181     3181              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@lorr1 lorr1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-agent review summary

Recommendation: COMMENT (non-blocking).

The fix is correct, well-scoped, and end-to-end safe. errors.Is(listErr, mcp.ErrMethodNotFound) is the right discriminator against the mcp-go v0.49.0 sentinel (JSONRPCErrorDetails.AsError() wraps it via fmt.Errorf("%w: ..."), so errors.Is matches both bare and wrapped variants). Switch ordering is nil-safe. tools/list strictness is preserved. Empty caps.Resources/caps.Prompts flow correctly through pkg/vmcp/session/factory.go and the aggregator merge — a downstream client resources/list against vMCP simply returns an empty merged set.

Four specialist reviewers (Go correctness, MCP protocol, test coverage, general code quality) consulted. Codex cross-review was attempted but timed out; not blocking.

Summary of inline findings

# Severity Theme
1 MEDIUM Switch default arm (the success-path populate loop) is unverified
2 LOW Comment phrasing ("spec violation" / "empty resource set")
3 LOW wantToolsCalled bool redundant with int counters
4 LOW assert.True(strings.Contains(...)) should be assert.Contains
5 LOW WARN log lacks workload name and base URL
6 LOW errors.Is won't match HTTP-level method-missing variants (404/501)

Out-of-scope / informational

  • Pre-existing: list calls do not loop on nextCursor — file a follow-up issue if pagination matters.
  • PR description references mcp-go v0.43.2; go.mod actually pins v0.49.0. Sentinel and wrapping semantics are identical, so the code is correct — doc nit only.
  • WARN may be noisy under reconnect churn; rate-limit per (backendID, method) only if it shows up in practice.
  • vMCP does not currently forward notifications/resources/list_changed, so the empty-recovery path does not create phantom subscriptions today.

// returns JSON-RPC -32601 to the corresponding list method, init must succeed
// with an empty capability set rather than aborting and dropping the backend's
// tools.
func TestInitAndQueryCapabilities_RecoversFromMethodNotFound(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Success-path (default switch arm) is unverified.

The production refactor moves the populate loop into a new default arm of the switch. None of the cases here exercise that arm with a non-empty payload — every recovery test injects an error, and every fatal-error test aborts before reaching it. A future regression that inverts a condition or drops a field assignment in the populate loop would not be caught.

Consider adding at least one positive subtable row that returns a non-empty resources (and one for prompts, ideally with arguments) and asserts field-by-field mapping including BackendID == target.WorkloadID. The existing fakeBackend already supports this — set advertiseResources: true and populate resources: []mcp.Resource{{...}}.

Comment on lines +391 to +394
// Spec violation: backend advertised the resources capability in its
// initialize response but does not implement resources/list. Recover
// with an empty resource set so the backend's tools remain usable
// instead of failing init outright. See issue #5231.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Comment phrasing. Three concerns converge on this comment block (mirrored in the prompts case at line 418):

  1. "Recover with an empty resource set" reads as if a slice is allocated. Actually caps.Resources simply remains at its zero value (nil) — range over nil and len are both zero, so the behavior is correct, but the wording mildly misleads about state.
  2. The MCP spec doesn't contain a normative MUST for "if you advertise capability X you must implement X/list." The label "spec violation" is rhetorically reasonable but not directly citable from the spec.
  3. The "why" (keep tools usable) is the third sentence in; the first two restate what the surrounding errors.Is(...) and slog.Warn already convey.

Suggest tightening to one rationale-focused sentence, e.g.:

Suggested change
// Spec violation: backend advertised the resources capability in its
// initialize response but does not implement resources/list. Recover
// with an empty resource set so the backend's tools remain usable
// instead of failing init outright. See issue #5231.
// Tolerate -32601 here so a backend that advertises the resources
// capability but does not implement resources/list (e.g. Atlassian Rovo,
// see #5231) still contributes its tools instead of being dropped.

Apply symmetrically to the prompts block at line 418.

wantTools int
wantResources int
wantPrompts int
wantToolsCalled bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] wantToolsCalled bool is redundant and asymmetric.

Every populated row sets wantToolsCalled: true and the assertion at line 305 is hard-coded to 1, so the bool encodes no information beyond "do this assert." Meanwhile resources and prompts use explicit int counters (wantResListCalls, wantPromptCalls).

Suggest replacing with wantToolsCalls int, set to 1 in each row, and applying the same pattern as the other counters at the assertion site:

if tc.wantToolsCalls > 0 {
    assert.Equal(t, tc.wantToolsCalls, tc.fb.callCount(string(mcp.MethodToolsList)))
}

Makes future cases that assert tools/list was not called (e.g., when neither resources nor prompts is advertised) trivial to add.


_, err := initAndQueryCapabilities(context.Background(), c, target)
require.Error(t, err)
assert.True(t, strings.Contains(err.Error(), tc.errSubstr),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Use assert.Contains (or require.ErrorContains) instead of reinventing it.

Gives a clearer failure message (testify shows both haystack and needle automatically) and lets you drop the strings import.

Suggested change
assert.True(t, strings.Contains(err.Error(), tc.errSubstr),
assert.ErrorContains(t, err, tc.errSubstr)

Comment on lines +395 to +398
slog.Warn("backend advertised resources capability but does not implement resources/list",
"backendID", target.WorkloadID,
"method", "resources/list",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] WARN log lacks workload name and base URL.

When this WARN fires in production it's the operator's primary breadcrumb that a backend silently produced an empty resource/prompt set. backendID (= target.WorkloadID) requires an extra ID-to-name lookup to correlate to a specific MCP server. Adding name and baseURL makes the log directly greppable. Same applies to the prompts WARN at line 422.

Suggested change
slog.Warn("backend advertised resources capability but does not implement resources/list",
"backendID", target.WorkloadID,
"method", "resources/list",
)
slog.Warn("backend advertised resources capability but does not implement resources/list",
"backendID", target.WorkloadID,
"name", target.WorkloadName,
"baseURL", target.BaseURL,
"method", "resources/list",
)

resResult, listErr := c.ListResources(ctx, mcp.ListResourcesRequest{})
if listErr != nil {
switch {
case errors.Is(listErr, mcp.ErrMethodNotFound):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] errors.Is will not match HTTP-level method-missing variants.

Confirmed mcp-go@v0.49.0's JSONRPCErrorDetails.AsError() wraps mcp.ErrMethodNotFound for the -32601 JSON-RPC envelope, so this discriminator is correct for the Atlassian Rovo case (which returns a proper JSON-RPC error response).

Caveat worth noting: some real backends behind reverse proxies or API gateways surface "this method isn't here" as HTTP 404 / 405 / 501 rather than a JSON-RPC -32601 envelope. Those return non-sentinel errors from the transport layer, so they will fall through to the fatal case listErr != nil arm under this PR. The narrow scope is defensible (the issue specifically calls out -32601), but the next operator hitting an HTTP-404 backend will wonder why their case wasn't recovered.

No change required for this PR. Worth either (a) a follow-up issue to enumerate which transport-level signals should also be treated as method-missing, or (b) a half-line in the recovery comment noting that only JSON-RPC -32601 envelopes are recovered (HTTP-level method absence is intentionally fatal).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vMCP backend init fails fatally when server advertises resources capability without resources/list

2 participants