Skip to content

feat(mcp): add hyperdx_describe_source tool and slim list_sources to catalog#2309

Merged
kodiakhq[bot] merged 11 commits into
mainfrom
brandon/mcp-source-refactor
May 21, 2026
Merged

feat(mcp): add hyperdx_describe_source tool and slim list_sources to catalog#2309
kodiakhq[bot] merged 11 commits into
mainfrom
brandon/mcp-source-refactor

Conversation

@brandon-pereira
Copy link
Copy Markdown
Member

@brandon-pereira brandon-pereira commented May 19, 2026

Problem

The #1 MCP error class (39% of clickstack failures, 9 of 23 errors) is unknown column or identifier — the LLM invents column names like TimestampTime, service.name, count, Timestamp, rows_read, memory_usage that don't exist in the ClickHouse schema.

Root causes:

  1. hyperdx_list_sources returned full column schemas for ALL sources in one response (50KB+), which overwhelms LLM context
  2. No way to discover actual column values — the LLM guesses SeverityText:error when the real data uses ERROR or Error
  3. Prompt guidance says "don't assume values" but provides no tool to discover them

Solution

New tool: hyperdx_describe_source

Per-source schema discovery that returns:

  • Column schema (name, ClickHouse type, JS type) from DESCRIBE TABLE
  • Map attribute keys (e.g. SpanAttributes, ResourceAttributes) — max 50 per map column
  • Low-cardinality value samples for LowCardinality(String) columns like SeverityText, StatusCode, ServiceName — uses existing getAllKeyValues() rollup tables for performance
  • Map attribute value samples for top-10 keys per map column — also via rollup tables
  • 10-second hard timeout with structured AbortError handling
  • Team-scoped access via getSource(teamId, sourceId)

Slimmed hyperdx_list_sources

Now a lightweight MongoDB-only catalog:

  • Returns: id, name, kind, connectionId, timestampColumn, keyColumns
  • No ClickHouse queries — instant response
  • Includes nextStep hint pointing to describe_source

Two-step workflow

hyperdx_list_sources          →  pick sourceId(s)
hyperdx_describe_source       →  get columns, attribute keys, sample values
hyperdx_timeseries / table / search / ...  →  query with real column names

Changes

File Change
tools/sources/describeSource.ts NEWhyperdx_describe_source tool
tools/sources/index.ts NEW — registers both source tools
tools/sources/listSources.ts MOVED from dashboards/, slimmed to catalog
tools/dashboards/listSources.ts DELETED — moved to sources/
tools/dashboards/index.ts Removed source tool registrations
mcpServer.ts Registers new sourcesTools module
tools/query/{search,table,timeseries,sql,eventPatterns}.ts Tool descriptions reference describe_source
prompts/dashboards/content.ts Workflow, SeverityText guidance, common mistakes updated
__tests__/sources.test.ts NEW — 11 tests (catalog, schema, LC values, metrics, cross-team)
__tests__/dashboards.test.ts Removed source tests (moved to sources.test.ts)
MCP.md Documents two-step workflow and new tool

No changes to common-utils — reuses existing getAllKeyValues() and getMapValues() methods.

Changeset

Patch release — this is a new tool addition and behavioral change to list_sources, but the MCP server is still in beta (version: ${CODE_VERSION}-beta), so a patch bump is appropriate.

Testing

Ensure agents call the new tool locally, and it queries the correct SeverityText (not hallucinate)
Screenshot 2026-05-19 at 1 28 39 PM

  • make ci-lint — passes (0 errors)
  • make ci-unit — 957 tests pass across 19 suites

…catalog

Addresses the #1 MCP error class (39% of failures): LLMs guessing column
names that don't exist (TimestampTime, service.name, count, etc.).

New tool: hyperdx_describe_source
- Full column schema (name, CH type, JS type) from DESCRIBE TABLE
- Map attribute keys (SpanAttributes, ResourceAttributes, etc.)
- Sampled low-cardinality values for columns like SeverityText, StatusCode,
  ServiceName — uses existing rollup tables via getAllKeyValues for
  performance, falls back to getMapValues
- Sampled map attribute values for top-10 keys per map column
- 10-second hard timeout with structured AbortError handling
- Team-scoped access control via getSource(teamId, sourceId)

Slimmed hyperdx_list_sources to a lightweight catalog:
- Returns only id/name/kind/connectionId/keyColumns from MongoDB
- No ClickHouse queries — instant response
- Includes nextStep hint pointing to describe_source

Moved source tools to dedicated module:
- New packages/api/src/mcp/tools/sources/ directory
- Registered as sourcesTools in mcpServer.ts
- Dedicated test file: sources.test.ts (11 tests)

Updated all tool descriptions and prompts:
- All 5 split query tools reference describe_source
- content.ts workflow updated to 5-step flow
- SeverityText/StatusCode guidance references lowCardinalityValues
- MCP.md documents the two-step discovery workflow
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 19, 2026

🦋 Changeset detected

Latest commit: 51d8327

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@hyperdx/api Patch
@hyperdx/app Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented May 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 21, 2026 8:35pm

Request Review

@github-actions github-actions Bot added the review/tier-3 Standard — full human review required label May 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

🟡 Tier 3 — Standard

Introduces new logic, modifies core functionality, or touches areas with non-trivial risk.

Why this tier:

  • Diff size: 757 production lines changed (Tier 2 max: < 250)

Review process: Full human review — logic, architecture, edge cases.
SLA: First-pass feedback within 1 business day.

Stats
  • Production files changed: 12
  • Production lines changed: 757 (+ 400 in test files, excluded from tier calculation)
  • Branch: brandon/mcp-source-refactor
  • Author: brandon-pereira

To override this classification, remove the review/tier-3 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

E2E Test Results

All tests passed • 177 passed • 3 skipped • 1253s

Status Count
✅ Passed 177
❌ Failed 0
⚠️ Flaky 5
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

PR Review

Solid refactor with good test coverage (cross-team access, metric-source edge case, partial-result handling). A few items worth addressing:

  • ⚠️ Leaked timer in describeSource.ts:1220setTimeout is never cleared on the success path. After a fast response, the timer still fires ~10s later, calls controller.abort() (harmless) and rejects (silently dropped by Promise.race), but keeps the event loop alive. Store the id and clearTimeout(id) in a finally.
  • ⚠️ getColumns() / getMapKeys() ignore the abort signal (describeSource.ts:999, 1019) → the wall-clock timeout only bounds the response, not the work. A slow DESCRIBE TABLE keeps running after the tool already returned an error. Pass signal through (or accept this as best-effort and add a note).
  • ⚠️ Partial-result flag is under-reported (describeSource.ts:1037, 1090) → skippedStages is only pushed when results are completely empty. If the abort fires mid-flight after one column succeeded, partial stays false even though sampling was truncated. Track whether each stage started but didn't finish, not whether it produced zero rows.
  • 💡 lcColumns regex strips whitespace then substring-matches String (describeSource.ts:1045-1051) → also matches LowCardinality(FixedString(N)), Array(LowCardinality(String)), etc. Probably fine, but worth a comment if intentional, or tighten to ^LowCardinality\((Nullable\()?String/.
  • 💡 20 parallel getAllKeyValues calls per describe (LC columns + 10 keys × N map columns) → acceptable since they hit the rollup cache, but worth confirming there's no per-connection concurrency cliff on cold-cache calls.

No security issues (team scoping via getSource(teamId, sourceId) is correct; map keys flow through parameterized queries in getAllKeyValues).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

Deep Review

✅ No critical issues found.

🟡 P2 -- recommended

  • packages/api/src/mcp/tools/sources/describeSource.ts:147,167,219,256 -- the new 10s Promise.race timeout cannot actually cancel server-side ClickHouse work because getColumns, getMapKeys, and the getAllKeyValues fallback to getMapValues in packages/common-utils/src/core/metadata.ts:262,343,1424 don't accept or honor AbortSignal, so each timed-out describe leaves multiple in-flight queries running for up to CH's max_execution_time: 15 per stage.

    • Fix: Add signal?: AbortSignal to getColumns, getMapKeys, and getMapValues, thread it through the getAllKeyValues fallback path, and forward controller.signal from every metadata call site in describeSource.
    • ce-correctness-reviewer, ce-reliability-reviewer, ce-adversarial-reviewer
  • packages/api/src/mcp/tools/sources/describeSource.ts:231-233,270-272 -- both LC-value and map-value sampling catch every error with an empty catch, so a 503 / permissions error / rollup-table failure silently produces a response with no lowCardinalityValues, no partial: true flag, and no skippedStages entry, defeating the tool's stated contract that "these are the REAL values in your data".

    • Fix: Log the error via logger.warn and push the stage name into skippedStages so meta.partial reflects sampling failures, not just signal.aborted.
    • ce-correctness-reviewer, ce-reliability-reviewer, ce-kieran-typescript-reviewer, ce-maintainability-reviewer
  • packages/api/src/mcp/tools/sources/describeSource.ts:40 -- a syntactically invalid sourceId such as "abc" passes Zod's z.string() but causes Source.findOne to throw a Mongoose CastError that bypasses the friendly "Source ... not found" response at line 47 and surfaces as an unhandled internal error to the MCP client.

    • Fix: Validate the input with mongoose.Types.ObjectId.isValid(sourceId) (or catch CastError around getSource) and return the same isError "not found" response on invalid IDs.
    • ce-correctness-reviewer, ce-security-reviewer, ce-testing-reviewer
  • packages/api/src/mcp/tools/sources/describeSource.ts:362-403 -- the headline DESCRIBE_TIMEOUT mechanism and the partial / skippedStages contract have zero direct test coverage, so a regression that drops the timeout error message or stops setting partial would ship undetected.

    • Fix: Inject DESCRIBE_TIMEOUT_MS (or mock a metadata method to return a never-resolving promise) in tests and assert both the "Schema discovery timed out." text and output.source.partial === true with the expected skippedStages entries.
    • ce-testing-reviewer, ce-kieran-typescript-reviewer
  • packages/api/src/mcp/__tests__/sources.test.ts:174-340 -- no test seeds ClickHouse data and asserts lowCardinalityValues or mapAttributeValues contain actual sampled values, so the PR's central "discover real values so the LLM doesn't guess" claim is unverified and would silently regress.

    • Fix: Seed distinct SeverityText / ServiceName rows (and a ResourceAttributes map) via bulkInsertLogs, then assert output.source.lowCardinalityValues.SeverityText contains the seeded values and output.source.mapAttributeValues["ResourceAttributes['service.name']"] is non-empty.
    • ce-testing-reviewer
  • packages/api/src/mcp/tools/sources/describeSource.ts:364-369 -- the setTimeout handle is never cleared on the success path, so the timer still fires controller.abort() and rejects an internal promise up to 10s after the response is sent, retaining the meta blob and AbortController closure for that whole window.

    • Fix: Capture the handle (const timer = setTimeout(...)) and clearTimeout(timer) in a finally block around Promise.race.
    • ce-correctness-reviewer, ce-reliability-reviewer, ce-kieran-typescript-reviewer, ce-maintainability-reviewer, ce-adversarial-reviewer
🔵 P3 nitpicks (10)
  • packages/api/src/mcp/__tests__/sources.test.ts:101-170 -- hyperdx_list_sources lost its cross-team isolation test in the move from dashboards.test.ts; the empty-team test does not prove team scoping.

    • Fix: Add a test that creates a second team with sources, calls list_sources as team A, and asserts team B's IDs are absent from output.sources and output.connections.
    • ce-correctness-reviewer, ce-security-reviewer, ce-testing-reviewer
  • packages/api/src/mcp/tools/sources/describeSource.ts:53-94, packages/api/src/mcp/tools/sources/listSources.ts:38-79 -- the base-meta construction (id/name/kind/connectionId/timestampColumn plus optional keyColumns per SourceKind) is duplicated nearly verbatim across both files.

    • Fix: Extract a buildSourceMeta(source) helper into a shared module and call it from both tool registrations.
    • ce-kieran-typescript-reviewer, ce-maintainability-reviewer
  • packages/api/src/mcp/tools/sources/describeSource.ts:62-72, packages/api/src/mcp/tools/sources/listSources.ts:47-55 -- structural narrowing via 'eventAttributesExpression' in source instead of switching on source.kind or using the existing isLogSource / isTraceSource type guards leaves the narrowing brittle to future variant additions.

    • Fix: Replace the in checks with a switch (source.kind) block or the typed guards from @hyperdx/common-utils.
    • ce-kieran-typescript-reviewer, ce-maintainability-reviewer
  • packages/api/src/mcp/tools/sources/describeSource.ts:53, packages/api/src/mcp/tools/sources/listSources.ts:39 -- meta is typed as Record<string, unknown>, which discards every structural guarantee of the response shape and silently accepts typoed keys like meta.keycolumns.

    • Fix: Define an explicit SourceMeta discriminated union on SourceKind and type the builder accordingly.
  • packages/api/src/mcp/tools/sources/describeSource.ts:374, packages/api/src/mcp/tools/sources/listSources.ts:34-35 -- teamId.toString() is called on a value already typed string in McpContext, suggesting uncertainty about the type at a security-sensitive multi-tenant boundary.

    • Fix: Drop the .toString() calls and rely on McpContext.teamId being typed correctly.
  • packages/api/src/mcp/__tests__/sources.test.ts:292-301 -- the nextSteps.query assertion only checks that the source ID appears, not that any of hyperdx_timeseries, hyperdx_table, or hyperdx_search are referenced, so a refactor that drops a tool name from the hint would pass silently.

    • Fix: Add expect(output.nextSteps.query).toMatch(/hyperdx_(timeseries|table|search)/).
  • packages/api/src/mcp/tools/sources/describeSource.ts:188 -- when getMapKeys returns [] for a column, meta.mapAttributeKeys still records the empty array, surfacing a noisy { SpanAttributes: [] } entry to the LLM.

    • Fix: Filter mapKeysResults to entries with non-empty arrays before assigning to meta.mapAttributeKeys.
  • packages/api/src/mcp/tools/sources/describeSource.ts:27 -- MAX_LC_VALUES = 20 reads as a global cap on values returned but is actually the per-column cap passed as maxValuesPerKey to getAllKeyValues.

    • Fix: Rename to MAX_LC_VALUES_PER_COLUMN and update the inline comment to match.
  • packages/api/src/mcp/tools/sources/describeSource.ts:35-322 -- describeSourceSchema is ~285 lines spanning four numbered "stage" sections with four slightly different signal.aborted skip-detection conditionals, making each stage hard to unit-test in isolation.

    • Fix: Split into fetchColumns / fetchMapKeys / sampleLowCardinalityValues / sampleMapAttributeValues helpers that each return { result, skipped }, then aggregate skippedStages centrally in the orchestrator.
    • ce-kieran-typescript-reviewer, ce-maintainability-reviewer
  • packages/api/src/mcp/tools/sources/describeSource.ts:24 -- DESCRIBE_TIMEOUT_MS is hard-coded at 10s, which is the one constant in this file an operator is plausibly going to want to tune for slow ClickHouse clusters.

    • Fix: Read from an env var such as HYPERDX_MCP_DESCRIBE_TIMEOUT_MS with 10_000 as the default.

Reviewers (7): ce-correctness-reviewer, ce-reliability-reviewer, ce-security-reviewer, ce-kieran-typescript-reviewer, ce-testing-reviewer, ce-adversarial-reviewer, ce-maintainability-reviewer.

Testing gaps:

  • Timeout path ("Schema discovery timed out.") is not exercised by any test.
  • partial: true / skippedStages flag is not asserted by any test.
  • lowCardinalityValues and mapAttributeValues are not verified against seeded data.
  • hyperdx_list_sources lacks an explicit cross-team isolation test (the empty-team case only checks zero results).
  • Malformed sourceId (non-ObjectId string) is not exercised — the existing "not found" test uses a valid 24-char hex ID.

@brandon-pereira
Copy link
Copy Markdown
Member Author

Will fix P0 ObjectId validation separately - several tools affected.

…nflict

getLoggedInAgent(server) with default MOCK_USER credentials returns
409 Conflict when called twice in the same test run because the user
already exists from beforeEach. Pass unique credentials for the second
team registration.
@brandon-pereira brandon-pereira requested review from a team and wrn14897 and removed request for a team May 19, 2026 21:59
Copy link
Copy Markdown
Member

@wrn14897 wrn14897 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

* Core schema-discovery logic. Extracted so the caller can wrap it in
* Promise.race for wall-clock timeout enforcement.
*/
async function describeSourceSchema(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about the perf of this method. There are a lot of db calls, and I wonder if we really need all of them

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will explore reducing DB calls if the need arises - currently this helps reduce LLM hallucinations

@kodiakhq kodiakhq Bot merged commit d134212 into main May 21, 2026
19 of 33 checks passed
@kodiakhq kodiakhq Bot deleted the brandon/mcp-source-refactor branch May 21, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge review/tier-3 Standard — full human review required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants