feat(mcp): add hyperdx_describe_source tool and slim list_sources to catalog by brandon-pereira · Pull Request #2309 · hyperdxio/hyperdx

brandon-pereira · 2026-05-19T20:06:13Z

Problem

The #1 MCP error class (39% of clickstack failures, 9 of 23 errors) is unknown column or identifier — the LLM invents column names like TimestampTime, service.name, count, Timestamp, rows_read, memory_usage that don't exist in the ClickHouse schema.

Root causes:

hyperdx_list_sources returned full column schemas for ALL sources in one response (50KB+), which overwhelms LLM context
No way to discover actual column values — the LLM guesses SeverityText:error when the real data uses ERROR or Error
Prompt guidance says "don't assume values" but provides no tool to discover them

Solution

New tool: `hyperdx_describe_source`

Per-source schema discovery that returns:

Column schema (name, ClickHouse type, JS type) from DESCRIBE TABLE
Map attribute keys (e.g. SpanAttributes, ResourceAttributes) — max 50 per map column
Low-cardinality value samples for LowCardinality(String) columns like SeverityText, StatusCode, ServiceName — uses existing getAllKeyValues() rollup tables for performance
Map attribute value samples for top-10 keys per map column — also via rollup tables
10-second hard timeout with structured AbortError handling
Team-scoped access via getSource(teamId, sourceId)

Slimmed `hyperdx_list_sources`

Now a lightweight MongoDB-only catalog:

Returns: id, name, kind, connectionId, timestampColumn, keyColumns
No ClickHouse queries — instant response
Includes nextStep hint pointing to describe_source

Two-step workflow

hyperdx_list_sources          →  pick sourceId(s)
hyperdx_describe_source       →  get columns, attribute keys, sample values
hyperdx_timeseries / table / search / ...  →  query with real column names

Changes

File	Change
`tools/sources/describeSource.ts`	NEW — `hyperdx_describe_source` tool
`tools/sources/index.ts`	NEW — registers both source tools
`tools/sources/listSources.ts`	MOVED from dashboards/, slimmed to catalog
`tools/dashboards/listSources.ts`	DELETED — moved to sources/
`tools/dashboards/index.ts`	Removed source tool registrations
`mcpServer.ts`	Registers new `sourcesTools` module
`tools/query/{search,table,timeseries,sql,eventPatterns}.ts`	Tool descriptions reference `describe_source`
`prompts/dashboards/content.ts`	Workflow, SeverityText guidance, common mistakes updated
`__tests__/sources.test.ts`	NEW — 11 tests (catalog, schema, LC values, metrics, cross-team)
`__tests__/dashboards.test.ts`	Removed source tests (moved to sources.test.ts)
`MCP.md`	Documents two-step workflow and new tool

No changes to common-utils — reuses existing getAllKeyValues() and getMapValues() methods.

Changeset

Patch release — this is a new tool addition and behavioral change to list_sources, but the MCP server is still in beta (version: ${CODE_VERSION}-beta), so a patch bump is appropriate.

Testing

Ensure agents call the new tool locally, and it queries the correct SeverityText (not hallucinate)

make ci-lint — passes (0 errors)
make ci-unit — 957 tests pass across 19 suites

…catalog Addresses the #1 MCP error class (39% of failures): LLMs guessing column names that don't exist (TimestampTime, service.name, count, etc.). New tool: hyperdx_describe_source - Full column schema (name, CH type, JS type) from DESCRIBE TABLE - Map attribute keys (SpanAttributes, ResourceAttributes, etc.) - Sampled low-cardinality values for columns like SeverityText, StatusCode, ServiceName — uses existing rollup tables via getAllKeyValues for performance, falls back to getMapValues - Sampled map attribute values for top-10 keys per map column - 10-second hard timeout with structured AbortError handling - Team-scoped access control via getSource(teamId, sourceId) Slimmed hyperdx_list_sources to a lightweight catalog: - Returns only id/name/kind/connectionId/keyColumns from MongoDB - No ClickHouse queries — instant response - Includes nextStep hint pointing to describe_source Moved source tools to dedicated module: - New packages/api/src/mcp/tools/sources/ directory - Registered as sourcesTools in mcpServer.ts - Dedicated test file: sources.test.ts (11 tests) Updated all tool descriptions and prompts: - All 5 split query tools reference describe_source - content.ts workflow updated to 5-step flow - SeverityText/StatusCode guidance references lowCardinalityValues - MCP.md documents the two-step discovery workflow

changeset-bot · 2026-05-19T20:06:18Z

🦋 Changeset detected

Latest commit: 51d8327

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages

Name	Type
@hyperdx/api	Patch
@hyperdx/app	Patch
@hyperdx/otel-collector	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel · 2026-05-19T20:06:20Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hyperdx-oss	Ready	Preview, Comment	May 21, 2026 8:35pm

github-actions · 2026-05-19T20:06:44Z

🟡 Tier 3 — Standard

Introduces new logic, modifies core functionality, or touches areas with non-trivial risk.

Why this tier:

Diff size: 757 production lines changed (Tier 2 max: < 250)

Review process: Full human review — logic, architecture, edge cases.
SLA: First-pass feedback within 1 business day.

Stats

Production files changed: 12
Production lines changed: 757 (+ 400 in test files, excluded from tier calculation)
Branch: brandon/mcp-source-refactor
Author: brandon-pereira

To override this classification, remove the review/tier-3 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

github-actions · 2026-05-19T20:07:56Z

E2E Test Results

✅ All tests passed • 177 passed • 3 skipped • 1253s

Status	Count
✅ Passed	177
❌ Failed	0
⚠️ Flaky	5
⏭️ Skipped	3

Tests ran across 4 shards in parallel.

View full report →

github-actions · 2026-05-19T20:08:24Z

PR Review

Solid refactor with good test coverage (cross-team access, metric-source edge case, partial-result handling). A few items worth addressing:

⚠️ Leaked timer in describeSource.ts:1220 → setTimeout is never cleared on the success path. After a fast response, the timer still fires ~10s later, calls controller.abort() (harmless) and rejects (silently dropped by Promise.race), but keeps the event loop alive. Store the id and clearTimeout(id) in a finally.
⚠️ getColumns() / getMapKeys() ignore the abort signal (describeSource.ts:999, 1019) → the wall-clock timeout only bounds the response, not the work. A slow DESCRIBE TABLE keeps running after the tool already returned an error. Pass signal through (or accept this as best-effort and add a note).
⚠️ Partial-result flag is under-reported (describeSource.ts:1037, 1090) → skippedStages is only pushed when results are completely empty. If the abort fires mid-flight after one column succeeded, partial stays false even though sampling was truncated. Track whether each stage started but didn't finish, not whether it produced zero rows.
💡 lcColumns regex strips whitespace then substring-matches String (describeSource.ts:1045-1051) → also matches LowCardinality(FixedString(N)), Array(LowCardinality(String)), etc. Probably fine, but worth a comment if intentional, or tighten to ^LowCardinality\((Nullable\()?String/.
💡 20 parallel getAllKeyValues calls per describe (LC columns + 10 keys × N map columns) → acceptable since they hit the rollup cache, but worth confirming there's no per-connection concurrency cliff on cold-cache calls.

No security issues (team scoping via getSource(teamId, sourceId) is correct; map keys flow through parameterized queries in getAllKeyValues).

…escribeSource

github-actions · 2026-05-19T20:24:03Z

Deep Review

✅ No critical issues found.

🟡 P2 -- recommended

packages/api/src/mcp/tools/sources/describeSource.ts:147,167,219,256 -- the new 10s Promise.race timeout cannot actually cancel server-side ClickHouse work because getColumns, getMapKeys, and the getAllKeyValues fallback to getMapValues in packages/common-utils/src/core/metadata.ts:262,343,1424 don't accept or honor AbortSignal, so each timed-out describe leaves multiple in-flight queries running for up to CH's max_execution_time: 15 per stage.
- Fix: Add signal?: AbortSignal to getColumns, getMapKeys, and getMapValues, thread it through the getAllKeyValues fallback path, and forward controller.signal from every metadata call site in describeSource.
- _{ce-correctness-reviewer, ce-reliability-reviewer, ce-adversarial-reviewer}
packages/api/src/mcp/tools/sources/describeSource.ts:231-233,270-272 -- both LC-value and map-value sampling catch every error with an empty catch, so a 503 / permissions error / rollup-table failure silently produces a response with no lowCardinalityValues, no partial: true flag, and no skippedStages entry, defeating the tool's stated contract that "these are the REAL values in your data".
- Fix: Log the error via logger.warn and push the stage name into skippedStages so meta.partial reflects sampling failures, not just signal.aborted.
- _{ce-correctness-reviewer, ce-reliability-reviewer, ce-kieran-typescript-reviewer, ce-maintainability-reviewer}
packages/api/src/mcp/tools/sources/describeSource.ts:40 -- a syntactically invalid sourceId such as "abc" passes Zod's z.string() but causes Source.findOne to throw a Mongoose CastError that bypasses the friendly "Source ... not found" response at line 47 and surfaces as an unhandled internal error to the MCP client.
- Fix: Validate the input with mongoose.Types.ObjectId.isValid(sourceId) (or catch CastError around getSource) and return the same isError "not found" response on invalid IDs.
- _{ce-correctness-reviewer, ce-security-reviewer, ce-testing-reviewer}
packages/api/src/mcp/tools/sources/describeSource.ts:362-403 -- the headline DESCRIBE_TIMEOUT mechanism and the partial / skippedStages contract have zero direct test coverage, so a regression that drops the timeout error message or stops setting partial would ship undetected.
- Fix: Inject DESCRIBE_TIMEOUT_MS (or mock a metadata method to return a never-resolving promise) in tests and assert both the "Schema discovery timed out." text and output.source.partial === true with the expected skippedStages entries.
- _{ce-testing-reviewer, ce-kieran-typescript-reviewer}
packages/api/src/mcp/__tests__/sources.test.ts:174-340 -- no test seeds ClickHouse data and asserts lowCardinalityValues or mapAttributeValues contain actual sampled values, so the PR's central "discover real values so the LLM doesn't guess" claim is unverified and would silently regress.
- Fix: Seed distinct SeverityText / ServiceName rows (and a ResourceAttributes map) via bulkInsertLogs, then assert output.source.lowCardinalityValues.SeverityText contains the seeded values and output.source.mapAttributeValues["ResourceAttributes['service.name']"] is non-empty.
- _{ce-testing-reviewer}
packages/api/src/mcp/tools/sources/describeSource.ts:364-369 -- the setTimeout handle is never cleared on the success path, so the timer still fires controller.abort() and rejects an internal promise up to 10s after the response is sent, retaining the meta blob and AbortController closure for that whole window.
- Fix: Capture the handle (const timer = setTimeout(...)) and clearTimeout(timer) in a finally block around Promise.race.
- _{ce-correctness-reviewer, ce-reliability-reviewer, ce-kieran-typescript-reviewer, ce-maintainability-reviewer, ce-adversarial-reviewer}

🔵 P3 nitpicks (10)

packages/api/src/mcp/__tests__/sources.test.ts:101-170 -- hyperdx_list_sources lost its cross-team isolation test in the move from dashboards.test.ts; the empty-team test does not prove team scoping.
- Fix: Add a test that creates a second team with sources, calls list_sources as team A, and asserts team B's IDs are absent from output.sources and output.connections.
- _{ce-correctness-reviewer, ce-security-reviewer, ce-testing-reviewer}
packages/api/src/mcp/tools/sources/describeSource.ts:53-94, packages/api/src/mcp/tools/sources/listSources.ts:38-79 -- the base-meta construction (id/name/kind/connectionId/timestampColumn plus optional keyColumns per SourceKind) is duplicated nearly verbatim across both files.
- Fix: Extract a buildSourceMeta(source) helper into a shared module and call it from both tool registrations.
- _{ce-kieran-typescript-reviewer, ce-maintainability-reviewer}
packages/api/src/mcp/tools/sources/describeSource.ts:62-72, packages/api/src/mcp/tools/sources/listSources.ts:47-55 -- structural narrowing via 'eventAttributesExpression' in source instead of switching on source.kind or using the existing isLogSource / isTraceSource type guards leaves the narrowing brittle to future variant additions.
- Fix: Replace the in checks with a switch (source.kind) block or the typed guards from @hyperdx/common-utils.
- _{ce-kieran-typescript-reviewer, ce-maintainability-reviewer}
packages/api/src/mcp/tools/sources/describeSource.ts:53, packages/api/src/mcp/tools/sources/listSources.ts:39 -- meta is typed as Record<string, unknown>, which discards every structural guarantee of the response shape and silently accepts typoed keys like meta.keycolumns.
- Fix: Define an explicit SourceMeta discriminated union on SourceKind and type the builder accordingly.
packages/api/src/mcp/tools/sources/describeSource.ts:374, packages/api/src/mcp/tools/sources/listSources.ts:34-35 -- teamId.toString() is called on a value already typed string in McpContext, suggesting uncertainty about the type at a security-sensitive multi-tenant boundary.
- Fix: Drop the .toString() calls and rely on McpContext.teamId being typed correctly.
packages/api/src/mcp/__tests__/sources.test.ts:292-301 -- the nextSteps.query assertion only checks that the source ID appears, not that any of hyperdx_timeseries, hyperdx_table, or hyperdx_search are referenced, so a refactor that drops a tool name from the hint would pass silently.
- Fix: Add expect(output.nextSteps.query).toMatch(/hyperdx_(timeseries|table|search)/).
packages/api/src/mcp/tools/sources/describeSource.ts:188 -- when getMapKeys returns [] for a column, meta.mapAttributeKeys still records the empty array, surfacing a noisy { SpanAttributes: [] } entry to the LLM.
- Fix: Filter mapKeysResults to entries with non-empty arrays before assigning to meta.mapAttributeKeys.
packages/api/src/mcp/tools/sources/describeSource.ts:27 -- MAX_LC_VALUES = 20 reads as a global cap on values returned but is actually the per-column cap passed as maxValuesPerKey to getAllKeyValues.
- Fix: Rename to MAX_LC_VALUES_PER_COLUMN and update the inline comment to match.
packages/api/src/mcp/tools/sources/describeSource.ts:35-322 -- describeSourceSchema is ~285 lines spanning four numbered "stage" sections with four slightly different signal.aborted skip-detection conditionals, making each stage hard to unit-test in isolation.
- Fix: Split into fetchColumns / fetchMapKeys / sampleLowCardinalityValues / sampleMapAttributeValues helpers that each return { result, skipped }, then aggregate skippedStages centrally in the orchestrator.
- _{ce-kieran-typescript-reviewer, ce-maintainability-reviewer}
packages/api/src/mcp/tools/sources/describeSource.ts:24 -- DESCRIBE_TIMEOUT_MS is hard-coded at 10s, which is the one constant in this file an operator is plausibly going to want to tune for slow ClickHouse clusters.
- Fix: Read from an env var such as HYPERDX_MCP_DESCRIBE_TIMEOUT_MS with 10_000 as the default.

Reviewers (7): ce-correctness-reviewer, ce-reliability-reviewer, ce-security-reviewer, ce-kieran-typescript-reviewer, ce-testing-reviewer, ce-adversarial-reviewer, ce-maintainability-reviewer.

Testing gaps:

Timeout path ("Schema discovery timed out.") is not exercised by any test.
partial: true / skippedStages flag is not asserted by any test.
lowCardinalityValues and mapAttributeValues are not verified against seeded data.
hyperdx_list_sources lacks an explicit cross-team isolation test (the empty-team case only checks zero results).
Malformed sourceId (non-ObjectId string) is not exercised — the existing "not found" test uses a valid 24-char hex ID.

brandon-pereira · 2026-05-19T21:28:01Z

Will fix P0 ObjectId validation separately - several tools affected.

…nflict getLoggedInAgent(server) with default MOCK_USER credentials returns 409 Conflict when called twice in the same test run because the user already exists from beforeEach. Pass unique credentials for the second team registration.

…factor # Conflicts: # packages/api/src/mcp/prompts/dashboards/content.ts

wrn14897

lgtm

wrn14897 · 2026-05-21T18:38:05Z

+ * Core schema-discovery logic. Extracted so the caller can wrap it in
+ * Promise.race for wall-clock timeout enforcement.
+ */
+async function describeSourceSchema(


I'm a bit worried about the perf of this method. There are a lot of db calls, and I wonder if we really need all of them

Will explore reducing DB calls if the need arises - currently this helps reduce LLM hallucinations

github-actions Bot added the review/tier-3 Standard — full human review required label May 19, 2026

chore: add patch changeset for describe_source tool

f517435

fix(mcp): replace hyperdx_query references with split tool names in d…

6916914

…escribeSource

vercel Bot deployed to Preview May 19, 2026 20:14 View deployment

pr feedback

16e1684

vercel Bot deployed to Preview May 19, 2026 21:06 View deployment

vercel Bot deployed to Preview May 19, 2026 21:50 View deployment

fix failing test

99f3db6

brandon-pereira requested review from a team and wrn14897 and removed request for a team May 19, 2026 21:59

vercel Bot deployed to Preview May 19, 2026 22:02 View deployment

fix more failing tests

5aac94a

vercel Bot deployed to Preview May 19, 2026 22:49 View deployment

brandon-pereira added 2 commits May 20, 2026 16:39

Merge remote-tracking branch 'origin/main' into brandon/mcp-source-re…

07cdee5

…factor # Conflicts: # packages/api/src/mcp/prompts/dashboards/content.ts

fix: adapt describeSource to batched getAllKeyValues API

60021f3

vercel Bot deployed to Preview May 20, 2026 22:48 View deployment

fix: remove em-dashes from dashboard prompt to pass prose-lint test

46706b5

vercel Bot deployed to Preview May 20, 2026 23:00 View deployment

wrn14897 approved these changes May 21, 2026

View reviewed changes

wrn14897 reviewed May 21, 2026

View reviewed changes

brandon-pereira added the automerge label May 21, 2026

Merge branch 'main' into brandon/mcp-source-refactor

51d8327

vercel Bot deployed to Preview May 21, 2026 20:35 View deployment

kodiakhq Bot merged commit d134212 into main May 21, 2026
19 of 33 checks passed

kodiakhq Bot deleted the brandon/mcp-source-refactor branch May 21, 2026 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcp): add hyperdx_describe_source tool and slim list_sources to catalog#2309

feat(mcp): add hyperdx_describe_source tool and slim list_sources to catalog#2309
kodiakhq[bot] merged 11 commits into
mainfrom
brandon/mcp-source-refactor

brandon-pereira commented May 19, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 19, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 19, 2026 •

edited

Loading

Uh oh!

brandon-pereira commented May 19, 2026

Uh oh!

wrn14897 left a comment

Uh oh!

wrn14897 May 21, 2026

Uh oh!

brandon-pereira May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brandon-pereira commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

New tool: hyperdx_describe_source

Slimmed hyperdx_list_sources

Two-step workflow

Changes

Changeset

Testing

Uh oh!

changeset-bot Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

vercel Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟡 Tier 3 — Standard

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deep Review

🟡 P2 -- recommended

Uh oh!

brandon-pereira commented May 19, 2026

Uh oh!

wrn14897 left a comment

Choose a reason for hiding this comment

Uh oh!

wrn14897 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

brandon-pereira May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brandon-pereira commented May 19, 2026 •

edited

Loading

New tool: `hyperdx_describe_source`

Slimmed `hyperdx_list_sources`

changeset-bot Bot commented May 19, 2026 •

edited

Loading

vercel Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading