Skip to content

Add /metrics-viz page for browsing Prometheus metrics#34968

Open
antiguru wants to merge 6 commits intoMaterializeInc:mainfrom
antiguru:metrics_viz
Open

Add /metrics-viz page for browsing Prometheus metrics#34968
antiguru wants to merge 6 commits intoMaterializeInc:mainfrom
antiguru:metrics_viz

Conversation

@antiguru
Copy link
Member

@antiguru antiguru commented Feb 10, 2026

  • Adds a new /metrics-viz page that parses and visualizes Prometheus metrics from environmentd and clusterd endpoints
  • Endpoint discovery via catalog query (including per-process endpoints for multi-process replicas)
  • Search/filter with URL persistence
  • Collapsible prefix groups with label dimension toggles for rollup granularity
  • Counter/gauge table display with copy-to-clipboard
  • Per-label-combination D3 histogram charts
  • Poll mode with configurable interval showing rate/s and delta columns
  • Save/load metrics snapshots to/from files for offline analysis
  • Playwright E2E tests

🤖 Generated with Claude Code

@antiguru
Copy link
Member Author

image

@antiguru antiguru marked this pull request as ready for review February 10, 2026 12:45
@antiguru antiguru requested a review from a team as a code owner February 10, 2026 12:45
Copy link
Contributor

@SangJunBak SangJunBak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the first commit

Comment on lines +214 to +217
.route(
"/metrics-viz",
routing::get(metrics_viz::handle_metrics_viz),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing that these interactive JS pages are going to not work if password authentication is enabled. This was already an issue, but might be worth documenting this somewhere

Comment on lines 17 to 33
async function query(sql) {
const response = await fetch('/api/sql', {
method: 'POST',
body: JSON.stringify({ query: sql }),
headers: { 'Content-Type': 'application/json' },
});
if (!response.ok) {
const text = await response.text();
throw `request failed: ${response.status} ${response.statusText}: ${text}`;
}
const data = await response.json();
for (const result of data.results) {
if (result.error) {
throw `SQL error: ${result.error.message}`;
}
}
return data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this repeated in Memory.jsx. and hierarchical-memory.jsx. We should probably extract this into like a top level fetch.js

}

async function discoverEndpoints() {
const endpoints = [{ label: 'environmentd', url: '/metrics' }];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is possible for the /metrics endpoint to be on a different port than these pages! I think this is in the same vein as the password comment, i.e. this only works with the src/materialized/ci/listener_configs/no_auth.json listener config

JOIN mz_catalog.mz_clusters c ON c.id = r.cluster_id
ORDER BY c.name, r.name
`);
const rows = data.results[0].rows;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results can be empty. Maybe data.results[0]?.rows ?? [];?

);
}

function CollapsibleScalarTable({ family }) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we just group this and RawData into an Expandable component?

function Expandable({ children, type, numValues }) {
  const [expanded, setExpanded] = useState(false);
  return (
    <div>
      <button style={styles.rawToggle} onClick={() => setExpanded(!expanded)}>
        {expanded ? `hide ${type}` : `show ${type}`} ({numValues} values)
      </button>
      {expanded && children}
    </div>
  );
}
...
<Expandable type="table" numValues={family.series.length}>
 <ScalarTable family={family} />
</Expandable>

placeholder="Search metrics..."
value={search}
onChange={e => setSearch(e.target.value)}
className="search-box"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird that we're using a classname

* @param {string} text - Raw Prometheus metrics text
* @returns {{ families: Map<string, MetricFamily>, groups: Map<string, string[]> }}
*/
function parsePrometheusText(text) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should write some tests for each of these functions, if anything just for documentation

if (spaceIdx === -1) continue;
const name = rest.slice(0, spaceIdx);
const help = rest.slice(spaceIdx + 1);
ensureFamily(families, name).help = help;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird how we're mutating in this function, then mutating right after. Maybe rename "ensureFamily" to "updateFamily" and have help and type as optional variables

}
}

function buildScalarSeries(family) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe make this function immutable then do the mutation in the caller?

Comment on lines 208 to 233
const labels = {};
if (!str) return labels;
let i = 0;
while (i < str.length) {
while (i < str.length && (str[i] === ',' || str[i] === ' ')) i++;
if (i >= str.length) break;
const eqIdx = str.indexOf('=', i);
if (eqIdx === -1) break;
const key = str.slice(i, eqIdx);
if (str[eqIdx + 1] !== '"') break;
let j = eqIdx + 2;
let value = '';
while (j < str.length) {
if (str[j] === '\\' && j + 1 < str.length) {
value += str[j + 1];
j += 2;
} else if (str[j] === '"') {
break;
} else {
value += str[j];
j++;
}
}
labels[key] = value;
i = j + 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replace this code with regex. Something like:

const re = /(\w+)="([^"]+)"/g;
const str = 'context="declare",le="0.001"';
[...str.matchAll(re)].map(m => [m[1], m[2]]); 
Screenshot 2026-02-10 at 12 41 16 PM

Copy link
Contributor

@SangJunBak SangJunBak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks correct! Would like to see my comments addressed, but none of them need to block this from merging.

Some general things I noticed that would help in maintainability but aren't blocking:

  • It's not easy to look at each function and predict what the input is going to be. Static type inference via js-docs could help with this for just the function. If Claude can one-shot it, might be worth doing
  • A lot of the "aggregation" / math functions are basically just doing a lot of countBy, sumBy, flatMap, and groupBy operations. A lot of these functions can become a lot more readable if we implemented it in a more declarative way.

lodash has an implementation for each of these by functions, but they shouldn't be hard to implement either


// --- Copy to clipboard ---

function CopyButton({ getText, style: extraStyle }) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticing extraStyle isn't being used

}
const group = groups.get(key);
for (const b of series.deCumulatedBuckets) {
const leKey = b.le === Infinity ? '+Inf' : String(b.le);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing a couple of times where we do this cast of '+inf' <-> Infinity. Might be simpler to just do String(b.le) when b.le is Infinity, then display it as +Inf when we're actually displaying it in the html-like components.

Comment on lines 665 to 672
const [data, setData] = useState(null);
const [loading, setLoading] = useState(true);
const [error, setError] = useState(null);
const [search, setSearch] = useState(params.get('search') || '');
const [polling, setPolling] = useState(false);
const [pollInterval, setPollInterval] = useState(5000);
const snapshotsRef = useRef([]); // [{ families, timestamp }, ...]
const [snapshotVersion, setSnapshotVersion] = useState(0); // trigger re-render on snapshot update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify the state here by combining data, snapshotVersion, and snapshotsRef into one state variable. Then we can derive each of the respective values from this one state variable.

antiguru and others added 5 commits February 16, 2026 14:57
Add an interactive /metrics-viz page that fetches and parses Prometheus
exposition format from environmentd and cluster replica endpoints.

Features:
- Prometheus text format parser with histogram de-cumulation
- Grouped metric display with search filtering
- Label dimension toggles for rollup granularity
- Copy-to-clipboard for tables, histograms, and raw data
- URL query param persistence for endpoint and search state
- Polling mode with rate/delta computation across snapshots
- Save/load metrics snapshots to/from files
- Auto-discovery of per-process replica endpoints

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use port 6878 (internal HTTP) instead of 6876 for test base URL
- Fix dropdown option visibility check to use count assertion
- Add more robust waiting for async endpoint discovery

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dback

- Extract shared query() into fetch.js, removing duplicates from
  metrics.jsx, memory.jsx, and hierarchical-memory.jsx
- Create generic Expandable component replacing CollapsibleScalarTable
  and RawData
- Replace className selectors with aria-label for test targeting
- Rename ensureFamily to updateFamily with optional help/type params
- Make buildScalarSeries return value instead of mutating
- Replace manual parseLabels with regex-based implementation
- Add unit tests for prometheus.js parsing functions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused `extraStyle` param from CopyButton
- Simplify Infinity handling in aggregateHistogramSeries
- Consolidate snapshot state (data, snapshotsRef, snapshotVersion -> snapshots)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add JSDoc @typedef for Labels, Sample, MetricFamily, HistogramSeries,
  ScalarSeries, and delta-enriched variants
- Add @param/@returns annotations to all non-component functions
- Introduce groupBy, keyBy, pickLabels utilities to replace manual
  Map-building patterns in aggregation functions
- Refactor buildHistogramSeries/buildScalarSeries to take explicit
  parameters and return values instead of mutating the family object

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs `tsc --checkJs --noEmit` in CI (before Playwright tests) to validate
JSDoc type annotations in prometheus.js, metrics.jsx, and fetch.js. Uses
@types/react@16, @types/react-dom@16, and @types/d3@5 matching the vendored
runtime versions. Fixes minor type issues caught by the new check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@antiguru antiguru requested a review from a team as a code owner February 16, 2026 14:38
Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test/lint changes lgtm, only one note about helping devs make the lint green locally

Comment on lines +20 to +21
try npm install --prefix test/dataflow-visualizer --silent
try npx --prefix test/dataflow-visualizer tsc -p test/dataflow-visualizer/tsconfig.typecheck.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check that npm exists and if not print a note for how to install it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants