Skip to content

console: worker tree map in the console for cluster/ objects page#36645

Draft
leedqin wants to merge 2 commits into
MaterializeInc:mainfrom
leedqin:console-worker-tree-map
Draft

console: worker tree map in the console for cluster/ objects page#36645
leedqin wants to merge 2 commits into
MaterializeInc:mainfrom
leedqin:console-worker-tree-map

Conversation

@leedqin
Copy link
Copy Markdown
Contributor

@leedqin leedqin commented May 20, 2026

Showing Cluster CPU worker skew

image image

By operator in the object details

image

leedqin and others added 2 commits May 20, 2026 16:33
Surfaces an interactive heatmap on the cluster Overview page that
answers "where is CPU going?" and "is it skewed across workers?"
mirroring step 1 and step 3 of the dataflow-troubleshooting docs
ladder.

A "Where is CPU going?" button on the Resource Usage section opens
a side drawer with per-replica tabs. Each tab renders one row per
dataflow on the cluster, one cell per worker, colored by per-row
elapsed_ns. Horizontal patterns reveal object-level skew (often a
bad GROUP BY key); vertical patterns and the cluster-wide footer
row reveal worker-level issues (a noisy neighbor). A skew badge
(max/min) plus a tooltip showing ratio-to-average match the docs'
canonical skew metric (>2 threshold).

Driven by mz_introspection.mz_scheduling_elapsed_per_worker joined
to mz_dataflow_operator_dataflows and mz_compute_exports so each
row links back to its maintained-object detail page. The button
is hidden for system clusters and clusters with replication_factor=0.

The heat gradient is theme-aware via useColorModeValue: pale slate
base in light mode, dark slate base in dark mode, sharing a warm
mid-stop and red high-stop. Colorblind-safe (amber to red).

The WorkerSkewHeatmap component is built generic on a HeatmapRow
contract so the upcoming object-detail Performance tab can reuse
it with operator-scoped rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the dataflow-troubleshooting funnel on the maintained-object
detail drawer. Where the cluster heatmap answers "which dataflow on
this cluster is hot or skewed?", this surface answers the natural
next question: "within this object's dataflow, which operator is
the bottleneck, and on which workers?"

Adds a Performance tab to ObjectDetailPanel (next to Definition and
Freshness). The tab contains a replica selector defaulting to the
first ready replica, with the same per-row-normalized heatmap as the
cluster drawer but rows scoped to operators inside this object's
dataflow. Joined via mz_compute_exports.export_id = object GlobalId
and filters out structural operators (BuildRegion, InputRegion,
LogOperatorHydration, etc.) so the surface only shows operators a
user can reason about.

The WorkerSkewHeatmap component lands generic on a HeatmapRow
contract in the previous commit; this commit defines an OperatorRow
that satisfies it, marking rows as non-clickable (operators have no
further drill target).

Empty states cover (a) objects not bound to a cluster (tables) and
(b) clusters with no replicas. Uses the Console Alert wrapper to
match existing conventions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant