You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: plugins/trogonstack-datadog/skills/datadog-design-dashboard/SKILL.md
+5Lines changed: 5 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -71,6 +71,8 @@ Before designing, understand what you are building observability for. The metric
71
71
72
72
**Skip domain discovery if**: You already have deep context about the service from prior conversations or the user has provided detailed specifications.
73
73
74
+
**Gate**: Before designing the Business group, you must be able to name at least 3 domain outcomes specific to this service — in plain language a product manager would recognize. Examples: "order placed", "payment completed", "message delivered". If you cannot name them, ask the user before proceeding. Do not substitute transport-layer metrics (gRPC error rate, HTTP request rate) as placeholders — those are `P`, not `B`. See the B trap in [references/widgets.md](references/widgets.md).
75
+
74
76
---
75
77
76
78
## Design
@@ -100,6 +102,8 @@ pup metrics list --filter="trace.*" --tag-filter="service:<service-name>" --agen
100
102
101
103
Use the actual metric names and tag values you find here when writing widget queries — do not guess or invent them. If a metric you expect does not appear, flag it to the user before building widgets around it.
102
104
105
+
**This applies to all query types**: metric queries, APM span filters (`operation_name`, `resource_name`, span tags), and log filters. The `**Configuration**` sections in [references/widgets.md](references/widgets.md) describe JSON structure and field constraints only — they are not prescriptive queries. Always verify the actual filter values with `pup` before using them.
106
+
103
107
### 3. Choose a framework
104
108
105
109
Match the dashboard purpose to a framework. Read [references/frameworks.md](references/frameworks.md) for detailed metric mappings and group structures.
@@ -213,6 +217,7 @@ Check:
213
217
- Do its widgets use the `B0-N:` prefix?
214
218
- Does it contain 5-8 metrics covering: customer-visible success rates, key transaction flows, and SLA-impacting latency?
215
219
- Can someone determine "are customers affected?" within 5 seconds of opening the dashboard?
220
+
-**B trap check**: For each B-prefixed widget, ask "Can a product manager interpret this without knowing the transport protocol?" If no — gRPC error rate, HTTP request rate, queue depth — it is `P`, not `B`, regardless of where it is placed. Flag and recommend moving to the appropriate platform group.
Copy file name to clipboardExpand all lines: plugins/trogonstack-datadog/skills/datadog-design-dashboard/references/layouts.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -337,3 +337,6 @@ Datadog uses a 12-column grid.
337
337
| Dashboard title with environment/region | Forces duplication | Put context in template variables |
338
338
| Identical widgets with different filters | Redundant, hard to maintain | Use template variables + saved views |
339
339
| Y-axis auto-scaling with distant threshold | Normal traffic compressed into flat band | Set `yaxis.max` near threshold — see [thresholds.md](thresholds.md)|
340
+
| Domain-specific filters in platform groups | Platform group silently shows only one domain; new domains are invisible | Platform groups (Commanded, Oban, Broadway, etc.) must scope only by template variables (`$env`, `$service`). Hardcoded handler names, queue names, or domain names belong in domain groups, not platform groups. |
341
+
| Individual handler/worker widget when global `by {dimension}` view exists | Redundant widget adds noise without adding signal | Before adding a widget scoped to a specific handler, queue, or worker in a platform group, check: does a global `by {handler_name}` / `by {queue}` timeseries already exist? If yes, the specific widget adds nothing — the global view already surfaces it when it spikes. Only add specific widgets when they have their own alert threshold or SLO that justifies the dedicated callout, and place them in the domain group, not the platform group. |
342
+
| Transport metrics (`P`) placed in Business group | Misleads readers into thinking protocol health = business health; obscures what domain outcomes actually are | gRPC error rate, HTTP request rate, and apdex are `P` regardless of placement. See [widgets.md](widgets.md) for the full B trap guide. |
**`trace_stream` query schema constraints** — the `query` object for `trace_stream` accepts **only** these fields:
208
+
-`data_source` — must be `"trace_stream"`
209
+
-`indexes` — array, usually `[]`
210
+
-`query_string` — the filter expression
211
+
212
+
**Do NOT include**`sort`, `storage`, `compute`, or any other fields inside the `query` object for `trace_stream`. The Datadog API will return a 400 validation error. These fields are valid for `logs_stream` but not `trace_stream`.
Shows the change in a metric value over a time period.
@@ -284,6 +305,97 @@ When assigning prefixes, use the domain discovery context:
284
305
285
306
The priority number comes from the ops review order: what do you look at first when paged at 3am? That's `0`.
286
307
308
+
### The B trap: transport metrics are not business metrics
309
+
310
+
The most common misclassification is putting transport-layer health metrics in the Business group. **gRPC error rate, HTTP error rate, and request throughput are `P` — not `B`** — even when they appear in the Business group and even when the service's only interface is gRPC or HTTP.
311
+
312
+
Ask: **"Can a product manager interpret this without knowing what gRPC or HTTP is?"** If no, it's `P`.
313
+
314
+
| Looks like B | Actually | Why |
315
+
|---|---|---|
316
+
| gRPC error rate |`P0`| Transport layer — how the code communicates, not what it does |
317
+
| HTTP request rate |`P0`| Transport layer |
318
+
| gRPC apdex |`P0`| Protocol health score, not a business outcome |
| Event handler lag |`D0`| Technical domain process health |
321
+
| Order completion rate |`B0`| Customer action — a PM can interpret this |
322
+
| Checkout success rate |`B0`| Business outcome — directly maps to customer value |
323
+
| Payment processed rate |`B1`| Business transaction throughput |
324
+
325
+
**Rule**: If you cannot complete the sentence "Customers are affected because ___" using only the metric name, it is not `B`.
326
+
327
+
### Platform metric catalog
328
+
329
+
When you encounter metrics in a codebase or Datadog, use this to classify them correctly. This is not a list of widgets to add — it is a guide for recognising what layer a metric belongs to. Only include metrics that actually exist and are relevant to the service being observed.
330
+
331
+
The signals listed under each component type are examples of what tends to exist, not requirements. Every service is different.
Any server that accepts requests or connections — regardless of protocol. Request rate, error rate, latency, and apdex are always `P` for these. They describe how the transport layer is performing, not what the business is doing.
0 commit comments