Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ Before designing, understand what you are building observability for. The metric

**Skip domain discovery if**: You already have deep context about the service from prior conversations or the user has provided detailed specifications.

**Gate**: Before designing the Business group, you must be able to name at least 3 domain outcomes specific to this service — in plain language a product manager would recognize. Examples: "order placed", "payment completed", "message delivered". If you cannot name them, ask the user before proceeding. Do not substitute transport-layer metrics (gRPC error rate, HTTP request rate) as placeholders — those are `P`, not `B`. See the B trap in [references/widgets.md](references/widgets.md).

---

## Design
Expand Down Expand Up @@ -100,6 +102,8 @@ pup metrics list --filter="trace.*" --tag-filter="service:<service-name>" --agen

Use the actual metric names and tag values you find here when writing widget queries — do not guess or invent them. If a metric you expect does not appear, flag it to the user before building widgets around it.

**This applies to all query types**: metric queries, APM span filters (`operation_name`, `resource_name`, span tags), and log filters. The `**Configuration**` sections in [references/widgets.md](references/widgets.md) describe JSON structure and field constraints only — they are not prescriptive queries. Always verify the actual filter values with `pup` before using them.

### 3. Choose a framework

Match the dashboard purpose to a framework. Read [references/frameworks.md](references/frameworks.md) for detailed metric mappings and group structures.
Expand Down Expand Up @@ -213,6 +217,7 @@ Check:
- Do its widgets use the `B0-N:` prefix?
- Does it contain 5-8 metrics covering: customer-visible success rates, key transaction flows, and SLA-impacting latency?
- Can someone determine "are customers affected?" within 5 seconds of opening the dashboard?
- **B trap check**: For each B-prefixed widget, ask "Can a product manager interpret this without knowing the transport protocol?" If no — gRPC error rate, HTTP request rate, queue depth — it is `P`, not `B`, regardless of where it is placed. Flag and recommend moving to the appropriate platform group.

**Findings format**:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -337,3 +337,6 @@ Datadog uses a 12-column grid.
| Dashboard title with environment/region | Forces duplication | Put context in template variables |
| Identical widgets with different filters | Redundant, hard to maintain | Use template variables + saved views |
| Y-axis auto-scaling with distant threshold | Normal traffic compressed into flat band | Set `yaxis.max` near threshold — see [thresholds.md](thresholds.md) |
| Domain-specific filters in platform groups | Platform group silently shows only one domain; new domains are invisible | Platform groups (Commanded, Oban, Broadway, etc.) must scope only by template variables (`$env`, `$service`). Hardcoded handler names, queue names, or domain names belong in domain groups, not platform groups. |
| Individual handler/worker widget when global `by {dimension}` view exists | Redundant widget adds noise without adding signal | Before adding a widget scoped to a specific handler, queue, or worker in a platform group, check: does a global `by {handler_name}` / `by {queue}` timeseries already exist? If yes, the specific widget adds nothing — the global view already surfaces it when it spikes. Only add specific widgets when they have their own alert threshold or SLO that justifies the dedicated callout, and place them in the domain group, not the platform group. |
| Transport metrics (`P`) placed in Business group | Misleads readers into thinking protocol health = business health; obscures what domain outcomes actually are | gRPC error rate, HTTP request rate, and apdex are `P` regardless of placement. See [widgets.md](widgets.md) for the full B trap guide. |
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,27 @@ Markdown text widget.

---

## List Stream (Trace / Log)

Live-updating list of trace spans or log entries. JSON widget type: `list_stream`.

**Use for**: Surfacing individual spans for investigation — failed jobs, slow commands, high-lag handler executions.

**Data sources**:
- `trace_stream` — APM trace spans
- `logs_stream` — Log entries

**`trace_stream` query schema constraints** — the `query` object for `trace_stream` accepts **only** these fields:
- `data_source` — must be `"trace_stream"`
- `indexes` — array, usually `[]`
- `query_string` — the filter expression

**Do NOT include** `sort`, `storage`, `compute`, or any other fields inside the `query` object for `trace_stream`. The Datadog API will return a 400 validation error. These fields are valid for `logs_stream` but not `trace_stream`.

**Sizing**: Minimum 6 columns, recommended 12 columns (full width).

---

## Change

Shows the change in a metric value over a time period.
Expand Down Expand Up @@ -284,6 +305,97 @@ When assigning prefixes, use the domain discovery context:

The priority number comes from the ops review order: what do you look at first when paged at 3am? That's `0`.

### The B trap: transport metrics are not business metrics

The most common misclassification is putting transport-layer health metrics in the Business group. **gRPC error rate, HTTP error rate, and request throughput are `P` — not `B`** — even when they appear in the Business group and even when the service's only interface is gRPC or HTTP.

Ask: **"Can a product manager interpret this without knowing what gRPC or HTTP is?"** If no, it's `P`.

| Looks like B | Actually | Why |
|---|---|---|
| gRPC error rate | `P0` | Transport layer — how the code communicates, not what it does |
| HTTP request rate | `P0` | Transport layer |
| gRPC apdex | `P0` | Protocol health score, not a business outcome |
| Oban job error rate | `P0` | Platform job processing |
| Event handler lag | `D0` | Technical domain process health |
| Order completion rate | `B0` | Customer action — a PM can interpret this |
| Checkout success rate | `B0` | Business outcome — directly maps to customer value |
| Payment processed rate | `B1` | Business transaction throughput |

**Rule**: If you cannot complete the sentence "Customers are affected because ___" using only the metric name, it is not `B`.

### Platform metric catalog

When you encounter metrics in a codebase or Datadog, use this to classify them correctly. This is not a list of widgets to add — it is a guide for recognising what layer a metric belongs to. Only include metrics that actually exist and are relevant to the service being observed.

The signals listed under each component type are examples of what tends to exist, not requirements. Every service is different.

#### Inbound protocol servers (HTTP, gRPC, GraphQL, WebSocket, etc.)

Any server that accepts requests or connections — regardless of protocol. Request rate, error rate, latency, and apdex are always `P` for these. They describe how the transport layer is performing, not what the business is doing.

| Signal | Prefix |
|--------|--------|
| Request / connection rate | `P0` |
| Error rate | `P0` |
| Latency percentiles | `P0` |
| Apdex / health score | `P1` |
| Breakdown by endpoint / operation | `P1` |

#### Outbound clients (HTTP, gRPC, RPC, external APIs)

Calls the service makes to other services or third-party APIs.

| Signal | Prefix |
|--------|--------|
| Call rate | `P1` |
| Error / timeout rate | `P0` |
| Latency | `P0` |

#### Message queue consumers

Any process consuming from a queue or stream (regardless of broker).

| Signal | Prefix |
|--------|--------|
| Processing rate | `P0` |
| Error / dead-letter rate | `P0` |
| Processing latency | `P0` |
| Queue depth / consumer lag | `P0` |

#### Background job processors

Deferred or scheduled work — any job queue or scheduler.

| Signal | Prefix |
|--------|--------|
| Execution rate | `P0` |
| Error / retry rate | `P0` |
| Latency | `P1` |
| Queue depth | `P0` |

#### Database and cache clients

The application's access layer — not the database engine or cache server (those are `I`).

| Signal | Prefix |
|--------|--------|
| Query / operation latency | `P0` |
| Connection pool wait time | `P0` |
| Error rate | `P0` |
| Cache hit / miss rate | `P1` |

#### Event-driven patterns (CQRS, event sourcing, pub/sub)

Command dispatch, aggregate execution, event handlers — when a service uses an event-driven architecture.

| Signal | Prefix | Notes |
|--------|--------|-------|
| Command dispatch rate and error rate | `P0` | |
| Aggregate execution latency | `P0` | |
| Event handler throughput and lag | `P0` | Show globally by handler name, not per handler — see anti-patterns in [layouts.md](layouts.md) |
| Write conflicts / retries | `P1` | |

---

## General Naming Rules
Expand Down