Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 156 additions & 0 deletions nagios/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,160 @@ The check watches the Nagios events log for log lines containing these strings,

The Nagios check does not include any service checks.

## Trigger on-call pages

Configure Nagios notification commands to call the [Datadog On-Call Paging API][11] directly, bypassing the Agent. The script creates a page on `PROBLEM` notifications and automatically resolves it on `RECOVERY`.

### How on-call pages work

- `CRITICAL`, `DOWN`, or `WARNING` notifications create a page targeting the configured On-Call team.
- `RECOVERY` notifications resolve the corresponding page.
- `UNKNOWN` notifications are ignored.

### Setup

#### Create the notification script

Create `/usr/local/nagios/libexec/notify_datadog_oncall.sh`:

```bash
#!/bin/bash
set -u

DD_API_KEY="<YOUR_DATADOG_API_KEY>"
DD_APP_KEY="<YOUR_DATADOG_APP_KEY>"
DD_SITE="datadoghq.com" # Change to your Datadog site

NOTIF_TYPE="${1}" # PROBLEM or RECOVERY
HOSTNAME="${2}"
SERVICEDESC="${3}"
STATE="${4}" # CRITICAL, WARNING, OK, UNKNOWN, UP, DOWN
ONCALL_TEAM="${5}" # Datadog On-Call team handle, e.g. "ops"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ONCALL_TEAM="${5}" # Datadog On-Call team handle, e.g. "ops"
ONCALL_TEAM="${5}" # Datadog On-Call team handle, for example, "ops"

OUTPUT="${6}"

# Map DD_SITE to On-Call API endpoint
case "$DD_SITE" in
datadoghq.com) ONCALL_URL="https://navy.oncall.datadoghq.com" ;;
datadoghq.eu) ONCALL_URL="https://beige.oncall.datadoghq.eu" ;;
us3.datadoghq.com) ONCALL_URL="https://teal.oncall.datadoghq.com" ;;
us5.datadoghq.com) ONCALL_URL="https://coral.oncall.datadoghq.com" ;;
ap1.datadoghq.com) ONCALL_URL="https://saffron.oncall.datadoghq.com" ;;
ap2.datadoghq.com) ONCALL_URL="https://lava.oncall.datadoghq.com" ;;
ddog-gov.com) ONCALL_URL="https://navy.oncall.datadoghq.com" ;;
*) echo "Unknown DD_SITE: $DD_SITE" >&2; exit 1 ;;
esac

STATE_DIR="/var/tmp/nagios_dd_oncall"
mkdir -p "$STATE_DIR"
PAGE_FILE="${STATE_DIR}/${HOSTNAME}-${SERVICEDESC}"

# Escape special characters for safe JSON embedding
OUTPUT=$(printf '%s' "$OUTPUT" | sed 's/\\/\\\\/g; s/"/\\"/g')

if [ "$NOTIF_TYPE" = "RECOVERY" ] || [ "$STATE" = "OK" ] || [ "$STATE" = "UP" ]; then
# Resolve existing page
if [ -f "$PAGE_FILE" ]; then
PAGE_ID=$(cat "$PAGE_FILE")
curl -s -m 15 -X POST \
"${ONCALL_URL}/api/v2/on-call/pages/${PAGE_ID}/resolve" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-H "DD-APPLICATION-KEY: ${DD_APP_KEY}"
rm -f "$PAGE_FILE"
fi
elif [ "$STATE" = "CRITICAL" ] || [ "$STATE" = "DOWN" ] || [ "$STATE" = "WARNING" ]; then
# Create page
RESPONSE=$(curl -s -m 15 -X POST \
"${ONCALL_URL}/api/v2/on-call/pages" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
-H "Content-Type: application/json" \
-d "{
\"data\": {
\"type\": \"pages\",
\"attributes\": {
\"title\": \"Nagios: ${HOSTNAME} / ${SERVICEDESC} is ${STATE}\",
\"description\": \"${OUTPUT}\",
\"urgency\": \"high\",
\"tags\": [\"integration:nagios\", \"service:${SERVICEDESC}\", \"host:${HOSTNAME}\"],
\"target\": {
\"identifier\": \"${ONCALL_TEAM}\",
\"type\": \"team_handle\"
}
}
}
}")

# Save page ID for later resolution
PAGE_ID=$(printf '%s' "$RESPONSE" | sed -n 's/.*"id":"\([^"]*\)".*/\1/p')
if [ -n "$PAGE_ID" ]; then
printf '%s' "$PAGE_ID" > "$PAGE_FILE"
fi
fi
```

Make the script executable:

```shell
sudo chmod 755 /usr/local/nagios/libexec/notify_datadog_oncall.sh
```

#### Define the Nagios commands

Add to `commands.cfg`. Use separate commands for service and host notifications so the correct Nagios macros are passed:

```nagios
define command {
command_name notify-datadog-oncall-service
command_line /usr/local/nagios/libexec/notify_datadog_oncall.sh "$NOTIFICATIONTYPE$" "$HOSTALIAS$" "$SERVICEDESC$" "$SERVICESTATE$" "$_CONTACTONCALL_TEAM$" "$SERVICEOUTPUT$"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Split host notifications to use host-state macros

The command definition only passes service macros ($SERVICEDESC$, $SERVICESTATE$, $SERVICEOUTPUT$) but the same command is wired to host_notification_commands, so host alerts (DOWN/UP) won't provide the expected state/output fields and can be sent with an incorrect alert_type (often defaulting to info, which does not page). This breaks the documented host paging path for outages unless a separate host command (using host macros like $HOSTSTATE$/$HOSTOUTPUT$) is defined.

Useful? React with 👍 / 👎.

}

define command {
command_name notify-datadog-oncall-host
command_line /usr/local/nagios/libexec/notify_datadog_oncall.sh "$NOTIFICATIONTYPE$" "$HOSTALIAS$" "Host" "$HOSTSTATE$" "$_CONTACTONCALL_TEAM$" "$HOSTOUTPUT$"
}
```

#### Create contacts with the On-Call team handle

The custom variable `_oncall_team` sets the Datadog On-Call team handle per contact. Add contacts to `contacts.cfg`:

```nagios
define contact {
contact_name datadog-ops
alias Ops Team On-Call
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands notify-datadog-oncall-service
host_notification_commands notify-datadog-oncall-host
_oncall_team ops
}
```

The `_oncall_team` value (for example, `ops`) must match the team handle configured in [Datadog On-Call][12].

#### Assign the contact to services or hosts

```nagios
define service {
use generic-service
host_name webserver-01
service_description HTTP_Service
check_command check_http
contacts datadog-ops
notification_options w,u,c,r
}
```

#### Reload Nagios

```shell
sudo systemctl reload nagios
```

Verify pages appear under **On-Call > Pages** in Datadog.

## Troubleshooting

Need help? Contact [Datadog support][9].
Expand All @@ -123,3 +277,5 @@ Need help? Contact [Datadog support][9].
[8]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information
[9]: https://docs.datadoghq.com/help/
[10]: https://www.datadoghq.com/blog/nagios-monitoring
[11]: https://docs.datadoghq.com/api/latest/on-call-paging/
[12]: https://docs.datadoghq.com/service_management/on-call/
2 changes: 2 additions & 0 deletions prometheus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,8 @@ Send Prometheus Alertmanager alerts in the event stream. Natively, Alertmanager

<div class="alert alert-tip">
Setting <code>send_resolved: true</code> (the default value) enables Alertmanager to send notifications when alerts are resolved in Prometheus. This is particularly important when using the <code>oncall_team</code> parameter to ensure that pages are marked as resolved. Note that resolved notifications may be delayed until the next <code>group_interval</code>.
<br><br>
Datadog deduplicates and auto-resolves On-Call pages using the event <code>aggregation_key</code>. Firing alerts (<code>alert_type: error</code> or <code>warning</code>) with the same <code>aggregation_key</code> are grouped into a single page; a resolved alert (<code>alert_type: success</code>) with the same key automatically closes it.
</div>

3. Restart the Prometheus and Alertmanager services.
Expand Down
Loading