Operability problem
When a connectivity alert fires at 3 AM, the on-call SRE jumps to the logs and finds messages like:
2026/05/16 03:14:22 [host-a][eth0][...][10.0.0.5 > 10.0.0.1 >> 93.184.216.34] example.com: Failed: dial tcp 93.184.216.34:443: connect: connection refused
2026/05/16 03:14:22 example.com: Failed HTTP GET: Get "https://example.com/health": context deadline exceeded
The literal log copy is "Failed" or "Failed HTTP GET". To distinguish between the failure modes the operator has to:
- Recognize whether the trailing Go error string represents a DNS error, TCP-level
connect: connection refused (RST), i/o timeout (network drop), x509: (TLS), http: (HTTP), etc.
- Cross-reference the source file to know which check stage produced the message.
Examples where the current copy is ambiguous:
Dial failure (dialer.go:20) logs "Failed" — operator cannot tell from copy alone whether this was the TCP dial, the UDP send, or something else. The "what failed" is buried in the surrounding context.
HTTPS failure (https.go:13) logs "Failed HTTP GET" for any error from http.Get, including DNS failure on a redirect, TLS handshake failure (expired cert, name mismatch), connection reset mid-stream, or HTTP-protocol error. There is no way for the operator to distinguish "the cert expired" from "the server RST'd us" without parsing the underlying error string.
pinger.go:21 logs "Failed to ping <ip>" for both setup errors (insufficient privileges, IP family mismatch) and run errors. Each has a very different remediation.
Concrete operator confusion
"I see Failed: dial tcp 1.2.3.4:443: i/o timeout. Is this a firewall drop (silent timeout) or did the server crash and stop accepting (which would have RST'd)? The copy just says 'Failed'."
"I see Failed HTTP GET: Get "https://api/": x509: certificate has expired or is not yet valid. The dial succeeded but the message starts with 'Failed HTTP GET', not 'TLS handshake failed'. I almost paged the API team for a 500 — it was a cert renewal issue."
Suggested fix
Replace bare "Failed" with stage-specific copy that names the OSI layer / step that failed:
dialer.go:20: "TCP dial failed" / "UDP send failed" (use dest.Protocol)
https.go:13: "HTTP request failed", and on TLS errors specifically tag "TLS handshake failed" (the x509.* and tls: prefixes in err.Error() are reliable enough to classify)
pinger.go:14 vs pinger.go:21: keep them distinct but prefix with "ICMP setup failed" vs "ICMP probe failed" (already partly done, but the run-error case should also surface whether 0 packets were received vs an actual error).
Operability-related and distinct from #21 (structured logs, which is about HOW logs are emitted) — this is about WHAT the human-readable copy says.
Operability problem
When a connectivity alert fires at 3 AM, the on-call SRE jumps to the logs and finds messages like:
The literal log copy is "Failed" or "Failed HTTP GET". To distinguish between the failure modes the operator has to:
connect: connection refused(RST),i/o timeout(network drop),x509:(TLS),http:(HTTP), etc.Examples where the current copy is ambiguous:
Dialfailure (dialer.go:20) logs"Failed"— operator cannot tell from copy alone whether this was the TCP dial, the UDP send, or something else. The "what failed" is buried in the surrounding context.HTTPSfailure (https.go:13) logs"Failed HTTP GET"for any error fromhttp.Get, including DNS failure on a redirect, TLS handshake failure (expired cert, name mismatch), connection reset mid-stream, or HTTP-protocol error. There is no way for the operator to distinguish "the cert expired" from "the server RST'd us" without parsing the underlying error string.pinger.go:21logs"Failed to ping <ip>"for both setup errors (insufficient privileges, IP family mismatch) and run errors. Each has a very different remediation.Concrete operator confusion
Suggested fix
Replace bare "Failed" with stage-specific copy that names the OSI layer / step that failed:
dialer.go:20:"TCP dial failed"/"UDP send failed"(usedest.Protocol)https.go:13:"HTTP request failed", and on TLS errors specifically tag"TLS handshake failed"(thex509.*andtls:prefixes inerr.Error()are reliable enough to classify)pinger.go:14vspinger.go:21: keep them distinct but prefix with"ICMP setup failed"vs"ICMP probe failed"(already partly done, but the run-error case should also surface whether 0 packets were received vs an actual error).Operability-related and distinct from #21 (structured logs, which is about HOW logs are emitted) — this is about WHAT the human-readable copy says.