Skip to content

Generic "Failed" log copy buries the failure mode under a Go error string #40

@dolph

Description

@dolph

Operability problem

When a connectivity alert fires at 3 AM, the on-call SRE jumps to the logs and finds messages like:

2026/05/16 03:14:22 [host-a][eth0][...][10.0.0.5 > 10.0.0.1 >> 93.184.216.34] example.com: Failed: dial tcp 93.184.216.34:443: connect: connection refused
2026/05/16 03:14:22 example.com: Failed HTTP GET: Get "https://example.com/health": context deadline exceeded

The literal log copy is "Failed" or "Failed HTTP GET". To distinguish between the failure modes the operator has to:

  1. Recognize whether the trailing Go error string represents a DNS error, TCP-level connect: connection refused (RST), i/o timeout (network drop), x509: (TLS), http: (HTTP), etc.
  2. Cross-reference the source file to know which check stage produced the message.

Examples where the current copy is ambiguous:

  • Dial failure (dialer.go:20) logs "Failed" — operator cannot tell from copy alone whether this was the TCP dial, the UDP send, or something else. The "what failed" is buried in the surrounding context.
  • HTTPS failure (https.go:13) logs "Failed HTTP GET" for any error from http.Get, including DNS failure on a redirect, TLS handshake failure (expired cert, name mismatch), connection reset mid-stream, or HTTP-protocol error. There is no way for the operator to distinguish "the cert expired" from "the server RST'd us" without parsing the underlying error string.
  • pinger.go:21 logs "Failed to ping <ip>" for both setup errors (insufficient privileges, IP family mismatch) and run errors. Each has a very different remediation.

Concrete operator confusion

"I see Failed: dial tcp 1.2.3.4:443: i/o timeout. Is this a firewall drop (silent timeout) or did the server crash and stop accepting (which would have RST'd)? The copy just says 'Failed'."

"I see Failed HTTP GET: Get "https://api/": x509: certificate has expired or is not yet valid. The dial succeeded but the message starts with 'Failed HTTP GET', not 'TLS handshake failed'. I almost paged the API team for a 500 — it was a cert renewal issue."

Suggested fix

Replace bare "Failed" with stage-specific copy that names the OSI layer / step that failed:

  • dialer.go:20: "TCP dial failed" / "UDP send failed" (use dest.Protocol)
  • https.go:13: "HTTP request failed", and on TLS errors specifically tag "TLS handshake failed" (the x509.* and tls: prefixes in err.Error() are reliable enough to classify)
  • pinger.go:14 vs pinger.go:21: keep them distinct but prefix with "ICMP setup failed" vs "ICMP probe failed" (already partly done, but the run-error case should also surface whether 0 packets were received vs an actual error).

Operability-related and distinct from #21 (structured logs, which is about HOW logs are emitted) — this is about WHAT the human-readable copy says.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions