Skip to content

"Failed to route" log line looks like an error but is non-fatal — wastes operator's time during incident triage #44

@dolph

Description

@dolph

Operability problem

In destinations.go:168-177:

route, err := GetRoute(ip)
if err != nil {
    LogDestinationError(dest, fmt.Sprintf("Failed to route to %s", ip.String()), err)
}

if dest.Protocol == "icmp" {
    reachable = reachable && Ping(route, dest, ip)
} else {
    reachable = reachable && Dial(route, dest, ip)
}

The route-lookup error is logged with the word "Failed" and the standard error-logging helper (LogDestinationError), but the code then proceeds to dial/ping anyway — and per the README, this is intentional:

"The routability of each address is evaluated. Failures in routability are non-fatal; that is, the attempt to determine a viable route is only useful for producing robust logging upon failure, but does not indicate a failure in the actual routability from the host."

So the log line lies: it announces a failure that is not a failure of the check. An on-call SRE reading the logs during an incident has no way to know this from the log line itself.

Concrete operator confusion

"I'm looking at last night's logs because we got paged. I see:

[host-a][10.0.0.5] api: Failed to route to 93.184.216.34: no route to host
[host-a][eth0][10.0.0.5 > 10.0.0.1 >> 93.184.216.34] api: Connected

The first line says 'Failed to route' with no route to host. The next line says 'Connected'. So... did it work? Did it not? Is one of these lying? I assume the first one is the page trigger and start investigating routing on host-a — wasting 20 minutes before I realize the route lookup failure is cosmetic and the actual dial succeeded."

The issue is exacerbated by:

  • Single-destination resolves can produce two IPs where one route lookup fails (non-fatal) and the other succeeds, and both end up in the log stream interleaved with the dial outcomes — making it ambiguous which "Failed" caused the overall check failure.
  • GetRoute is called via the routing library which re-parses the kernel routing table (GetRoute re-parses the entire kernel routing table on every check #30) — meaning these errors can be intermittent / load-related, not actually indicative of a real routing problem.

Suggested fix

Either:

  1. Log as a warning, not an error, with copy that signals it's diagnostic and not actionable: "route lookup failed (non-fatal, dial will still proceed): %v". This requires log levels (Logs are unstructured plain text without levels; GetLocalIPs runs per log line #21).
  2. Demote to debug-only (gated by a --verbose flag — see the companion issue about adding --verbose/doctor), since per the README's own admission, this is only useful "for producing robust logging upon failure."
  3. Suppress entirely when the subsequent dial succeeds — i.e., buffer the route-lookup error and only emit it if the dial also fails. (Costs complexity but produces the clearest operator experience.)

Bug-class: the log message materially misleads the operator about what just happened.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions