You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The route-lookup error is logged with the word "Failed" and the standard error-logging helper (LogDestinationError), but the code then proceeds to dial/ping anyway — and per the README, this is intentional:
"The routability of each address is evaluated. Failures in routability are non-fatal; that is, the attempt to determine a viable route is only useful for producing robust logging upon failure, but does not indicate a failure in the actual routability from the host."
So the log line lies: it announces a failure that is not a failure of the check. An on-call SRE reading the logs during an incident has no way to know this from the log line itself.
Concrete operator confusion
"I'm looking at last night's logs because we got paged. I see:
[host-a][10.0.0.5] api: Failed to route to 93.184.216.34: no route to host
[host-a][eth0][10.0.0.5 > 10.0.0.1 >> 93.184.216.34] api: Connected
The first line says 'Failed to route' with no route to host. The next line says 'Connected'. So... did it work? Did it not? Is one of these lying? I assume the first one is the page trigger and start investigating routing on host-a — wasting 20 minutes before I realize the route lookup failure is cosmetic and the actual dial succeeded."
The issue is exacerbated by:
Single-destination resolves can produce two IPs where one route lookup fails (non-fatal) and the other succeeds, and both end up in the log stream interleaved with the dial outcomes — making it ambiguous which "Failed" caused the overall check failure.
Demote to debug-only (gated by a --verbose flag — see the companion issue about adding --verbose/doctor), since per the README's own admission, this is only useful "for producing robust logging upon failure."
Suppress entirely when the subsequent dial succeeds — i.e., buffer the route-lookup error and only emit it if the dial also fails. (Costs complexity but produces the clearest operator experience.)
Bug-class: the log message materially misleads the operator about what just happened.
Operability problem
In
destinations.go:168-177:The route-lookup error is logged with the word "Failed" and the standard error-logging helper (
LogDestinationError), but the code then proceeds to dial/ping anyway — and per the README, this is intentional:So the log line lies: it announces a failure that is not a failure of the check. An on-call SRE reading the logs during an incident has no way to know this from the log line itself.
Concrete operator confusion
The issue is exacerbated by:
GetRouteis called via theroutinglibrary which re-parses the kernel routing table (GetRoutere-parses the entire kernel routing table on every check #30) — meaning these errors can be intermittent / load-related, not actually indicative of a real routing problem.Suggested fix
Either:
"route lookup failed (non-fatal, dial will still proceed): %v". This requires log levels (Logs are unstructured plain text without levels;GetLocalIPsruns per log line #21).--verboseflag — see the companion issue about adding--verbose/doctor), since per the README's own admission, this is only useful "for producing robust logging upon failure."Bug-class: the log message materially misleads the operator about what just happened.