-
Notifications
You must be signed in to change notification settings - Fork 281
Description
Problem
The UnknownError result code accounts for 2,352,993 errors per 28 days (6,403 users) — making it the single largest error bucket in azd telemetry. These errors provide zero diagnostic signal because they lack any classification, error code, or service attribution.
Telemetry Evidence (rolling 28 days ending Mar 18, 2026)
| Result Code | Count | Users | % of All Failures |
|---|---|---|---|
UnknownError |
2,352,993 | 6,403 | 65.2% |
auth.login_required |
547,007 | 1,984 | 15.2% |
internal.errors_errorString |
462,344 | 16,978 | 12.8% |
| All other codes | 249,656 | — | 6.8% |
69.5% of auth token failures are classified as UnknownError. This inflates the unknown bucket and makes telemetry dashboards unreliable for error analysis.
Where It Happens
UnknownError originates when errors reach MapError() in cli/azd/internal/cmd/errors.go but do not match any:
- Typed error assertion (e.g.,
*azcore.ResponseError,*auth.AuthFailedError) - Sentinel error check via
classifySentinel() - Network error check via
isNetworkError() - Generic fallback (which produces
internal.*codes, notUnknownError)
This means UnknownError is likely set before MapError() is called, possibly by middleware or the command framework itself when errors are not propagated through the telemetry path.
Root Cause Hypotheses
-
Errors bypassing
MapError()— Some error paths may return errors before the telemetry middleware has a chance to classify them, resulting in the defaultUnknownErrorstatus. -
MSAL/credential chain errors — Token acquisition failures from the MSAL library may produce composite error types (e.g.,
ChainedTokenCredentialError) that are not caught by any type assertion inMapError(). -
HTTP transport errors — Network timeouts, TLS failures, or connection resets that do not match the
isNetworkError()patterns. -
Expired cached tokens — The credential cache may return errors that are not
ReLoginRequiredErrororAuthFailedError, falling through without classification. -
internal.errors_errorStringoverlap — The 462Kerrors_errorStringentries suggest widespread use of bareerrors.New()without typed sentinels. Some of these may also contribute toUnknownErrorwhen they bypassMapError()entirely.
Proposed Investigation
Phase 1: Sampling (understand what is in the bucket)
- Add error type sampling — Temporarily log the Go error type chain (
errorType(err)) forUnknownErrorevents to surface the actual types hitting this bucket. - Query Kusto for error patterns — Analyze the
UnknownErrorentries for commonAzdErrorTypevalues, command paths, and execution environments to identify clusters. - Trace the
UnknownErrororigin — Find all code paths where the span status is set toUnknownError(or whereMapError()is not called).
Phase 2: Classification (add handlers for top types)
- Add typed handlers — For the top error types found in Phase 1, add corresponding branches in
MapError()with proper error codes andServiceNameattributes. - Add test cases — Ensure each new handler has test coverage following the existing
Test_MapErrorandTest_ClassifySuggestionType_MatchesMapErrorpatterns.
Phase 3: Prevention (reduce errors_errorString)
- Audit hot paths — Identify the highest-volume
errors.New()call sites in auth token, provision, and deploy code paths. - Add typed sentinels — Replace bare errors with typed sentinels or structured error types.
- Expand test enforcement — The test suite already has
allowedCatchAllenforcement (errors_test.go line 791). Expand this pattern to prevent regressions.
Expected Impact
- Telemetry clarity: Moving even 50% of
UnknownErrorinto proper categories would transform the error analysis dashboard - Actionable insights: Classified errors can trigger targeted fixes, UX improvements, and agent-specific optimizations
- Downstream improvements: Better classification feeds into the error suggestion pipeline (
ErrorWithSuggestion), the agent error categorization (classifyError()), and the pre-flight validation work (Feature: Pre-flight auth validation for provision/deploy/up to prevent downstream failures #7234)
Related Issues
- Telemetry: auth errors (login_required, not_logged_in) classified as 'unknown' error category #7233 / PR Fix auth error telemetry classification #7235 — Fixed auth error classification (610K errors moved from
unknowntoaad) - Feature: Pre-flight auth validation for provision/deploy/up to prevent downstream failures #7234 / PR Add auth pre-flight validation for agents #7236 — Auth pre-flight validation (prevents auth failures from reaching the error pipeline)
- Improve error classification to reduce Unknown error bucket #6796 — Previous work to reduce Unknown error bucket (closed)
- Follow up on unknown errors classification #6576 — Follow-up on unknown errors classification (closed)
- tracing: better unknown error breakdowns #5743 — Better unknown error breakdowns (closed, added
errorType())