Skip to content

Investigate and classify UnknownError bucket (2.35M errors/28d) #7239

@spboyer

Description

@spboyer

Problem

The UnknownError result code accounts for 2,352,993 errors per 28 days (6,403 users) — making it the single largest error bucket in azd telemetry. These errors provide zero diagnostic signal because they lack any classification, error code, or service attribution.

Telemetry Evidence (rolling 28 days ending Mar 18, 2026)

Result Code Count Users % of All Failures
UnknownError 2,352,993 6,403 65.2%
auth.login_required 547,007 1,984 15.2%
internal.errors_errorString 462,344 16,978 12.8%
All other codes 249,656 6.8%

69.5% of auth token failures are classified as UnknownError. This inflates the unknown bucket and makes telemetry dashboards unreliable for error analysis.

Where It Happens

UnknownError originates when errors reach MapError() in cli/azd/internal/cmd/errors.go but do not match any:

  • Typed error assertion (e.g., *azcore.ResponseError, *auth.AuthFailedError)
  • Sentinel error check via classifySentinel()
  • Network error check via isNetworkError()
  • Generic fallback (which produces internal.* codes, not UnknownError)

This means UnknownError is likely set before MapError() is called, possibly by middleware or the command framework itself when errors are not propagated through the telemetry path.

Root Cause Hypotheses

  1. Errors bypassing MapError() — Some error paths may return errors before the telemetry middleware has a chance to classify them, resulting in the default UnknownError status.

  2. MSAL/credential chain errors — Token acquisition failures from the MSAL library may produce composite error types (e.g., ChainedTokenCredentialError) that are not caught by any type assertion in MapError().

  3. HTTP transport errors — Network timeouts, TLS failures, or connection resets that do not match the isNetworkError() patterns.

  4. Expired cached tokens — The credential cache may return errors that are not ReLoginRequiredError or AuthFailedError, falling through without classification.

  5. internal.errors_errorString overlap — The 462K errors_errorString entries suggest widespread use of bare errors.New() without typed sentinels. Some of these may also contribute to UnknownError when they bypass MapError() entirely.

Proposed Investigation

Phase 1: Sampling (understand what is in the bucket)

  1. Add error type sampling — Temporarily log the Go error type chain (errorType(err)) for UnknownError events to surface the actual types hitting this bucket.
  2. Query Kusto for error patterns — Analyze the UnknownError entries for common AzdErrorType values, command paths, and execution environments to identify clusters.
  3. Trace the UnknownError origin — Find all code paths where the span status is set to UnknownError (or where MapError() is not called).

Phase 2: Classification (add handlers for top types)

  1. Add typed handlers — For the top error types found in Phase 1, add corresponding branches in MapError() with proper error codes and ServiceName attributes.
  2. Add test cases — Ensure each new handler has test coverage following the existing Test_MapError and Test_ClassifySuggestionType_MatchesMapError patterns.

Phase 3: Prevention (reduce errors_errorString)

  1. Audit hot paths — Identify the highest-volume errors.New() call sites in auth token, provision, and deploy code paths.
  2. Add typed sentinels — Replace bare errors with typed sentinels or structured error types.
  3. Expand test enforcement — The test suite already has allowedCatchAll enforcement (errors_test.go line 791). Expand this pattern to prevent regressions.

Expected Impact

  • Telemetry clarity: Moving even 50% of UnknownError into proper categories would transform the error analysis dashboard
  • Actionable insights: Classified errors can trigger targeted fixes, UX improvements, and agent-specific optimizations
  • Downstream improvements: Better classification feeds into the error suggestion pipeline (ErrorWithSuggestion), the agent error categorization (classifyError()), and the pre-flight validation work (Feature: Pre-flight auth validation for provision/deploy/up to prevent downstream failures #7234)

Related Issues

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions