Skip to content

fix(router): map gRPC errors to proper HTTP codes (#31)#68

Merged
Bowei Du (bowei) merged 2 commits into
agent-substrate:mainfrom
mchmarny:issue-31-router-error-messages
Jun 1, 2026
Merged

fix(router): map gRPC errors to proper HTTP codes (#31)#68
Bowei Du (bowei) merged 2 commits into
agent-substrate:mainfrom
mchmarny:issue-31-router-error-messages

Conversation

@mchmarny
Copy link
Copy Markdown
Contributor

@mchmarny Mark Chmarny (mchmarny) commented May 23, 2026

Fixes #31.

The Envoy ext_proc handler wrapped resume errors with %w, so the gRPC encoding leaked into the response body, and every non-NotFound failure returned 500. The 404 path returned "not found" without indicating what was missing.

This change adds a small mapping layer that translates gRPC status codes to HTTP status codes with short, client-safe bodies:

gRPC code HTTP Body
NotFound 404 actor "<id>" not found
FailedPrecondition 503 actor "<id>" unavailable: <desc> *
Unavailable 503 actor "<id>" unavailable
DeadlineExceeded 504 actor "<id>" request timed out
PermissionDenied 403 actor "<id>" access denied
Unauthenticated 401 actor "<id>" authentication required
ResourceExhausted 429 actor "<id>" rate limited
anything else 500 error resuming actor "<id>"

* FailedPrecondition is the only code whose gRPC desc is preserved verbatim; it carries actionable client-facing context like "no free workers available" and is not security-sensitive. Other codes deliberately drop the desc to avoid leaking server-side detail.

The original error is preserved via Unwrap so existing slog.ErrorContext logging keeps full fidelity.

Two smaller follow-ons made in the same change:

  • ActorResumer no longer translates codes.NotFound into a router-layer sentinel. gRPC errors now propagate untouched so the HTTP boundary owns the mapping. The retry behavior is unchanged (the backoff already exits on any non-Aborted error).
  • The "bad worker IP" branch (returned actor with an unparseable IP) now responds with an opaque actor "<id>" routing failed instead of the previous body that echoed back the invalid IP. The full IP is still logged server-side. This is deliberate; the IP is internal infrastructure detail and shouldn't appear in a client response.

Before / after

Before:

HTTP/1.1 500 Internal Server Error
error resuming actor ctr6: rpc error: code = FailedPrecondition desc = no free workers available
HTTP/1.1 404 Not Found
not found

After:

HTTP/1.1 503 Service Unavailable
actor "ctr6" unavailable: no free workers available
HTTP/1.1 404 Not Found
invalid host "ctr6.example.com": invalid actor_id: must end with actors.resources.substrate.ate.dev, got "ctr6.example.com"

Test plan

  • go test -race ./cmd/atenet/internal/app/router/... — all pass
  • go vet ./... — clean
  • make verify-fmt — clean
  • hack/verify-all.sh (boilerplate, go-generate, gofmt, licenses, mod-tidy) — all pass
  • make build-atenet — succeeds
  • New tests: 9-case table for mapResumeError, plus constructor and contract tests in errors_test.go; existing extproc_test.go cases expanded to cover the new gRPC code mappings end-to-end through handleRequestHeaders.

cc Tim Hockin (@thockin), is this what you had in mind in #31? PTAL

mchmarny

This comment was marked as resolved.

The Envoy ext_proc handler previously wrapped resume errors with %w,
leaking the gRPC encoding ("rpc error: code = ... desc = ...") into the
response body, and routed every non-NotFound failure through 500. It
also returned a generic "not found" body that did not identify which
resource was missing.

This change introduces a small mapping layer that translates gRPC status
codes into appropriate HTTP status codes with short, client-safe bodies:

  NotFound           -> 404
  FailedPrecondition -> 503 (desc preserved, e.g. "no free workers
                             available", which is actionable and not
                             security-sensitive)
  Unavailable        -> 503
  DeadlineExceeded   -> 504
  PermissionDenied   -> 403
  Unauthenticated    -> 401
  ResourceExhausted  -> 429
  other              -> 500 (no detail leak)

The original error is preserved via Unwrap so existing slog.ErrorContext
logging keeps full fidelity.

Layering: ActorResumer no longer translates codes.NotFound into a
router-layer sentinel; gRPC errors propagate untouched so the HTTP
boundary owns the mapping.
Copy link
Copy Markdown
Collaborator

@bowei Bowei Du (bowei) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits. Thanks for the change!

Comment thread cmd/atenet/internal/app/router/errors.go Outdated
Comment thread cmd/atenet/internal/app/router/errors.go Outdated
Comment thread cmd/atenet/internal/app/router/errors.go Outdated
@BenTheElder Benjamin Elder (BenTheElder) added the kind/cleanup Small fixes that are not bugs, for example a typo in a code comment label May 29, 2026
Address review feedback from @bowei: return error interface rather than
the concrete *reqError pointer so callers cannot accidentally fall into
the typed-nil pitfall. Tests use errors.As to access struct fields.
@mchmarny
Copy link
Copy Markdown
Contributor Author

Some nits. Thanks for the change!

Bowei Du (@bowei), PTAL
All 3 should be resolved with d91a4a0

Copy link
Copy Markdown
Collaborator

@bowei Bowei Du (bowei) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@bowei Bowei Du (bowei) merged commit a8c6739 into agent-substrate:main Jun 1, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/cleanup Small fixes that are not bugs, for example a typo in a code comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Router error messages are not great

3 participants