feat: add observability, health endpoints, CLI tooling, and Dockerfile #30
feat: add observability, health endpoints, CLI tooling, and Dockerfile #30tac0turtle merged 2 commits intomainfrom
Conversation
…e (Phase 3) Add production-readiness features: Prometheus metrics with nil-safe Recorder interface, /health and /health/ready endpoints, CLI commands (status, blob get/list, config validate/show), and a multi-stage Dockerfile. Fix Coordinator.Status() to track latestHeight and networkHeight accurately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @tac0turtle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the production readiness of the application by integrating comprehensive observability features, including Prometheus metrics and health endpoints. It also introduces a suite of new command-line interface tools for interacting with the indexer's data and configuration, alongside a Dockerfile for streamlined containerized deployment. A key improvement to the sync coordinator ensures more accurate tracking of network and local sync heights. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughThis PR introduces a comprehensive metrics and health monitoring infrastructure alongside new CLI capabilities. Changes include Prometheus-backed metrics collection, health/status endpoints, blob and config CLI commands, Docker multi-stage build configuration, and metrics integration across sync, storage, and notification layers. Changes
Sequence DiagramssequenceDiagram
participant CLI as CLI User
participant StatusCmd as Status Command
participant RPCClient as RPC Client
participant HealthHandler as Health Handler
participant Store as SQLiteStore
participant Notifier as Notifier
CLI->>StatusCmd: Execute status command
StatusCmd->>RPCClient: Fetch /health via JSON-RPC
RPCClient->>HealthHandler: HTTP GET /health
HealthHandler->>Store: Query latest block (readiness)
HealthHandler->>Notifier: Get subscriber count
HealthHandler->>HealthHandler: Compute sync lag, uptime, status
HealthHandler-->>RPCClient: HealthStatus JSON
RPCClient-->>StatusCmd: Parsed response
StatusCmd->>StatusCmd: Format as table or JSON
StatusCmd-->>CLI: Display status
sequenceDiagram
participant Coordinator as Sync Coordinator
participant Metrics as Metrics Recorder
participant PromRec as Prometheus Recorder
participant Store as SQLiteStore
participant MetricsServer as Metrics HTTP Server
Coordinator->>Coordinator: Backfill/stream blobs & headers
Coordinator->>Metrics: SetLatestHeight(h)
Coordinator->>Metrics: IncBlobsProcessed(n)
Metrics->>PromRec: Record height gauge, blob counter
PromRec->>PromRec: Update apex_sync_latest_height, apex_sync_blobs_processed_total
Store->>Metrics: ObserveStoreQueryDuration("GetBlobs", duration)
Metrics->>PromRec: Record histogram
PromRec->>PromRec: Update apex_store_query_duration_seconds
MetricsServer->>PromRec: Scrape /metrics endpoint
PromRec-->>MetricsServer: Prometheus text format response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces significant production-readiness features, including Prometheus metrics, health endpoints, CLI tooling, and a multi-stage Dockerfile. The implementation is well-structured, particularly the metrics.Recorder interface for abstracting observability. My review has identified a few issues, including a critical race condition in the metrics recorder and a bug in the CLI's JSON output formatting. I've also included some suggestions for improving code clarity and robustness. Overall, this is a great addition to the project.
pkg/metrics/metrics.go
Outdated
| lastLatest uint64 | ||
| lastNetwork uint64 |
There was a problem hiding this comment.
The fields lastLatest and lastNetwork are accessed concurrently by SetLatestHeight and SetNetworkHeight without synchronization, creating a race condition. Since the Recorder interface must be safe for concurrent use, you should add a mutex to PromRecorder and use it to protect these fields.
You'll need to add mu sync.Mutex to the PromRecorder struct, and then wrap the critical sections in SetLatestHeight and SetNetworkHeight with r.mu.Lock() and r.mu.Unlock().
mu sync.Mutex
lastLatest uint64
lastNetwork uint64There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/api/notifier.go (1)
76-82:⚠️ Potential issue | 🟠 MajorRace:
len(n.subscribers)read after lock release inSubscribe.Line 78 releases
n.mu, but line 81 readsn.subscriberswithout holding the lock. Another goroutine callingSubscribeorUnsubscribeconcurrently can mutate the map between the unlock and theSetActiveSubscriptionscall, producing an inaccurate count and a data race.Compare with
Unsubscribe(line 92) which correctly callsSetActiveSubscriptionswhile still holding the lock.🔒 Proposed fix: move the metrics call inside the critical section
n.mu.Lock() n.subscribers[id] = sub + n.metrics.SetActiveSubscriptions(len(n.subscribers)) n.mu.Unlock() n.log.Debug().Uint64("sub_id", id).Int("namespaces", len(namespaces)).Msg("new subscription") - n.metrics.SetActiveSubscriptions(len(n.subscribers)) return sub🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/api/notifier.go` around lines 76 - 82, In Subscribe, after unlocking n.mu you read len(n.subscribers) for n.metrics.SetActiveSubscriptions which can race with concurrent Subscribe/Unsubscribe; move the SetActiveSubscriptions call inside the critical section (i.e., while n.mu is still held) so the count is computed and reported atomically with the map mutation—mirror the pattern used in Unsubscribe by calling n.metrics.SetActiveSubscriptions(len(n.subscribers)) before n.mu.Unlock().
🧹 Nitpick comments (10)
cmd/apex/client.go (1)
66-80: Consider checking HTTP status before unmarshal to avoid cryptic errors.When a reverse proxy or misconfigured server returns a non-JSON 4xx/5xx, the unmarshal at Line 78 produces a confusing error like
"unmarshal response: invalid character '<' looking for beginning of value"instead of a clear"unexpected status 502".♻️ Proposed fix
resp, err := c.client.Do(req) if err != nil { return nil, fmt.Errorf("send request: %w", err) } defer resp.Body.Close() //nolint:errcheck + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + body, _ := io.ReadAll(resp.Body) + return nil, fmt.Errorf("unexpected status %d: %s", resp.StatusCode, bytes.TrimSpace(body)) + } + respBody, err := io.ReadAll(resp.Body)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/apex/client.go` around lines 66 - 80, Before attempting json.Unmarshal into jsonRPCResponse, check the HTTP response status from resp (returned by c.client.Do(req)); if resp.StatusCode is not in the 2xx range, return a clear error that includes the status code (and optionally the respBody text) instead of proceeding to unmarshal. Locate the block that reads respBody and unmarshals into jsonRPCResponse and add an early status check using resp.StatusCode (and resp.Status) to produce a descriptive error like "unexpected status 502" plus the response body when non-2xx is received.Dockerfile (1)
18-20: Consider adding aHEALTHCHECKdirective to leverage the new health endpoints.The service now exposes
/healthand/health/readyon port 8080 but the image has noHEALTHCHECK. Container schedulers (Docker Compose, standalone Docker) won't surface instance health without it.💡 Example HEALTHCHECK
EXPOSE 8080 9090 9091 +HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \ + CMD ["/apex", "status", "--rpc-addr", "localhost:8080"] + ENTRYPOINT ["/apex", "start"]Alternatively use a
wget/curl-based check if thestatusCLI command adds startup overhead:# Not available in distroless; prefer the CLI form above or a custom health binary.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 18 - 20, Add a Docker HEALTHCHECK after the EXPOSE/ENTRYPOINT lines in the Dockerfile to probe the new /health or /health/ready endpoint on port 8080; for example invoke the app's CLI (ENTRYPOINT ["/apex", "start"] is used) or run a lightweight HTTP probe like curl/wget against http://localhost:8080/health/ready and return non-zero on failure so container schedulers can detect unhealthy instances. Ensure the HEALTHCHECK uses a sensible interval, timeout and retries (so transient startup delays don't mark the container unhealthy).pkg/api/health_test.go (1)
84-103: ConvertTestReadyEndpointto table-driven test and add not-ready case.The
/health/readyendpoint returns HTTP 503 when the node is not ready (Initializing state or store is inaccessible), but the test only covers the happy path. Extend with a not-ready case to align with the table-driven pattern used inTestHealthEndpointand catch regressions.♻️ Proposed extension
-func TestReadyEndpoint(t *testing.T) { - sp := &mockStatusProvider{status: types.SyncStatus{ - State: types.Streaming, - LatestHeight: 100, - NetworkHeight: 100, - }} - notifier := NewNotifier(64, zerolog.Nop()) - h := NewHealthHandler(sp, newMockStore(), notifier, "test") - - mux := http.NewServeMux() - h.Register(mux) - - req := httptest.NewRequest(http.MethodGet, "/health/ready", nil) - rec := httptest.NewRecorder() - mux.ServeHTTP(rec, req) - - if rec.Code != http.StatusOK { - t.Errorf("status code = %d, want 200", rec.Code) - } -} +func TestReadyEndpoint(t *testing.T) { + tests := []struct { + name string + state types.SyncState + wantCode int + }{ + {"streaming is ready", types.Streaming, http.StatusOK}, + {"initializing is not ready", types.Initializing, http.StatusServiceUnavailable}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + sp := &mockStatusProvider{status: types.SyncStatus{ + State: tt.state, + LatestHeight: 100, + NetworkHeight: 100, + }} + notifier := NewNotifier(64, zerolog.Nop()) + h := NewHealthHandler(sp, newMockStore(), notifier, "test") + mux := http.NewServeMux() + h.Register(mux) + + req := httptest.NewRequest(http.MethodGet, "/health/ready", nil) + rec := httptest.NewRecorder() + mux.ServeHTTP(rec, req) + + if rec.Code != tt.wantCode { + t.Errorf("status code = %d, want %d", rec.Code, tt.wantCode) + } + }) + } +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/api/health_test.go` around lines 84 - 103, TestReadyEndpoint only covers the happy path; convert it to a table-driven test with at least two cases: "ready" (mockStatusProvider returns types.SyncStatus with State=Streaming and LatestHeight==NetworkHeight, expecting HTTP 200) and "not ready" (e.g., State=Initializing or mismatched heights, expecting HTTP 503). For each case create the notifier and handler via NewNotifier and NewHealthHandler (use newMockStore() for ready and a failing store or status with non-Streaming for not-ready), register the handler on a ServeMux, issue a GET to "/health/ready" with httptest.NewRequest and assert rec.Code equals the expected status. Use the existing mockStatusProvider, NewHealthHandler, NewNotifier, and newMockStore/newFailingMockStore helpers to locate and modify the test.cmd/apex/main.go (2)
48-49:rpc-addrandformatas root-level persistent flags leak into unrelated subcommands.These flags are only meaningful for
statusandblobcommands, but asPersistentFlagson root they appear in the help text ofstart,init,version, etc., and are silently accepted there. Consider registering them only on the subcommands that use them (or on a shared parent).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/apex/main.go` around lines 48 - 49, The root-level PersistentFlags registration for "rpc-addr" and "format" leaks into unrelated commands; remove the two lines that call root.PersistentFlags().String("rpc-addr", ...) and root.PersistentFlags().String("format", ...), and instead register these flags only on the commands that need them (e.g., statusCmd and blobCmd) using cmd.Flags().String(...) or create a helper like addRPCFlags(cmd *cobra.Command) and call it from status and blob command setup; ensure any existing references to viper or flag lookups use the flag on those commands.
253-278: Shared shutdown context may starve the metrics server shutdown.
httpSrv.ShutdownandmetricsSrv.Shutdownshare the same 5-secondshutdownCtx. If the HTTP server's graceful drain consumes most of the deadline, the metrics server gets little or no time. For a metrics server this is low-risk, but you could use independent timeouts if needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/apex/main.go` around lines 253 - 278, In gracefulShutdown, httpSrv.Shutdown and metricsSrv.Shutdown use the same 5s shutdownCtx which can let the HTTP shutdown exhaust the deadline and starve metrics; change this by creating independent contexts: keep shutdownCtx/shutdownCancel for httpSrv.Shutdown (context.WithTimeout(..., 5*time.Second)) and create a separate metricsCtx/metricsCancel (context.WithTimeout(..., 5*time.Second)) used only for metricsSrv.Shutdown (and defer its cancel), then call metricsSrv.Shutdown(metricsCtx) instead of sharing shutdownCtx; reference gracefulShutdown, httpSrv.Shutdown and metricsSrv.Shutdown when making the change.pkg/api/notifier.go (1)
56-59:SetMetricsis not concurrency-safe.
SetMetricswritesn.metricswithout holdingn.mu, whileSubscribe,Unsubscribe, andPublishreadn.metricsunder different lock states. In the current wiring (called once at startup before sync begins) this is fine, but the public API has no guard. Consider either documenting the "must call before use" constraint or protecting the write withn.mu.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/api/notifier.go` around lines 56 - 59, SetMetrics currently writes n.metrics without synchronization while Subscribe, Unsubscribe, and Publish access it under n.mu; make SetMetrics concurrency-safe by acquiring the same mutex (n.mu) before assigning n.metrics (or alternatively document and enforce a "call before use" requirement), ensuring consistency with Subscribe/Unsubscribe/Publish which read n.metrics under lock.cmd/apex/status.go (1)
28-43: Unknown--formatvalues silently fall through to JSON output.If a user passes
--format=yamlor a typo like--format=tabel, they'll get JSON without any indication. Consider validating the flag or at least documenting the supported values.Example validation
+ switch format { + case "json", "table": + default: + return fmt.Errorf("unsupported format %q (supported: json, table)", format) + } + client := newRPCClient(addr)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/apex/status.go` around lines 28 - 43, The code currently treats any non-"table" format as JSON, so unknown values silently fallback to JSON; update the validation in the command handler around the format variable (the branch that calls printStatusTable and the JSON encoding path) to explicitly accept only supported values (e.g., "table" and "json") and return a user-facing error for anything else (or document and normalize aliases like "yaml" if you intend to support them). Locate the format check that gates printStatusTable(&hs) and the JSON encoder (enc.Encode(out)), add a switch or if/else that validates format, call printStatusTable for "table", perform JSON output for "json" (or empty/default), and return fmt.Errorf("unsupported --format value: %q; valid values are: table, json") for unknown values.pkg/store/sqlite.go (1)
130-131: Instrumentation is added to some store methods but not all.
PutBlobs,GetBlobs,PutHeader,GetHeader, andGetSyncStateare instrumented, butGetBlob,PutNamespace,GetNamespaces, andSetSyncStateare not. If the omission is intentional (e.g., these are low-frequency or trivial), that's fine — but it may surprise operators when those ops don't appear in dashboards.Also applies to: 169-170, 200-201, 214-215, 265-266
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/store/sqlite.go` around lines 130 - 131, Several store methods are missing the same timing instrumentation used elsewhere; add the same pattern to instrument GetBlob, PutNamespace, GetNamespaces, and SetSyncState (and any other methods at the referenced regions) so dashboards include them. For each missing method (e.g., function/method names GetBlob, PutNamespace, GetNamespaces, SetSyncState) insert the same start := time.Now() and defer func() { s.metrics.ObserveStoreQueryDuration("MethodName", time.Since(start)) }() at the top of the method, matching the existing pattern used in PutBlobs/GetBlobs/PutHeader/GetHeader/GetSyncState; repeat for the other spots called out in the comment so all store operations are consistently instrumented.pkg/metrics/metrics.go (2)
12-13: Misleading "A nil Recorder is valid" comment.A nil
Recorderinterface value panics on any method call; what's valid (and safe) is theNop()helper. The comment should be updated to avoid confusion.✏️ Proposed fix
-// Implementations must be safe for concurrent use. A nil Recorder is -// valid — callers should use the package-level Nop() helper. +// Implementations must be safe for concurrent use. +// Callers that do not need real metrics should use the package-level Nop() helper.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/metrics.go` around lines 12 - 13, Update the misleading package comment for the Recorder interface: remove the sentence "A nil Recorder is valid — callers should use the package-level Nop() helper." and replace it with a clear note that a nil Recorder interface value will panic if its methods are called and that callers should use the package-level Nop() helper instead; keep the concurrency-safety remark intact and reference the Recorder interface and Nop() helper in the comment so readers know which symbols to use.
157-165: Hardcoded sync-state list is a silent maintenance trap.If the sync package gains a new state,
SetSyncStatewill never set it to1— the gauge silently stays at0for all states. Consider exporting the canonical state set as avar(or sharing a typed-string constant from the sync package) so this list has a single source of truth.♻️ Illustrative refactor
+// SyncStates lists every valid sync state label. +var SyncStates = []string{"initializing", "backfilling", "streaming"} func (r *PromRecorder) SetSyncState(state string) { - for _, s := range []string{"initializing", "backfilling", "streaming"} { + for _, s := range SyncStates { if s == state { r.syncState.WithLabelValues(s).Set(1) } else { r.syncState.WithLabelValues(s).Set(0) } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/metrics.go` around lines 157 - 165, Replace the hardcoded state slice in PromRecorder.SetSyncState with a single source of truth: reference an exported canonical state list (e.g., SyncStates []string) or a typed-string constant set from the sync package, then iterate over that exported list to set r.syncState.WithLabelValues(s).Set(1) only for the matching state and 0 for others; update SetSyncState to use the exported symbol (e.g., sync.SyncStates or package-level variable) so new states are automatically handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/apex/blob_cmd.go`:
- Around line 94-102: The conditional in printJSON (which checks
cmd.Flags().GetString("format") == "table") is dead because both branches call
prettyPrintJSON(raw); fix by either removing the conditional entirely and always
calling prettyPrintJSON(raw), or implement a distinct table rendering path
(e.g., add a new function renderTableFromJSON and call it when format ==
"table") while keeping the existing prettyPrintJSON(raw) as the default; update
references in printJSON accordingly.
In `@Dockerfile`:
- Around line 14-20: The Dockerfile lacks an explicit USER directive which Trivy
DS-0002 flags; add an explicit non-root user by appending a USER instruction
(use the base image's nonroot UID 65532) after the COPY --from=builder /apex
/apex step and before ENTRYPOINT so the non-root intent is self-documenting and
survives base-image changes (ensure /apex remains executable by that UID).
In `@pkg/api/health.go`:
- Around line 94-95: Replace the sentinel equality check with errors.Is when
computing storeOK in the health handler: change the expression using err ==
store.ErrNotFound to errors.Is(err, store.ErrNotFound) (so the line with
GetSyncState result uses err == nil || errors.Is(err, store.ErrNotFound)); also
add the "errors" import to the file's import block so the errors.Is symbol is
available.
In `@pkg/metrics/metrics.go`:
- Around line 69-71: The fields lastLatest and lastNetwork on the Recorder must
be made concurrency-safe: switch their accesses to atomic operations instead of
plain reads/writes. Import sync/atomic, keep the fields as uint64, and replace
any direct assignments/reads in SetLatestHeight and SetNetworkHeight (and the
other usages around the 167-177 block) with atomic.StoreUint64/atomic.LoadUint64
(or atomic.AddUint64 if computing deltas) so all reads and writes are atomic and
the Recorder concurrency contract is honored; ensure you update every location
that touches lastLatest/lastNetwork.
- Around line 179-185: IncBlobsProcessed and IncHeadersProcessed on PromRecorder
call Counter.Add with a float64(conversion of int n), which will panic if n < 0;
guard these methods by checking n > 0 before calling r.blobsProcessed.Add /
r.headersProcessed.Add (return early on n <= 0), or alternatively update the
Recorder interface signatures to use unsigned types and propagate that change;
ensure using the PromRecorder methods' names (IncBlobsProcessed,
IncHeadersProcessed) and the metric fields (blobsProcessed, headersProcessed)
when applying the fix.
- Around line 167-177: SetLatestHeight and SetNetworkHeight compute syncLag
directly from lastNetwork and lastLatest which can produce negative values at
startup; change the logic in PromRecorder.SetLatestHeight and
PromRecorder.SetNetworkHeight to clamp the computed lag to zero (e.g., compute
diff := float64(r.lastNetwork) - float64(h) and if diff < 0 { diff = 0 } before
calling r.syncLag.Set(diff)) so the Gauge never reports a negative lag, and
apply the same clamping pattern in SetNetworkHeight when computing float64(h) -
float64(r.lastLatest).
---
Outside diff comments:
In `@pkg/api/notifier.go`:
- Around line 76-82: In Subscribe, after unlocking n.mu you read
len(n.subscribers) for n.metrics.SetActiveSubscriptions which can race with
concurrent Subscribe/Unsubscribe; move the SetActiveSubscriptions call inside
the critical section (i.e., while n.mu is still held) so the count is computed
and reported atomically with the map mutation—mirror the pattern used in
Unsubscribe by calling n.metrics.SetActiveSubscriptions(len(n.subscribers))
before n.mu.Unlock().
---
Nitpick comments:
In `@cmd/apex/client.go`:
- Around line 66-80: Before attempting json.Unmarshal into jsonRPCResponse,
check the HTTP response status from resp (returned by c.client.Do(req)); if
resp.StatusCode is not in the 2xx range, return a clear error that includes the
status code (and optionally the respBody text) instead of proceeding to
unmarshal. Locate the block that reads respBody and unmarshals into
jsonRPCResponse and add an early status check using resp.StatusCode (and
resp.Status) to produce a descriptive error like "unexpected status 502" plus
the response body when non-2xx is received.
In `@cmd/apex/main.go`:
- Around line 48-49: The root-level PersistentFlags registration for "rpc-addr"
and "format" leaks into unrelated commands; remove the two lines that call
root.PersistentFlags().String("rpc-addr", ...) and
root.PersistentFlags().String("format", ...), and instead register these flags
only on the commands that need them (e.g., statusCmd and blobCmd) using
cmd.Flags().String(...) or create a helper like addRPCFlags(cmd *cobra.Command)
and call it from status and blob command setup; ensure any existing references
to viper or flag lookups use the flag on those commands.
- Around line 253-278: In gracefulShutdown, httpSrv.Shutdown and
metricsSrv.Shutdown use the same 5s shutdownCtx which can let the HTTP shutdown
exhaust the deadline and starve metrics; change this by creating independent
contexts: keep shutdownCtx/shutdownCancel for httpSrv.Shutdown
(context.WithTimeout(..., 5*time.Second)) and create a separate
metricsCtx/metricsCancel (context.WithTimeout(..., 5*time.Second)) used only for
metricsSrv.Shutdown (and defer its cancel), then call
metricsSrv.Shutdown(metricsCtx) instead of sharing shutdownCtx; reference
gracefulShutdown, httpSrv.Shutdown and metricsSrv.Shutdown when making the
change.
In `@cmd/apex/status.go`:
- Around line 28-43: The code currently treats any non-"table" format as JSON,
so unknown values silently fallback to JSON; update the validation in the
command handler around the format variable (the branch that calls
printStatusTable and the JSON encoding path) to explicitly accept only supported
values (e.g., "table" and "json") and return a user-facing error for anything
else (or document and normalize aliases like "yaml" if you intend to support
them). Locate the format check that gates printStatusTable(&hs) and the JSON
encoder (enc.Encode(out)), add a switch or if/else that validates format, call
printStatusTable for "table", perform JSON output for "json" (or empty/default),
and return fmt.Errorf("unsupported --format value: %q; valid values are: table,
json") for unknown values.
In `@Dockerfile`:
- Around line 18-20: Add a Docker HEALTHCHECK after the EXPOSE/ENTRYPOINT lines
in the Dockerfile to probe the new /health or /health/ready endpoint on port
8080; for example invoke the app's CLI (ENTRYPOINT ["/apex", "start"] is used)
or run a lightweight HTTP probe like curl/wget against
http://localhost:8080/health/ready and return non-zero on failure so container
schedulers can detect unhealthy instances. Ensure the HEALTHCHECK uses a
sensible interval, timeout and retries (so transient startup delays don't mark
the container unhealthy).
In `@pkg/api/health_test.go`:
- Around line 84-103: TestReadyEndpoint only covers the happy path; convert it
to a table-driven test with at least two cases: "ready" (mockStatusProvider
returns types.SyncStatus with State=Streaming and LatestHeight==NetworkHeight,
expecting HTTP 200) and "not ready" (e.g., State=Initializing or mismatched
heights, expecting HTTP 503). For each case create the notifier and handler via
NewNotifier and NewHealthHandler (use newMockStore() for ready and a failing
store or status with non-Streaming for not-ready), register the handler on a
ServeMux, issue a GET to "/health/ready" with httptest.NewRequest and assert
rec.Code equals the expected status. Use the existing mockStatusProvider,
NewHealthHandler, NewNotifier, and newMockStore/newFailingMockStore helpers to
locate and modify the test.
In `@pkg/api/notifier.go`:
- Around line 56-59: SetMetrics currently writes n.metrics without
synchronization while Subscribe, Unsubscribe, and Publish access it under n.mu;
make SetMetrics concurrency-safe by acquiring the same mutex (n.mu) before
assigning n.metrics (or alternatively document and enforce a "call before use"
requirement), ensuring consistency with Subscribe/Unsubscribe/Publish which read
n.metrics under lock.
In `@pkg/metrics/metrics.go`:
- Around line 12-13: Update the misleading package comment for the Recorder
interface: remove the sentence "A nil Recorder is valid — callers should use the
package-level Nop() helper." and replace it with a clear note that a nil
Recorder interface value will panic if its methods are called and that callers
should use the package-level Nop() helper instead; keep the concurrency-safety
remark intact and reference the Recorder interface and Nop() helper in the
comment so readers know which symbols to use.
- Around line 157-165: Replace the hardcoded state slice in
PromRecorder.SetSyncState with a single source of truth: reference an exported
canonical state list (e.g., SyncStates []string) or a typed-string constant set
from the sync package, then iterate over that exported list to set
r.syncState.WithLabelValues(s).Set(1) only for the matching state and 0 for
others; update SetSyncState to use the exported symbol (e.g., sync.SyncStates or
package-level variable) so new states are automatically handled.
In `@pkg/store/sqlite.go`:
- Around line 130-131: Several store methods are missing the same timing
instrumentation used elsewhere; add the same pattern to instrument GetBlob,
PutNamespace, GetNamespaces, and SetSyncState (and any other methods at the
referenced regions) so dashboards include them. For each missing method (e.g.,
function/method names GetBlob, PutNamespace, GetNamespaces, SetSyncState) insert
the same start := time.Now() and defer func() {
s.metrics.ObserveStoreQueryDuration("MethodName", time.Since(start)) }() at the
top of the method, matching the existing pattern used in
PutBlobs/GetBlobs/PutHeader/GetHeader/GetSyncState; repeat for the other spots
called out in the comment so all store operations are consistently instrumented.
cmd/apex/blob_cmd.go
Outdated
| func printJSON(cmd *cobra.Command, raw json.RawMessage) error { | ||
| format, _ := cmd.Flags().GetString("format") | ||
| if format == "table" { | ||
| // For blob data, table format just pretty-prints the JSON since blob | ||
| // fields are dynamic. | ||
| return prettyPrintJSON(raw) | ||
| } | ||
| return prettyPrintJSON(raw) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for format flag registration in cmd/apex/
rg -n --type=go '"format"' cmd/apex/Repository: evstack/apex
Length of output: 295
🏁 Script executed:
#!/bin/bash
# Check blob_cmd.go around the printJSON function (lines 94-102)
sed -n '90,110p' cmd/apex/blob_cmd.go
# Also check status.go to see how format is handled there
sed -n '15,35p' cmd/apex/status.goRepository: evstack/apex
Length of output: 1160
Dead branch in printJSON — format == "table" path is identical to the default.
Both the "table" branch and the fallthrough call prettyPrintJSON(raw), making the flag check a no-op. Either remove the conditional or implement differentiated table output.
🔧 Proposed simplification (if table output is not yet planned)
func printJSON(cmd *cobra.Command, raw json.RawMessage) error {
- format, _ := cmd.Flags().GetString("format")
- if format == "table" {
- // For blob data, table format just pretty-prints the JSON since blob
- // fields are dynamic.
- return prettyPrintJSON(raw)
- }
- return prettyPrintJSON(raw)
+ return prettyPrintJSON(raw)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func printJSON(cmd *cobra.Command, raw json.RawMessage) error { | |
| format, _ := cmd.Flags().GetString("format") | |
| if format == "table" { | |
| // For blob data, table format just pretty-prints the JSON since blob | |
| // fields are dynamic. | |
| return prettyPrintJSON(raw) | |
| } | |
| return prettyPrintJSON(raw) | |
| } | |
| func printJSON(cmd *cobra.Command, raw json.RawMessage) error { | |
| return prettyPrintJSON(raw) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/apex/blob_cmd.go` around lines 94 - 102, The conditional in printJSON
(which checks cmd.Flags().GetString("format") == "table") is dead because both
branches call prettyPrintJSON(raw); fix by either removing the conditional
entirely and always calling prettyPrintJSON(raw), or implement a distinct table
rendering path (e.g., add a new function renderTableFromJSON and call it when
format == "table") while keeping the existing prettyPrintJSON(raw) as the
default; update references in printJSON accordingly.
| FROM gcr.io/distroless/static-debian12:nonroot | ||
|
|
||
| COPY --from=builder /apex /apex | ||
|
|
||
| EXPOSE 8080 9090 9091 | ||
|
|
||
| ENTRYPOINT ["/apex", "start"] |
There was a problem hiding this comment.
Add an explicit USER instruction to satisfy Trivy DS-0002 and make non-root intent self-documenting.
While distroless/static-debian12:nonroot sets UID 65532 in its image config, the Dockerfile itself contains no USER directive. Trivy DS-0002 flags this, and omitting an explicit USER can also cause surprises if the base image tag is ever changed to a variant that defaults to root.
🛡️ Proposed fix
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /apex /apex
+USER 65532:65532
+
EXPOSE 8080 9090 9091📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| FROM gcr.io/distroless/static-debian12:nonroot | |
| COPY --from=builder /apex /apex | |
| EXPOSE 8080 9090 9091 | |
| ENTRYPOINT ["/apex", "start"] | |
| FROM gcr.io/distroless/static-debian12:nonroot | |
| COPY --from=builder /apex /apex | |
| USER 65532:65532 | |
| EXPOSE 8080 9090 9091 | |
| ENTRYPOINT ["/apex", "start"] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Dockerfile` around lines 14 - 20, The Dockerfile lacks an explicit USER
directive which Trivy DS-0002 flags; add an explicit non-root user by appending
a USER instruction (use the base image's nonroot UID 65532) after the COPY
--from=builder /apex /apex step and before ENTRYPOINT so the non-root intent is
self-documenting and survives base-image changes (ensure /apex remains
executable by that UID).
pkg/api/health.go
Outdated
| _, err := h.store.GetSyncState(ctx) | ||
| storeOK := err == nil || err == store.ErrNotFound |
There was a problem hiding this comment.
Use errors.Is instead of == for sentinel error comparison.
err == store.ErrNotFound will break if GetSyncState ever wraps the sentinel (e.g., fmt.Errorf("...: %w", store.ErrNotFound)). The errors package is already imported in the store layer, and errors.Is is the idiomatic and future-proof comparison.
Proposed fix
_, err := h.store.GetSyncState(ctx)
- storeOK := err == nil || err == store.ErrNotFound
+ storeOK := err == nil || errors.Is(err, store.ErrNotFound)You'll also need to add "errors" to the import block.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/api/health.go` around lines 94 - 95, Replace the sentinel equality check
with errors.Is when computing storeOK in the health handler: change the
expression using err == store.ErrNotFound to errors.Is(err, store.ErrNotFound)
(so the line with GetSyncState result uses err == nil || errors.Is(err,
store.ErrNotFound)); also add the "errors" import to the file's import block so
the errors.Is symbol is available.
pkg/metrics/metrics.go
Outdated
| // cached for lag calculation | ||
| lastLatest uint64 | ||
| lastNetwork uint64 |
There was a problem hiding this comment.
Data race on lastLatest / lastNetwork violates the stated concurrency contract.
lastLatest and lastNetwork are plain uint64 fields. SetLatestHeight writes lastLatest and reads lastNetwork; SetNetworkHeight does the inverse — both without any synchronization. The Recorder interface documentation on line 12 explicitly promises concurrent safety, making this a data race that will be flagged by -race.
🔒 Proposed fix using `sync/atomic`
import (
"runtime"
+ "sync/atomic"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
) // cached for lag calculation
- lastLatest uint64
- lastNetwork uint64
+ lastLatest atomic.Uint64
+ lastNetwork atomic.Uint64
} func (r *PromRecorder) SetLatestHeight(h uint64) {
r.latestHeight.Set(float64(h))
- r.lastLatest = h
- r.syncLag.Set(float64(r.lastNetwork) - float64(h))
+ r.lastLatest.Store(h)
+ r.syncLag.Set(float64(r.lastNetwork.Load()) - float64(h))
}
func (r *PromRecorder) SetNetworkHeight(h uint64) {
r.networkHeight.Set(float64(h))
- r.lastNetwork = h
- r.syncLag.Set(float64(h) - float64(r.lastLatest))
+ r.lastNetwork.Store(h)
+ r.syncLag.Set(float64(h) - float64(r.lastLatest.Load()))
}Also applies to: 167-177
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/metrics/metrics.go` around lines 69 - 71, The fields lastLatest and
lastNetwork on the Recorder must be made concurrency-safe: switch their accesses
to atomic operations instead of plain reads/writes. Import sync/atomic, keep the
fields as uint64, and replace any direct assignments/reads in SetLatestHeight
and SetNetworkHeight (and the other usages around the 167-177 block) with
atomic.StoreUint64/atomic.LoadUint64 (or atomic.AddUint64 if computing deltas)
so all reads and writes are atomic and the Recorder concurrency contract is
honored; ensure you update every location that touches lastLatest/lastNetwork.
| func (r *PromRecorder) SetLatestHeight(h uint64) { | ||
| r.latestHeight.Set(float64(h)) | ||
| r.lastLatest = h | ||
| r.syncLag.Set(float64(r.lastNetwork) - float64(h)) | ||
| } | ||
|
|
||
| func (r *PromRecorder) SetNetworkHeight(h uint64) { | ||
| r.networkHeight.Set(float64(h)) | ||
| r.lastNetwork = h | ||
| r.syncLag.Set(float64(h) - float64(r.lastLatest)) | ||
| } |
There was a problem hiding this comment.
syncLag goes negative before SetNetworkHeight is called.
Both lastLatest and lastNetwork start at 0. If SetLatestHeight is called before any network height is known (common at startup), the lag computes as 0 - h, yielding a negative gauge value. While a Prometheus Gauge doesn't panic on negative values, a negative "lag" is semantically incorrect and can trip alert thresholds unexpectedly.
🛠️ Proposed fix — clamp lag to zero
func (r *PromRecorder) SetLatestHeight(h uint64) {
r.latestHeight.Set(float64(h))
- r.lastLatest = h
- r.syncLag.Set(float64(r.lastNetwork) - float64(h))
+ r.lastLatest.Store(h)
+ lag := float64(r.lastNetwork.Load()) - float64(h)
+ if lag < 0 {
+ lag = 0
+ }
+ r.syncLag.Set(lag)
}
func (r *PromRecorder) SetNetworkHeight(h uint64) {
r.networkHeight.Set(float64(h))
- r.lastNetwork = h
- r.syncLag.Set(float64(h) - float64(r.lastLatest))
+ r.lastNetwork.Store(h)
+ lag := float64(h) - float64(r.lastLatest.Load())
+ if lag < 0 {
+ lag = 0
+ }
+ r.syncLag.Set(lag)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/metrics/metrics.go` around lines 167 - 177, SetLatestHeight and
SetNetworkHeight compute syncLag directly from lastNetwork and lastLatest which
can produce negative values at startup; change the logic in
PromRecorder.SetLatestHeight and PromRecorder.SetNetworkHeight to clamp the
computed lag to zero (e.g., compute diff := float64(r.lastNetwork) - float64(h)
and if diff < 0 { diff = 0 } before calling r.syncLag.Set(diff)) so the Gauge
never reports a negative lag, and apply the same clamping pattern in
SetNetworkHeight when computing float64(h) - float64(r.lastLatest).
| func (r *PromRecorder) IncBlobsProcessed(n int) { | ||
| r.blobsProcessed.Add(float64(n)) | ||
| } | ||
|
|
||
| func (r *PromRecorder) IncHeadersProcessed(n int) { | ||
| r.headersProcessed.Add(float64(n)) | ||
| } |
There was a problem hiding this comment.
Counter.Add panics on negative n — the int parameter type allows callers to trigger this.
Counter.Add is documented to panic if the value is < 0. Both IncBlobsProcessed(n int) and IncHeadersProcessed(n int) cast the signed n directly to float64 and pass it to Add, meaning any negative argument causes a runtime panic.
🛡️ Proposed fix — guard against non-positive values
func (r *PromRecorder) IncBlobsProcessed(n int) {
- r.blobsProcessed.Add(float64(n))
+ if n > 0 {
+ r.blobsProcessed.Add(float64(n))
+ }
}
func (r *PromRecorder) IncHeadersProcessed(n int) {
- r.headersProcessed.Add(float64(n))
+ if n > 0 {
+ r.headersProcessed.Add(float64(n))
+ }
}Alternatively, change the Recorder interface signatures to uint to make the constraint part of the type contract.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/metrics/metrics.go` around lines 179 - 185, IncBlobsProcessed and
IncHeadersProcessed on PromRecorder call Counter.Add with a float64(conversion
of int n), which will panic if n < 0; guard these methods by checking n > 0
before calling r.blobsProcessed.Add / r.headersProcessed.Add (return early on n
<= 0), or alternatively update the Recorder interface signatures to use unsigned
types and propagate that change; ensure using the PromRecorder methods' names
(IncBlobsProcessed, IncHeadersProcessed) and the metric fields (blobsProcessed,
headersProcessed) when applying the fix.
Also address PR review feedback: - Fix data race on PromRecorder lag fields using atomic.Uint64 - Clamp sync lag gauge to zero (avoid negative values at startup) - Guard Counter.Add against negative n to prevent panic - Fix JSON pretty-printing (unmarshal into any, not RawMessage) - Remove dead branch in printJSON - Use errors.Is for sentinel error comparison in health handler - Add explicit USER directive in Dockerfile Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add production-readiness features: Prometheus metrics with nil-safe Recorder interface, /health and /health/ready endpoints, CLI commands (status, blob get/list, config validate/show), and a multi-stage Dockerfile. Fix Coordinator.Status() to track latestHeight and networkHeight accurately.
Overview
Summary by CodeRabbit
Release Notes
New Features
/healthand/health/readyendpoints for service status monitoringConfiguration