Skip to content

feat: add observability, health endpoints, CLI tooling, and Dockerfile #30

Merged
tac0turtle merged 2 commits intomainfrom
feat/phase3-polish
Feb 19, 2026
Merged

feat: add observability, health endpoints, CLI tooling, and Dockerfile #30
tac0turtle merged 2 commits intomainfrom
feat/phase3-polish

Conversation

@tac0turtle
Copy link
Contributor

@tac0turtle tac0turtle commented Feb 19, 2026

Add production-readiness features: Prometheus metrics with nil-safe Recorder interface, /health and /health/ready endpoints, CLI commands (status, blob get/list, config validate/show), and a multi-stage Dockerfile. Fix Coordinator.Status() to track latestHeight and networkHeight accurately.

Overview

Summary by CodeRabbit

Release Notes

  • New Features

    • Added blob command to query and list blobs via RPC with namespace filtering
    • Added status command to display sync health and metrics in table or JSON format
    • Added config command to validate and display configuration
    • Added /health and /health/ready endpoints for service status monitoring
    • Added Prometheus metrics support for monitoring sync progress, API performance, and store operations
  • Configuration

    • Added metrics configuration section with enabled flag and listen address

…e (Phase 3)

Add production-readiness features: Prometheus metrics with nil-safe
Recorder interface, /health and /health/ready endpoints, CLI commands
(status, blob get/list, config validate/show), and a multi-stage
Dockerfile. Fix Coordinator.Status() to track latestHeight and
networkHeight accurately.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello @tac0turtle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the production readiness of the application by integrating comprehensive observability features, including Prometheus metrics and health endpoints. It also introduces a suite of new command-line interface tools for interacting with the indexer's data and configuration, alongside a Dockerfile for streamlined containerized deployment. A key improvement to the sync coordinator ensures more accurate tracking of network and local sync heights.

Highlights

  • Observability Features: Introduced Prometheus metrics for sync state, API requests, store operations, and subscription events, along with '/health' and '/health/ready' HTTP endpoints for service status.
  • CLI Tooling: Added new 'apex' CLI commands for querying blobs ('blob get', 'blob list'), checking indexer status ('status'), and managing configuration ('config validate', 'config show').
  • Docker Support: Provided a multi-stage Dockerfile for building and deploying the application.
  • Coordinator Enhancements: Improved the 'Coordinator.Status()' method to accurately track 'latestHeight' and 'networkHeight' and integrated metrics reporting into its lifecycle.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • Dockerfile
    • Added a multi-stage Dockerfile for building and deploying the application.
  • cmd/apex/blob_cmd.go
    • Added new CLI commands for fetching and listing blobs.
  • cmd/apex/client.go
    • Added a new JSON-RPC client for CLI interactions.
  • cmd/apex/config_cmd.go
    • Added new CLI commands for validating and showing the application configuration.
  • cmd/apex/main.go
    • Integrated new CLI commands, set up Prometheus metrics, and mounted health endpoints.
  • cmd/apex/status.go
    • Added a new CLI command to display the indexer's sync status.
  • config/config.go
    • Extended the configuration structure to include Prometheus metrics settings.
  • config/load.go
    • Updated the default configuration and validation logic to support new metrics settings.
  • go.mod
    • Added 'github.com/prometheus/client_golang' dependency.
  • go.sum
    • Updated module checksums to reflect new dependencies.
  • pkg/api/health.go
    • Added new health status provider and HTTP handlers for '/health' and '/health/ready'.
  • pkg/api/health_test.go
    • Added unit tests for the new health endpoints.
  • pkg/api/notifier.go
    • Integrated metrics reporting for subscription activity, including active subscribers and dropped events.
  • pkg/metrics/metrics.go
    • Defined a 'Recorder' interface for metrics and implemented a Prometheus-specific recorder.
  • pkg/metrics/metrics_test.go
    • Added unit tests for the Prometheus metrics recorder.
  • pkg/metrics/server.go
    • Implemented an HTTP server to expose Prometheus metrics.
  • pkg/store/sqlite.go
    • Integrated metrics reporting for various database operations.
  • pkg/sync/coordinator.go
    • Enhanced the coordinator to track 'latestHeight' and 'networkHeight' and report sync metrics.
Activity
  • The pull request was created by tac0turtle.
  • No human activity (comments, reviews) has been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive metrics and health monitoring infrastructure alongside new CLI capabilities. Changes include Prometheus-backed metrics collection, health/status endpoints, blob and config CLI commands, Docker multi-stage build configuration, and metrics integration across sync, storage, and notification layers.

Changes

Cohort / File(s) Summary
Docker Build Configuration
Dockerfile
Multi-stage build using golang:1.25-bookworm builder stage with static binary compilation and version embedding via git describe; final stage uses distroless image with exposed ports 8080, 9090, 9091.
Metrics Infrastructure
pkg/metrics/metrics.go, pkg/metrics/metrics_test.go, pkg/metrics/server.go
Introduces Recorder interface with no-op and Prometheus implementations; defines 12+ Prometheus metrics (sync state, heights, processing, API requests, store durations, subscriptions, events); provides NewServer for /metrics HTTP endpoint with graceful shutdown.
Configuration Updates
config/config.go, config/load.go, go.mod
Adds MetricsConfig with Enabled and ListenAddr fields; validation enforces non-empty listen address when enabled; updates dependencies with prometheus/client_golang v1.23.2 and related packages.
Health & Status API
pkg/api/health.go, pkg/api/health_test.go
Introduces HealthHandler with /health and /health/ready endpoints; exposes StatusProvider interface and HealthStatus struct; computes sync lag, uptime, subscriber count; readiness check queries store with 2s timeout.
Component Metrics Integration
pkg/sync/coordinator.go, pkg/store/sqlite.go, pkg/api/notifier.go
Adds metrics.Recorder field and WithMetrics option to Coordinator; tracks latestHeight/networkHeight; instruments PutBlobs/GetBlobs/GetHeader operations in SQLiteStore; tracks active subscriptions and dropped events in Notifier.
CLI Commands & RPC Client
cmd/apex/blob_cmd.go, cmd/apex/config_cmd.go, cmd/apex/status.go, cmd/apex/client.go
Implements blob get/list commands with hex validation; config validate/show commands; status command with table/JSON output; JSON-RPC over HTTP client with 10s timeout and health-check helper.
Main Application Orchestration
cmd/apex/main.go
Adds setupMetrics and gracefulShutdown functions; wires metrics to DB, notifier, coordinator; mounts health handler to HTTP mux; introduces statusCmd, blobCmd, configCmd; adds rpc-addr and format CLI flags; coordinates async metrics server startup and shutdown.

Sequence Diagrams

sequenceDiagram
    participant CLI as CLI User
    participant StatusCmd as Status Command
    participant RPCClient as RPC Client
    participant HealthHandler as Health Handler
    participant Store as SQLiteStore
    participant Notifier as Notifier
    
    CLI->>StatusCmd: Execute status command
    StatusCmd->>RPCClient: Fetch /health via JSON-RPC
    RPCClient->>HealthHandler: HTTP GET /health
    HealthHandler->>Store: Query latest block (readiness)
    HealthHandler->>Notifier: Get subscriber count
    HealthHandler->>HealthHandler: Compute sync lag, uptime, status
    HealthHandler-->>RPCClient: HealthStatus JSON
    RPCClient-->>StatusCmd: Parsed response
    StatusCmd->>StatusCmd: Format as table or JSON
    StatusCmd-->>CLI: Display status
Loading
sequenceDiagram
    participant Coordinator as Sync Coordinator
    participant Metrics as Metrics Recorder
    participant PromRec as Prometheus Recorder
    participant Store as SQLiteStore
    participant MetricsServer as Metrics HTTP Server
    
    Coordinator->>Coordinator: Backfill/stream blobs & headers
    Coordinator->>Metrics: SetLatestHeight(h)
    Coordinator->>Metrics: IncBlobsProcessed(n)
    Metrics->>PromRec: Record height gauge, blob counter
    PromRec->>PromRec: Update apex_sync_latest_height, apex_sync_blobs_processed_total
    Store->>Metrics: ObserveStoreQueryDuration("GetBlobs", duration)
    Metrics->>PromRec: Record histogram
    PromRec->>PromRec: Update apex_store_query_duration_seconds
    MetricsServer->>PromRec: Scrape /metrics endpoint
    PromRec-->>MetricsServer: Prometheus text format response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 Metrics bloom like clover in spring,
Health checks hop from every service wing,
Prometheus gathers the tales we tell,
While status commands serve us well,
Observability—the rabbit's delight! 🌿

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the primary changes: observability (metrics), health endpoints, CLI tooling, and Dockerfile. All four major components align with the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/phase3-polish

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant production-readiness features, including Prometheus metrics, health endpoints, CLI tooling, and a multi-stage Dockerfile. The implementation is well-structured, particularly the metrics.Recorder interface for abstracting observability. My review has identified a few issues, including a critical race condition in the metrics recorder and a bug in the CLI's JSON output formatting. I've also included some suggestions for improving code clarity and robustness. Overall, this is a great addition to the project.

Comment on lines 70 to 71
lastLatest uint64
lastNetwork uint64

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The fields lastLatest and lastNetwork are accessed concurrently by SetLatestHeight and SetNetworkHeight without synchronization, creating a race condition. Since the Recorder interface must be safe for concurrent use, you should add a mutex to PromRecorder and use it to protect these fields.

You'll need to add mu sync.Mutex to the PromRecorder struct, and then wrap the critical sections in SetLatestHeight and SetNetworkHeight with r.mu.Lock() and r.mu.Unlock().

	mu          sync.Mutex
	lastLatest  uint64
	lastNetwork uint64

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/api/notifier.go (1)

76-82: ⚠️ Potential issue | 🟠 Major

Race: len(n.subscribers) read after lock release in Subscribe.

Line 78 releases n.mu, but line 81 reads n.subscribers without holding the lock. Another goroutine calling Subscribe or Unsubscribe concurrently can mutate the map between the unlock and the SetActiveSubscriptions call, producing an inaccurate count and a data race.

Compare with Unsubscribe (line 92) which correctly calls SetActiveSubscriptions while still holding the lock.

🔒 Proposed fix: move the metrics call inside the critical section
 	n.mu.Lock()
 	n.subscribers[id] = sub
+	n.metrics.SetActiveSubscriptions(len(n.subscribers))
 	n.mu.Unlock()
 
 	n.log.Debug().Uint64("sub_id", id).Int("namespaces", len(namespaces)).Msg("new subscription")
-	n.metrics.SetActiveSubscriptions(len(n.subscribers))
 	return sub
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/notifier.go` around lines 76 - 82, In Subscribe, after unlocking n.mu
you read len(n.subscribers) for n.metrics.SetActiveSubscriptions which can race
with concurrent Subscribe/Unsubscribe; move the SetActiveSubscriptions call
inside the critical section (i.e., while n.mu is still held) so the count is
computed and reported atomically with the map mutation—mirror the pattern used
in Unsubscribe by calling n.metrics.SetActiveSubscriptions(len(n.subscribers))
before n.mu.Unlock().
🧹 Nitpick comments (10)
cmd/apex/client.go (1)

66-80: Consider checking HTTP status before unmarshal to avoid cryptic errors.

When a reverse proxy or misconfigured server returns a non-JSON 4xx/5xx, the unmarshal at Line 78 produces a confusing error like "unmarshal response: invalid character '<' looking for beginning of value" instead of a clear "unexpected status 502".

♻️ Proposed fix
 	resp, err := c.client.Do(req)
 	if err != nil {
 		return nil, fmt.Errorf("send request: %w", err)
 	}
 	defer resp.Body.Close() //nolint:errcheck
 
+	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
+		body, _ := io.ReadAll(resp.Body)
+		return nil, fmt.Errorf("unexpected status %d: %s", resp.StatusCode, bytes.TrimSpace(body))
+	}
+
 	respBody, err := io.ReadAll(resp.Body)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/apex/client.go` around lines 66 - 80, Before attempting json.Unmarshal
into jsonRPCResponse, check the HTTP response status from resp (returned by
c.client.Do(req)); if resp.StatusCode is not in the 2xx range, return a clear
error that includes the status code (and optionally the respBody text) instead
of proceeding to unmarshal. Locate the block that reads respBody and unmarshals
into jsonRPCResponse and add an early status check using resp.StatusCode (and
resp.Status) to produce a descriptive error like "unexpected status 502" plus
the response body when non-2xx is received.
Dockerfile (1)

18-20: Consider adding a HEALTHCHECK directive to leverage the new health endpoints.

The service now exposes /health and /health/ready on port 8080 but the image has no HEALTHCHECK. Container schedulers (Docker Compose, standalone Docker) won't surface instance health without it.

💡 Example HEALTHCHECK
 EXPOSE 8080 9090 9091
 
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+    CMD ["/apex", "status", "--rpc-addr", "localhost:8080"]
+
 ENTRYPOINT ["/apex", "start"]

Alternatively use a wget/curl-based check if the status CLI command adds startup overhead:

# Not available in distroless; prefer the CLI form above or a custom health binary.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 18 - 20, Add a Docker HEALTHCHECK after the
EXPOSE/ENTRYPOINT lines in the Dockerfile to probe the new /health or
/health/ready endpoint on port 8080; for example invoke the app's CLI
(ENTRYPOINT ["/apex", "start"] is used) or run a lightweight HTTP probe like
curl/wget against http://localhost:8080/health/ready and return non-zero on
failure so container schedulers can detect unhealthy instances. Ensure the
HEALTHCHECK uses a sensible interval, timeout and retries (so transient startup
delays don't mark the container unhealthy).
pkg/api/health_test.go (1)

84-103: Convert TestReadyEndpoint to table-driven test and add not-ready case.

The /health/ready endpoint returns HTTP 503 when the node is not ready (Initializing state or store is inaccessible), but the test only covers the happy path. Extend with a not-ready case to align with the table-driven pattern used in TestHealthEndpoint and catch regressions.

♻️ Proposed extension
-func TestReadyEndpoint(t *testing.T) {
-	sp := &mockStatusProvider{status: types.SyncStatus{
-		State:         types.Streaming,
-		LatestHeight:  100,
-		NetworkHeight: 100,
-	}}
-	notifier := NewNotifier(64, zerolog.Nop())
-	h := NewHealthHandler(sp, newMockStore(), notifier, "test")
-
-	mux := http.NewServeMux()
-	h.Register(mux)
-
-	req := httptest.NewRequest(http.MethodGet, "/health/ready", nil)
-	rec := httptest.NewRecorder()
-	mux.ServeHTTP(rec, req)
-
-	if rec.Code != http.StatusOK {
-		t.Errorf("status code = %d, want 200", rec.Code)
-	}
-}
+func TestReadyEndpoint(t *testing.T) {
+	tests := []struct {
+		name     string
+		state    types.SyncState
+		wantCode int
+	}{
+		{"streaming is ready", types.Streaming, http.StatusOK},
+		{"initializing is not ready", types.Initializing, http.StatusServiceUnavailable},
+	}
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			sp := &mockStatusProvider{status: types.SyncStatus{
+				State:         tt.state,
+				LatestHeight:  100,
+				NetworkHeight: 100,
+			}}
+			notifier := NewNotifier(64, zerolog.Nop())
+			h := NewHealthHandler(sp, newMockStore(), notifier, "test")
+			mux := http.NewServeMux()
+			h.Register(mux)
+
+			req := httptest.NewRequest(http.MethodGet, "/health/ready", nil)
+			rec := httptest.NewRecorder()
+			mux.ServeHTTP(rec, req)
+
+			if rec.Code != tt.wantCode {
+				t.Errorf("status code = %d, want %d", rec.Code, tt.wantCode)
+			}
+		})
+	}
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/health_test.go` around lines 84 - 103, TestReadyEndpoint only covers
the happy path; convert it to a table-driven test with at least two cases:
"ready" (mockStatusProvider returns types.SyncStatus with State=Streaming and
LatestHeight==NetworkHeight, expecting HTTP 200) and "not ready" (e.g.,
State=Initializing or mismatched heights, expecting HTTP 503). For each case
create the notifier and handler via NewNotifier and NewHealthHandler (use
newMockStore() for ready and a failing store or status with non-Streaming for
not-ready), register the handler on a ServeMux, issue a GET to "/health/ready"
with httptest.NewRequest and assert rec.Code equals the expected status. Use the
existing mockStatusProvider, NewHealthHandler, NewNotifier, and
newMockStore/newFailingMockStore helpers to locate and modify the test.
cmd/apex/main.go (2)

48-49: rpc-addr and format as root-level persistent flags leak into unrelated subcommands.

These flags are only meaningful for status and blob commands, but as PersistentFlags on root they appear in the help text of start, init, version, etc., and are silently accepted there. Consider registering them only on the subcommands that use them (or on a shared parent).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/apex/main.go` around lines 48 - 49, The root-level PersistentFlags
registration for "rpc-addr" and "format" leaks into unrelated commands; remove
the two lines that call root.PersistentFlags().String("rpc-addr", ...) and
root.PersistentFlags().String("format", ...), and instead register these flags
only on the commands that need them (e.g., statusCmd and blobCmd) using
cmd.Flags().String(...) or create a helper like addRPCFlags(cmd *cobra.Command)
and call it from status and blob command setup; ensure any existing references
to viper or flag lookups use the flag on those commands.

253-278: Shared shutdown context may starve the metrics server shutdown.

httpSrv.Shutdown and metricsSrv.Shutdown share the same 5-second shutdownCtx. If the HTTP server's graceful drain consumes most of the deadline, the metrics server gets little or no time. For a metrics server this is low-risk, but you could use independent timeouts if needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/apex/main.go` around lines 253 - 278, In gracefulShutdown,
httpSrv.Shutdown and metricsSrv.Shutdown use the same 5s shutdownCtx which can
let the HTTP shutdown exhaust the deadline and starve metrics; change this by
creating independent contexts: keep shutdownCtx/shutdownCancel for
httpSrv.Shutdown (context.WithTimeout(..., 5*time.Second)) and create a separate
metricsCtx/metricsCancel (context.WithTimeout(..., 5*time.Second)) used only for
metricsSrv.Shutdown (and defer its cancel), then call
metricsSrv.Shutdown(metricsCtx) instead of sharing shutdownCtx; reference
gracefulShutdown, httpSrv.Shutdown and metricsSrv.Shutdown when making the
change.
pkg/api/notifier.go (1)

56-59: SetMetrics is not concurrency-safe.

SetMetrics writes n.metrics without holding n.mu, while Subscribe, Unsubscribe, and Publish read n.metrics under different lock states. In the current wiring (called once at startup before sync begins) this is fine, but the public API has no guard. Consider either documenting the "must call before use" constraint or protecting the write with n.mu.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/notifier.go` around lines 56 - 59, SetMetrics currently writes
n.metrics without synchronization while Subscribe, Unsubscribe, and Publish
access it under n.mu; make SetMetrics concurrency-safe by acquiring the same
mutex (n.mu) before assigning n.metrics (or alternatively document and enforce a
"call before use" requirement), ensuring consistency with
Subscribe/Unsubscribe/Publish which read n.metrics under lock.
cmd/apex/status.go (1)

28-43: Unknown --format values silently fall through to JSON output.

If a user passes --format=yaml or a typo like --format=tabel, they'll get JSON without any indication. Consider validating the flag or at least documenting the supported values.

Example validation
+			switch format {
+			case "json", "table":
+			default:
+				return fmt.Errorf("unsupported format %q (supported: json, table)", format)
+			}
+
 			client := newRPCClient(addr)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/apex/status.go` around lines 28 - 43, The code currently treats any
non-"table" format as JSON, so unknown values silently fallback to JSON; update
the validation in the command handler around the format variable (the branch
that calls printStatusTable and the JSON encoding path) to explicitly accept
only supported values (e.g., "table" and "json") and return a user-facing error
for anything else (or document and normalize aliases like "yaml" if you intend
to support them). Locate the format check that gates printStatusTable(&hs) and
the JSON encoder (enc.Encode(out)), add a switch or if/else that validates
format, call printStatusTable for "table", perform JSON output for "json" (or
empty/default), and return fmt.Errorf("unsupported --format value: %q; valid
values are: table, json") for unknown values.
pkg/store/sqlite.go (1)

130-131: Instrumentation is added to some store methods but not all.

PutBlobs, GetBlobs, PutHeader, GetHeader, and GetSyncState are instrumented, but GetBlob, PutNamespace, GetNamespaces, and SetSyncState are not. If the omission is intentional (e.g., these are low-frequency or trivial), that's fine — but it may surprise operators when those ops don't appear in dashboards.

Also applies to: 169-170, 200-201, 214-215, 265-266

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/store/sqlite.go` around lines 130 - 131, Several store methods are
missing the same timing instrumentation used elsewhere; add the same pattern to
instrument GetBlob, PutNamespace, GetNamespaces, and SetSyncState (and any other
methods at the referenced regions) so dashboards include them. For each missing
method (e.g., function/method names GetBlob, PutNamespace, GetNamespaces,
SetSyncState) insert the same start := time.Now() and defer func() {
s.metrics.ObserveStoreQueryDuration("MethodName", time.Since(start)) }() at the
top of the method, matching the existing pattern used in
PutBlobs/GetBlobs/PutHeader/GetHeader/GetSyncState; repeat for the other spots
called out in the comment so all store operations are consistently instrumented.
pkg/metrics/metrics.go (2)

12-13: Misleading "A nil Recorder is valid" comment.

A nil Recorder interface value panics on any method call; what's valid (and safe) is the Nop() helper. The comment should be updated to avoid confusion.

✏️ Proposed fix
-// Implementations must be safe for concurrent use. A nil Recorder is
-// valid — callers should use the package-level Nop() helper.
+// Implementations must be safe for concurrent use.
+// Callers that do not need real metrics should use the package-level Nop() helper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 12 - 13, Update the misleading package
comment for the Recorder interface: remove the sentence "A nil Recorder is valid
— callers should use the package-level Nop() helper." and replace it with a
clear note that a nil Recorder interface value will panic if its methods are
called and that callers should use the package-level Nop() helper instead; keep
the concurrency-safety remark intact and reference the Recorder interface and
Nop() helper in the comment so readers know which symbols to use.

157-165: Hardcoded sync-state list is a silent maintenance trap.

If the sync package gains a new state, SetSyncState will never set it to 1 — the gauge silently stays at 0 for all states. Consider exporting the canonical state set as a var (or sharing a typed-string constant from the sync package) so this list has a single source of truth.

♻️ Illustrative refactor
+// SyncStates lists every valid sync state label.
+var SyncStates = []string{"initializing", "backfilling", "streaming"}

 func (r *PromRecorder) SetSyncState(state string) {
-	for _, s := range []string{"initializing", "backfilling", "streaming"} {
+	for _, s := range SyncStates {
 		if s == state {
 			r.syncState.WithLabelValues(s).Set(1)
 		} else {
 			r.syncState.WithLabelValues(s).Set(0)
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 157 - 165, Replace the hardcoded state
slice in PromRecorder.SetSyncState with a single source of truth: reference an
exported canonical state list (e.g., SyncStates []string) or a typed-string
constant set from the sync package, then iterate over that exported list to set
r.syncState.WithLabelValues(s).Set(1) only for the matching state and 0 for
others; update SetSyncState to use the exported symbol (e.g., sync.SyncStates or
package-level variable) so new states are automatically handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/apex/blob_cmd.go`:
- Around line 94-102: The conditional in printJSON (which checks
cmd.Flags().GetString("format") == "table") is dead because both branches call
prettyPrintJSON(raw); fix by either removing the conditional entirely and always
calling prettyPrintJSON(raw), or implement a distinct table rendering path
(e.g., add a new function renderTableFromJSON and call it when format ==
"table") while keeping the existing prettyPrintJSON(raw) as the default; update
references in printJSON accordingly.

In `@Dockerfile`:
- Around line 14-20: The Dockerfile lacks an explicit USER directive which Trivy
DS-0002 flags; add an explicit non-root user by appending a USER instruction
(use the base image's nonroot UID 65532) after the COPY --from=builder /apex
/apex step and before ENTRYPOINT so the non-root intent is self-documenting and
survives base-image changes (ensure /apex remains executable by that UID).

In `@pkg/api/health.go`:
- Around line 94-95: Replace the sentinel equality check with errors.Is when
computing storeOK in the health handler: change the expression using err ==
store.ErrNotFound to errors.Is(err, store.ErrNotFound) (so the line with
GetSyncState result uses err == nil || errors.Is(err, store.ErrNotFound)); also
add the "errors" import to the file's import block so the errors.Is symbol is
available.

In `@pkg/metrics/metrics.go`:
- Around line 69-71: The fields lastLatest and lastNetwork on the Recorder must
be made concurrency-safe: switch their accesses to atomic operations instead of
plain reads/writes. Import sync/atomic, keep the fields as uint64, and replace
any direct assignments/reads in SetLatestHeight and SetNetworkHeight (and the
other usages around the 167-177 block) with atomic.StoreUint64/atomic.LoadUint64
(or atomic.AddUint64 if computing deltas) so all reads and writes are atomic and
the Recorder concurrency contract is honored; ensure you update every location
that touches lastLatest/lastNetwork.
- Around line 179-185: IncBlobsProcessed and IncHeadersProcessed on PromRecorder
call Counter.Add with a float64(conversion of int n), which will panic if n < 0;
guard these methods by checking n > 0 before calling r.blobsProcessed.Add /
r.headersProcessed.Add (return early on n <= 0), or alternatively update the
Recorder interface signatures to use unsigned types and propagate that change;
ensure using the PromRecorder methods' names (IncBlobsProcessed,
IncHeadersProcessed) and the metric fields (blobsProcessed, headersProcessed)
when applying the fix.
- Around line 167-177: SetLatestHeight and SetNetworkHeight compute syncLag
directly from lastNetwork and lastLatest which can produce negative values at
startup; change the logic in PromRecorder.SetLatestHeight and
PromRecorder.SetNetworkHeight to clamp the computed lag to zero (e.g., compute
diff := float64(r.lastNetwork) - float64(h) and if diff < 0 { diff = 0 } before
calling r.syncLag.Set(diff)) so the Gauge never reports a negative lag, and
apply the same clamping pattern in SetNetworkHeight when computing float64(h) -
float64(r.lastLatest).

---

Outside diff comments:
In `@pkg/api/notifier.go`:
- Around line 76-82: In Subscribe, after unlocking n.mu you read
len(n.subscribers) for n.metrics.SetActiveSubscriptions which can race with
concurrent Subscribe/Unsubscribe; move the SetActiveSubscriptions call inside
the critical section (i.e., while n.mu is still held) so the count is computed
and reported atomically with the map mutation—mirror the pattern used in
Unsubscribe by calling n.metrics.SetActiveSubscriptions(len(n.subscribers))
before n.mu.Unlock().

---

Nitpick comments:
In `@cmd/apex/client.go`:
- Around line 66-80: Before attempting json.Unmarshal into jsonRPCResponse,
check the HTTP response status from resp (returned by c.client.Do(req)); if
resp.StatusCode is not in the 2xx range, return a clear error that includes the
status code (and optionally the respBody text) instead of proceeding to
unmarshal. Locate the block that reads respBody and unmarshals into
jsonRPCResponse and add an early status check using resp.StatusCode (and
resp.Status) to produce a descriptive error like "unexpected status 502" plus
the response body when non-2xx is received.

In `@cmd/apex/main.go`:
- Around line 48-49: The root-level PersistentFlags registration for "rpc-addr"
and "format" leaks into unrelated commands; remove the two lines that call
root.PersistentFlags().String("rpc-addr", ...) and
root.PersistentFlags().String("format", ...), and instead register these flags
only on the commands that need them (e.g., statusCmd and blobCmd) using
cmd.Flags().String(...) or create a helper like addRPCFlags(cmd *cobra.Command)
and call it from status and blob command setup; ensure any existing references
to viper or flag lookups use the flag on those commands.
- Around line 253-278: In gracefulShutdown, httpSrv.Shutdown and
metricsSrv.Shutdown use the same 5s shutdownCtx which can let the HTTP shutdown
exhaust the deadline and starve metrics; change this by creating independent
contexts: keep shutdownCtx/shutdownCancel for httpSrv.Shutdown
(context.WithTimeout(..., 5*time.Second)) and create a separate
metricsCtx/metricsCancel (context.WithTimeout(..., 5*time.Second)) used only for
metricsSrv.Shutdown (and defer its cancel), then call
metricsSrv.Shutdown(metricsCtx) instead of sharing shutdownCtx; reference
gracefulShutdown, httpSrv.Shutdown and metricsSrv.Shutdown when making the
change.

In `@cmd/apex/status.go`:
- Around line 28-43: The code currently treats any non-"table" format as JSON,
so unknown values silently fallback to JSON; update the validation in the
command handler around the format variable (the branch that calls
printStatusTable and the JSON encoding path) to explicitly accept only supported
values (e.g., "table" and "json") and return a user-facing error for anything
else (or document and normalize aliases like "yaml" if you intend to support
them). Locate the format check that gates printStatusTable(&hs) and the JSON
encoder (enc.Encode(out)), add a switch or if/else that validates format, call
printStatusTable for "table", perform JSON output for "json" (or empty/default),
and return fmt.Errorf("unsupported --format value: %q; valid values are: table,
json") for unknown values.

In `@Dockerfile`:
- Around line 18-20: Add a Docker HEALTHCHECK after the EXPOSE/ENTRYPOINT lines
in the Dockerfile to probe the new /health or /health/ready endpoint on port
8080; for example invoke the app's CLI (ENTRYPOINT ["/apex", "start"] is used)
or run a lightweight HTTP probe like curl/wget against
http://localhost:8080/health/ready and return non-zero on failure so container
schedulers can detect unhealthy instances. Ensure the HEALTHCHECK uses a
sensible interval, timeout and retries (so transient startup delays don't mark
the container unhealthy).

In `@pkg/api/health_test.go`:
- Around line 84-103: TestReadyEndpoint only covers the happy path; convert it
to a table-driven test with at least two cases: "ready" (mockStatusProvider
returns types.SyncStatus with State=Streaming and LatestHeight==NetworkHeight,
expecting HTTP 200) and "not ready" (e.g., State=Initializing or mismatched
heights, expecting HTTP 503). For each case create the notifier and handler via
NewNotifier and NewHealthHandler (use newMockStore() for ready and a failing
store or status with non-Streaming for not-ready), register the handler on a
ServeMux, issue a GET to "/health/ready" with httptest.NewRequest and assert
rec.Code equals the expected status. Use the existing mockStatusProvider,
NewHealthHandler, NewNotifier, and newMockStore/newFailingMockStore helpers to
locate and modify the test.

In `@pkg/api/notifier.go`:
- Around line 56-59: SetMetrics currently writes n.metrics without
synchronization while Subscribe, Unsubscribe, and Publish access it under n.mu;
make SetMetrics concurrency-safe by acquiring the same mutex (n.mu) before
assigning n.metrics (or alternatively document and enforce a "call before use"
requirement), ensuring consistency with Subscribe/Unsubscribe/Publish which read
n.metrics under lock.

In `@pkg/metrics/metrics.go`:
- Around line 12-13: Update the misleading package comment for the Recorder
interface: remove the sentence "A nil Recorder is valid — callers should use the
package-level Nop() helper." and replace it with a clear note that a nil
Recorder interface value will panic if its methods are called and that callers
should use the package-level Nop() helper instead; keep the concurrency-safety
remark intact and reference the Recorder interface and Nop() helper in the
comment so readers know which symbols to use.
- Around line 157-165: Replace the hardcoded state slice in
PromRecorder.SetSyncState with a single source of truth: reference an exported
canonical state list (e.g., SyncStates []string) or a typed-string constant set
from the sync package, then iterate over that exported list to set
r.syncState.WithLabelValues(s).Set(1) only for the matching state and 0 for
others; update SetSyncState to use the exported symbol (e.g., sync.SyncStates or
package-level variable) so new states are automatically handled.

In `@pkg/store/sqlite.go`:
- Around line 130-131: Several store methods are missing the same timing
instrumentation used elsewhere; add the same pattern to instrument GetBlob,
PutNamespace, GetNamespaces, and SetSyncState (and any other methods at the
referenced regions) so dashboards include them. For each missing method (e.g.,
function/method names GetBlob, PutNamespace, GetNamespaces, SetSyncState) insert
the same start := time.Now() and defer func() {
s.metrics.ObserveStoreQueryDuration("MethodName", time.Since(start)) }() at the
top of the method, matching the existing pattern used in
PutBlobs/GetBlobs/PutHeader/GetHeader/GetSyncState; repeat for the other spots
called out in the comment so all store operations are consistently instrumented.

Comment on lines 94 to 102
func printJSON(cmd *cobra.Command, raw json.RawMessage) error {
format, _ := cmd.Flags().GetString("format")
if format == "table" {
// For blob data, table format just pretty-prints the JSON since blob
// fields are dynamic.
return prettyPrintJSON(raw)
}
return prettyPrintJSON(raw)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for format flag registration in cmd/apex/
rg -n --type=go '"format"' cmd/apex/

Repository: evstack/apex

Length of output: 295


🏁 Script executed:

#!/bin/bash
# Check blob_cmd.go around the printJSON function (lines 94-102)
sed -n '90,110p' cmd/apex/blob_cmd.go

# Also check status.go to see how format is handled there
sed -n '15,35p' cmd/apex/status.go

Repository: evstack/apex

Length of output: 1160


Dead branch in printJSONformat == "table" path is identical to the default.

Both the "table" branch and the fallthrough call prettyPrintJSON(raw), making the flag check a no-op. Either remove the conditional or implement differentiated table output.

🔧 Proposed simplification (if table output is not yet planned)
 func printJSON(cmd *cobra.Command, raw json.RawMessage) error {
-	format, _ := cmd.Flags().GetString("format")
-	if format == "table" {
-		// For blob data, table format just pretty-prints the JSON since blob
-		// fields are dynamic.
-		return prettyPrintJSON(raw)
-	}
-	return prettyPrintJSON(raw)
+	return prettyPrintJSON(raw)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func printJSON(cmd *cobra.Command, raw json.RawMessage) error {
format, _ := cmd.Flags().GetString("format")
if format == "table" {
// For blob data, table format just pretty-prints the JSON since blob
// fields are dynamic.
return prettyPrintJSON(raw)
}
return prettyPrintJSON(raw)
}
func printJSON(cmd *cobra.Command, raw json.RawMessage) error {
return prettyPrintJSON(raw)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/apex/blob_cmd.go` around lines 94 - 102, The conditional in printJSON
(which checks cmd.Flags().GetString("format") == "table") is dead because both
branches call prettyPrintJSON(raw); fix by either removing the conditional
entirely and always calling prettyPrintJSON(raw), or implement a distinct table
rendering path (e.g., add a new function renderTableFromJSON and call it when
format == "table") while keeping the existing prettyPrintJSON(raw) as the
default; update references in printJSON accordingly.

Comment on lines 14 to 20
FROM gcr.io/distroless/static-debian12:nonroot

COPY --from=builder /apex /apex

EXPOSE 8080 9090 9091

ENTRYPOINT ["/apex", "start"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add an explicit USER instruction to satisfy Trivy DS-0002 and make non-root intent self-documenting.

While distroless/static-debian12:nonroot sets UID 65532 in its image config, the Dockerfile itself contains no USER directive. Trivy DS-0002 flags this, and omitting an explicit USER can also cause surprises if the base image tag is ever changed to a variant that defaults to root.

🛡️ Proposed fix
 FROM gcr.io/distroless/static-debian12:nonroot
 
 COPY --from=builder /apex /apex
 
+USER 65532:65532
+
 EXPOSE 8080 9090 9091
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /apex /apex
EXPOSE 8080 9090 9091
ENTRYPOINT ["/apex", "start"]
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /apex /apex
USER 65532:65532
EXPOSE 8080 9090 9091
ENTRYPOINT ["/apex", "start"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 14 - 20, The Dockerfile lacks an explicit USER
directive which Trivy DS-0002 flags; add an explicit non-root user by appending
a USER instruction (use the base image's nonroot UID 65532) after the COPY
--from=builder /apex /apex step and before ENTRYPOINT so the non-root intent is
self-documenting and survives base-image changes (ensure /apex remains
executable by that UID).

Comment on lines 94 to 95
_, err := h.store.GetSyncState(ctx)
storeOK := err == nil || err == store.ErrNotFound
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use errors.Is instead of == for sentinel error comparison.

err == store.ErrNotFound will break if GetSyncState ever wraps the sentinel (e.g., fmt.Errorf("...: %w", store.ErrNotFound)). The errors package is already imported in the store layer, and errors.Is is the idiomatic and future-proof comparison.

Proposed fix
 	_, err := h.store.GetSyncState(ctx)
-	storeOK := err == nil || err == store.ErrNotFound
+	storeOK := err == nil || errors.Is(err, store.ErrNotFound)

You'll also need to add "errors" to the import block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/health.go` around lines 94 - 95, Replace the sentinel equality check
with errors.Is when computing storeOK in the health handler: change the
expression using err == store.ErrNotFound to errors.Is(err, store.ErrNotFound)
(so the line with GetSyncState result uses err == nil || errors.Is(err,
store.ErrNotFound)); also add the "errors" import to the file's import block so
the errors.Is symbol is available.

Comment on lines 69 to 71
// cached for lag calculation
lastLatest uint64
lastNetwork uint64
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Data race on lastLatest / lastNetwork violates the stated concurrency contract.

lastLatest and lastNetwork are plain uint64 fields. SetLatestHeight writes lastLatest and reads lastNetwork; SetNetworkHeight does the inverse — both without any synchronization. The Recorder interface documentation on line 12 explicitly promises concurrent safety, making this a data race that will be flagged by -race.

🔒 Proposed fix using `sync/atomic`
 import (
 	"runtime"
+	"sync/atomic"
 	"time"

 	"github.com/prometheus/client_golang/prometheus"
 	"github.com/prometheus/client_golang/prometheus/promauto"
 )
 	// cached for lag calculation
-	lastLatest  uint64
-	lastNetwork uint64
+	lastLatest  atomic.Uint64
+	lastNetwork atomic.Uint64
 }
 func (r *PromRecorder) SetLatestHeight(h uint64) {
 	r.latestHeight.Set(float64(h))
-	r.lastLatest = h
-	r.syncLag.Set(float64(r.lastNetwork) - float64(h))
+	r.lastLatest.Store(h)
+	r.syncLag.Set(float64(r.lastNetwork.Load()) - float64(h))
 }

 func (r *PromRecorder) SetNetworkHeight(h uint64) {
 	r.networkHeight.Set(float64(h))
-	r.lastNetwork = h
-	r.syncLag.Set(float64(h) - float64(r.lastLatest))
+	r.lastNetwork.Store(h)
+	r.syncLag.Set(float64(h) - float64(r.lastLatest.Load()))
 }

Also applies to: 167-177

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 69 - 71, The fields lastLatest and
lastNetwork on the Recorder must be made concurrency-safe: switch their accesses
to atomic operations instead of plain reads/writes. Import sync/atomic, keep the
fields as uint64, and replace any direct assignments/reads in SetLatestHeight
and SetNetworkHeight (and the other usages around the 167-177 block) with
atomic.StoreUint64/atomic.LoadUint64 (or atomic.AddUint64 if computing deltas)
so all reads and writes are atomic and the Recorder concurrency contract is
honored; ensure you update every location that touches lastLatest/lastNetwork.

Comment on lines 167 to 177
func (r *PromRecorder) SetLatestHeight(h uint64) {
r.latestHeight.Set(float64(h))
r.lastLatest = h
r.syncLag.Set(float64(r.lastNetwork) - float64(h))
}

func (r *PromRecorder) SetNetworkHeight(h uint64) {
r.networkHeight.Set(float64(h))
r.lastNetwork = h
r.syncLag.Set(float64(h) - float64(r.lastLatest))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

syncLag goes negative before SetNetworkHeight is called.

Both lastLatest and lastNetwork start at 0. If SetLatestHeight is called before any network height is known (common at startup), the lag computes as 0 - h, yielding a negative gauge value. While a Prometheus Gauge doesn't panic on negative values, a negative "lag" is semantically incorrect and can trip alert thresholds unexpectedly.

🛠️ Proposed fix — clamp lag to zero
 func (r *PromRecorder) SetLatestHeight(h uint64) {
 	r.latestHeight.Set(float64(h))
-	r.lastLatest = h
-	r.syncLag.Set(float64(r.lastNetwork) - float64(h))
+	r.lastLatest.Store(h)
+	lag := float64(r.lastNetwork.Load()) - float64(h)
+	if lag < 0 {
+		lag = 0
+	}
+	r.syncLag.Set(lag)
 }

 func (r *PromRecorder) SetNetworkHeight(h uint64) {
 	r.networkHeight.Set(float64(h))
-	r.lastNetwork = h
-	r.syncLag.Set(float64(h) - float64(r.lastLatest))
+	r.lastNetwork.Store(h)
+	lag := float64(h) - float64(r.lastLatest.Load())
+	if lag < 0 {
+		lag = 0
+	}
+	r.syncLag.Set(lag)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 167 - 177, SetLatestHeight and
SetNetworkHeight compute syncLag directly from lastNetwork and lastLatest which
can produce negative values at startup; change the logic in
PromRecorder.SetLatestHeight and PromRecorder.SetNetworkHeight to clamp the
computed lag to zero (e.g., compute diff := float64(r.lastNetwork) - float64(h)
and if diff < 0 { diff = 0 } before calling r.syncLag.Set(diff)) so the Gauge
never reports a negative lag, and apply the same clamping pattern in
SetNetworkHeight when computing float64(h) - float64(r.lastLatest).

Comment on lines 179 to 185
func (r *PromRecorder) IncBlobsProcessed(n int) {
r.blobsProcessed.Add(float64(n))
}

func (r *PromRecorder) IncHeadersProcessed(n int) {
r.headersProcessed.Add(float64(n))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Counter.Add panics on negative n — the int parameter type allows callers to trigger this.

Counter.Add is documented to panic if the value is < 0. Both IncBlobsProcessed(n int) and IncHeadersProcessed(n int) cast the signed n directly to float64 and pass it to Add, meaning any negative argument causes a runtime panic.

🛡️ Proposed fix — guard against non-positive values
 func (r *PromRecorder) IncBlobsProcessed(n int) {
-	r.blobsProcessed.Add(float64(n))
+	if n > 0 {
+		r.blobsProcessed.Add(float64(n))
+	}
 }

 func (r *PromRecorder) IncHeadersProcessed(n int) {
-	r.headersProcessed.Add(float64(n))
+	if n > 0 {
+		r.headersProcessed.Add(float64(n))
+	}
 }

Alternatively, change the Recorder interface signatures to uint to make the constraint part of the type contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 179 - 185, IncBlobsProcessed and
IncHeadersProcessed on PromRecorder call Counter.Add with a float64(conversion
of int n), which will panic if n < 0; guard these methods by checking n > 0
before calling r.blobsProcessed.Add / r.headersProcessed.Add (return early on n
<= 0), or alternatively update the Recorder interface signatures to use unsigned
types and propagate that change; ensure using the PromRecorder methods' names
(IncBlobsProcessed, IncHeadersProcessed) and the metric fields (blobsProcessed,
headersProcessed) when applying the fix.

Also address PR review feedback:
- Fix data race on PromRecorder lag fields using atomic.Uint64
- Clamp sync lag gauge to zero (avoid negative values at startup)
- Guard Counter.Add against negative n to prevent panic
- Fix JSON pretty-printing (unmarshal into any, not RawMessage)
- Remove dead branch in printJSON
- Use errors.Is for sentinel error comparison in health handler
- Add explicit USER directive in Dockerfile

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tac0turtle tac0turtle merged commit 4446bf3 into main Feb 19, 2026
3 checks passed
@tac0turtle tac0turtle deleted the feat/phase3-polish branch February 19, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments