Add do-obsd -> do-otelcol healthcheck with supervision loop#6
Add do-obsd -> do-otelcol healthcheck with supervision loop#6
Conversation
- Enable healthcheck extension (v1) on localhost:13134 in otelcol config - Add CheckHealth() that probes the collector's HTTP health endpoint - Add Restart() method (systemctl restart do-otelcol.service) - Replace signal-wait in main with a supervision loop: - Polls health every 15s - After 3 consecutive failures, restarts with exponential backoff - Caps at 5 restart attempts before giving up - Resets all counters on recovery - Add unit tests for CheckHealth, Restart, and connection refused Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
This PR adds a local OpenTelemetry Collector health endpoint and a do-obsd supervision loop that polls it to detect “running but broken” collector states and restart do-otelcol when needed.
Changes:
- Enable the collector
health_checkextension onlocalhost:13134and wire it into the service config. - Add
Collector.CheckHealth()(HTTP 200 check w/ timeout) andCollector.Restart()(systemctl restart), with unit tests. - Replace
do-obsd’s “wait for signal then stop” flow with a long-running supervision loop (15s polling, 3-failure threshold, exponential backoff, max restart budget).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packaging/config/otelcol-config.yaml | Enables health_check extension on a non-default port and registers it in service.extensions. |
| internal/collector/collector.go | Adds health endpoint constants, HTTP client setup (no redirects), and Restart() implementation. |
| internal/collector/health.go | Implements CheckHealth() with request timeout and connection reuse. |
| internal/collector/health_test.go | Adds unit tests for CheckHealth() (200, non-200, timeout, refused, redirect). |
| internal/collector/collector_test.go | Adds unit tests for Restart(). |
| cmd/do-obsd/main.go | Adds the supervision loop that polls health and restarts on repeated failures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| sigs := make(chan os.Signal, 1) | ||
| signal.Notify(sigs, syscall.SIGTERM, syscall.SIGINT) | ||
| <-sigs | ||
|
|
||
| slog.Info("stopping") | ||
| ticker := time.NewTicker(healthCheckInterval) | ||
| defer ticker.Stop() | ||
|
|
||
| ctx, cancel := context.WithCancel(context.Background()) | ||
| defer cancel() | ||
|
|
||
| consecutiveFails := 0 | ||
| restartCount := 0 | ||
| backoff := initialBackoff | ||
|
|
||
| if err := col.Stop(); err != nil { | ||
| slog.Warn("stop collector failed", "err", err) | ||
| // shutdown cancels in-flight health checks and stops the collector. | ||
| // Returns nil so the process exits cleanly with code 0. | ||
| shutdown := func(sig os.Signal) error { | ||
| slog.Info("received signal, stopping", "signal", sig) | ||
| cancel() | ||
| if err := col.Stop(); err != nil { | ||
| slog.Warn("stop collector failed", "err", err) | ||
| } | ||
| return nil | ||
| } | ||
|
|
||
| return nil | ||
| for { | ||
| select { | ||
| case <-ticker.C: | ||
| if err := col.CheckHealth(ctx); err != nil { | ||
| consecutiveFails++ | ||
| slog.Warn("health check failed", |
There was a problem hiding this comment.
SIGTERM/SIGINT handling can be delayed and can also race with the ticker because CheckHealth runs synchronously inside the select. If a signal arrives while CheckHealth is in-flight, the loop can’t read from sigs until the call returns (up to HealthTimeout), and when it does return both ticker.C and sigs may be ready so the select may pick another tick and even attempt a restart before shutdown. Consider switching to signal.NotifyContext(...) and selecting on ctx.Done() (and using that ctx for CheckHealth/backoff) so cancellation happens immediately on signal and the loop exits deterministically without performing restart work during shutdown.
| name: "timeout", | ||
| handler: func(w http.ResponseWriter, _ *http.Request) { | ||
| time.Sleep(5 * time.Second) | ||
| w.WriteHeader(http.StatusOK) | ||
| }, | ||
| wantErr: true, | ||
| errSubstr: "context deadline exceeded", | ||
| }, |
There was a problem hiding this comment.
The "timeout" test currently sleeps for 5s, which makes the suite take at least HealthTimeout (~3s) to run. To keep tests fast and deterministic, consider using a shorter parent context deadline (shorter than HealthTimeout) and only blocking the handler slightly longer than that (e.g., ~100–200ms) so the timeout path is exercised without multi-second sleeps.
Summary
Enables the healthcheckextension (v1) in the otelcol config on localhost:13134 (non-default port to avoid conflicts with customer collectors)
Adds a supervision loop in do-obsd that polls the health endpoint every 15s, restarts do-otelcol via systemctl restart after 3 consecutive failures, with exponential backoff (5s-60s) and a max of 5 restarts
Adds CheckHealth() and Restart() methods to the collector package with unit tests
Note
This will not be necessary once we implement the OpAMP-go SDK, since OpAMP will report collector health to the server (push-based ComponentHealth) and supervise the agent lifecycle, replacing this HTTP polling loop.
Tests