Add do-obsd -> do-otelcol healthcheck with supervision loop by jnaidu-do · Pull Request #6 · digitalocean/do-obsd

jnaidu-do · 2026-04-01T15:04:33Z

Summary

Enables the healthcheckextension (v1) in the otelcol config on localhost:13134 (non-default port to avoid conflicts with customer collectors)
Adds a supervision loop in do-obsd that polls the health endpoint every 15s, restarts do-otelcol via systemctl restart after 3 consecutive failures, with exponential backoff (5s-60s) and a max of 5 restarts
Adds CheckHealth() and Restart() methods to the collector package with unit tests

Note

This will not be necessary once we implement the OpAMP-go SDK, since OpAMP will report collector health to the server (push-based ComponentHealth) and supervise the agent lifecycle, replacing this HTTP polling loop.

Tests

Scenario	Result
Frozen process (SIGSTOP)	do-obsd detected 3 consecutive timeout failures, restarted otelcol after 5s backoff. New PID confirmed. systemd did NOT detect this.
Process crash (kill -9)	Healthcheck returned connection refused. systemd also restarts in this case via Restart=on-failure.
Config tampering (extension removed)	otelcol ran but healthcheck refused connections. do-obsd would detect and restart.
Graceful stop	Healthcheck correctly showed service down.
Recovery	After restart, do-obsd logged "collector healthy, resetting counters" and reset failure/restart counters.

- Enable healthcheck extension (v1) on localhost:13134 in otelcol config - Add CheckHealth() that probes the collector's HTTP health endpoint - Add Restart() method (systemctl restart do-otelcol.service) - Replace signal-wait in main with a supervision loop: - Polls health every 15s - After 3 consecutive failures, restarts with exponential backoff - Caps at 5 restart attempts before giving up - Resets all counters on recovery - Add unit tests for CheckHealth, Restart, and connection refused Made-with: Cursor

Copilot

Pull request overview

This PR adds a local OpenTelemetry Collector health endpoint and a do-obsd supervision loop that polls it to detect “running but broken” collector states and restart do-otelcol when needed.

Changes:

Enable the collector health_check extension on localhost:13134 and wire it into the service config.
Add Collector.CheckHealth() (HTTP 200 check w/ timeout) and Collector.Restart() (systemctl restart), with unit tests.
Replace do-obsd’s “wait for signal then stop” flow with a long-running supervision loop (15s polling, 3-failure threshold, exponential backoff, max restart budget).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
packaging/config/otelcol-config.yaml	Enables `health_check` extension on a non-default port and registers it in `service.extensions`.
internal/collector/collector.go	Adds health endpoint constants, HTTP client setup (no redirects), and `Restart()` implementation.
internal/collector/health.go	Implements `CheckHealth()` with request timeout and connection reuse.
internal/collector/health_test.go	Adds unit tests for `CheckHealth()` (200, non-200, timeout, refused, redirect).
internal/collector/collector_test.go	Adds unit tests for `Restart()`.
cmd/do-obsd/main.go	Adds the supervision loop that polls health and restarts on repeated failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T15:10:18Z

 	sigs := make(chan os.Signal, 1)
 	signal.Notify(sigs, syscall.SIGTERM, syscall.SIGINT)
-	<-sigs

-	slog.Info("stopping")
+	ticker := time.NewTicker(healthCheckInterval)
+	defer ticker.Stop()
+
+	ctx, cancel := context.WithCancel(context.Background())
+	defer cancel()
+
+	consecutiveFails := 0
+	restartCount := 0
+	backoff := initialBackoff

-	if err := col.Stop(); err != nil {
-		slog.Warn("stop collector failed", "err", err)
+	// shutdown cancels in-flight health checks and stops the collector.
+	// Returns nil so the process exits cleanly with code 0.
+	shutdown := func(sig os.Signal) error {
+		slog.Info("received signal, stopping", "signal", sig)
+		cancel()
+		if err := col.Stop(); err != nil {
+			slog.Warn("stop collector failed", "err", err)
+		}
+		return nil
 	}

-	return nil
+	for {
+		select {
+		case <-ticker.C:
+			if err := col.CheckHealth(ctx); err != nil {
+				consecutiveFails++
+				slog.Warn("health check failed",


SIGTERM/SIGINT handling can be delayed and can also race with the ticker because CheckHealth runs synchronously inside the select. If a signal arrives while CheckHealth is in-flight, the loop can’t read from sigs until the call returns (up to HealthTimeout), and when it does return both ticker.C and sigs may be ready so the select may pick another tick and even attempt a restart before shutdown. Consider switching to signal.NotifyContext(...) and selecting on ctx.Done() (and using that ctx for CheckHealth/backoff) so cancellation happens immediately on signal and the loop exits deterministically without performing restart work during shutdown.

Copilot · 2026-04-01T15:10:19Z

+			name: "timeout",
+			handler: func(w http.ResponseWriter, _ *http.Request) {
+				time.Sleep(5 * time.Second)
+				w.WriteHeader(http.StatusOK)
+			},
+			wantErr:   true,
+			errSubstr: "context deadline exceeded",
+		},


The "timeout" test currently sleeps for 5s, which makes the suite take at least HealthTimeout (~3s) to run. To keep tests fast and deterministic, consider using a shorter parent context deadline (shorter than HealthTimeout) and only blocking the handler slightly longer than that (e.g., ~100–200ms) so the timeout path is exercised without multi-second sleeps.

jnaidu-do added 2 commits April 1, 2026 19:29

minor bug fixes and added comments

816c6e1

jnaidu-do requested review from atelpis and Copilot and removed request for Copilot April 1, 2026 15:04

Copilot started reviewing on behalf of jnaidu-do April 1, 2026 15:05 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

jnaidu-do requested review from nmoganti-boop April 1, 2026 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add do-obsd -> do-otelcol healthcheck with supervision loop#6

Add do-obsd -> do-otelcol healthcheck with supervision loop#6
jnaidu-do wants to merge 2 commits intomainfrom
jnaidu/IN-6898

jnaidu-do commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jnaidu-do commented Apr 1, 2026

Summary

Note

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants