Skip to content

Add do-obsd -> do-otelcol healthcheck with supervision loop#6

Open
jnaidu-do wants to merge 2 commits intomainfrom
jnaidu/IN-6898
Open

Add do-obsd -> do-otelcol healthcheck with supervision loop#6
jnaidu-do wants to merge 2 commits intomainfrom
jnaidu/IN-6898

Conversation

@jnaidu-do
Copy link
Copy Markdown
Collaborator

Summary

  • Enables the healthcheckextension (v1) in the otelcol config on localhost:13134 (non-default port to avoid conflicts with customer collectors)

  • Adds a supervision loop in do-obsd that polls the health endpoint every 15s, restarts do-otelcol via systemctl restart after 3 consecutive failures, with exponential backoff (5s-60s) and a max of 5 restarts

  • Adds CheckHealth() and Restart() methods to the collector package with unit tests

Note

This will not be necessary once we implement the OpAMP-go SDK, since OpAMP will report collector health to the server (push-based ComponentHealth) and supervise the agent lifecycle, replacing this HTTP polling loop.

Tests

Scenario Result
Frozen process (SIGSTOP) do-obsd detected 3 consecutive timeout failures, restarted otelcol after 5s backoff. New PID confirmed. systemd did NOT detect this.
Process crash (kill -9) Healthcheck returned connection refused. systemd also restarts in this case via Restart=on-failure.
Config tampering (extension removed) otelcol ran but healthcheck refused connections. do-obsd would detect and restart.
Graceful stop Healthcheck correctly showed service down.
Recovery After restart, do-obsd logged "collector healthy, resetting counters" and reset failure/restart counters.

- Enable healthcheck extension (v1) on localhost:13134 in otelcol config
- Add CheckHealth() that probes the collector's HTTP health endpoint
- Add Restart() method (systemctl restart do-otelcol.service)
- Replace signal-wait in main with a supervision loop:
  - Polls health every 15s
  - After 3 consecutive failures, restarts with exponential backoff
  - Caps at 5 restart attempts before giving up
  - Resets all counters on recovery
- Add unit tests for CheckHealth, Restart, and connection refused

Made-with: Cursor
@jnaidu-do jnaidu-do requested review from atelpis and Copilot and removed request for Copilot April 1, 2026 15:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a local OpenTelemetry Collector health endpoint and a do-obsd supervision loop that polls it to detect “running but broken” collector states and restart do-otelcol when needed.

Changes:

  • Enable the collector health_check extension on localhost:13134 and wire it into the service config.
  • Add Collector.CheckHealth() (HTTP 200 check w/ timeout) and Collector.Restart() (systemctl restart), with unit tests.
  • Replace do-obsd’s “wait for signal then stop” flow with a long-running supervision loop (15s polling, 3-failure threshold, exponential backoff, max restart budget).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packaging/config/otelcol-config.yaml Enables health_check extension on a non-default port and registers it in service.extensions.
internal/collector/collector.go Adds health endpoint constants, HTTP client setup (no redirects), and Restart() implementation.
internal/collector/health.go Implements CheckHealth() with request timeout and connection reuse.
internal/collector/health_test.go Adds unit tests for CheckHealth() (200, non-200, timeout, refused, redirect).
internal/collector/collector_test.go Adds unit tests for Restart().
cmd/do-obsd/main.go Adds the supervision loop that polls health and restarts on repeated failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmd/do-obsd/main.go
Comment on lines 57 to +86
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGTERM, syscall.SIGINT)
<-sigs

slog.Info("stopping")
ticker := time.NewTicker(healthCheckInterval)
defer ticker.Stop()

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

consecutiveFails := 0
restartCount := 0
backoff := initialBackoff

if err := col.Stop(); err != nil {
slog.Warn("stop collector failed", "err", err)
// shutdown cancels in-flight health checks and stops the collector.
// Returns nil so the process exits cleanly with code 0.
shutdown := func(sig os.Signal) error {
slog.Info("received signal, stopping", "signal", sig)
cancel()
if err := col.Stop(); err != nil {
slog.Warn("stop collector failed", "err", err)
}
return nil
}

return nil
for {
select {
case <-ticker.C:
if err := col.CheckHealth(ctx); err != nil {
consecutiveFails++
slog.Warn("health check failed",
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGTERM/SIGINT handling can be delayed and can also race with the ticker because CheckHealth runs synchronously inside the select. If a signal arrives while CheckHealth is in-flight, the loop can’t read from sigs until the call returns (up to HealthTimeout), and when it does return both ticker.C and sigs may be ready so the select may pick another tick and even attempt a restart before shutdown. Consider switching to signal.NotifyContext(...) and selecting on ctx.Done() (and using that ctx for CheckHealth/backoff) so cancellation happens immediately on signal and the loop exits deterministically without performing restart work during shutdown.

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +50
name: "timeout",
handler: func(w http.ResponseWriter, _ *http.Request) {
time.Sleep(5 * time.Second)
w.WriteHeader(http.StatusOK)
},
wantErr: true,
errSubstr: "context deadline exceeded",
},
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "timeout" test currently sleeps for 5s, which makes the suite take at least HealthTimeout (~3s) to run. To keep tests fast and deterministic, consider using a shorter parent context deadline (shorter than HealthTimeout) and only blocking the handler slightly longer than that (e.g., ~100–200ms) so the timeout path is exercised without multi-second sleeps.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants