Skip to content

Add config-driven readiness check that fails deployment when an OSGi plugin does not start #35806

@wezell

Description

@wezell

Description

dotCMS starts OSGi (Felix) bundles asynchronously during boot. InitServlet submits OSGIUtil.initializeFramework() to a background thread (when START_CLIENT_OSGI_IN_SEPARATE_THREAD=true, the default), and Felix file-install then resolves and starts each .jar in the deploy folder in parallel. Because of this, InitServlet completing does not imply that plugins reached ACTIVE — a regressed plugin in a new image can produce a pod that appears "up" while the plugin is silently stuck in INSTALLED or RESOLVED.

In a rolling deployment (Kubernetes, etc.), this is a cutover hazard: the new pod passes readiness, traffic shifts to it, and a critical plugin is dark.

We need a configuration switch that prevents a new deployment from going live if an installed OSGi plugin has not successfully started.

Acceptance Criteria

  • New readiness health check (osgi-bundles) that walks the OSGi framework and requires every non-system, non-fragment bundle to be Bundle.ACTIVE
  • Configurable grace period (default 5 minutes) that begins when OSGIUtil.initializeFramework() completes — during the grace window the check reports UP even if bundles are still starting
  • After the grace window expires, any bundle not ACTIVE returns DOWN with the failing bundle's symbolic name and state in the message
  • Fragment bundles excluded (they legitimately never reach ACTIVE)
  • Optional required.bundles allow-list (CSV of symbolic names) for cases where only specific plugins are deployment-critical
  • Registered as a readiness check only — never liveness — so a broken plugin blocks the rollout but does not restart-loop the pod
  • Honors existing HealthCheckMode convention: PRODUCTION / MONITOR_MODE / DISABLED
  • Structured health data exposes bundlesByState, notActiveBundles, elapsedSinceInitMs, gracePeriodMs for ops visibility
  • Unit test coverage for the four key transitions: pre-init, within-grace, post-grace-with-failure, post-grace-all-active, plus the fragment-exclusion case and required-bundles missing case

Configuration

All keys are settable as DOT_-prefixed environment variables — dots and hyphens both map to underscores via Config.envKey().

Property key Env var Default Purpose
health.check.osgi-bundles.mode DOT_HEALTH_CHECK_OSGI_BUNDLES_MODE PRODUCTION Standard health-check safety mode (PRODUCTION / MONITOR_MODE / DISABLED)
health.check.osgi-bundles.grace.period.ms DOT_HEALTH_CHECK_OSGI_BUNDLES_GRACE_PERIOD_MS 300000 Time window after OSGi init in which bundles may still be starting
health.check.osgi-bundles.required.bundles DOT_HEALTH_CHECK_OSGI_BUNDLES_REQUIRED_BUNDLES (empty) Optional CSV of symbolic names; when empty, every non-system, non-fragment bundle is required

MONITOR_MODE is the recommended initial rollout path: the check logs and reports the failure for one or two releases without actually blocking pods, then ops flips to PRODUCTION once confident.

Why readiness, not hard fail

System.exit on init-thread failure is a worse contract than failing readiness:

  • Readiness failure: Kubernetes/the LB simply never sends traffic to the new pod, the old (working) pod keeps serving, and rollout halts on its own. No request loss.
  • Hard fail: forces a CrashLoopBackOff state, can trigger restart-storm alerting, and is harder to recover from cleanly.

Both achieve "do not cut over." Readiness is strictly better.

Out of scope

  • Detecting bundles that are ACTIVE but whose internal services are broken (a deeper plugin-specific contract — not enforceable from outside the bundle).

Priority

Medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions