Description
dotCMS starts OSGi (Felix) bundles asynchronously during boot. InitServlet submits OSGIUtil.initializeFramework() to a background thread (when START_CLIENT_OSGI_IN_SEPARATE_THREAD=true, the default), and Felix file-install then resolves and starts each .jar in the deploy folder in parallel. Because of this, InitServlet completing does not imply that plugins reached ACTIVE — a regressed plugin in a new image can produce a pod that appears "up" while the plugin is silently stuck in INSTALLED or RESOLVED.
In a rolling deployment (Kubernetes, etc.), this is a cutover hazard: the new pod passes readiness, traffic shifts to it, and a critical plugin is dark.
We need a configuration switch that prevents a new deployment from going live if an installed OSGi plugin has not successfully started.
Acceptance Criteria
Configuration
All keys are settable as DOT_-prefixed environment variables — dots and hyphens both map to underscores via Config.envKey().
| Property key |
Env var |
Default |
Purpose |
health.check.osgi-bundles.mode |
DOT_HEALTH_CHECK_OSGI_BUNDLES_MODE |
PRODUCTION |
Standard health-check safety mode (PRODUCTION / MONITOR_MODE / DISABLED) |
health.check.osgi-bundles.grace.period.ms |
DOT_HEALTH_CHECK_OSGI_BUNDLES_GRACE_PERIOD_MS |
300000 |
Time window after OSGi init in which bundles may still be starting |
health.check.osgi-bundles.required.bundles |
DOT_HEALTH_CHECK_OSGI_BUNDLES_REQUIRED_BUNDLES |
(empty) |
Optional CSV of symbolic names; when empty, every non-system, non-fragment bundle is required |
MONITOR_MODE is the recommended initial rollout path: the check logs and reports the failure for one or two releases without actually blocking pods, then ops flips to PRODUCTION once confident.
Why readiness, not hard fail
System.exit on init-thread failure is a worse contract than failing readiness:
- Readiness failure: Kubernetes/the LB simply never sends traffic to the new pod, the old (working) pod keeps serving, and rollout halts on its own. No request loss.
- Hard fail: forces a CrashLoopBackOff state, can trigger restart-storm alerting, and is harder to recover from cleanly.
Both achieve "do not cut over." Readiness is strictly better.
Out of scope
- Detecting bundles that are
ACTIVE but whose internal services are broken (a deeper plugin-specific contract — not enforceable from outside the bundle).
Priority
Medium
Description
dotCMS starts OSGi (Felix) bundles asynchronously during boot.
InitServletsubmitsOSGIUtil.initializeFramework()to a background thread (whenSTART_CLIENT_OSGI_IN_SEPARATE_THREAD=true, the default), and Felix file-install then resolves and starts each.jarin the deploy folder in parallel. Because of this,InitServletcompleting does not imply that plugins reachedACTIVE— a regressed plugin in a new image can produce a pod that appears "up" while the plugin is silently stuck inINSTALLEDorRESOLVED.In a rolling deployment (Kubernetes, etc.), this is a cutover hazard: the new pod passes readiness, traffic shifts to it, and a critical plugin is dark.
We need a configuration switch that prevents a new deployment from going live if an installed OSGi plugin has not successfully started.
Acceptance Criteria
osgi-bundles) that walks the OSGi framework and requires every non-system, non-fragment bundle to beBundle.ACTIVEOSGIUtil.initializeFramework()completes — during the grace window the check reportsUPeven if bundles are still startingACTIVEreturnsDOWNwith the failing bundle's symbolic name and state in the messageACTIVE)required.bundlesallow-list (CSV of symbolic names) for cases where only specific plugins are deployment-criticalHealthCheckModeconvention:PRODUCTION/MONITOR_MODE/DISABLEDbundlesByState,notActiveBundles,elapsedSinceInitMs,gracePeriodMsfor ops visibilityConfiguration
All keys are settable as
DOT_-prefixed environment variables — dots and hyphens both map to underscores viaConfig.envKey().health.check.osgi-bundles.modeDOT_HEALTH_CHECK_OSGI_BUNDLES_MODEPRODUCTIONPRODUCTION/MONITOR_MODE/DISABLED)health.check.osgi-bundles.grace.period.msDOT_HEALTH_CHECK_OSGI_BUNDLES_GRACE_PERIOD_MS300000health.check.osgi-bundles.required.bundlesDOT_HEALTH_CHECK_OSGI_BUNDLES_REQUIRED_BUNDLESMONITOR_MODEis the recommended initial rollout path: the check logs and reports the failure for one or two releases without actually blocking pods, then ops flips toPRODUCTIONonce confident.Why readiness, not hard fail
System.exiton init-thread failure is a worse contract than failing readiness:Both achieve "do not cut over." Readiness is strictly better.
Out of scope
ACTIVEbut whose internal services are broken (a deeper plugin-specific contract — not enforceable from outside the bundle).Priority
Medium