Skip to content

fix(telemetry): fixtelemetry DB connection leak in MetricStatsColl (#34480)#34490

Queued
spbolton wants to merge 2 commits intomainfrom
issue-34480-fixtelemetry-db-connection-leak-in
Queued

fix(telemetry): fixtelemetry DB connection leak in MetricStatsColl (#34480)#34490
spbolton wants to merge 2 commits intomainfrom
issue-34480-fixtelemetry-db-connection-leak-in

Conversation

@spbolton
Copy link
Contributor

@spbolton spbolton commented Feb 3, 2026

Description

Fixes a probable database connection leak in telemetry metrics collection where connections were not properly returned to the HikariCP pool. Based on code analysis and production evidence, the leak likely occurred during a race condition when metric queries timed out.

Probable Root Cause Analysis (Unconfirmed)

⚠️ Important: This root cause is based on code analysis and production evidence but has NOT been confirmed through replication. We cannot definitively prove this is the exact cause without being able to reproduce the issue.

Suspected mechanism: The leak likely occurred through a narrow race condition window (~milliseconds) when:

  1. A database metric query is executing (original code had no connection lifecycle wrapper)
  2. The query exceeds the timeout threshold (~1000ms)
  3. Timeout fires and cancel(true) immediately interrupts the thread
  4. The interrupt arrives while the thread is using or closing the database connection
  5. Connection cleanup fails due to the interruption, leaving the connection orphaned in the pool

Original code pattern (no connection management):

// DBMetricType.getValue() - original implementation
return APILocator.getMetricsAPI().getValue(getSqlQuery());

This pattern had no explicit connection lifecycle management, making it vulnerable to interruption during query execution or cleanup.

Supporting evidence:

  • Production connections were stuck for 11+ days with queries that should complete in milliseconds
  • Only ~30 connections leaked over weeks, consistent with a rare race condition
  • Original code lacked explicit connection lifecycle management
  • Aggressive shutdownNow() immediately interrupted all running tasks
  • HikariCP pool showed connections as "active" but not actually executing queries

Confidence level: High probability based on code analysis, but unconfirmed without replication.

⚠️ Test Limitations

We have NOT been able to replicate this issue in testing. The race condition (if this is indeed the cause) is extremely timing-dependent and occurs in a narrow millisecond window. The integration test does NOT fail with the old code - it passes both before and after our fixes.

This is expected behavior because:

  • The suspected race condition requires precise timing alignment between timeout firing and query execution/cleanup
  • Production leaked only ~30 connections over weeks, indicating how rare the condition is (if our hypothesis is correct)
  • Our fixes are preventative - they eliminate the suspected race condition window entirely

The test validates that our fixes don't break existing functionality and that the timeout mechanism works correctly, but cannot reliably trigger the suspected race condition.

Changes

1. Database Connection Lifecycle Management (PRIMARY FIX)

File: dotCMS/src/main/java/com/dotcms/telemetry/collectors/DBMetricType.java

  • Wrapped getValue() calls in DbConnectionFactory.wrapConnection() to ensure proper connection lifecycle
  • Ensures connections opened for metrics are properly closed even if interrupted
  • This wrapper did not exist in the original code - adding it is the primary fix

Before:

return APILocator.getMetricsAPI().getValue(getSqlQuery());

After:

return DbConnectionFactory.wrapConnection(() ->
    APILocator.getMetricsAPI().getValue(getSqlQuery())
);

2. Race Condition Prevention - Layer 1: Task Completion Check

File: dotCMS/src/main/java/com/dotcms/telemetry/collectors/MetricStatsCollector.java (line 187)

  • Added isDone() check before calling cancel(true) on timeout
  • Prevents interrupting tasks that have just completed, avoiding potential race condition

3. Race Condition Prevention - Layer 2: Graceful Shutdown

File: dotCMS/src/main/java/com/dotcms/telemetry/collectors/MetricStatsCollector.java (lines 246-264)

  • Replaced aggressive shutdownNow() with graceful shutdown() + awaitTermination(5s)
  • Gives running tasks time to complete cleanup before forcing interruption
  • Reduces likelihood of interrupting connection cleanup code

Before:

executor.shutdownNow();

After:

executor.shutdown();
if (!executor.awaitTermination(5, TimeUnit.SECONDS)) {
    executor.shutdownNow();
    // ... additional handling
}

4. Race Condition Prevention - Layer 3: Resilient Finally Block

File: dotCMS/src/main/java/com/dotmarketing/db/DbConnectionFactory.java (lines 788-801)

  • Made finally block interrupt-resilient by clearing interrupt flag during cleanup
  • Ensures connection cleanup completes even if interrupted
  • Restores interrupt flag after cleanup to preserve thread state

This layer works in conjunction with the new wrapConnection() usage in #1 above.

5. Configurable Timeout

File: dotCMS/src/main/java/com/dotcms/telemetry/business/TimeoutConfig.java

  • Added TELEMETRY_METRIC_TIMEOUT_SECONDS configuration property
  • Default: 10 seconds (production), 1 second (tests)
  • Allows tuning for slow-running database metrics

6. Integration Tests

Files:

  • dotcms-integration/src/test/java/com/dotcms/telemetry/collectors/MetricTimeoutTest.java (new)
  • dotcms-integration/src/test/java/com/dotcms/telemetry/collectors/SlowDatabaseTestMetric.java (new)

Test Coverage:

  • Validates timeout mechanism fires correctly
  • Tests database connection management using PostgreSQL pg_sleep(3)
  • Verifies connections are properly returned to pool after timeout
  • Checks for connection accumulation over multiple iterations
  • Uses cache bypass to ensure fresh query execution

Note: The test validates correct behavior but does NOT reproduce the suspected race condition leak.

Testing

Manual Testing

  • Verified no connection leaks in HikariCP pool monitoring
  • Tested with multiple concurrent metric collection cycles
  • Confirmed timeout configuration works correctly

Integration Tests

./mvnw verify -pl :dotcms-integration -Dcoreit.test.skip=false -Dit.test=MetricTimeoutTest

Test validates:

  1. Timeouts fire at configured threshold (1s in tests)
  2. Database connections are properly managed during timeout
  3. No connection accumulation over multiple timeout cycles
  4. Cache bypass ensures actual database queries are tested

Test Limitation: The test passes with both old and new code because the suspected race condition is too rare to trigger reliably in testing. The test serves to validate that our fixes don't break existing functionality and that the timeout mechanism works correctly.

Test Design Notes

  • Uses PostgreSQL's pg_sleep(3) to hold real database connections
  • Tests the actual DbConnectionFactory.wrapConnection() code path
  • Cache bypass parameter ensures fresh query execution on every test

Configuration

New configuration property in dotcms-config.properties:

# Telemetry metric timeout in seconds (default: 10)
TELEMETRY_METRIC_TIMEOUT_SECONDS=10

For tests, set to 1 second via system property:

telemetry.metric.timeout.seconds=1

Production Impact

  • Positive: Addresses the most probable cause of connection leaks based on code analysis
  • Performance: Negligible - adds graceful 5s shutdown window during collection
  • Compatibility: Fully backward compatible
  • Configuration: Default 10s timeout suitable for most metrics
  • Monitoring: Will require production monitoring to confirm the leak is resolved

Defense-in-Depth Strategy

This fix implements a four-layer defense targeting the suspected race condition:

  1. Connection Lifecycle: Wrap all DB metric queries in proper connection management (primary fix)
  2. Prevention: Check isDone() before canceling to avoid racing with task completion
  3. Mitigation: Graceful shutdown gives tasks time to complete cleanup
  4. Recovery: Resilient finally block completes cleanup even if interrupted

All four layers work together to eliminate the suspected race condition window, which should prevent connection leaks if our hypothesis is correct.

Risk Assessment

Low Risk Change:

  • Fixes are defensive and preventative in nature
  • No behavior changes under normal operation
  • Test coverage validates no regression in timeout handling
  • Changes follow established patterns from HikariCP best practices
  • Production evidence strongly suggests the race condition as root cause (connections stuck for 11+ days with queries that should complete in milliseconds)
  • If hypothesis is incorrect: No harm done - the fixes improve connection handling regardless
  • If hypothesis is correct: Should eliminate the leak entirely

Next Steps

  • Deploy to production with monitoring
  • Verify connection leak is resolved over time
  • If leak persists, further investigation needed to identify alternative root causes

Closes #34480

🤖 Generated with Claude Code

@spbolton spbolton force-pushed the issue-34480-fixtelemetry-db-connection-leak-in branch from 80a422f to 7db7edb Compare February 3, 2026 22:01
@spbolton spbolton enabled auto-merge February 4, 2026 17:46
@spbolton spbolton added this pull request to the merge queue Feb 5, 2026
Any commits made after this event will not be merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

fix(telemetry): DB connection leak in MetricStatsCollector — executor thread connections never returned to HikariCP pool

3 participants