fix(telemetry): fixtelemetry DB connection leak in MetricStatsColl (#34480) by spbolton · Pull Request #34490 · dotCMS/core

spbolton · 2026-02-03T18:11:09Z

Description

Fixes a probable database connection leak in telemetry metrics collection where connections were not properly returned to the HikariCP pool. Based on code analysis and production evidence, the leak likely occurred during a race condition when metric queries timed out.

Probable Root Cause Analysis (Unconfirmed)

⚠️ Important: This root cause is based on code analysis and production evidence but has NOT been confirmed through replication. We cannot definitively prove this is the exact cause without being able to reproduce the issue.

Suspected mechanism: The leak likely occurred through a narrow race condition window (~milliseconds) when:

A database metric query is executing (original code had no connection lifecycle wrapper)
The query exceeds the timeout threshold (~1000ms)
Timeout fires and cancel(true) immediately interrupts the thread
The interrupt arrives while the thread is using or closing the database connection
Connection cleanup fails due to the interruption, leaving the connection orphaned in the pool

Original code pattern (no connection management):

// DBMetricType.getValue() - original implementation
return APILocator.getMetricsAPI().getValue(getSqlQuery());

This pattern had no explicit connection lifecycle management, making it vulnerable to interruption during query execution or cleanup.

Supporting evidence:

Production connections were stuck for 11+ days with queries that should complete in milliseconds
Only ~30 connections leaked over weeks, consistent with a rare race condition
Original code lacked explicit connection lifecycle management
Aggressive shutdownNow() immediately interrupted all running tasks
HikariCP pool showed connections as "active" but not actually executing queries

Confidence level: High probability based on code analysis, but unconfirmed without replication.

⚠️ Test Limitations

We have NOT been able to replicate this issue in testing. The race condition (if this is indeed the cause) is extremely timing-dependent and occurs in a narrow millisecond window. The integration test does NOT fail with the old code - it passes both before and after our fixes.

This is expected behavior because:

The suspected race condition requires precise timing alignment between timeout firing and query execution/cleanup
Production leaked only ~30 connections over weeks, indicating how rare the condition is (if our hypothesis is correct)
Our fixes are preventative - they eliminate the suspected race condition window entirely

The test validates that our fixes don't break existing functionality and that the timeout mechanism works correctly, but cannot reliably trigger the suspected race condition.

Changes

1. Database Connection Lifecycle Management (PRIMARY FIX)

File: dotCMS/src/main/java/com/dotcms/telemetry/collectors/DBMetricType.java

Wrapped getValue() calls in DbConnectionFactory.wrapConnection() to ensure proper connection lifecycle
Ensures connections opened for metrics are properly closed even if interrupted
This wrapper did not exist in the original code - adding it is the primary fix

Before:

return APILocator.getMetricsAPI().getValue(getSqlQuery());

After:

return DbConnectionFactory.wrapConnection(() ->
    APILocator.getMetricsAPI().getValue(getSqlQuery())
);

2. Race Condition Prevention - Layer 1: Task Completion Check

File: dotCMS/src/main/java/com/dotcms/telemetry/collectors/MetricStatsCollector.java (line 187)

Added isDone() check before calling cancel(true) on timeout
Prevents interrupting tasks that have just completed, avoiding potential race condition

3. Race Condition Prevention - Layer 2: Graceful Shutdown

File: dotCMS/src/main/java/com/dotcms/telemetry/collectors/MetricStatsCollector.java (lines 246-264)

Replaced aggressive shutdownNow() with graceful shutdown() + awaitTermination(5s)
Gives running tasks time to complete cleanup before forcing interruption
Reduces likelihood of interrupting connection cleanup code

Before:

executor.shutdownNow();

After:

executor.shutdown();
if (!executor.awaitTermination(5, TimeUnit.SECONDS)) {
    executor.shutdownNow();
    // ... additional handling
}

4. Race Condition Prevention - Layer 3: Resilient Finally Block

File: dotCMS/src/main/java/com/dotmarketing/db/DbConnectionFactory.java (lines 788-801)

Made finally block interrupt-resilient by clearing interrupt flag during cleanup
Ensures connection cleanup completes even if interrupted
Restores interrupt flag after cleanup to preserve thread state

This layer works in conjunction with the new wrapConnection() usage in #1 above.

5. Configurable Timeout

File: dotCMS/src/main/java/com/dotcms/telemetry/business/TimeoutConfig.java

Added TELEMETRY_METRIC_TIMEOUT_SECONDS configuration property
Default: 10 seconds (production), 1 second (tests)
Allows tuning for slow-running database metrics

6. Integration Tests

Files:

dotcms-integration/src/test/java/com/dotcms/telemetry/collectors/MetricTimeoutTest.java (new)
dotcms-integration/src/test/java/com/dotcms/telemetry/collectors/SlowDatabaseTestMetric.java (new)

Test Coverage:

Validates timeout mechanism fires correctly
Tests database connection management using PostgreSQL pg_sleep(3)
Verifies connections are properly returned to pool after timeout
Checks for connection accumulation over multiple iterations
Uses cache bypass to ensure fresh query execution

Note: The test validates correct behavior but does NOT reproduce the suspected race condition leak.

Testing

Manual Testing

Verified no connection leaks in HikariCP pool monitoring
Tested with multiple concurrent metric collection cycles
Confirmed timeout configuration works correctly

Integration Tests

./mvnw verify -pl :dotcms-integration -Dcoreit.test.skip=false -Dit.test=MetricTimeoutTest

Test validates:

Timeouts fire at configured threshold (1s in tests)
Database connections are properly managed during timeout
No connection accumulation over multiple timeout cycles
Cache bypass ensures actual database queries are tested

Test Limitation: The test passes with both old and new code because the suspected race condition is too rare to trigger reliably in testing. The test serves to validate that our fixes don't break existing functionality and that the timeout mechanism works correctly.

Test Design Notes

Uses PostgreSQL's pg_sleep(3) to hold real database connections
Tests the actual DbConnectionFactory.wrapConnection() code path
Cache bypass parameter ensures fresh query execution on every test

Configuration

New configuration property in dotcms-config.properties:

# Telemetry metric timeout in seconds (default: 10)
TELEMETRY_METRIC_TIMEOUT_SECONDS=10

For tests, set to 1 second via system property:

telemetry.metric.timeout.seconds=1

Production Impact

Positive: Addresses the most probable cause of connection leaks based on code analysis
Performance: Negligible - adds graceful 5s shutdown window during collection
Compatibility: Fully backward compatible
Configuration: Default 10s timeout suitable for most metrics
Monitoring: Will require production monitoring to confirm the leak is resolved

Defense-in-Depth Strategy

This fix implements a four-layer defense targeting the suspected race condition:

Connection Lifecycle: Wrap all DB metric queries in proper connection management (primary fix)
Prevention: Check isDone() before canceling to avoid racing with task completion
Mitigation: Graceful shutdown gives tasks time to complete cleanup
Recovery: Resilient finally block completes cleanup even if interrupted

All four layers work together to eliminate the suspected race condition window, which should prevent connection leaks if our hypothesis is correct.

Risk Assessment

Low Risk Change:

Fixes are defensive and preventative in nature
No behavior changes under normal operation
Test coverage validates no regression in timeout handling
Changes follow established patterns from HikariCP best practices
Production evidence strongly suggests the race condition as root cause (connections stuck for 11+ days with queries that should complete in milliseconds)
If hypothesis is incorrect: No harm done - the fixes improve connection handling regardless
If hypothesis is correct: Should eliminate the leak entirely

Next Steps

Deploy to production with monitoring
Verify connection leak is resolved over time
If leak persists, further investigation needed to identify alternative root causes

Closes #34480

🤖 Generated with Claude Code

dotcms-integration/src/test/java/com/dotcms/MainSuite3a.java

spbolton linked an issue Feb 3, 2026 that may be closed by this pull request

fix(telemetry): DB connection leak in MetricStatsCollector — executor thread connections never returned to HikariCP pool #34480

Open

1 task

github-project-automation bot added this to dotCMS - Product Planning Feb 3, 2026

spbolton mentioned this pull request Feb 3, 2026

fix(telemetry): DB connection leak in MetricStatsCollector — executor thread connections never returned to HikariCP pool #34480

Open

1 task

spbolton force-pushed the issue-34480-fixtelemetry-db-connection-leak-in branch from 06267b7 to 80a422f Compare February 3, 2026 18:23

spbolton added 2 commits February 3, 2026 18:37

Fix orphaned connections

4e92d41

fixes and tests

7db7edb

spbolton force-pushed the issue-34480-fixtelemetry-db-connection-leak-in branch from 80a422f to 7db7edb Compare February 3, 2026 22:01

dario-daza approved these changes Feb 4, 2026

View reviewed changes

erickgonzalez requested changes Feb 4, 2026

View reviewed changes

dotcms-integration/src/test/java/com/dotcms/MainSuite3a.java Show resolved Hide resolved

spbolton enabled auto-merge February 4, 2026 17:46

erickgonzalez approved these changes Feb 5, 2026

View reviewed changes

spbolton added this pull request to the merge queue Feb 5, 2026

Any commits made after this event will not be merged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(telemetry): fixtelemetry DB connection leak in MetricStatsColl (#34480)#34490

fix(telemetry): fixtelemetry DB connection leak in MetricStatsColl (#34480)#34490
spbolton wants to merge 2 commits intomainfrom
issue-34480-fixtelemetry-db-connection-leak-in

spbolton commented Feb 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

spbolton commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Probable Root Cause Analysis (Unconfirmed)

⚠️ Test Limitations

Changes

1. Database Connection Lifecycle Management (PRIMARY FIX)

2. Race Condition Prevention - Layer 1: Task Completion Check

3. Race Condition Prevention - Layer 2: Graceful Shutdown

4. Race Condition Prevention - Layer 3: Resilient Finally Block

5. Configurable Timeout

6. Integration Tests

Testing

Manual Testing

Integration Tests

Test Design Notes

Configuration

Production Impact

Defense-in-Depth Strategy

Risk Assessment

Next Steps

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

spbolton commented Feb 3, 2026 •

edited

Loading