fix(telemetry): fixtelemetry DB connection leak in MetricStatsColl (#34480)#34490
Queued
fix(telemetry): fixtelemetry DB connection leak in MetricStatsColl (#34480)#34490
Conversation
1 task
1 task
06267b7 to
80a422f
Compare
80a422f to
7db7edb
Compare
dario-daza
approved these changes
Feb 4, 2026
erickgonzalez
requested changes
Feb 4, 2026
erickgonzalez
approved these changes
Feb 5, 2026
Any commits made after this event will not be merged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes a probable database connection leak in telemetry metrics collection where connections were not properly returned to the HikariCP pool. Based on code analysis and production evidence, the leak likely occurred during a race condition when metric queries timed out.
Probable Root Cause Analysis (Unconfirmed)
Suspected mechanism: The leak likely occurred through a narrow race condition window (~milliseconds) when:
cancel(true)immediately interrupts the threadOriginal code pattern (no connection management):
This pattern had no explicit connection lifecycle management, making it vulnerable to interruption during query execution or cleanup.
Supporting evidence:
shutdownNow()immediately interrupted all running tasksConfidence level: High probability based on code analysis, but unconfirmed without replication.
We have NOT been able to replicate this issue in testing. The race condition (if this is indeed the cause) is extremely timing-dependent and occurs in a narrow millisecond window. The integration test does NOT fail with the old code - it passes both before and after our fixes.
This is expected behavior because:
The test validates that our fixes don't break existing functionality and that the timeout mechanism works correctly, but cannot reliably trigger the suspected race condition.
Changes
1. Database Connection Lifecycle Management (PRIMARY FIX)
File:
dotCMS/src/main/java/com/dotcms/telemetry/collectors/DBMetricType.javagetValue()calls inDbConnectionFactory.wrapConnection()to ensure proper connection lifecycleBefore:
After:
2. Race Condition Prevention - Layer 1: Task Completion Check
File:
dotCMS/src/main/java/com/dotcms/telemetry/collectors/MetricStatsCollector.java(line 187)isDone()check before callingcancel(true)on timeout3. Race Condition Prevention - Layer 2: Graceful Shutdown
File:
dotCMS/src/main/java/com/dotcms/telemetry/collectors/MetricStatsCollector.java(lines 246-264)shutdownNow()with gracefulshutdown()+awaitTermination(5s)Before:
After:
4. Race Condition Prevention - Layer 3: Resilient Finally Block
File:
dotCMS/src/main/java/com/dotmarketing/db/DbConnectionFactory.java(lines 788-801)This layer works in conjunction with the new
wrapConnection()usage in #1 above.5. Configurable Timeout
File:
dotCMS/src/main/java/com/dotcms/telemetry/business/TimeoutConfig.javaTELEMETRY_METRIC_TIMEOUT_SECONDSconfiguration property6. Integration Tests
Files:
dotcms-integration/src/test/java/com/dotcms/telemetry/collectors/MetricTimeoutTest.java(new)dotcms-integration/src/test/java/com/dotcms/telemetry/collectors/SlowDatabaseTestMetric.java(new)Test Coverage:
pg_sleep(3)Note: The test validates correct behavior but does NOT reproduce the suspected race condition leak.
Testing
Manual Testing
Integration Tests
Test validates:
Test Limitation: The test passes with both old and new code because the suspected race condition is too rare to trigger reliably in testing. The test serves to validate that our fixes don't break existing functionality and that the timeout mechanism works correctly.
Test Design Notes
pg_sleep(3)to hold real database connectionsDbConnectionFactory.wrapConnection()code pathConfiguration
New configuration property in
dotcms-config.properties:For tests, set to 1 second via system property:
telemetry.metric.timeout.seconds=1Production Impact
Defense-in-Depth Strategy
This fix implements a four-layer defense targeting the suspected race condition:
isDone()before canceling to avoid racing with task completionAll four layers work together to eliminate the suspected race condition window, which should prevent connection leaks if our hypothesis is correct.
Risk Assessment
Low Risk Change:
Next Steps
Closes #34480
🤖 Generated with Claude Code