Describe the bug
Under concurrent connection open/close for the same Databricks workspace host, threads can block for 60+ seconds inside TelemetryClientFactory.getTelemetryClient. The root cause is a classic "blocking I/O inside ConcurrentHashMap.compute" anti-pattern:
- When a connection is closed,
TelemetryClientFactory.closeTelemetryClient calls telemetryClientHolders.computeIfPresent(key, ...). Inside the mapping function, it invokes TelemetryClient.close(), which in turn calls flush(true).get() — a synchronous FutureTask.get() waiting for a network flush to the /telemetry endpoint to complete.
- While that flush is in-flight, the CHM bucket lock for the host key is held (this is documented
ConcurrentHashMap behavior — compute / computeIfPresent take a bucket-level lock for the duration of the lambda).
- Any concurrent
getTelemetryClient(connectionContext) call for the same key enters telemetryClientHolders.compute(key, ...) and blocks waiting for the same bucket lock.
Per the ConcurrentHashMap.compute javadoc, the mapping function "must be short and simple, and must not attempt to update any other mappings of this map." Performing a synchronous network I/O inside is unsafe.
When the telemetry endpoint is slow, transiently backlogged, or otherwise stalled, the lock can be held for the driver's socket timeout (default TelemetrySocketTimeout=5s, but observed 60–95s in practice when the endpoint is degraded), causing observable request hangs on every thread that is opening a new connection to the same host during that window.
To Reproduce
- Configure a JDBC client to open and close connections to the same Databricks workspace URL at moderate concurrency (e.g., 50+ parallel threads).
- Ensure telemetry is enabled (default in recent versions:
EnableTelemetry=1, subject to server-side feature flag).
- Introduce latency on the
/telemetry endpoint — easiest way is to point the driver at a proxy that delays responses by 30–60 seconds, or run the repro at a time when the telemetry endpoint is under transient load.
- Observe that some
DataSource.getConnection() calls stall for tens of seconds while other threads are concurrently closing connections to the same host. Thread dumps show the blocked threads stuck in ConcurrentHashMap.compute called from TelemetryClientFactory.getTelemetryClient.
Expected behavior
Opening a connection should not block waiting for another connection's telemetry flush to finish. TelemetryClient.close() (which performs a synchronous network flush) should not be invoked while holding a ConcurrentHashMap bucket lock.
Screenshots
N/A
Client side logs
Stack trace of a blocked thread pair (class names are from the driver, other frames redacted):
Thread "worker-A" is BLOCKED because it is waiting for the lock held by "worker-B".
at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1931)
at com.databricks.jdbc.telemetry.TelemetryClientFactory.getTelemetryClient(TelemetryClientFactory.java:86)
at com.databricks.jdbc.telemetry.TelemetryHelper.exportTelemetryEvent(TelemetryHelper.java:130)
at com.databricks.jdbc.telemetry.TelemetryHelper.exportTelemetryLog(TelemetryHelper.java:82)
at com.databricks.jdbc.telemetry.latency.TelemetryCollector.recordOperationLatency(TelemetryCollector.java:79)
at com.databricks.jdbc.telemetry.latency.DatabricksMetricsTimedProcessor$TimedInvocationHandler.invoke(DatabricksMetricsTimedProcessor.java:82)
at jdk.proxy2.$Proxy165.createSession(Unknown Source)
at com.databricks.jdbc.api.impl.DatabricksSession.open(DatabricksSession.java:163)
at com.databricks.jdbc.api.impl.DatabricksConnection.open(DatabricksConnection.java:67)
at com.databricks.client.jdbc.Driver.connect(Driver.java:63)
at com.databricks.client.jdbc.DataSource.getConnection(DataSource.java:53)
... (application frames)
Caused by: Thread "worker-B" is in [WAITING] state holding the CHM bucket lock.
at jdk.internal.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
at java.util.concurrent.FutureTask.get(FutureTask.java:190)
at com.databricks.jdbc.telemetry.TelemetryClient.close(TelemetryClient.java:115)
at com.databricks.jdbc.telemetry.TelemetryClientFactory.closeTelemetryClient(TelemetryClientFactory.java:242)
at com.databricks.jdbc.telemetry.TelemetryClientFactory.lambda$closeTelemetryClient$2(TelemetryClientFactory.java:150)
at java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1828)
at com.databricks.jdbc.telemetry.TelemetryClientFactory.closeTelemetryClient(TelemetryClientFactory.java:145)
at com.databricks.jdbc.api.impl.DatabricksConnection.close(DatabricksConnection.java:422)
... (application frames)
Client Environment
- OS: Linux
- Java version: Java 17
- Java vendor: OpenJDK
- Driver Version: 3.3.1 (also reproduces on 3.3.3; same code on
main as of 2026-05-07)
- BI Tool (if used): N/A, custom JDBC client
- BI Tool version (if applicable): N/A
Additional context
The bug appears unfixed across multiple versions. I diffed TelemetryClient.java and TelemetryClientFactory.java across:
All three are byte-for-byte identical in the relevant section. The v3.3.3 changelog only lists a Maven POM fix; no telemetry-related change.
The offending code path:
TelemetryClientFactory.closeTelemetryClient (the outer lock holder):
public void closeTelemetryClient(IDatabricksConnectionContext connectionContext) {
String key = TelemetryHelper.keyOf(connectionContext);
String connectionUuid = connectionContext.getConnectionUuid();
telemetryClientHolders.computeIfPresent(
key,
(k, holder) -> {
holder.connectionUuids.remove(connectionUuid);
if (holder.connectionUuids.isEmpty()) {
closeTelemetryClient(holder.client, "telemetry client"); // ← blocking I/O inside CHM compute
return null;
}
return holder;
});
...
}
TelemetryClient.close (the blocking operation):
@Override
public void close() {
TelemetryCollector collector = TelemetryCollectorManager.getInstance().getOrCreateCollector(context);
collector.exportAllPendingTelemetryDetails();
try {
flush(true).get(); // ← FutureTask.get() blocks until the network flush completes
} catch (Exception e) {
LOGGER.trace(...);
}
...
}
Suggested fixes (any of these would resolve it):
-
Remove-then-close — extract the holder from the map inside computeIfPresent, then close the client outside the lambda so the network I/O runs without holding the bucket lock:
final TelemetryClientHolder[] toClose = new TelemetryClientHolder[1];
telemetryClientHolders.computeIfPresent(key, (k, holder) -> {
holder.connectionUuids.remove(connectionUuid);
if (holder.connectionUuids.isEmpty()) {
toClose[0] = holder;
return null;
}
return holder;
});
if (toClose[0] != null) {
closeTelemetryClient(toClose[0].client, "telemetry client"); // outside the lock
}
-
Make TelemetryClient.close() truly asynchronous — don't call .get() on the flush future. Let the flush complete on the existing executor and just schedule task cancellation for flushTask.
-
Bound the flush with a timeout — replace flush(true).get() with flush(true).get(N, TimeUnit.SECONDS) where N is very small (e.g., 2 s, configurable). Lowest-risk minimal patch; the bucket lock is never held longer than N.
Option 1 is the cleanest, because it eliminates the fundamental anti-pattern rather than just bounding its damage.
Workaround for users experiencing this:
Setting telemetryLogLevel=OFF on the JDBC URL (or as a DataSource property) causes TelemetryHelper.isTelemetryAllowedForConnection to return false at its first check, routing all getTelemetryClient calls to NoopTelemetryClient.getInstance() and skipping the ConcurrentHashMap.compute path entirely. This avoids the deadlock but also disables all driver telemetry to the Databricks side.
Describe the bug
Under concurrent connection open/close for the same Databricks workspace host, threads can block for 60+ seconds inside
TelemetryClientFactory.getTelemetryClient. The root cause is a classic "blocking I/O insideConcurrentHashMap.compute" anti-pattern:TelemetryClientFactory.closeTelemetryClientcallstelemetryClientHolders.computeIfPresent(key, ...). Inside the mapping function, it invokesTelemetryClient.close(), which in turn callsflush(true).get()— a synchronousFutureTask.get()waiting for a network flush to the/telemetryendpoint to complete.ConcurrentHashMapbehavior —compute/computeIfPresenttake a bucket-level lock for the duration of the lambda).getTelemetryClient(connectionContext)call for the same key enterstelemetryClientHolders.compute(key, ...)and blocks waiting for the same bucket lock.Per the
ConcurrentHashMap.computejavadoc, the mapping function "must be short and simple, and must not attempt to update any other mappings of this map." Performing a synchronous network I/O inside is unsafe.When the telemetry endpoint is slow, transiently backlogged, or otherwise stalled, the lock can be held for the driver's socket timeout (default
TelemetrySocketTimeout=5s, but observed 60–95s in practice when the endpoint is degraded), causing observable request hangs on every thread that is opening a new connection to the same host during that window.To Reproduce
EnableTelemetry=1, subject to server-side feature flag)./telemetryendpoint — easiest way is to point the driver at a proxy that delays responses by 30–60 seconds, or run the repro at a time when the telemetry endpoint is under transient load.DataSource.getConnection()calls stall for tens of seconds while other threads are concurrently closing connections to the same host. Thread dumps show the blocked threads stuck inConcurrentHashMap.computecalled fromTelemetryClientFactory.getTelemetryClient.Expected behavior
Opening a connection should not block waiting for another connection's telemetry flush to finish.
TelemetryClient.close()(which performs a synchronous network flush) should not be invoked while holding aConcurrentHashMapbucket lock.Screenshots
N/A
Client side logs
Stack trace of a blocked thread pair (class names are from the driver, other frames redacted):
Client Environment
mainas of 2026-05-07)Additional context
The bug appears unfixed across multiple versions. I diffed
TelemetryClient.javaandTelemetryClientFactory.javaacross:v3.3.1v3.3.3main(as of 2026-05-07)All three are byte-for-byte identical in the relevant section. The v3.3.3 changelog only lists a Maven POM fix; no telemetry-related change.
The offending code path:
TelemetryClientFactory.closeTelemetryClient(the outer lock holder):TelemetryClient.close(the blocking operation):Suggested fixes (any of these would resolve it):
Remove-then-close — extract the holder from the map inside
computeIfPresent, then close the client outside the lambda so the network I/O runs without holding the bucket lock:Make
TelemetryClient.close()truly asynchronous — don't call.get()on the flush future. Let the flush complete on the existing executor and just schedule task cancellation forflushTask.Bound the flush with a timeout — replace
flush(true).get()withflush(true).get(N, TimeUnit.SECONDS)whereNis very small (e.g., 2 s, configurable). Lowest-risk minimal patch; the bucket lock is never held longer thanN.Option 1 is the cleanest, because it eliminates the fundamental anti-pattern rather than just bounding its damage.
Workaround for users experiencing this:
Setting
telemetryLogLevel=OFFon the JDBC URL (or as a DataSource property) causesTelemetryHelper.isTelemetryAllowedForConnectionto returnfalseat its first check, routing allgetTelemetryClientcalls toNoopTelemetryClient.getInstance()and skipping theConcurrentHashMap.computepath entirely. This avoids the deadlock but also disables all driver telemetry to the Databricks side.