Skip to content

[BUG] Deadlock in TelemetryClientFactory between concurrent connection open and close on the same host #1442

@haotian-wang

Description

@haotian-wang

Describe the bug

Under concurrent connection open/close for the same Databricks workspace host, threads can block for 60+ seconds inside TelemetryClientFactory.getTelemetryClient. The root cause is a classic "blocking I/O inside ConcurrentHashMap.compute" anti-pattern:

  1. When a connection is closed, TelemetryClientFactory.closeTelemetryClient calls telemetryClientHolders.computeIfPresent(key, ...). Inside the mapping function, it invokes TelemetryClient.close(), which in turn calls flush(true).get() — a synchronous FutureTask.get() waiting for a network flush to the /telemetry endpoint to complete.
  2. While that flush is in-flight, the CHM bucket lock for the host key is held (this is documented ConcurrentHashMap behavior — compute / computeIfPresent take a bucket-level lock for the duration of the lambda).
  3. Any concurrent getTelemetryClient(connectionContext) call for the same key enters telemetryClientHolders.compute(key, ...) and blocks waiting for the same bucket lock.

Per the ConcurrentHashMap.compute javadoc, the mapping function "must be short and simple, and must not attempt to update any other mappings of this map." Performing a synchronous network I/O inside is unsafe.

When the telemetry endpoint is slow, transiently backlogged, or otherwise stalled, the lock can be held for the driver's socket timeout (default TelemetrySocketTimeout=5s, but observed 60–95s in practice when the endpoint is degraded), causing observable request hangs on every thread that is opening a new connection to the same host during that window.

To Reproduce

  1. Configure a JDBC client to open and close connections to the same Databricks workspace URL at moderate concurrency (e.g., 50+ parallel threads).
  2. Ensure telemetry is enabled (default in recent versions: EnableTelemetry=1, subject to server-side feature flag).
  3. Introduce latency on the /telemetry endpoint — easiest way is to point the driver at a proxy that delays responses by 30–60 seconds, or run the repro at a time when the telemetry endpoint is under transient load.
  4. Observe that some DataSource.getConnection() calls stall for tens of seconds while other threads are concurrently closing connections to the same host. Thread dumps show the blocked threads stuck in ConcurrentHashMap.compute called from TelemetryClientFactory.getTelemetryClient.

Expected behavior

Opening a connection should not block waiting for another connection's telemetry flush to finish. TelemetryClient.close() (which performs a synchronous network flush) should not be invoked while holding a ConcurrentHashMap bucket lock.

Screenshots

N/A

Client side logs

Stack trace of a blocked thread pair (class names are from the driver, other frames redacted):

Thread "worker-A" is BLOCKED because it is waiting for the lock held by "worker-B".
	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1931)
	at com.databricks.jdbc.telemetry.TelemetryClientFactory.getTelemetryClient(TelemetryClientFactory.java:86)
	at com.databricks.jdbc.telemetry.TelemetryHelper.exportTelemetryEvent(TelemetryHelper.java:130)
	at com.databricks.jdbc.telemetry.TelemetryHelper.exportTelemetryLog(TelemetryHelper.java:82)
	at com.databricks.jdbc.telemetry.latency.TelemetryCollector.recordOperationLatency(TelemetryCollector.java:79)
	at com.databricks.jdbc.telemetry.latency.DatabricksMetricsTimedProcessor$TimedInvocationHandler.invoke(DatabricksMetricsTimedProcessor.java:82)
	at jdk.proxy2.$Proxy165.createSession(Unknown Source)
	at com.databricks.jdbc.api.impl.DatabricksSession.open(DatabricksSession.java:163)
	at com.databricks.jdbc.api.impl.DatabricksConnection.open(DatabricksConnection.java:67)
	at com.databricks.client.jdbc.Driver.connect(Driver.java:63)
	at com.databricks.client.jdbc.DataSource.getConnection(DataSource.java:53)
	... (application frames)

Caused by: Thread "worker-B" is in [WAITING] state holding the CHM bucket lock.
	at jdk.internal.misc.Unsafe.park(Native Method)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
	at java.util.concurrent.FutureTask.get(FutureTask.java:190)
	at com.databricks.jdbc.telemetry.TelemetryClient.close(TelemetryClient.java:115)
	at com.databricks.jdbc.telemetry.TelemetryClientFactory.closeTelemetryClient(TelemetryClientFactory.java:242)
	at com.databricks.jdbc.telemetry.TelemetryClientFactory.lambda$closeTelemetryClient$2(TelemetryClientFactory.java:150)
	at java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1828)
	at com.databricks.jdbc.telemetry.TelemetryClientFactory.closeTelemetryClient(TelemetryClientFactory.java:145)
	at com.databricks.jdbc.api.impl.DatabricksConnection.close(DatabricksConnection.java:422)
	... (application frames)

Client Environment

  • OS: Linux
  • Java version: Java 17
  • Java vendor: OpenJDK
  • Driver Version: 3.3.1 (also reproduces on 3.3.3; same code on main as of 2026-05-07)
  • BI Tool (if used): N/A, custom JDBC client
  • BI Tool version (if applicable): N/A

Additional context

The bug appears unfixed across multiple versions. I diffed TelemetryClient.java and TelemetryClientFactory.java across:

All three are byte-for-byte identical in the relevant section. The v3.3.3 changelog only lists a Maven POM fix; no telemetry-related change.

The offending code path:

TelemetryClientFactory.closeTelemetryClient (the outer lock holder):

public void closeTelemetryClient(IDatabricksConnectionContext connectionContext) {
    String key = TelemetryHelper.keyOf(connectionContext);
    String connectionUuid = connectionContext.getConnectionUuid();
    telemetryClientHolders.computeIfPresent(
        key,
        (k, holder) -> {
            holder.connectionUuids.remove(connectionUuid);
            if (holder.connectionUuids.isEmpty()) {
                closeTelemetryClient(holder.client, "telemetry client");  // ← blocking I/O inside CHM compute
                return null;
            }
            return holder;
        });
    ...
}

TelemetryClient.close (the blocking operation):

@Override
public void close() {
    TelemetryCollector collector = TelemetryCollectorManager.getInstance().getOrCreateCollector(context);
    collector.exportAllPendingTelemetryDetails();
    try {
        flush(true).get();  // ← FutureTask.get() blocks until the network flush completes
    } catch (Exception e) {
        LOGGER.trace(...);
    }
    ...
}

Suggested fixes (any of these would resolve it):

  1. Remove-then-close — extract the holder from the map inside computeIfPresent, then close the client outside the lambda so the network I/O runs without holding the bucket lock:

    final TelemetryClientHolder[] toClose = new TelemetryClientHolder[1];
    telemetryClientHolders.computeIfPresent(key, (k, holder) -> {
        holder.connectionUuids.remove(connectionUuid);
        if (holder.connectionUuids.isEmpty()) {
            toClose[0] = holder;
            return null;
        }
        return holder;
    });
    if (toClose[0] != null) {
        closeTelemetryClient(toClose[0].client, "telemetry client");  // outside the lock
    }
  2. Make TelemetryClient.close() truly asynchronous — don't call .get() on the flush future. Let the flush complete on the existing executor and just schedule task cancellation for flushTask.

  3. Bound the flush with a timeout — replace flush(true).get() with flush(true).get(N, TimeUnit.SECONDS) where N is very small (e.g., 2 s, configurable). Lowest-risk minimal patch; the bucket lock is never held longer than N.

Option 1 is the cleanest, because it eliminates the fundamental anti-pattern rather than just bounding its damage.

Workaround for users experiencing this:

Setting telemetryLogLevel=OFF on the JDBC URL (or as a DataSource property) causes TelemetryHelper.isTelemetryAllowedForConnection to return false at its first check, routing all getTelemetryClient calls to NoopTelemetryClient.getInstance() and skipping the ConcurrentHashMap.compute path entirely. This avoids the deadlock but also disables all driver telemetry to the Databricks side.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions