[BUG] Deadlock in TelemetryClientFactory between concurrent connection open and close on the same host

**Describe the bug**

Under concurrent connection open/close for the same Databricks workspace host, threads can block for 60+ seconds inside `TelemetryClientFactory.getTelemetryClient`. The root cause is a classic "blocking I/O inside `ConcurrentHashMap.compute`" anti-pattern:

1. When a connection is closed, `TelemetryClientFactory.closeTelemetryClient` calls `telemetryClientHolders.computeIfPresent(key, ...)`. Inside the mapping function, it invokes `TelemetryClient.close()`, which in turn calls `flush(true).get()` — a **synchronous `FutureTask.get()`** waiting for a network flush to the `/telemetry` endpoint to complete.
2. While that flush is in-flight, the CHM bucket lock for the host key is held (this is documented `ConcurrentHashMap` behavior — `compute` / `computeIfPresent` take a bucket-level lock for the duration of the lambda).
3. Any concurrent `getTelemetryClient(connectionContext)` call for the same key enters `telemetryClientHolders.compute(key, ...)` and blocks waiting for the same bucket lock.

Per the [`ConcurrentHashMap.compute` javadoc](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/ConcurrentHashMap.html#compute(K,java.util.function.BiFunction)), the mapping function "must be short and simple, and must not attempt to update any other mappings of this map." Performing a synchronous network I/O inside is unsafe.

When the telemetry endpoint is slow, transiently backlogged, or otherwise stalled, the lock can be held for the driver's socket timeout (default `TelemetrySocketTimeout=5s`, but observed 60–95s in practice when the endpoint is degraded), causing observable request hangs on every thread that is opening a new connection to the same host during that window.

**To Reproduce**

1. Configure a JDBC client to open and close connections to the same Databricks workspace URL at moderate concurrency (e.g., 50+ parallel threads).
2. Ensure telemetry is enabled (default in recent versions: `EnableTelemetry=1`, subject to server-side feature flag).
3. Introduce latency on the `/telemetry` endpoint — easiest way is to point the driver at a proxy that delays responses by 30–60 seconds, or run the repro at a time when the telemetry endpoint is under transient load.
4. Observe that some `DataSource.getConnection()` calls stall for tens of seconds while other threads are concurrently closing connections to the same host. Thread dumps show the blocked threads stuck in `ConcurrentHashMap.compute` called from `TelemetryClientFactory.getTelemetryClient`.

**Expected behavior**

Opening a connection should not block waiting for another connection's telemetry flush to finish. `TelemetryClient.close()` (which performs a synchronous network flush) should not be invoked while holding a `ConcurrentHashMap` bucket lock.

**Screenshots**

N/A

**Client side logs**

Stack trace of a blocked thread pair (class names are from the driver, other frames redacted):

```
Thread "worker-A" is BLOCKED because it is waiting for the lock held by "worker-B".
	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1931)
	at com.databricks.jdbc.telemetry.TelemetryClientFactory.getTelemetryClient(TelemetryClientFactory.java:86)
	at com.databricks.jdbc.telemetry.TelemetryHelper.exportTelemetryEvent(TelemetryHelper.java:130)
	at com.databricks.jdbc.telemetry.TelemetryHelper.exportTelemetryLog(TelemetryHelper.java:82)
	at com.databricks.jdbc.telemetry.latency.TelemetryCollector.recordOperationLatency(TelemetryCollector.java:79)
	at com.databricks.jdbc.telemetry.latency.DatabricksMetricsTimedProcessor$TimedInvocationHandler.invoke(DatabricksMetricsTimedProcessor.java:82)
	at jdk.proxy2.$Proxy165.createSession(Unknown Source)
	at com.databricks.jdbc.api.impl.DatabricksSession.open(DatabricksSession.java:163)
	at com.databricks.jdbc.api.impl.DatabricksConnection.open(DatabricksConnection.java:67)
	at com.databricks.client.jdbc.Driver.connect(Driver.java:63)
	at com.databricks.client.jdbc.DataSource.getConnection(DataSource.java:53)
	... (application frames)

Caused by: Thread "worker-B" is in [WAITING] state holding the CHM bucket lock.
	at jdk.internal.misc.Unsafe.park(Native Method)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
	at java.util.concurrent.FutureTask.get(FutureTask.java:190)
	at com.databricks.jdbc.telemetry.TelemetryClient.close(TelemetryClient.java:115)
	at com.databricks.jdbc.telemetry.TelemetryClientFactory.closeTelemetryClient(TelemetryClientFactory.java:242)
	at com.databricks.jdbc.telemetry.TelemetryClientFactory.lambda$closeTelemetryClient$2(TelemetryClientFactory.java:150)
	at java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1828)
	at com.databricks.jdbc.telemetry.TelemetryClientFactory.closeTelemetryClient(TelemetryClientFactory.java:145)
	at com.databricks.jdbc.api.impl.DatabricksConnection.close(DatabricksConnection.java:422)
	... (application frames)
```

**Client Environment**
- OS: Linux
- Java version: Java 17
- Java vendor: OpenJDK
- Driver Version: 3.3.1 (also reproduces on 3.3.3; same code on `main` as of 2026-05-07)
- BI Tool (if used): N/A, custom JDBC client
- BI Tool version (if applicable): N/A

**Additional context**

**The bug appears unfixed across multiple versions.** I diffed `TelemetryClient.java` and `TelemetryClientFactory.java` across:

- [`v3.3.1`](https://github.com/databricks/databricks-jdbc/blob/v3.3.1/src/main/java/com/databricks/jdbc/telemetry/TelemetryClientFactory.java)
- [`v3.3.3`](https://github.com/databricks/databricks-jdbc/blob/v3.3.3/src/main/java/com/databricks/jdbc/telemetry/TelemetryClientFactory.java)
- [`main`](https://github.com/databricks/databricks-jdbc/blob/main/src/main/java/com/databricks/jdbc/telemetry/TelemetryClientFactory.java) (as of 2026-05-07)

All three are byte-for-byte identical in the relevant section. The v3.3.3 changelog only lists a Maven POM fix; no telemetry-related change.

**The offending code path:**

`TelemetryClientFactory.closeTelemetryClient` (the outer lock holder):
```java
public void closeTelemetryClient(IDatabricksConnectionContext connectionContext) {
    String key = TelemetryHelper.keyOf(connectionContext);
    String connectionUuid = connectionContext.getConnectionUuid();
    telemetryClientHolders.computeIfPresent(
        key,
        (k, holder) -> {
            holder.connectionUuids.remove(connectionUuid);
            if (holder.connectionUuids.isEmpty()) {
                closeTelemetryClient(holder.client, "telemetry client");  // ← blocking I/O inside CHM compute
                return null;
            }
            return holder;
        });
    ...
}
```

`TelemetryClient.close` (the blocking operation):
```java
@Override
public void close() {
    TelemetryCollector collector = TelemetryCollectorManager.getInstance().getOrCreateCollector(context);
    collector.exportAllPendingTelemetryDetails();
    try {
        flush(true).get();  // ← FutureTask.get() blocks until the network flush completes
    } catch (Exception e) {
        LOGGER.trace(...);
    }
    ...
}
```

**Suggested fixes (any of these would resolve it):**

1. **Remove-then-close** — extract the holder from the map inside `computeIfPresent`, then close the client outside the lambda so the network I/O runs without holding the bucket lock:
   ```java
   final TelemetryClientHolder[] toClose = new TelemetryClientHolder[1];
   telemetryClientHolders.computeIfPresent(key, (k, holder) -> {
       holder.connectionUuids.remove(connectionUuid);
       if (holder.connectionUuids.isEmpty()) {
           toClose[0] = holder;
           return null;
       }
       return holder;
   });
   if (toClose[0] != null) {
       closeTelemetryClient(toClose[0].client, "telemetry client");  // outside the lock
   }
   ```

2. **Make `TelemetryClient.close()` truly asynchronous** — don't call `.get()` on the flush future. Let the flush complete on the existing executor and just schedule task cancellation for `flushTask`.

3. **Bound the flush with a timeout** — replace `flush(true).get()` with `flush(true).get(N, TimeUnit.SECONDS)` where `N` is very small (e.g., 2 s, configurable). Lowest-risk minimal patch; the bucket lock is never held longer than `N`.

Option 1 is the cleanest, because it eliminates the fundamental anti-pattern rather than just bounding its damage.

**Workaround for users experiencing this:**

Setting `telemetryLogLevel=OFF` on the JDBC URL (or as a DataSource property) causes `TelemetryHelper.isTelemetryAllowedForConnection` to return `false` at its first check, routing all `getTelemetryClient` calls to `NoopTelemetryClient.getInstance()` and skipping the `ConcurrentHashMap.compute` path entirely. This avoids the deadlock but also disables all driver telemetry to the Databricks side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Deadlock in TelemetryClientFactory between concurrent connection open and close on the same host #1442

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] Deadlock in TelemetryClientFactory between concurrent connection open and close on the same host #1442

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions