Skip to content

Commit 7e04c7c

Browse files
authored
docs: telemetry user guide, design spec, and test summary (#364)
* docs: add telemetry design, test summary, and user guide Rebases the docs PR onto main now that #327 has landed. Drops all code changes from the diff; keeps only the new docs (docs/TELEMETRY.md, spec/telemetry-design.md, spec/telemetry-test-completion-summary.md) and swaps README's telemetry section to a short pointer to docs/TELEMETRY.md. Co-authored-by: Isaac * docs: shorten README telemetry section Replace the verbose bulleted overview with a tight three-sentence blurb that names what's collected, where the opt-outs live, and points readers to docs/TELEMETRY.md for everything else. Co-authored-by: Isaac * docs: rewrite telemetry guides to be terse and accurate - docs/TELEMETRY.md: 717 → 123 lines. Drop duplicate privacy sections, long examples, architecture deep-dive. Fix the inaccurate "disabled by default" claim. Cross-reference IDBSQLClient.ts JSDoc for defaults so the doc can't drift. - spec/telemetry-design.md: 2538 → 125 lines. Drop the implementation checklist, exhaustive proto field table, and inline class bodies that mirrored lib/telemetry/*.ts. Keep architecture diagram, per-component responsibilities, export lifecycle, privacy/error/shutdown invariants. - spec/telemetry-test-completion-summary.md: 648 → 47 lines. Collapse per-component prose into a single table, drop verbose exit-criteria-verification and quality-metrics sections. Co-authored-by: Isaac * Delete spec/telemetry-test-completion-summary.md * docs: prettier-format docs/TELEMETRY.md Align the configuration-options table columns to satisfy prettier's markdown table formatter. Pure formatting; no content change. Co-authored-by: Isaac
1 parent 5f1728a commit 7e04c7c

3 files changed

Lines changed: 257 additions & 82 deletions

File tree

README.md

Lines changed: 9 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -53,88 +53,15 @@ client
5353

5454
## Telemetry
5555

56-
Starting with version 1.13, the driver collects telemetry — connection,
57-
statement, and CloudFetch chunk metrics, plus error events with redacted
58-
stack traces — to help Databricks improve driver performance and
59-
reliability. **Telemetry is enabled by default and gated by a server-side
60-
feature flag**: events are emitted only when the workspace's feature flag
61-
is on. No SQL text, parameter values, or row data are ever included.
62-
63-
### What's collected
64-
65-
- Connection lifecycle (`CREATE_SESSION`, `DELETE_SESSION`) with latency.
66-
- Statement lifecycle (`STATEMENT_START`, `STATEMENT_COMPLETE`) with
67-
execution latency, operation type, and result format.
68-
- CloudFetch chunk timings and byte counts.
69-
- Error events with redacted stack traces (Bearer/JWT tokens, OAuth
70-
secrets, home-directory paths, and Databricks PATs are stripped before
71-
emission).
72-
73-
See `TelemetryEvent` and `TelemetryMetric` in the package exports for the
74-
exact payload shapes.
75-
76-
### Multi-tenant SaaS deployments — read this before enabling telemetry
77-
78-
The telemetry layer shares one per-host `TelemetryClient` across every
79-
`DBSQLClient` connected to the same Databricks workspace host. The
80-
authenticated export path uses the **first-registered** client's auth
81-
provider, User-Agent, and `telemetryAuthenticatedExport` value — these
82-
fields are snapshotted at the host singleton and are **not** per-tenant.
83-
84-
If you are operating a SaaS layer that fronts multiple tenants against the
85-
same Databricks workspace host with a shared driver process, telemetry from
86-
tenant B's queries can be POSTed under tenant A's auth headers, with
87-
tenant A's `userAgentEntry`. A tenant B that explicitly set
88-
`telemetryAuthenticatedExport: false` will still ride tenant A's
89-
authenticated pipeline.
90-
91-
> **Recommendation for multi-tenant deployments**: set
92-
> `telemetryEnabled: false` on all `DBSQLClient` instances, or partition
93-
> by Databricks workspace host so each tenant owns its own
94-
> `TelemetryClient`. Subsequent registrants with diverging auth/UA values
95-
> emit a warn-level log so the leak is at least visible.
96-
97-
### Opting out
98-
99-
Three independent ways to disable telemetry, in order of precedence:
100-
101-
1. **Environment variable** — set `DATABRICKS_TELEMETRY_DISABLED` to one
102-
of `1`, `true`, `yes`, or `on` (case-insensitive). Other values
103-
(empty, `0`, `false`, `off`, `no`) are ignored, leaving the runtime
104-
config in charge.
105-
2. **Programmatic** — pass `telemetryEnabled: false` to `connect()`:
106-
```javascript
107-
await client.connect({
108-
host,
109-
path,
110-
token,
111-
telemetryEnabled: false,
112-
});
113-
```
114-
3. **Server-side** — Databricks-managed feature flag; if disabled for
115-
your workspace, the driver does not emit telemetry regardless of
116-
client config.
117-
118-
### Tuning
119-
120-
If you keep telemetry on, the following knobs are available on
121-
`ConnectionOptions` (see JSDoc on `IDBSQLClient.ts` for defaults and
122-
units):
123-
124-
- `telemetryAuthenticatedExport` — set to `false` to ship reduced
125-
payloads (no statement/session correlation IDs, generic User-Agent)
126-
via the unauthenticated endpoint.
127-
- `telemetryBatchSize`, `telemetryFlushIntervalMs`, `telemetryMaxRetries`
128-
— batching and retry tuning.
129-
- `telemetryCircuitBreakerThreshold`, `telemetryCircuitBreakerTimeout`
130-
circuit-breaker tuning for the export endpoint.
131-
- `telemetryCloseTimeoutMs` — bound on `await client.close()` waiting for
132-
the final flush.
133-
134-
> **Note for short-lived processes**: always `await client.close()`
135-
> before `process.exit(0)` so the final batch is flushed. Without an
136-
> explicit close, the periodic flush timer is `unref()`'d to avoid
137-
> holding the event loop open, so any unflushed events are dropped.
56+
The driver emits connection, statement, and CloudFetch metrics plus
57+
redacted error events to help Databricks improve driver reliability. No
58+
SQL text, parameter values, or row data is ever collected. Emission is
59+
gated by a server-side feature flag and can be disabled per-connection
60+
with `telemetryEnabled: false` or globally with the
61+
`DATABRICKS_TELEMETRY_DISABLED` env var.
62+
63+
See [docs/TELEMETRY.md](docs/TELEMETRY.md) for the full event payloads,
64+
tuning knobs, multi-tenant guidance, and troubleshooting.
13865

13966
## Run Tests
14067

docs/TELEMETRY.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Telemetry
2+
3+
The driver emits anonymous usage and performance metrics to Databricks to help track driver
4+
adoption, identify performance regressions, and prioritize fixes. Telemetry is **enabled by
5+
default** and is additionally gated by a per-workspace server-side feature flag, so events are
6+
only exported when the workspace has telemetry turned on. No SQL text, parameter values, row
7+
data, table/column names, credentials, or IP addresses are ever collected.
8+
9+
## What's collected
10+
11+
Events are batched per host and exported to the Databricks control plane over HTTPS using the
12+
same auth as your queries.
13+
14+
- **Connection** (`connection.open`): driver version and name, Node.js version, OS platform/
15+
version, and boolean feature toggles (CloudFetch, LZ4, Arrow, direct results) plus numeric
16+
configs (socket timeout, retry max, CloudFetch concurrency).
17+
- **Statement** (`statement.start` / `statement.complete`): randomly generated statement and
18+
session UUIDs, operation type (e.g. `SELECT`), latency, result format, poll count, chunk
19+
count, bytes downloaded.
20+
- **CloudFetch chunk** (`cloudfetch.chunk`): chunk index, download latency, byte size,
21+
compressed flag.
22+
- **Error**: error class name, sanitized message (no PII), HTTP status, terminal-vs-retryable
23+
flag. Stack traces are not transmitted.
24+
25+
Correlation IDs (session ID, statement ID) are random UUIDs and are not tied to user identity.
26+
Workspace ID is included for aggregation.
27+
28+
## Configuration
29+
30+
Options are passed to `new DBSQLClient({...})` (and can be overridden per `connect()` call).
31+
See the JSDoc on `IDBSQLClientConnectionOptions` in
32+
[`lib/contracts/IDBSQLClient.ts`](../lib/contracts/IDBSQLClient.ts) for the authoritative
33+
defaults and full descriptions.
34+
35+
| Option | Purpose |
36+
| ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
37+
| `telemetryEnabled` | Master switch. `false` is a hard opt-out; `true` requests telemetry (still subject to the server flag). |
38+
| `telemetryAuthenticatedExport` | When `true`, exports go to the authenticated `/telemetry-ext` endpoint with full event context. When `false`, only error names go to the unauthenticated endpoint. |
39+
| `telemetryBatchSize` | Events accumulated before a flush. |
40+
| `telemetryFlushIntervalMs` | Periodic flush interval. |
41+
| `telemetryMaxRetries` | Retries per failed export. |
42+
| `telemetryCircuitBreakerThreshold` | Consecutive failures before the per-host breaker opens. |
43+
| `telemetryCircuitBreakerTimeout` | How long the breaker stays open before re-probing. |
44+
| `telemetryCloseTimeoutMs` | Upper bound on the final flush during `client.close()`. |
45+
46+
### Basic example
47+
48+
```javascript
49+
const { DBSQLClient } = require('@databricks/sql');
50+
51+
const client = new DBSQLClient();
52+
await client.connect({
53+
host: '********.databricks.com',
54+
path: '/sql/2.0/warehouses/****************',
55+
token: 'dapi********************************',
56+
});
57+
```
58+
59+
### Disabling telemetry
60+
61+
```javascript
62+
const client = new DBSQLClient({ telemetryEnabled: false });
63+
```
64+
65+
## Opt-out
66+
67+
Three independent ways to disable, in order of precedence (first match wins):
68+
69+
1. **Environment variable**: `DATABRICKS_TELEMETRY_DISABLED` set to `1`, `true`, `yes`, or
70+
`on` (case-insensitive) disables telemetry process-wide regardless of any other setting.
71+
2. **Programmatic**: `telemetryEnabled: false` in `DBSQLClient` or `connect()` options is a
72+
hard opt-out for that client.
73+
3. **Server feature flag**: If the workspace's server-side flag is off, no events are exported
74+
even when the client requests them.
75+
76+
## Multi-tenant / SaaS warning
77+
78+
The driver maintains a singleton telemetry client per host (shared across all `DBSQLClient`
79+
instances pointing at the same workspace) to batch events and avoid rate limits. In a
80+
multi-tenant process where multiple tenants connect to the same host with different
81+
credentials, events buffered for tenant A may be flushed using whichever connection happens to
82+
own the authenticated export at the time. Tenant B's auth headers could carry tenant A's
83+
telemetry payload.
84+
85+
If you run a multi-tenant SaaS that proxies queries from distinct end-customers through one
86+
Node process to the same Databricks host, set `telemetryEnabled: false` (or
87+
`telemetryAuthenticatedExport: false`) to prevent cross-tenant attribution in telemetry.
88+
89+
## Troubleshooting
90+
91+
- **No events visible**: confirm `telemetryEnabled` is not `false`, `DATABRICKS_TELEMETRY_DISABLED`
92+
is unset, and the workspace feature flag is on. Look for the debug log
93+
`Telemetry disabled via feature flag`.
94+
- **Events suddenly stop**: the per-host circuit breaker has likely opened after repeated
95+
export failures. Look for `Circuit breaker transitioned to OPEN`; it re-probes automatically
96+
after `telemetryCircuitBreakerTimeout` (default 60s).
97+
- **Buffer pressure / dropped metrics**: check `client.getTelemetryStats().droppedMetrics`. If
98+
it climbs, increase `telemetryMaxPendingMetrics` or lower `telemetryFlushIntervalMs`.
99+
- **Shutdown delay**: `client.close()` waits up to `telemetryCloseTimeoutMs` (default 2s) for
100+
the final flush. Lower it if shutdown latency matters more than the last batch.
101+
- **Telemetry failures impacting the app**: they shouldn't. Exceptions are caught and logged
102+
at debug only; the driver continues regardless. File an issue if you see otherwise.
103+
104+
## FAQ
105+
106+
**Does telemetry affect query performance?** Event emission is non-blocking and exports are
107+
batched on a background timer. Overhead is well under 1% of query time in typical workloads.
108+
109+
**Can I see what's being sent?** Yes, enable debug-level logging on the driver's logger.
110+
Every export and circuit-breaker transition is logged.
111+
112+
**Where does the data go?** To `/api/2.0/sql/telemetry-ext` (authenticated) or
113+
`/api/2.0/sql/telemetry-unauth` on the same Databricks host you're connected to. It stays in
114+
the same regional control plane as your queries.
115+
116+
**Can I route telemetry to my own backend?** Not via configuration. Disable it and instrument
117+
your application using your own logger/metrics.
118+
119+
**Can I disable telemetry for a single query?** No, the granularity is per-connection. Open a
120+
separate `DBSQLClient` with `telemetryEnabled: false` for the queries you want excluded.
121+
122+
For implementation details (per-host management, circuit breaker state machine, exception
123+
handling policy), see [`spec/telemetry-design.md`](../spec/telemetry-design.md).

0 commit comments

Comments
 (0)