Skip to content

Extract SPOG org-id from cluster httpPath for non-Thrift requests#367

Open
msrathore-db wants to merge 2 commits into
mainfrom
fix-spog-org-id-extract-cluster-path
Open

Extract SPOG org-id from cluster httpPath for non-Thrift requests#367
msrathore-db wants to merge 2 commits into
mainfrom
fix-spog-org-id-extract-cluster-path

Conversation

@msrathore-db
Copy link
Copy Markdown
Contributor

Summary

  • Fixes silent telemetry loss on SPOG (custom-URL) hosts when connecting to an all-purpose cluster via an httpPath like sql/protocolv1/o/<workspace-id>/<cluster-id>.
  • extractSpogHeaders now extracts the workspace ID from the /o/<wsid>/ path segment as a fallback when ?o=<wsid> is not present, and emits an x-databricks-org-id header so the wrapped telemetryClient / feature-flag client can route correctly on SPOG.
  • Priority order preserved: ?o= in httpPath/sql/protocolv1/o/<wsid>/ path segment. Caller-set request headers still win, including mixed-case X-Databricks-Org-Id.

Why

On a SPOG host the workspace identity has to be in either the URL or the x-databricks-org-id header for PoPP to route a request to the correct workspace. For all-purpose cluster Thrift this is free — the workspace ID is in the /o/<wsid>/ segment of httpPath, so PoPP routes Thrift via routing_reason=workspace-id and the session opens fine without an explicit ?o=.

The telemetry and feature-flag transports built by connector.Connect wrap c.client with withSpogHeaders only when extractSpogHeaders(c.cfg.HTTPPath) returns a non-nil map. Before this PR that only happened when ?o= was present, so cluster URLs without ?o= produced no x-databricks-org-id header, PoPP fell back to default (account) routing on /telemetry-ext, and the responses were 303 redirects to /login — silently dropping telemetry on SPOG.

What changes

connector.go

  • New clusterPathOrgIDPattern regex ((?:^|/)sql/protocolv1/o/(\d+)/[^/?]+) compiled once at package init.
  • extractSpogHeaders checks ?o= first (existing behavior), then falls back to the cluster path segment, then returns nil. The malformed-query-string path now also falls through to path inspection instead of returning early. Log messages indicate which source produced the workspace ID.
  • Updated the connection comment to describe both query-param and cluster-path extraction.
  • regexp added to imports.

connector_spog_test.go

  • Four new table entries under TestExtractSpogHeaders:
    • Cluster path without ?o= → header extracted from /o/<wsid>/
    • Cluster path with leading / → same
    • Cluster path with ?o= → query-param value wins
    • Warehouse path without ?o= → still nil (regression guard: the new regex must not match warehouse paths)
  • New transport regression test proving a mixed-case caller-set X-Databricks-Org-Id is not overwritten by the SPOG wrapper.

Test plan

  • go test -run 'TestExtractSpogHeaders|TestHeaderInjectingTransport' -v . — focused SPOG extraction and transport tests pass.
  • go test -short . — root package suite passes.
  • Behavior validated on the OSS JDBC equivalent fix against Prod SPOG (peco.azuredatabricks.net) all-purpose cluster: telemetry POST /telemetry-ext flipped from HTTP 303 → /login to HTTP 200 OK once the path-segment extraction populated x-databricks-org-id. Same shape of fix here.

Out of scope

This pull request and its description were written with assistance from Claude Code.

For all-purpose-compute Thrift connections on SPOG (custom-URL) hosts
httpPath is /sql/protocolv1/o/<workspace-id>/<cluster-id> and the
workspace ID is encoded in the path itself. PoPP routes the Thrift
request correctly off the /o/<wsid>/ segment, so the connection succeeds
without an explicit ?o= query parameter.

Other clients on the same driver (telemetry pushes to /telemetry-ext,
feature-flag fetches, ...) hit different paths that don't carry the
workspace ID. Previously extractSpogHeaders only looked at ?o= in
httpPath, so the x-databricks-org-id header was never set for cluster
URLs without ?o=. On SPOG hosts PoPP then had no workspace context for
these requests and redirected them to /login, silently dropping telemetry.

Extend extractSpogHeaders to also extract the workspace ID from the
cluster path segment via clusterPathOrgIDPattern as a fallback when
?o= is absent. Priority order is preserved: ?o= query param wins over
the path-segment match. Adds four new test cases — cluster path
without ?o=, leading-slash variant, ?o= wins precedence, and a
warehouse-path regression guard so the new regex does not match
warehouse paths.

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
@gopalldb
Copy link
Copy Markdown
Collaborator

gopalldb commented Jun 1, 2026

🟠 P1 (Important — should fix)

?o= query value is injected into the header without digit validation — connector.go:172
The query-param branch accepts any non-empty string as the org-id (the added test o=first even codifies accepting a non-numeric value), while the new path branch enforces \d+. The two sources now have
asymmetric validation for the same header. A malformed ?o= silently produces a bad routing header; a CRLF value breaks the entire telemetry/FF request (verified). This is pre-existing behavior (from #347),
but this PR — which adds the second, validated source — is the natural place to make them consistent.
→ Fix: validate ?o= against ^\d+$ before use, falling through to path inspection on failure. This also hardens precedence: a junk ?o= would no longer mask a valid cluster-path org-id.

🟡 P2 (Minor)

  • Regex prefix anchor allows nested matches — connector.go:142, (?:^|/)sql/protocolv1/o/(\d+)/[^/?]+. Confirmed evil/sql/protocolv1/o/999/x matches → 999 (while xsql/... correctly doesn't). Only P2 because
    HTTPPath is operator-supplied trusted config and the value is digits-only. → anchor to ^/?sql/protocolv1/o/... if you want strict matching. The [^/?]+ cluster-segment requirement is correct — verified
    /sql/protocolv1/o/123 and /sql/protocolv1/o/123/ (empty cluster) both correctly fail to match. RE2 → no ReDoS.
  • Missing test coverage — added tests cover leading-slash, precedence, and the warehouse regression, but not: the nested-prefix false positive, a warehouse path incidentally containing /sql/protocolv1/o/,
    non-numeric/CRLF ?o= behavior, or the missing-cluster-segment case returning nil. Add these to lock down anchoring + validation (especially alongside the P1 fix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants