Skip to content

Route downloads through storage proxy builder#1329

Merged
parthban-db merged 1 commit intomainfrom
parthban-db/stack/refactor-files-client-6
Mar 13, 2026
Merged

Route downloads through storage proxy builder#1329
parthban-db merged 1 commit intomainfrom
parthban-db/stack/refactor-files-client-6

Conversation

@parthban-db
Copy link
Contributor

@parthban-db parthban-db commented Mar 12, 2026

🥞 Stacked PR

Use this link to review incremental changes.


Summary

Routes file downloads through the request builder abstraction, so downloads go directly to the storage proxy when available instead of fetching presigned URLs from the coordination API.

Why

The upload path already goes through the builder (_create_request_builder()), which routes to the storage proxy when available. But the download path still called the create-download-url coordination API directly, bypassing the proxy. This meant downloads still required an extra round-trip to the control plane even when the proxy was available on the data plane.

What changed

Interface changes

None.

Behavioral changes

  • When the storage proxy is enabled and available, download() and download_to() now GET directly from the proxy (http://storage-proxy.databricks.com/api/2.0/fs/files{path}) instead of calling create-download-url to get a presigned URL first. This eliminates one round-trip for each download.
  • When the proxy is not available, behavior is unchanged — presigned URLs are fetched as before.

Internal changes

  • _PresignedUrlRequestBuilder.build_download_url() — new method that calls create-download-url and returns a CreateDownloadUrlResponse. Extracts the logic that was previously inline in _create_download_url.
  • _StorageProxyRequestBuilder.build_download_url() — new method that constructs a direct proxy download URL without calling any coordination API.
  • _create_download_url() — now delegates to _create_request_builder().build_download_url() instead of calling the coordination API inline.
  • PresignedUrlDownloadTestCase — gains use_storage_proxy parameter and _expected_hostname property. Mock handlers added for proxy probe (SCIM Me + HEAD), direct proxy download, and assertion that create-download-url is never called when proxy is active.

How is this tested?

  • 115 unit tests pass (113 existing + 2 new storage proxy download tests).
  • New test cases:
    • Sequential download via proxy: verifies download GET goes to the proxy hostname, create-download-url is not called.
    • Parallel download via proxy: same, with Range headers for parallel chunks.
  • E2E validated on AWS (staging) and GCP (staging) with 200 MB files. Raw urllib3 connection pool logs confirm:
    • GET http://storage-proxy.databricks.com:80 /api/2.0/fs/files{path} HTTP/1.1" 200 209715200 — download went directly to proxy.
    • Zero calls to create-download-url — coordination API bypassed.

NO_CHANGELOG=true

@parthban-db parthban-db marked this pull request as ready for review March 12, 2026 23:40
github-merge-queue bot pushed a commit that referenced this pull request Mar 13, 2026
## 🥞 Stacked PR
Use this
[link](https://github.com/databricks/databricks-sdk-py/pull/1319/files)
to review incremental changes.
-
[**stack/refactor-files-client-4**](#1319)
[[Files
changed](https://github.com/databricks/databricks-sdk-py/pull/1319/files)]
-
[stack/refactor-files-client-5](#1324)
[[Files
changed](https://github.com/databricks/databricks-sdk-py/pull/1324/files/85824ec07a796f348263077a05c5339a190f1131..14e13676c5e12adff8f68d5f30aca295a423acdd)]
-
[stack/refactor-files-client-6](#1329)
[[Files
changed](https://github.com/databricks/databricks-sdk-py/pull/1329/files/14e13676c5e12adff8f68d5f30aca295a423acdd..80c211ffa59afea71f5d01697d95381e0d126865)]

---------
## Summary

Adds optional storage-proxy routing for file uploads, gated behind the
`experimental_files_ext_enable_storage_proxy` config flag. When enabled
and the proxy is reachable, uploads bypass the presigned URL
coordination APIs and go directly to the storage proxy.

## Why

When running inside a Databricks cluster or notebook, a storage proxy is
available on the data plane. It can handle file upload operations
directly.

The feature is disabled by default and marked experimental, so there is
no impact on existing users.

## What changed

### Interface changes

One new config field:
- **`experimental_files_ext_enable_storage_proxy: bool = False`** —
enables the storage proxy path.

### Behavioral changes

- When the flag is enabled and the probe succeeds, multipart uploads
(AWS/Azure) and resumable uploads (GCP) go directly to the proxy instead
of fetching presigned URLs from coordination APIs.
- The proxy requires SDK authentication on every request, so an
authenticated session (`session.auth` callback) is used instead of the
unauthenticated session used for presigned URL uploads.
- `build_abort_url` was added to `_PresignedUrlRequestBuilder`,
extracting inline abort URL logic from `_abort_multipart_upload`.

### Internal changes

- **`_StorageProxyRequestBuilder`** — new class with the same method
signatures as `_PresignedUrlRequestBuilder` (`build_upload_part_urls`,
`build_resumable_upload_url`, `build_abort_url`). Constructs URLs
directly to the proxy endpoint instead of calling coordination APIs.
- **`_create_request_builder()`** — factory method that returns the
proxy builder when the proxy is available, otherwise the presigned URL
builder.
- **`_probe_storage_proxy()`** — GET to the proxy ping endpoint with SDK
auth. Result is cached.
- **`_create_storage_proxy_session()`** — cached `requests.Session` with
`session.auth` callback for SDK credentials. Protected by a lock for
thread safety.
- **`_cloud_provider_session()`** — updated to return the authenticated
proxy session when the proxy is active. Session creation protected by a
lock for thread safety.
- **`_get_hostname()`** — updated to return the proxy hostname when the
flag is enabled and the probe succeeds.
- **`_abort_multipart_upload`** — now uses the builder instead of inline
`create-abort-upload-url` API call.
- Test infrastructure: `UploadTestCase` gains `use_storage_proxy`
parameter. Storage proxy test cases added to existing
`MultipartUploadTestCase` and `ResumableUploadTestCase` parametrized
lists. Presigned URL coordination handlers assert they are never called
when storage proxy is active.

## How is this tested?

- Unit tests.
- E2E validated on two cloud environments with 200 MB file uploads:
- **AWS** (AWS staging): multipart upload via proxy, 20 parallel parts,
hash verified.
- **GCP** (GCP staging): resumable upload via proxy, 20 sequential
chunks, hash verified.
- Both environments confirmed: `_StorageProxyRequestBuilder` used, no
`create-upload-part-urls`/`create-resumable-upload-url` calls,
`session.auth` callback active, uploaded and downloaded hashes match.


NO_CHANGELOG=true
@parthban-db parthban-db force-pushed the parthban-db/stack/refactor-files-client-6 branch from 80c211f to 8d52ace Compare March 13, 2026 09:58
@parthban-db parthban-db requested a review from Divyansh-db March 13, 2026 09:58
github-merge-queue bot pushed a commit that referenced this pull request Mar 13, 2026
## 🥞 Stacked PR
Use this
[link](https://github.com/databricks/databricks-sdk-py/pull/1324/files)
to review incremental changes.
-
[**stack/refactor-files-client-5**](#1324)
[[Files
changed](https://github.com/databricks/databricks-sdk-py/pull/1324/files)]
-
[stack/refactor-files-client-6](#1329)
[[Files
changed](https://github.com/databricks/databricks-sdk-py/pull/1329/files/214368bb27264d3f66156799b719bf75adac6911..8d52ace6e5b7bf1ce9a712345a9938fb3eb63821)]

---------
## Summary

Adds workspace ID validation to the storage proxy probe so the proxy is
only used when it can serve the target workspace. Prevents false
positives when a `WorkspaceClient` is configured to point at a different
workspace than the one hosting the data plane.

## Why

The previous probe (`GET /DatabricksInternal/Probes/ping`) only checked
if the proxy was reachable. It didn't verify whether the proxy could
serve the specific workspace the client is targeting. This meant a
`WorkspaceClient` configured with a different workspace's host and token
could incorrectly route uploads through the local proxy, which cannot
serve that workspace.

## What changed

### Interface changes

None.

### Behavioral changes

- The probe now resolves the workspace ID via the SCIM Me endpoint (`GET
/api/2.0/preview/scim/v2/Me`) and passes it as `?ew=<workspace_id>` to
the probe URL. The proxy returns 200 only if it can serve that
workspace, 403 otherwise.
- If the workspace ID cannot be resolved (e.g., auth fails on the target
host), the probe is skipped and the SDK falls back to presigned URLs.

### Internal changes

- **`_resolve_workspace_id()`** — new method that calls SCIM Me to get
the workspace ID from the `X-Databricks-Org-Id` response header.
- **`_probe_storage_proxy()`** — updated to resolve workspace ID first,
skip if unavailable, use HEAD method, and pass `ew` parameter.
- Test mock updated to handle the SCIM Me endpoint and assert the `ew`
parameter is present in the probe URL.

## How is this tested?

- Unit tests
- E2E validated on 5 paths:
- **AWS proxy (on DP)**: SUCCESS — probe HEAD with `ew` returns 200,
`_StorageProxyRequestBuilder` used, 200 MB hash verified.
  - **GCP proxy (on DP)**: SUCCESS — same.
- **AWS presigned (off DP)**: SUCCESS — flag disabled,
`_PresignedUrlRequestBuilder` used, 200 MB hash verified.
  - **GCP presigned (off DP)**: SUCCESS — same.
- **Cross-workspace (AWS DP, GCP token/host)**: SUCCESS — SCIM Me
resolves GCP workspace ID, probe sends `ew=<gcp_id>` to local AWS proxy,
proxy returns 403, correctly falls back to presigned URLs.

NO_CHANGELOG=true
@parthban-db parthban-db force-pushed the parthban-db/stack/refactor-files-client-6 branch from 8d52ace to 9c1fa72 Compare March 13, 2026 13:29
@github-actions
Copy link

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-py

Inputs:

  • PR number: 1329
  • Commit SHA: 9c1fa72914d5daf8a34cb3a21cc2a9624d1f5d45

Checks will be approved automatically on success.

@parthban-db parthban-db added this pull request to the merge queue Mar 13, 2026
Merged via the queue into main with commit 09a313a Mar 13, 2026
17 of 18 checks passed
@parthban-db parthban-db deleted the parthban-db/stack/refactor-files-client-6 branch March 13, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants