Route downloads through storage proxy builder#1329
Merged
parthban-db merged 1 commit intomainfrom Mar 13, 2026
Merged
Conversation
This was referenced Mar 12, 2026
github-merge-queue bot
pushed a commit
that referenced
this pull request
Mar 13, 2026
## 🥞 Stacked PR Use this [link](https://github.com/databricks/databricks-sdk-py/pull/1319/files) to review incremental changes. - [**stack/refactor-files-client-4**](#1319) [[Files changed](https://github.com/databricks/databricks-sdk-py/pull/1319/files)] - [stack/refactor-files-client-5](#1324) [[Files changed](https://github.com/databricks/databricks-sdk-py/pull/1324/files/85824ec07a796f348263077a05c5339a190f1131..14e13676c5e12adff8f68d5f30aca295a423acdd)] - [stack/refactor-files-client-6](#1329) [[Files changed](https://github.com/databricks/databricks-sdk-py/pull/1329/files/14e13676c5e12adff8f68d5f30aca295a423acdd..80c211ffa59afea71f5d01697d95381e0d126865)] --------- ## Summary Adds optional storage-proxy routing for file uploads, gated behind the `experimental_files_ext_enable_storage_proxy` config flag. When enabled and the proxy is reachable, uploads bypass the presigned URL coordination APIs and go directly to the storage proxy. ## Why When running inside a Databricks cluster or notebook, a storage proxy is available on the data plane. It can handle file upload operations directly. The feature is disabled by default and marked experimental, so there is no impact on existing users. ## What changed ### Interface changes One new config field: - **`experimental_files_ext_enable_storage_proxy: bool = False`** — enables the storage proxy path. ### Behavioral changes - When the flag is enabled and the probe succeeds, multipart uploads (AWS/Azure) and resumable uploads (GCP) go directly to the proxy instead of fetching presigned URLs from coordination APIs. - The proxy requires SDK authentication on every request, so an authenticated session (`session.auth` callback) is used instead of the unauthenticated session used for presigned URL uploads. - `build_abort_url` was added to `_PresignedUrlRequestBuilder`, extracting inline abort URL logic from `_abort_multipart_upload`. ### Internal changes - **`_StorageProxyRequestBuilder`** — new class with the same method signatures as `_PresignedUrlRequestBuilder` (`build_upload_part_urls`, `build_resumable_upload_url`, `build_abort_url`). Constructs URLs directly to the proxy endpoint instead of calling coordination APIs. - **`_create_request_builder()`** — factory method that returns the proxy builder when the proxy is available, otherwise the presigned URL builder. - **`_probe_storage_proxy()`** — GET to the proxy ping endpoint with SDK auth. Result is cached. - **`_create_storage_proxy_session()`** — cached `requests.Session` with `session.auth` callback for SDK credentials. Protected by a lock for thread safety. - **`_cloud_provider_session()`** — updated to return the authenticated proxy session when the proxy is active. Session creation protected by a lock for thread safety. - **`_get_hostname()`** — updated to return the proxy hostname when the flag is enabled and the probe succeeds. - **`_abort_multipart_upload`** — now uses the builder instead of inline `create-abort-upload-url` API call. - Test infrastructure: `UploadTestCase` gains `use_storage_proxy` parameter. Storage proxy test cases added to existing `MultipartUploadTestCase` and `ResumableUploadTestCase` parametrized lists. Presigned URL coordination handlers assert they are never called when storage proxy is active. ## How is this tested? - Unit tests. - E2E validated on two cloud environments with 200 MB file uploads: - **AWS** (AWS staging): multipart upload via proxy, 20 parallel parts, hash verified. - **GCP** (GCP staging): resumable upload via proxy, 20 sequential chunks, hash verified. - Both environments confirmed: `_StorageProxyRequestBuilder` used, no `create-upload-part-urls`/`create-resumable-upload-url` calls, `session.auth` callback active, uploaded and downloaded hashes match. NO_CHANGELOG=true
80c211f to
8d52ace
Compare
github-merge-queue bot
pushed a commit
that referenced
this pull request
Mar 13, 2026
## 🥞 Stacked PR Use this [link](https://github.com/databricks/databricks-sdk-py/pull/1324/files) to review incremental changes. - [**stack/refactor-files-client-5**](#1324) [[Files changed](https://github.com/databricks/databricks-sdk-py/pull/1324/files)] - [stack/refactor-files-client-6](#1329) [[Files changed](https://github.com/databricks/databricks-sdk-py/pull/1329/files/214368bb27264d3f66156799b719bf75adac6911..8d52ace6e5b7bf1ce9a712345a9938fb3eb63821)] --------- ## Summary Adds workspace ID validation to the storage proxy probe so the proxy is only used when it can serve the target workspace. Prevents false positives when a `WorkspaceClient` is configured to point at a different workspace than the one hosting the data plane. ## Why The previous probe (`GET /DatabricksInternal/Probes/ping`) only checked if the proxy was reachable. It didn't verify whether the proxy could serve the specific workspace the client is targeting. This meant a `WorkspaceClient` configured with a different workspace's host and token could incorrectly route uploads through the local proxy, which cannot serve that workspace. ## What changed ### Interface changes None. ### Behavioral changes - The probe now resolves the workspace ID via the SCIM Me endpoint (`GET /api/2.0/preview/scim/v2/Me`) and passes it as `?ew=<workspace_id>` to the probe URL. The proxy returns 200 only if it can serve that workspace, 403 otherwise. - If the workspace ID cannot be resolved (e.g., auth fails on the target host), the probe is skipped and the SDK falls back to presigned URLs. ### Internal changes - **`_resolve_workspace_id()`** — new method that calls SCIM Me to get the workspace ID from the `X-Databricks-Org-Id` response header. - **`_probe_storage_proxy()`** — updated to resolve workspace ID first, skip if unavailable, use HEAD method, and pass `ew` parameter. - Test mock updated to handle the SCIM Me endpoint and assert the `ew` parameter is present in the probe URL. ## How is this tested? - Unit tests - E2E validated on 5 paths: - **AWS proxy (on DP)**: SUCCESS — probe HEAD with `ew` returns 200, `_StorageProxyRequestBuilder` used, 200 MB hash verified. - **GCP proxy (on DP)**: SUCCESS — same. - **AWS presigned (off DP)**: SUCCESS — flag disabled, `_PresignedUrlRequestBuilder` used, 200 MB hash verified. - **GCP presigned (off DP)**: SUCCESS — same. - **Cross-workspace (AWS DP, GCP token/host)**: SUCCESS — SCIM Me resolves GCP workspace ID, probe sends `ew=<gcp_id>` to local AWS proxy, proxy returns 403, correctly falls back to presigned URLs. NO_CHANGELOG=true
8d52ace to
9c1fa72
Compare
|
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below: Trigger: Inputs:
Checks will be approved automatically on success. |
Divyansh-db
approved these changes
Mar 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🥞 Stacked PR
Use this link to review incremental changes.
Summary
Routes file downloads through the request builder abstraction, so downloads go directly to the storage proxy when available instead of fetching presigned URLs from the coordination API.
Why
The upload path already goes through the builder (
_create_request_builder()), which routes to the storage proxy when available. But the download path still called thecreate-download-urlcoordination API directly, bypassing the proxy. This meant downloads still required an extra round-trip to the control plane even when the proxy was available on the data plane.What changed
Interface changes
None.
Behavioral changes
download()anddownload_to()now GET directly from the proxy (http://storage-proxy.databricks.com/api/2.0/fs/files{path}) instead of callingcreate-download-urlto get a presigned URL first. This eliminates one round-trip for each download.Internal changes
_PresignedUrlRequestBuilder.build_download_url()— new method that callscreate-download-urland returns aCreateDownloadUrlResponse. Extracts the logic that was previously inline in_create_download_url._StorageProxyRequestBuilder.build_download_url()— new method that constructs a direct proxy download URL without calling any coordination API._create_download_url()— now delegates to_create_request_builder().build_download_url()instead of calling the coordination API inline.PresignedUrlDownloadTestCase— gainsuse_storage_proxyparameter and_expected_hostnameproperty. Mock handlers added for proxy probe (SCIM Me + HEAD), direct proxy download, and assertion thatcreate-download-urlis never called when proxy is active.How is this tested?
create-download-urlis not called.GET http://storage-proxy.databricks.com:80 /api/2.0/fs/files{path} HTTP/1.1" 200 209715200— download went directly to proxy.create-download-url— coordination API bypassed.NO_CHANGELOG=true