Skip to content

refactor(amber): stop hardcoding S3 in REST catalog init#4988

Open
mengw15 wants to merge 3 commits intoapache:mainfrom
mengw15:refactor/remove-hardcoded-s3-from-rest-catalog
Open

refactor(amber): stop hardcoding S3 in REST catalog init#4988
mengw15 wants to merge 3 commits intoapache:mainfrom
mengw15:refactor/remove-hardcoded-s3-from-rest-catalog

Conversation

@mengw15
Copy link
Copy Markdown
Contributor

@mengw15 mengw15 commented May 8, 2026

What changes were proposed in this PR?

Stop hardcoding s3.endpoint, s3.region, s3.path-style-access, s3.access-key-id and s3.secret-access-key at REST-catalog init in both IcebergUtil.createRestCatalog (Scala) and iceberg_utils.create_rest_catalog (Python). Both helpers now pass only warehouse + catalog uri (and on the Scala side the FileIO impl hint).

Why: When a Lakekeeper warehouse is created, its S3 settings (endpoint, region, credentials, path-style) are registered against that warehouse on the server. At catalog init the client only needs warehouse + uri — Lakekeeper resolves the S3 config from the warehouse record and serves it back. The hardcoded StorageConfig.s3* values on the client were redundant, and forcing them everywhere also pinned every warehouse to the single system bucket. Removing them lets each warehouse own its own storage settings.

StorageConfig.s3* itself is kept — pytexera/storage/large_binary_manager.py still uses it for the non-Iceberg texera-large-binaries bucket (R UDF large-binary support), which is out of scope.

Any related issues, documentation, discussions?

Closes #4987

How was this PR tested?

  • sbt "WorkflowCore/compile" — passes; verifies no other Scala caller depends on the removed properties.
  • Python edits parse cleanly via ast.parse; the only caller (iceberg_catalog_instance.py) is updated to match the new create_rest_catalog signature.

End-to-end verification (warehouse with its own S3 settings → REST catalog opened with only warehouse + uri → table round-trip) requires a running Lakekeeper, which CI doesn't have today. #4276 (draft) wires Lakekeeper into CI; once that lands I'll add the integration test on top of it.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

@github-actions github-actions Bot added python refactor Refactor the code common labels May 8, 2026
@mengw15 mengw15 requested a review from bobbai00 May 8, 2026 20:50
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 8, 2026

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.72%. Comparing base (deb0e2c) to head (754c29a).

Files with missing lines Patch % Lines
...ala/org/apache/texera/amber/util/IcebergUtil.scala 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4988      +/-   ##
============================================
- Coverage     42.72%   42.72%   -0.01%     
+ Complexity     2185     2184       -1     
============================================
  Files          1031     1031              
  Lines         38152    38146       -6     
  Branches       4004     4004              
============================================
- Hits          16301    16298       -3     
+ Misses        20831    20827       -4     
- Partials       1020     1021       +1     
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø)
agent-service 33.72% <ø> (ø) Carriedforward from 0609d3d
amber 43.21% <0.00%> (-0.01%) ⬇️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 32.18% <ø> (ø)
frontend 33.08% <ø> (ø) Carriedforward from 0609d3d
python 88.90% <ø> (ø)
workflow-compiling-service 47.72% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Thanks @mengw15 can you make the description a bit concise about why this change is needed? Also please add tests to confirm this change works.

@mengw15
Copy link
Copy Markdown
Contributor Author

mengw15 commented May 9, 2026

Thanks @mengw15 can you make the description a bit concise about why this change is needed? Also please add tests to confirm this change works.

Thanks for the questions.

Why the change is needed. When a Lakekeeper warehouse is created, the S3 settings (endpoint, region, credentials, path-style, etc.) are already registered against that warehouse on the server side. At REST-catalog init the client only needs the warehouse identifier and uri — Lakekeeper resolves and serves the S3 config from the warehouse record. The previously hardcoded s3.* properties from StorageConfig were therefore redundant on the client; deleting them lets each warehouse own its own storage settings instead of all warehouses being forced onto the system bucket. I'll tighten the PR description to say just this.

About tests. End-to-end verification needs a running Lakekeeper, which CI doesn't have yet. #4276 (draft) adds Lakekeeper to CI; once that lands I'll layer an integration test on top of it that creates a warehouse with its own S3 settings, opens a REST catalog with only warehouse + uri, and round-trips a table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg REST catalog should not hardcode S3 properties at init

3 participants