feat(ingestion): Databricks E2E CLI tests + tableDiff PAT extraction fix by ulixius9 · Pull Request #27963 · open-metadata/OpenMetadata

ulixius9 · 2026-05-07T10:56:25Z

Summary

Add a comprehensive Databricks E2E CLI test suite (ingestion/tests/cli_e2e/test_cli_databricks.py) mirroring the Snowflake one — vanilla ingestion, profiler, auto-classification, lineage, schema/table filters, mark-deleted, data quality, complex column types (STRUCT/ARRAY/MAP), external table lineage, plus a combined Databricks-features test (table tag, column tag, FK informational constraint, liquid clustering, view, materialized view).
Fix a BaseTableParameter._get_service_connection_config lookup bug in databricks_base.py that was reading connection.token (flat) when the schema actually nests it under connection.authType.token. The empty-token URL silently fell back to OAuth U2M browser auth in databricks-sql-connector, hanging non-interactive tableDiff runs. The fix reads from authType.token first, falls back to the flat path for backwards compat, and raises a clear ValueError rather than producing a hang-on-browser URL when no PAT is found.
Relax the racy sink-record-count assertion in common/test_cli_db.py (used by every connector E2E test). The metadata-rest sink batches PUTs and final-flushes on workflow close, but Status is logged before that flush — so sink_status.records understates the true count. Replaced with source records + zero sink failures + (in the Databricks override) an end-to-end retrieve_table API check that proves data actually landed. Should also un-flake other connectors that hit the same assertion.

Files changed

File	What
`ingestion/src/metadata/ingestion/source/database/common/data_diff/databricks_base.py`	Fix nested `authType.token` lookup; raise on missing PAT instead of silent OAuth fallback
`ingestion/tests/cli_e2e/common/test_cli_db.py`	Drop racy sink count assertion; document the batching reason
`ingestion/tests/unit/topology/database/test_databricks_migration.py`	New unit tests for nested-authType lookup + missing-token error
`ingestion/tests/cli_e2e/test_cli_databricks.py`	New: 16-test Databricks E2E suite
`ingestion/tests/cli_e2e/database/databricks/databricks.yaml`	New: connector config (env-var driven)

Test plan

pytest ingestion/tests/unit/topology/database/test_databricks_migration.py — all 8 unit tests pass (3 new + 5 existing)
pytest ingestion/tests/cli_e2e/test_cli_databricks.py::DatabricksCliTest::test_vanilla_ingestion — passes against a real Databricks Unity Catalog workspace
pytest ingestion/tests/cli_e2e/test_cli_databricks.py::DatabricksCliTest::test_data_quality — XPASS after the databricks_base.py fix (previously hung opening a browser)
pytest ingestion/tests/cli_e2e/test_cli_databricks.py::DatabricksCliTest::test_complex_column_types — passes
pytest ingestion/tests/cli_e2e/test_cli_databricks.py::DatabricksCliTest::test_databricks_features_ingestion — passes (FK assertion soft-warns due to a separate Databricks-dialect FK-extraction gap, documented inline)
pytest --collect-only lists all 16 tests
nox --no-venv -s static-checks — no new errors above baseline (verified by stashing changes; baseline output is byte-identical)

Pre-reqs to run the E2E suite locally

export E2E_DATABRICKS_HOST_PORT=<workspace>.cloud.databricks.com:443
export E2E_DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/<id>
export E2E_DATABRICKS_TOKEN=<PAT>
export E2E_DATABRICKS_CATALOG=e2e_db   # or any UC catalog the principal owns
pip install pytest-order pytest-timeout
pytest ingestion/tests/cli_e2e/test_cli_databricks.py --timeout=900

Catalog must already exist in UC (CREATE CATALOG requires metastore-admin). Schemas and tables are created/torn down by setUp/tearDownClass.

Known gaps (out of scope for this PR)

FK informational constraints — Databricks SQLAlchemy dialect doesn't extract them from information_schema.referential_constraints. The features test's FK assertion soft-warns. Worth a follow-up to add get_foreign_keys on DatabricksDialect.
External-table lineage test — skips unless E2E_DATABRICKS_EXTERNAL_LOCATION is set to a UC external location the principal can write to.
Materialized view branch — only asserts when the warehouse supports MV (DLT/serverless); silently skipped otherwise.

🤖 Generated with Claude Code

Summary by Gitar

Improved test connection:
- Replaced the simple catalog lookup with select_test_catalog to automatically skip __databricks_internal and federated (FOREIGN) catalogs.
- This prevents connection probe failures caused by pushing down information_schema queries to external sources with stale credentials.
- Added unit tests in test_unity_catalog_connection.py covering manual catalog overrides and filtering logic for various catalog types.
Refined PAT extraction:
- Updated _extract_pat_token to explicitly handle None and improve string conversion logic for Databricks PATs.
- Added stricter validation to ensure empty tokens raise a ValueError rather than triggering silent OAuth fallback.

_{This will update automatically on new commits.}

…AT fix Add a Databricks E2E test suite mirroring the Snowflake one (vanilla ingestion, profiler, auto-classification, lineage, filters, mark-deleted, data quality, complex types, external table lineage, plus a combined features test for tags, FK, clustering, view, and materialized view). Fix a pre-existing bug in BaseTableParameter._get_service_connection_config where the token lookup walked `connection.token` (flat) but the schema has it nested under `connection.authType.token`. The empty-token URL silently triggered OAuth U2M browser fallback in databricks-sql-connector, hanging non-interactive runs. The fix reads from authType.token first, falls back to the legacy flat path for backwards compatibility, and raises a clear ValueError when no PAT is found rather than producing a URL that hangs on a browser prompt. Relax the racy sink-record-count assertion in common/test_cli_db.py: the metadata-rest sink batches PUTs and final-flushes on workflow close — the Status object is logged before that flush so sink_status.records understates the true count. Replaced with source records + zero sink failures + (in the Databricks override) an end-to-end retrieve_table API check that proves data actually landed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The test-connection step previously walked `catalogs.list()` and picked the first catalog (excluding `__databricks_internal`). When that first catalog was a foreign / federated catalog (Lakehouse Federation against an external Postgres / MySQL / etc.), the `information_schema.*_tags` queries get pushed down to the source DB and fail on stale or absent credentials — causing a healthy Unity Catalog workspace to report a broken test connection. Extract the catalog selection into `select_test_catalog`: - Honor `configured_catalog` from the service config when set (the user has explicitly chosen the catalog and may have foreign creds wired up). - Otherwise skip both `__databricks_internal` and any catalog whose `catalog_type` contains `FOREIGN`, picking the first remaining catalog. Add unit tests covering: configured-catalog override, foreign-catalog honoring when explicitly pinned, foreign-catalog skipping when auto- selecting, and the no-internal-catalog fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… extraction Code review surfaced that the previous guard only checked `if token is None`. An empty-string token (set via `token: ""` in YAML or an empty `$E2E_DATABRICKS_TOKEN` env var) would slip past the None check, produce a `databricks://:@host/...` URL, and silently trigger OAuth U2M browser fallback in `databricks-sql-connector` — exactly the failure mode the docstring warned against. Resolve the value first (handling SecretStr, None, and any object that stringifies to ""), then validate. This catches: - `token = None` (missing attribute) - `token = ""` (empty plain string) - `token = SecretStr("")` (empty SecretStr / Pydantic field) with a single `if not token_value` check. Adds two unit tests covering the empty-string and empty-SecretStr cases to lock in the behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR strengthens Databricks ingestion and validation by adding a Databricks-focused CLI E2E suite, fixing PAT extraction for Databricks data-diff URL construction (to avoid OAuth browser fallback hangs), and reducing flakiness in connector E2E assertions by removing a racy metadata-rest sink record-count check.

Changes:

Add a comprehensive Databricks CLI E2E test suite plus Databricks connector YAML configuration.
Fix Databricks data-diff connection URL building to read the PAT from authType.token (with legacy fallback) and raise when missing.
Relax CLI E2E assertions by removing sink record-count checks that can under-report due to batched flush timing, and add stronger Databricks-specific end-to-end API verification.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`ingestion/src/metadata/ingestion/source/database/common/data_diff/databricks_base.py`	Fix PAT extraction path for Databricks data-diff URL generation and introduce explicit error on missing PAT.
`ingestion/tests/cli_e2e/common/test_cli_db.py`	Remove racy sink record-count assertions; document metadata-rest batching/flush timing.
`ingestion/tests/unit/topology/database/test_databricks_migration.py`	Add unit coverage for nested `authType.token` extraction and missing-token error behavior.
`ingestion/tests/cli_e2e/test_cli_databricks.py`	Introduce a Databricks CLI E2E suite covering ingestion, profiler, classification, lineage, data quality, and feature ingestion scenarios.
`ingestion/tests/cli_e2e/database/databricks/databricks.yaml`	Add Databricks E2E workflow config driven by environment variables (host/httpPath/token/catalog).

+        # Resolve to the bare string before validating: an empty-string token
+        # (e.g. `$E2E_DATABRICKS_TOKEN` set but empty, or `token: ""` in YAML)
+        # would otherwise build a URL like `databricks://:@host/...` and the
+        # SQL driver would fall back to OAuth U2M, opening a browser. Validate
+        # the resolved value so we fail fast in non-interactive environments.
+        token_value = (
+            token.get_secret_value()


+
+    def test_empty_string_token_raises(self):


github-actions · 2026-05-07T11:09:17Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

CI surfaced two checks failing on the previous commit: 1. Static checks: basedpyright flagged `databricks_base.py:62: "get_secret_value" is not a known attribute of "None"`. The conditional expression `token.get_secret_value() if hasattr(...) else str(token)` could not be narrowed by basedpyright — it could not infer that `hasattr` implies `token is not None`. 2. py-checkstyle: `ruff format` wanted to inline the same multi-line conditional, but inlining it would exceed the line-length limit and bounce against ruff again. Restructure the resolution into explicit branches: if token is None: token_value = "" elif hasattr(token, "get_secret_value"): token_value = token.get_secret_value() else: token_value = str(token) This narrows `token` cleanly for basedpyright in the elif branch, and ruff is satisfied with the natural block layout. Also tighten `test_token_nested_under_authtype`: replace the `assert "..." in result` with `assert isinstance(result, str)` followed by an exact-equality check. The `in` operator on a `str | dict | None` union was a basedpyright operator-issue, and the equality check below already implies the substring assertion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gitar-bot · 2026-05-07T11:20:09Z

Code Review ✅ Approved 1 resolved / 1 findings

Implements comprehensive Databricks E2E CLI tests and fixes a critical bug in PAT extraction that caused silent OAuth hangs. Resolves the empty-string token authentication issue; no other issues found.

✅ 1 resolved

✅ Bug: Empty-string token bypasses ValueError, causes silent OAuth hang

📄 ingestion/src/metadata/ingestion/source/database/common/data_diff/databricks_base.py:54-62
The _extract_pat_token method only checks if token is None before returning. If the token attribute exists but is an empty string (or a SecretStr wrapping an empty string), get_secret_value() / str(token) will return "", producing the exact empty-token URL the docstring warns about — silently falling back to OAuth U2M and hanging non-interactive runs.

This can happen when the YAML has token: "" or when the env var $E2E_DATABRICKS_TOKEN is set but empty.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-05-07T12:23:18Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-07T13:54:02Z

🟡 Playwright Results — all passed (14 flaky)

✅ 4014 passed · ❌ 0 failed · 🟡 14 flaky · ⏭️ 86 skipped

Shard	Passed	Flaky	Skipped
✅ Shard 1	299	0	4
🟡 Shard 2	750	5	8
🟡 Shard 3	755	4	7
🟡 Shard 4	787	3	18
✅ Shard 5	687	0	41
🟡 Shard 6	736	2	8

🟡 14 flaky test(s) (passed on retry)

Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/BulkImport.spec.ts › Keyboard Delete selection (shard 2, 1 retry)
Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
Features/IncidentManager.spec.ts › Complete Incident lifecycle with table owner (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Flow/AddRoleAndAssignToUser.spec.ts › Verify assigned role to new user (shard 3, 1 retry)
Flow/ObservabilityAlerts.spec.ts › Table alert (shard 3, 1 retry)
Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
Pages/CustomProperties.spec.ts › Set table-cp custom property on column and verify in UI (shard 4, 1 retry)
Pages/DataContractInheritance.spec.ts › Delete Asset Contract - Falls back to showing inherited contract from Data Product (shard 4, 1 retry)
Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings May 7, 2026 10:56

ulixius9 requested a review from a team as a code owner May 7, 2026 10:56

github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 7, 2026

Copilot started reviewing on behalf of ulixius9 May 7, 2026 10:57 View session

gitar-bot Bot reviewed May 7, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/common/data_diff/databricks_base.py Outdated

ulixius9 and others added 2 commits May 7, 2026 16:29

Copilot AI reviewed May 7, 2026

View reviewed changes

ulixius9 had a problem deploying to test May 7, 2026 11:13 — with GitHub Actions Error

ulixius9 temporarily deployed to test May 7, 2026 11:28 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion): Databricks E2E CLI tests + tableDiff PAT extraction fix#27963

feat(ingestion): Databricks E2E CLI tests + tableDiff PAT extraction fix#27963
ulixius9 wants to merge 4 commits intomainfrom
feat/databricks-e2e-tests

ulixius9 commented May 7, 2026 •

edited by gitar-bot Bot

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

gitar-bot Bot commented May 7, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ulixius9 commented May 7, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed

Test plan

Pre-reqs to run the E2E suite locally

Known gaps (out of scope for this PR)

Summary by Gitar

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

gitar-bot Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

github-actions Bot commented May 7, 2026

🟡 Playwright Results — all passed (14 flaky)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ulixius9 commented May 7, 2026 •

edited by gitar-bot Bot

Loading

gitar-bot Bot commented May 7, 2026 •

edited

Loading