Skip to content

feat(ingestion): Databricks E2E CLI tests + tableDiff PAT extraction fix#27963

Open
ulixius9 wants to merge 4 commits intomainfrom
feat/databricks-e2e-tests
Open

feat(ingestion): Databricks E2E CLI tests + tableDiff PAT extraction fix#27963
ulixius9 wants to merge 4 commits intomainfrom
feat/databricks-e2e-tests

Conversation

@ulixius9
Copy link
Copy Markdown
Member

@ulixius9 ulixius9 commented May 7, 2026

Summary

  • Add a comprehensive Databricks E2E CLI test suite (ingestion/tests/cli_e2e/test_cli_databricks.py) mirroring the Snowflake one — vanilla ingestion, profiler, auto-classification, lineage, schema/table filters, mark-deleted, data quality, complex column types (STRUCT/ARRAY/MAP), external table lineage, plus a combined Databricks-features test (table tag, column tag, FK informational constraint, liquid clustering, view, materialized view).
  • Fix a BaseTableParameter._get_service_connection_config lookup bug in databricks_base.py that was reading connection.token (flat) when the schema actually nests it under connection.authType.token. The empty-token URL silently fell back to OAuth U2M browser auth in databricks-sql-connector, hanging non-interactive tableDiff runs. The fix reads from authType.token first, falls back to the flat path for backwards compat, and raises a clear ValueError rather than producing a hang-on-browser URL when no PAT is found.
  • Relax the racy sink-record-count assertion in common/test_cli_db.py (used by every connector E2E test). The metadata-rest sink batches PUTs and final-flushes on workflow close, but Status is logged before that flush — so sink_status.records understates the true count. Replaced with source records + zero sink failures + (in the Databricks override) an end-to-end retrieve_table API check that proves data actually landed. Should also un-flake other connectors that hit the same assertion.

Files changed

File What
ingestion/src/metadata/ingestion/source/database/common/data_diff/databricks_base.py Fix nested authType.token lookup; raise on missing PAT instead of silent OAuth fallback
ingestion/tests/cli_e2e/common/test_cli_db.py Drop racy sink count assertion; document the batching reason
ingestion/tests/unit/topology/database/test_databricks_migration.py New unit tests for nested-authType lookup + missing-token error
ingestion/tests/cli_e2e/test_cli_databricks.py New: 16-test Databricks E2E suite
ingestion/tests/cli_e2e/database/databricks/databricks.yaml New: connector config (env-var driven)

Test plan

  • pytest ingestion/tests/unit/topology/database/test_databricks_migration.py — all 8 unit tests pass (3 new + 5 existing)
  • pytest ingestion/tests/cli_e2e/test_cli_databricks.py::DatabricksCliTest::test_vanilla_ingestion — passes against a real Databricks Unity Catalog workspace
  • pytest ingestion/tests/cli_e2e/test_cli_databricks.py::DatabricksCliTest::test_data_quality — XPASS after the databricks_base.py fix (previously hung opening a browser)
  • pytest ingestion/tests/cli_e2e/test_cli_databricks.py::DatabricksCliTest::test_complex_column_types — passes
  • pytest ingestion/tests/cli_e2e/test_cli_databricks.py::DatabricksCliTest::test_databricks_features_ingestion — passes (FK assertion soft-warns due to a separate Databricks-dialect FK-extraction gap, documented inline)
  • pytest --collect-only lists all 16 tests
  • nox --no-venv -s static-checks — no new errors above baseline (verified by stashing changes; baseline output is byte-identical)

Pre-reqs to run the E2E suite locally

export E2E_DATABRICKS_HOST_PORT=<workspace>.cloud.databricks.com:443
export E2E_DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/<id>
export E2E_DATABRICKS_TOKEN=<PAT>
export E2E_DATABRICKS_CATALOG=e2e_db   # or any UC catalog the principal owns
pip install pytest-order pytest-timeout
pytest ingestion/tests/cli_e2e/test_cli_databricks.py --timeout=900

Catalog must already exist in UC (CREATE CATALOG requires metastore-admin). Schemas and tables are created/torn down by setUp/tearDownClass.

Known gaps (out of scope for this PR)

  1. FK informational constraints — Databricks SQLAlchemy dialect doesn't extract them from information_schema.referential_constraints. The features test's FK assertion soft-warns. Worth a follow-up to add get_foreign_keys on DatabricksDialect.
  2. External-table lineage test — skips unless E2E_DATABRICKS_EXTERNAL_LOCATION is set to a UC external location the principal can write to.
  3. Materialized view branch — only asserts when the warehouse supports MV (DLT/serverless); silently skipped otherwise.

🤖 Generated with Claude Code


Summary by Gitar

  • Improved test connection:
    • Replaced the simple catalog lookup with select_test_catalog to automatically skip __databricks_internal and federated (FOREIGN) catalogs.
    • This prevents connection probe failures caused by pushing down information_schema queries to external sources with stale credentials.
    • Added unit tests in test_unity_catalog_connection.py covering manual catalog overrides and filtering logic for various catalog types.
  • Refined PAT extraction:
    • Updated _extract_pat_token to explicitly handle None and improve string conversion logic for Databricks PATs.
    • Added stricter validation to ensure empty tokens raise a ValueError rather than triggering silent OAuth fallback.

This will update automatically on new commits.

…AT fix

Add a Databricks E2E test suite mirroring the Snowflake one (vanilla
ingestion, profiler, auto-classification, lineage, filters, mark-deleted,
data quality, complex types, external table lineage, plus a combined
features test for tags, FK, clustering, view, and materialized view).

Fix a pre-existing bug in BaseTableParameter._get_service_connection_config
where the token lookup walked `connection.token` (flat) but the schema
has it nested under `connection.authType.token`. The empty-token URL
silently triggered OAuth U2M browser fallback in databricks-sql-connector,
hanging non-interactive runs. The fix reads from authType.token first,
falls back to the legacy flat path for backwards compatibility, and
raises a clear ValueError when no PAT is found rather than producing a
URL that hangs on a browser prompt.

Relax the racy sink-record-count assertion in
common/test_cli_db.py: the metadata-rest sink batches PUTs and
final-flushes on workflow close — the Status object is logged before
that flush so sink_status.records understates the true count. Replaced
with source records + zero sink failures + (in the Databricks override)
an end-to-end retrieve_table API check that proves data actually landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 7, 2026 10:56
@ulixius9 ulixius9 requested a review from a team as a code owner May 7, 2026 10:56
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 7, 2026
ulixius9 and others added 2 commits May 7, 2026 16:29
The test-connection step previously walked `catalogs.list()` and picked
the first catalog (excluding `__databricks_internal`). When that first
catalog was a foreign / federated catalog (Lakehouse Federation against
an external Postgres / MySQL / etc.), the `information_schema.*_tags`
queries get pushed down to the source DB and fail on stale or absent
credentials — causing a healthy Unity Catalog workspace to report a
broken test connection.

Extract the catalog selection into `select_test_catalog`:
- Honor `configured_catalog` from the service config when set (the user
  has explicitly chosen the catalog and may have foreign creds wired up).
- Otherwise skip both `__databricks_internal` and any catalog whose
  `catalog_type` contains `FOREIGN`, picking the first remaining catalog.

Add unit tests covering: configured-catalog override, foreign-catalog
honoring when explicitly pinned, foreign-catalog skipping when auto-
selecting, and the no-internal-catalog fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… extraction

Code review surfaced that the previous guard only checked `if token is
None`. An empty-string token (set via `token: ""` in YAML or an empty
`$E2E_DATABRICKS_TOKEN` env var) would slip past the None check, produce
a `databricks://:@host/...` URL, and silently trigger OAuth U2M browser
fallback in `databricks-sql-connector` — exactly the failure mode the
docstring warned against.

Resolve the value first (handling SecretStr, None, and any object that
stringifies to ""), then validate. This catches:

- `token = None` (missing attribute)
- `token = ""` (empty plain string)
- `token = SecretStr("")` (empty SecretStr / Pydantic field)

with a single `if not token_value` check.

Adds two unit tests covering the empty-string and empty-SecretStr cases
to lock in the behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens Databricks ingestion and validation by adding a Databricks-focused CLI E2E suite, fixing PAT extraction for Databricks data-diff URL construction (to avoid OAuth browser fallback hangs), and reducing flakiness in connector E2E assertions by removing a racy metadata-rest sink record-count check.

Changes:

  • Add a comprehensive Databricks CLI E2E test suite plus Databricks connector YAML configuration.
  • Fix Databricks data-diff connection URL building to read the PAT from authType.token (with legacy fallback) and raise when missing.
  • Relax CLI E2E assertions by removing sink record-count checks that can under-report due to batched flush timing, and add stronger Databricks-specific end-to-end API verification.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ingestion/src/metadata/ingestion/source/database/common/data_diff/databricks_base.py Fix PAT extraction path for Databricks data-diff URL generation and introduce explicit error on missing PAT.
ingestion/tests/cli_e2e/common/test_cli_db.py Remove racy sink record-count assertions; document metadata-rest batching/flush timing.
ingestion/tests/unit/topology/database/test_databricks_migration.py Add unit coverage for nested authType.token extraction and missing-token error behavior.
ingestion/tests/cli_e2e/test_cli_databricks.py Introduce a Databricks CLI E2E suite covering ingestion, profiler, classification, lineage, data quality, and feature ingestion scenarios.
ingestion/tests/cli_e2e/database/databricks/databricks.yaml Add Databricks E2E workflow config driven by environment variables (host/httpPath/token/catalog).

Comment on lines +56 to +62
# Resolve to the bare string before validating: an empty-string token
# (e.g. `$E2E_DATABRICKS_TOKEN` set but empty, or `token: ""` in YAML)
# would otherwise build a URL like `databricks://:@host/...` and the
# SQL driver would fall back to OAuth U2M, opening a browser. Validate
# the resolved value so we fail fast in non-interactive environments.
token_value = (
token.get_secret_value()
Comment on lines +128 to +129

def test_empty_string_token_raises(self):
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

CI surfaced two checks failing on the previous commit:

1. Static checks: basedpyright flagged
   `databricks_base.py:62: "get_secret_value" is not a known attribute of "None"`.
   The conditional expression `token.get_secret_value() if hasattr(...)
   else str(token)` could not be narrowed by basedpyright — it could not
   infer that `hasattr` implies `token is not None`.

2. py-checkstyle: `ruff format` wanted to inline the same multi-line
   conditional, but inlining it would exceed the line-length limit and
   bounce against ruff again.

Restructure the resolution into explicit branches:

    if token is None:
        token_value = ""
    elif hasattr(token, "get_secret_value"):
        token_value = token.get_secret_value()
    else:
        token_value = str(token)

This narrows `token` cleanly for basedpyright in the elif branch, and
ruff is satisfied with the natural block layout.

Also tighten `test_token_nested_under_authtype`: replace the
`assert "..." in result` with `assert isinstance(result, str)` followed
by an exact-equality check. The `in` operator on a `str | dict | None`
union was a basedpyright operator-issue, and the equality check below
already implies the substring assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 7, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Implements comprehensive Databricks E2E CLI tests and fixes a critical bug in PAT extraction that caused silent OAuth hangs. Resolves the empty-string token authentication issue; no other issues found.

✅ 1 resolved
Bug: Empty-string token bypasses ValueError, causes silent OAuth hang

📄 ingestion/src/metadata/ingestion/source/database/common/data_diff/databricks_base.py:54-62
The _extract_pat_token method only checks if token is None before returning. If the token attribute exists but is an empty string (or a SecretStr wrapping an empty string), get_secret_value() / str(token) will return "", producing the exact empty-token URL the docstring warns about — silently falling back to OAuth U2M and hanging non-interactive runs.

This can happen when the YAML has token: "" or when the env var $E2E_DATABRICKS_TOKEN is set but empty.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 7, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🟡 Playwright Results — all passed (14 flaky)

✅ 4014 passed · ❌ 0 failed · 🟡 14 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 750 0 5 8
🟡 Shard 3 755 0 4 7
🟡 Shard 4 787 0 3 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 736 0 2 8
🟡 14 flaky test(s) (passed on retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/BulkImport.spec.ts › Keyboard Delete selection (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/IncidentManager.spec.ts › Complete Incident lifecycle with table owner (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/AddRoleAndAssignToUser.spec.ts › Verify assigned role to new user (shard 3, 1 retry)
  • Flow/ObservabilityAlerts.spec.ts › Table alert (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Set table-cp custom property on column and verify in UI (shard 4, 1 retry)
  • Pages/DataContractInheritance.spec.ts › Delete Asset Contract - Falls back to showing inherited contract from Data Product (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants