Skip to content

Fixes #26299 make compressed dag ingestion available#27984

Open
varun-lakhyani wants to merge 3 commits intoopen-metadata:mainfrom
varun-lakhyani:airflow-compression
Open

Fixes #26299 make compressed dag ingestion available#27984
varun-lakhyani wants to merge 3 commits intoopen-metadata:mainfrom
varun-lakhyani:airflow-compression

Conversation

@varun-lakhyani
Copy link
Copy Markdown
Member

@varun-lakhyani varun-lakhyani commented May 8, 2026

Describe your changes:

Fixes #26299
Issue seems to be that dags are compressed and it was not supported, If this the the case They will be ingested now

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

High-level design:

N/A — small change.

Tests:

Use cases covered

Unit tests

Backend integration tests

Ingestion integration tests

Playwright (UI) tests

Manual testing performed

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • My PR is linked to a GitHub issue via Fixes #<issue-number> above.
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • For UI changes: I attached a screen recording and/or screenshots above.
  • I have added tests (unit / integration / Playwright as applicable) and listed them above.

Summary by Gitar

  • Test suite updates:
    • Replaced the specific Airflow 3.x test case with a more generic test_compressed_dag_falls_back_to_dag_id_query to validate fallback logic when data is NULL.
    • Streamlined test_airflow2_returns_tasks_when_data_is_valid and test_airflow2_returns_none_when_table_empty by removing unnecessary IS_AIRFLOW_3 mock patches.

This will update automatically on new commits.

@varun-lakhyani varun-lakhyani requested a review from a team as a code owner May 8, 2026 08:40
@varun-lakhyani varun-lakhyani added the safe to test Add this label to run secure Github workflows on PRs label May 8, 2026
Comment thread ingestion/src/metadata/ingestion/source/pipeline/airflow/metadata.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/pipeline/airflow/metadata.py Outdated
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 8, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Enables support for compressed DAG ingestion by implementing zlib decompression in Airflow source, resolving the N+1 query issue and potential decompression failures for non-zlib data.

✅ 2 resolved
Performance: N+1 query: extra DB round-trip per compressed DAG

📄 ingestion/src/metadata/ingestion/source/pipeline/airflow/metadata.py:477-491 📄 ingestion/src/metadata/ingestion/source/pipeline/airflow/metadata.py:603
For every DAG whose _data column is NULL (i.e., every DAG when COMPRESS_SERIALIZED_DAGS is enabled), _resolve_dag_data issues an additional SELECT _data_compressed … query. In environments with hundreds or thousands of DAGs this means hundreds of extra round-trips to the metadata DB.

A more efficient approach would be to include _data_compressed in the original batch query (alongside dag_id, json_data_column, fileloc) so both the uncompressed and compressed columns are fetched in a single pass, and _resolve_dag_data can simply decompress in-memory without hitting the DB again.

Edge Case: zlib.decompress may raise on non-zlib data (e.g. zstd)

📄 ingestion/src/metadata/ingestion/source/pipeline/airflow/metadata.py:491
Airflow's COMPRESS_SERIALIZED_DAGS option may use different compression algorithms depending on version/config (zlib is common, but zstd is also possible in newer Airflow). zlib.decompress will raise zlib.error on data compressed with a different algorithm. While the outer except Exception block (line 628) will catch this and log a warning, the error message won't clearly indicate a compression-format mismatch, making debugging harder.

Consider wrapping the decompression with a more informative error message, or detecting the compression format (e.g., checking magic bytes).

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

🟡 Playwright Results — all passed (17 flaky)

✅ 3981 passed · ❌ 0 failed · 🟡 17 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 749 0 5 8
🟡 Shard 3 743 0 3 7
✅ Shard 4 775 0 0 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 729 0 8 8
🟡 17 flaky test(s) (passed on retry)
  • Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 2 retries)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/ColumnBulkOperations.spec.ts › should filter by metadata status and verify API param (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 2 retries)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Workflows/WorkflowOssRestrictions.spec.ts › delete-node-button absent in node config sidebar (structural edit blocked) (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/GlossaryImportExport.spec.ts › Glossary CSV import preserves typed relations (shard 6, 1 retry)
  • Pages/InputOutputPorts.spec.ts › Cancel port removal (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)
  • Pages/Users.spec.ts › Create and Delete user (shard 6, 1 retry)
  • Pages/Users.spec.ts › User Performance across different entities pages (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Airflow 3.1.7 metadata ingestion OMD 1.12

1 participant