Ensure DuckLake metadata indexes on startup#330
Merged
fuziontech merged 3 commits intomainfrom Mar 19, 2026
Merged
Conversation
DuckLake does not create indexes on its PostgreSQL metadata tables, which causes severe performance issues at scale. DuckDB's postgres scanner uses COPY with ctid range batches and pushes down filters, but without indexes PostgreSQL must do sequential scans within each batch. On first DuckLake attachment, connect directly to the PostgreSQL metadata store and create indexes on the most critical tables: - ducklake_file_column_stats (table_id, column_id) — the biggest win; this table can grow to millions of rows and is queried on every DuckLake query for filter pushdown - ducklake_tag, ducklake_column_tag — correlated subqueries in catalog loading - ducklake_table, ducklake_column — base JOIN for catalog loading - ducklake_data_file, ducklake_delete_file — file list queries - ducklake_table_stats, ducklake_table_column_stats — stats queries Uses sync.Once to run at most once per process. Non-fatal on failure. See: duckdb/ducklake#859 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use atomic.Bool + mutex instead of sync.Once so transient failures (network blip, DNS issue) can be retried on subsequent connections - Run in a goroutine so it doesn't block the DuckLake semaphore or delay connection setup - Increase timeout from 30s to 5min for first-run CREATE INDEX on large tables (e.g., 1.2 GB ducklake_file_column_stats) - Guard non-PostgreSQL metadata stores (skip if no "postgres:" prefix) - Only mark done when all indexes succeed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On first DuckLake attachment, connect directly to the PostgreSQL metadata store and create indexes that dramatically improve query planning performance. Without these indexes, DuckLake queries have a fixed ~1-2s overhead due to sequential scans on large metadata tables.
Context
DuckLake does not create indexes on its PostgreSQL metadata tables. DuckDB's postgres scanner uses COPY with
ctidrange batches and does push downtable_id/column_idfilters, but without indexes PostgreSQL must sequentially scan within each batch. For a catalog with ~60K data files,ducklake_file_column_statsgrows to 6.7M rows (1.2 GB), causing every query to spend ~2s scanning this table.With a
(table_id, column_id)index, the same queries drop from ~2s to ~1.25s. The remaining overhead is from the ctid batching pattern itself (tracked upstream: duckdb/ducklake#859).Changes
ensureDuckLakeMetadataIndexes()function that connects directly to the PostgreSQL metadata store viapgxand createsIF NOT EXISTSindexes on all DuckLake metadata tablesAttachDuckLake()after successful attachmentsync.Onceto run at most once per processpgx/stdlibdriver import for direct PostgreSQL connectionsIndexes created
ducklake_file_column_stats(table_id, column_id)ducklake_tag(object_id, begin_snapshot, end_snapshot)ducklake_column_tag(table_id, column_id, begin_snapshot, end_snapshot)ducklake_table(begin_snapshot, end_snapshot)ducklake_column(table_id, begin_snapshot, end_snapshot, column_order)ducklake_data_file(table_id, begin_snapshot, end_snapshot)ducklake_delete_file(table_id, begin_snapshot, end_snapshot)ducklake_table_stats(table_id)ducklake_table_column_stats(table_id)Test plan
slogoutput)ducklake_file_column_stats🤖 Generated with Claude Code