Ensure DuckLake metadata indexes on startup by fuziontech · Pull Request #330 · PostHog/duckgres

fuziontech · 2026-03-19T18:20:41Z

Summary

On first DuckLake attachment, connect directly to the PostgreSQL metadata store and create indexes that dramatically improve query planning performance. Without these indexes, DuckLake queries have a fixed ~1-2s overhead due to sequential scans on large metadata tables.

Context

DuckLake does not create indexes on its PostgreSQL metadata tables. DuckDB's postgres scanner uses COPY with ctid range batches and does push down table_id/column_id filters, but without indexes PostgreSQL must sequentially scan within each batch. For a catalog with ~60K data files, ducklake_file_column_stats grows to 6.7M rows (1.2 GB), causing every query to spend ~2s scanning this table.

With a (table_id, column_id) index, the same queries drop from ~2s to ~1.25s. The remaining overhead is from the ctid batching pattern itself (tracked upstream: duckdb/ducklake#859).

Changes

Added ensureDuckLakeMetadataIndexes() function that connects directly to the PostgreSQL metadata store via pgx and creates IF NOT EXISTS indexes on all DuckLake metadata tables
Called from AttachDuckLake() after successful attachment
Uses sync.Once to run at most once per process
Non-fatal: logs warnings on failure but doesn't break startup
Added pgx/stdlib driver import for direct PostgreSQL connections

Indexes created

Table	Index columns	Why
`ducklake_file_column_stats`	`(table_id, column_id)`	Queried on every DuckLake query for filter pushdown; often millions of rows
`ducklake_tag`	`(object_id, begin_snapshot, end_snapshot)`	Correlated subqueries in catalog loading
`ducklake_column_tag`	`(table_id, column_id, begin_snapshot, end_snapshot)`	Correlated subqueries in catalog loading
`ducklake_table`	`(begin_snapshot, end_snapshot)`	Snapshot filtering in catalog loading
`ducklake_column`	`(table_id, begin_snapshot, end_snapshot, column_order)`	JOIN + ordering in catalog loading
`ducklake_data_file`	`(table_id, begin_snapshot, end_snapshot)`	File list queries
`ducklake_delete_file`	`(table_id, begin_snapshot, end_snapshot)`	Delete file queries
`ducklake_table_stats`	`(table_id)`	Stats queries
`ducklake_table_column_stats`	`(table_id)`	Stats queries

Test plan

Deploy to a duckgres instance with DuckLake PostgreSQL metadata store
Verify indexes are created on first startup (check slog output)
Verify no errors on subsequent restarts (IF NOT EXISTS handles idempotency)
Verify DuckLake queries still work normally
Benchmark: query latency should drop from ~2s to ~1.25s on catalogs with large ducklake_file_column_stats

🤖 Generated with Claude Code

DuckLake does not create indexes on its PostgreSQL metadata tables, which causes severe performance issues at scale. DuckDB's postgres scanner uses COPY with ctid range batches and pushes down filters, but without indexes PostgreSQL must do sequential scans within each batch. On first DuckLake attachment, connect directly to the PostgreSQL metadata store and create indexes on the most critical tables: - ducklake_file_column_stats (table_id, column_id) — the biggest win; this table can grow to millions of rows and is queried on every DuckLake query for filter pushdown - ducklake_tag, ducklake_column_tag — correlated subqueries in catalog loading - ducklake_table, ducklake_column — base JOIN for catalog loading - ducklake_data_file, ducklake_delete_file — file list queries - ducklake_table_stats, ducklake_table_column_stats — stats queries Uses sync.Once to run at most once per process. Non-fatal on failure. See: duckdb/ducklake#859 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Use atomic.Bool + mutex instead of sync.Once so transient failures (network blip, DNS issue) can be retried on subsequent connections - Run in a goroutine so it doesn't block the DuckLake semaphore or delay connection setup - Increase timeout from 30s to 5min for first-run CREATE INDEX on large tables (e.g., 1.2 GB ducklake_file_column_stats) - Guard non-PostgreSQL metadata stores (skip if no "postgres:" prefix) - Only mark done when all indexes succeed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fuziontech and others added 3 commits March 19, 2026 11:20

fix: check pgDB.Close() return value for errcheck linter

68839fd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fuziontech enabled auto-merge (squash) March 19, 2026 18:32

fuziontech merged commit f1d2a3a into main Mar 19, 2026
17 checks passed

fuziontech deleted the james/ducklake-metadata-indexes branch March 19, 2026 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure DuckLake metadata indexes on startup#330

Ensure DuckLake metadata indexes on startup#330
fuziontech merged 3 commits intomainfrom
james/ducklake-metadata-indexes

fuziontech commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fuziontech commented Mar 19, 2026

Summary

Context

Changes

Indexes created

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant