Skip to content

Iceberg: 3-second cold-cache delay walking manifest files on swarm clusters #1740

@il9ue

Description

@il9ue

Summary

Cold-cache Iceberg queries on swarm clusters spend multiple seconds walking the manifest list serially before any data work begins. Swarm replicas sit idle the entire time, waiting on StorageClusterTaskDistributor while the initiator fetches and parses each manifest file one at a time.

Reported on Alexander Zaitsev's test server. Reproduced on Antalya 25.8.

Reproduction

Any Iceberg table with N≥20 manifest files; cold IcebergMetadataFilesCache (server start, snapshot rotation, or eviction); object_storage_cluster='swarm'; query with predicates that prune all data files.

SELECT qname, count(*)
FROM ice.<table>
WHERE eventDate BETWEEN '<start>' AND '<end>'
  AND <partition_predicate>
GROUP BY qname
ORDER BY count() DESC
LIMIT 5000
SETTINGS object_storage_cluster = 'swarm', max_threads = 16

Observed behavior

Trimmed log excerpt with send_logs_level='test':

T+0.000s  executeQuery: query received on initiator
T+0.146s  IcebergMetadata: read metadata.json
T+0.187s  read snap-*.avro manifest list
T+0.227s  dispatched sub-queries to 4 swarm replicas
T+0.339s  Aggregator: Adjusting memory limit ... [last log line on initiator]

  ↑ 3.29-second silence ↓

T+3.628s  StorageClusterTaskDistributor: Iterator is exhausted
T+3.661s  query completes

3.29 s is the entire wall clock of the query and is 100% serial manifest fetching.

Root cause

SingleThreadIcebergKeysIterator::next() (in src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergIterator.cpp) calls Iceberg::getManifestFile() once per manifest_list_entry, single-threaded, on the producer thread. Each call is one S3 GET + Avro parse — typically 50–100 ms RTT cold. With N manifests the cost is N × RTT.

IcebergIterator::IcebergIterator() also drains deletes_iterator synchronously on the constructor thread, suffering the same bottleneck for delete manifests.

IcebergMetadataFilesCache hides this on warm runs but every first query after cache miss pays the full cost.

The pathology is amplified on swarm because replicas are idle while the initiator walks metadata serially — max_threads=16 × 4 nodes is 64 cores worth of waste.

Confirmed by:

  • Same query on warm cache completes in <100 ms.
  • system.iceberg_metadata_log for the slow query shows N ManifestFileMetadata entries — one per serial fetch.
  • No S3 throttling, no aggregation work, no other load on the cluster.

Fix

Upstream PR submitted: <link to your ClickHouse/ClickHouse PR>

Parallelizes manifest fetching via the IO thread pool, gated by new setting iceberg_metadata_files_parallel_loading_threads (default 8, range 1–64). Setting=1 reproduces serial behavior.

Will be cherry-picked into Antalya 25.8 — separate PR to follow.

Affects

  • Antalya 25.8 (confirmed)
  • Likely all Antalya versions; not Antalya-specific (also present in upstream master)

Workaround

Keep use_iceberg_metadata_files_cache=1 (default). Run a warm-up query after server start. Reduce manifest count via writer-side compaction (REWRITE_MANIFESTS in Spark/PyIceberg).

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions