Iceberg: 3-second cold-cache delay walking manifest files on swarm clusters

## Summary

Cold-cache Iceberg queries on swarm clusters spend multiple seconds walking the manifest list serially before any data work begins. Swarm replicas sit idle the entire time, waiting on `StorageClusterTaskDistributor` while the initiator fetches and parses each manifest file one at a time.

Reported on Alexander Zaitsev's test server. Reproduced on Antalya 25.8.

## Reproduction

Any Iceberg table with N≥20 manifest files; cold `IcebergMetadataFilesCache` (server start, snapshot rotation, or eviction); `object_storage_cluster='swarm'`; query with predicates that prune all data files.

```sql
SELECT qname, count(*)
FROM ice.<table>
WHERE eventDate BETWEEN '<start>' AND '<end>'
  AND <partition_predicate>
GROUP BY qname
ORDER BY count() DESC
LIMIT 5000
SETTINGS object_storage_cluster = 'swarm', max_threads = 16
```

## Observed behavior

Trimmed log excerpt with `send_logs_level='test'`:

```
T+0.000s  executeQuery: query received on initiator
T+0.146s  IcebergMetadata: read metadata.json
T+0.187s  read snap-*.avro manifest list
T+0.227s  dispatched sub-queries to 4 swarm replicas
T+0.339s  Aggregator: Adjusting memory limit ... [last log line on initiator]

  ↑ 3.29-second silence ↓

T+3.628s  StorageClusterTaskDistributor: Iterator is exhausted
T+3.661s  query completes
```

3.29 s is the entire wall clock of the query and is 100% serial manifest fetching.

## Root cause

`SingleThreadIcebergKeysIterator::next()` (in `src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergIterator.cpp`) calls `Iceberg::getManifestFile()` once per `manifest_list_entry`, single-threaded, on the producer thread. Each call is one S3 GET + Avro parse — typically 50–100 ms RTT cold. With N manifests the cost is N × RTT.

`IcebergIterator::IcebergIterator()` also drains `deletes_iterator` synchronously on the constructor thread, suffering the same bottleneck for delete manifests.

`IcebergMetadataFilesCache` hides this on warm runs but every first query after cache miss pays the full cost.

The pathology is amplified on swarm because replicas are idle while the initiator walks metadata serially — `max_threads=16` × 4 nodes is 64 cores worth of waste.

Confirmed by:
- Same query on warm cache completes in <100 ms.
- `system.iceberg_metadata_log` for the slow query shows N `ManifestFileMetadata` entries — one per serial fetch.
- No S3 throttling, no aggregation work, no other load on the cluster.

## Fix

Upstream PR submitted: <link to your ClickHouse/ClickHouse PR>

Parallelizes manifest fetching via the IO thread pool, gated by new setting `iceberg_metadata_files_parallel_loading_threads` (default 8, range 1–64). Setting=1 reproduces serial behavior.

Will be cherry-picked into Antalya 25.8 — separate PR to follow.

## Affects

- Antalya 25.8 (confirmed)
- Likely all Antalya versions; not Antalya-specific (also present in upstream master)

## Workaround

Keep `use_iceberg_metadata_files_cache=1` (default). Run a warm-up query after server start. Reduce manifest count via writer-side compaction (`REWRITE_MANIFESTS` in Spark/PyIceberg).

- Corresponding Upstream issue [Clickhouse/Clickhouse#103714](https://github.com/ClickHouse/ClickHouse/pull/103714)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg: 3-second cold-cache delay walking manifest files on swarm clusters #1740

Summary

Reproduction

Observed behavior

Root cause

Fix

Affects

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Iceberg: 3-second cold-cache delay walking manifest files on swarm clusters #1740

Description

Summary

Reproduction

Observed behavior

Root cause

Fix

Affects

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions