Iceberg: parallelize manifest file loading to fix 3s cold-cache bottl… by il9ue · Pull Request #1753 · Altinity/ClickHouse

il9ue · 2026-05-07T01:26:29Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Parallelize Iceberg manifest file loading during query planning. Cold-cache queries against Iceberg tables with many manifest files (especially on swarm clusters where replicas sit idle while the initiator walks metadata) previously suffered multi-second delays from sequential getManifestFile calls. New setting iceberg_metadata_files_parallel_loading_threads (default 8, range 1–64) controls the parallelism; set to 1 to restore the previous serial behavior. Antalya 25.8 port of upstream ClickHouse#103714. Closes Altinity/ClickHouse#1740.

Description

Cherry-pick of upstream PR ClickHouse/ClickHouse# onto Antalya 25.8 (base tag v25.8.22.20001.altinityantalya).

Why port early

The bug is most acute on Antalya swarm clusters because swarm replicas sit idle on StorageClusterTaskDistributor while the initiator walks manifests serially — max_threads=16 × N replicas of compute is wasted during the metadata walk. A user-reported case on Alexander Zaitsev's test server showed a 3.3-second gap on an object_storage_cluster='swarm' query where every data file was pruned by partition + min/max predicates; the entire wall-clock time was the initiator's serial metadata walk via Iceberg::getManifestFile.

See the upstream PR for full RCA, implementation details, and thread-safety analysis.

Conflicts resolved during cherry-pick

src/Common/ProfileEvents.cpp: added IcebergManifestFilesParallelFetchMicroseconds; dropped master-only ParquetMetadata* events not present in Antalya 25.8.
src/Common/setThreadName.h: kept Antalya HEAD version. Antalya 25.8 lacks the ThreadName enum that master added, so the thread name is set via string literal "IcebergMFetch" directly in the initParallelPrefetch lambda.
src/Core/Settings.cpp: added iceberg_metadata_files_parallel_loading_threads; dropped master-only iceberg_metadata_staleness_ms and use_parquet_metadata_cache settings.
src/Core/SettingsChangesHistory.cpp: auto-merged into the 25.8 section (correct for Antalya).
IcebergIterator.{h,cpp}: adapted parallel prefetch to Antalya API (ManifestFilePtr futures, Antalya getManifestFile signature, Antalya pruning/logging logic preserved on the serial path); merged the two destructors so that pre-existing ProfileEvents::increment calls and the new "wait for in-flight futures" cleanup both run.
Integration test path: moved test file from tests/integration/test_storage_iceberg_no_spark/ (master) to tests/integration/test_storage_iceberg/ (Antalya 25.8 directory rename); updated fixture name and the allow_experimental_insert_into_iceberg setting reference to match Antalya conventions.

Tested

Cherry-pick applied cleanly after manual conflict resolution; commit on antalya-25.8/iceberg-parallel-manifest-loading.
Local build not verified on this branch: dev environment is on clang 21 while Antalya 25.8 requires clang 19 (per cmake/tools.cmake). Antalya CI on this PR will validate the build with the correct toolchain.
The same patch builds and the stateless test 04075_iceberg_parallel_manifest_loading passes on the upstream branch (feature/iceberg-parallel-manifest-loading, clang 21, RelWithDebInfo).

CI/CD Options

Exclude tests:

Regression jobs to run:

…eneck < Root cause > `SingleThreadIcebergKeysIterator::next()` calls `Iceberg::getManifestFile` for each `ManifestFileCacheKey` in `manifest_list_entries` one at a time. On a cold `IcebergMetadataFilesCache` (server start, snapshot rotation, or cache eviction) each call blocks on an S3 GET + Avro parse — roughly 50-100 ms per manifest at typical S3 RTT. With 30 manifests this accumulates to 1.5-3 s of serial latency before the first file is handed to the query executor. The deletes drain in `IcebergIterator::IcebergIterator()` suffers the same problem: it calls `deletes_iterator.next()` in a tight loop, each iteration waiting on a serial manifest fetch. `IcebergMetadataFilesCache` was warm on repeated queries (hiding the issue), but every first query after server start or snapshot change paid the full cost. < Fix > Add `iceberg_metadata_files_parallel_loading_threads` setting (default 8, range 1-64). When > 1, `SingleThreadIcebergKeysIterator` fans out all `getManifestFile` calls to `getIOThreadPool()` on the first `next()` call (lazy init), captures one `std::promise`/`std::future` per matching manifest entry, and then consumes futures in `manifest_list_entry` order in subsequent `next()` calls. Thread-safety analysis: - `CacheBase::getOrSet` has singleflight protection (one thread calls `load_func`, others wait), so N parallel callers for the same key produce exactly one S3 GET. - `IcebergSchemaProcessor::addIcebergTableSchema` uses an exclusive lock (already thread-safe for concurrent callers). - `AvroForIcebergDeserializer` uses `SharedMutex` on mutable cache fields (already thread-safe). - `ManifestFileIterator::create()` is called serially in `next()` after the future resolves — no new concurrency. Ordering invariant preserved: futures are submitted and consumed in `manifest_list_entries` index order, so the `previous_entry_schema` warning logic and downstream pruning remain correct. Cancellation: the lambda captures all dependencies by value (`shared_ptr`s). The destructor waits for in-flight futures so no use-after-free is possible even if the iterator is destroyed mid-query. Setting = 1 bypasses the parallel path entirely and falls through to the original serial loop — a safe fallback and the CI test mode. The deletes drain in `IcebergIterator::IcebergIterator()` automatically benefits: the first `deletes_iterator.next()` triggers `initParallelPrefetch`, then the drain loop consumes pre-fetched futures, so delete manifests are also fetched in parallel. New observable signals: - `IcebergManifestFilesParallelFetchMicroseconds`: per-future wait time accumulated across all `fut.get()` calls in a query. On a warm cache this is near-zero; on a cold cache it approaches (max single manifest RTT) instead of (sum of all manifest RTTs) — directly measuring the parallelism benefit. - `ThreadName::ICEBERG_MANIFEST_FETCH` ("IcebergMFetch") on pool threads. < Tests > - `04075_iceberg_parallel_manifest_loading.sh`: creates 30-manifest local Iceberg table via 30 separate inserts, queries with threads=1 and threads=16, asserts identical count (30), sum (465), and full row content. - `test_storage_iceberg_no_spark/test_parallel_manifest_loading.py`: same correctness check in the integration harness; second test verifies `IcebergManifestFilesParallelFetchMicroseconds` > 0 when threads=8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg: parallelize manifest file loading to fix 3s cold-cache bottl…#1753

Iceberg: parallelize manifest file loading to fix 3s cold-cache bottl…#1753
il9ue wants to merge 1 commit intoAltinity:antalya-25.8from
il9ue:antalya-25.8/iceberg-parallel-manifest-loading

il9ue commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

il9ue commented May 7, 2026

Description

Why port early

Conflicts resolved during cherry-pick

Tested

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant