Iceberg: parallelize manifest file loading to fix 3s cold-cache bottl…#1753
Open
il9ue wants to merge 1 commit intoAltinity:antalya-25.8from
Open
Iceberg: parallelize manifest file loading to fix 3s cold-cache bottl…#1753il9ue wants to merge 1 commit intoAltinity:antalya-25.8from
il9ue wants to merge 1 commit intoAltinity:antalya-25.8from
Conversation
…eneck
< Root cause >
`SingleThreadIcebergKeysIterator::next()` calls `Iceberg::getManifestFile`
for each `ManifestFileCacheKey` in `manifest_list_entries` one at a time.
On a cold `IcebergMetadataFilesCache` (server start, snapshot rotation, or
cache eviction) each call blocks on an S3 GET + Avro parse — roughly 50-100 ms
per manifest at typical S3 RTT. With 30 manifests this accumulates to 1.5-3 s
of serial latency before the first file is handed to the query executor.
The deletes drain in `IcebergIterator::IcebergIterator()` suffers the same
problem: it calls `deletes_iterator.next()` in a tight loop, each iteration
waiting on a serial manifest fetch.
`IcebergMetadataFilesCache` was warm on repeated queries (hiding the issue),
but every first query after server start or snapshot change paid the full cost.
< Fix >
Add `iceberg_metadata_files_parallel_loading_threads` setting (default 8,
range 1-64). When > 1, `SingleThreadIcebergKeysIterator` fans out all
`getManifestFile` calls to `getIOThreadPool()` on the first `next()` call
(lazy init), captures one `std::promise`/`std::future` per matching manifest
entry, and then consumes futures in `manifest_list_entry` order in subsequent
`next()` calls.
Thread-safety analysis:
- `CacheBase::getOrSet` has singleflight protection (one thread calls
`load_func`, others wait), so N parallel callers for the same key produce
exactly one S3 GET.
- `IcebergSchemaProcessor::addIcebergTableSchema` uses an exclusive lock
(already thread-safe for concurrent callers).
- `AvroForIcebergDeserializer` uses `SharedMutex` on mutable cache fields
(already thread-safe).
- `ManifestFileIterator::create()` is called serially in `next()` after the
future resolves — no new concurrency.
Ordering invariant preserved: futures are submitted and consumed in
`manifest_list_entries` index order, so the `previous_entry_schema` warning
logic and downstream pruning remain correct.
Cancellation: the lambda captures all dependencies by value (`shared_ptr`s).
The destructor waits for in-flight futures so no use-after-free is possible
even if the iterator is destroyed mid-query.
Setting = 1 bypasses the parallel path entirely and falls through to the
original serial loop — a safe fallback and the CI test mode.
The deletes drain in `IcebergIterator::IcebergIterator()` automatically
benefits: the first `deletes_iterator.next()` triggers `initParallelPrefetch`,
then the drain loop consumes pre-fetched futures, so delete manifests are also
fetched in parallel.
New observable signals:
- `IcebergManifestFilesParallelFetchMicroseconds`: per-future wait time
accumulated across all `fut.get()` calls in a query. On a warm cache this
is near-zero; on a cold cache it approaches (max single manifest RTT) instead
of (sum of all manifest RTTs) — directly measuring the parallelism benefit.
- `ThreadName::ICEBERG_MANIFEST_FETCH` ("IcebergMFetch") on pool threads.
< Tests >
- `04075_iceberg_parallel_manifest_loading.sh`: creates 30-manifest local
Iceberg table via 30 separate inserts, queries with threads=1 and threads=16,
asserts identical count (30), sum (465), and full row content.
- `test_storage_iceberg_no_spark/test_parallel_manifest_loading.py`: same
correctness check in the integration harness; second test verifies
`IcebergManifestFilesParallelFetchMicroseconds` > 0 when threads=8.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Parallelize Iceberg manifest file loading during query planning. Cold-cache queries against Iceberg tables with many manifest files (especially on swarm clusters where replicas sit idle while the initiator walks metadata) previously suffered multi-second delays from sequential
getManifestFilecalls. New settingiceberg_metadata_files_parallel_loading_threads(default 8, range 1–64) controls the parallelism; set to 1 to restore the previous serial behavior. Antalya 25.8 port of upstream ClickHouse#103714. Closes Altinity/ClickHouse#1740.Description
Cherry-pick of upstream PR ClickHouse/ClickHouse# onto Antalya 25.8 (base tag
v25.8.22.20001.altinityantalya).Why port early
The bug is most acute on Antalya swarm clusters because swarm replicas sit idle on
StorageClusterTaskDistributorwhile the initiator walks manifests serially —max_threads=16× N replicas of compute is wasted during the metadata walk. A user-reported case on Alexander Zaitsev's test server showed a 3.3-second gap on anobject_storage_cluster='swarm'query where every data file was pruned by partition + min/max predicates; the entire wall-clock time was the initiator's serial metadata walk viaIceberg::getManifestFile.See the upstream PR for full RCA, implementation details, and thread-safety analysis.
Conflicts resolved during cherry-pick
src/Common/ProfileEvents.cpp: addedIcebergManifestFilesParallelFetchMicroseconds; dropped master-onlyParquetMetadata*events not present in Antalya 25.8.src/Common/setThreadName.h: kept Antalya HEAD version. Antalya 25.8 lacks theThreadNameenum that master added, so the thread name is set via string literal"IcebergMFetch"directly in theinitParallelPrefetchlambda.src/Core/Settings.cpp: addediceberg_metadata_files_parallel_loading_threads; dropped master-onlyiceberg_metadata_staleness_msanduse_parquet_metadata_cachesettings.src/Core/SettingsChangesHistory.cpp: auto-merged into the25.8section (correct for Antalya).IcebergIterator.{h,cpp}: adapted parallel prefetch to Antalya API (ManifestFilePtrfutures, AntalyagetManifestFilesignature, Antalya pruning/logging logic preserved on the serial path); merged the two destructors so that pre-existingProfileEvents::incrementcalls and the new "wait for in-flight futures" cleanup both run.tests/integration/test_storage_iceberg_no_spark/(master) totests/integration/test_storage_iceberg/(Antalya 25.8 directory rename); updated fixture name and theallow_experimental_insert_into_icebergsetting reference to match Antalya conventions.Tested
antalya-25.8/iceberg-parallel-manifest-loading.cmake/tools.cmake). Antalya CI on this PR will validate the build with the correct toolchain.04075_iceberg_parallel_manifest_loadingpasses on the upstream branch (feature/iceberg-parallel-manifest-loading, clang 21, RelWithDebInfo).CI/CD Options
Exclude tests:
Regression jobs to run: