Skip to content

Iceberg: parallelize manifest file loading to fix 3s cold-cache bottl…#1753

Open
il9ue wants to merge 1 commit intoAltinity:antalya-25.8from
il9ue:antalya-25.8/iceberg-parallel-manifest-loading
Open

Iceberg: parallelize manifest file loading to fix 3s cold-cache bottl…#1753
il9ue wants to merge 1 commit intoAltinity:antalya-25.8from
il9ue:antalya-25.8/iceberg-parallel-manifest-loading

Conversation

@il9ue
Copy link
Copy Markdown

@il9ue il9ue commented May 7, 2026

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Parallelize Iceberg manifest file loading during query planning. Cold-cache queries against Iceberg tables with many manifest files (especially on swarm clusters where replicas sit idle while the initiator walks metadata) previously suffered multi-second delays from sequential getManifestFile calls. New setting iceberg_metadata_files_parallel_loading_threads (default 8, range 1–64) controls the parallelism; set to 1 to restore the previous serial behavior. Antalya 25.8 port of upstream ClickHouse#103714. Closes Altinity/ClickHouse#1740.


Description

Cherry-pick of upstream PR ClickHouse/ClickHouse# onto Antalya 25.8 (base tag v25.8.22.20001.altinityantalya).

Why port early

The bug is most acute on Antalya swarm clusters because swarm replicas sit idle on StorageClusterTaskDistributor while the initiator walks manifests serially — max_threads=16 × N replicas of compute is wasted during the metadata walk. A user-reported case on Alexander Zaitsev's test server showed a 3.3-second gap on an object_storage_cluster='swarm' query where every data file was pruned by partition + min/max predicates; the entire wall-clock time was the initiator's serial metadata walk via Iceberg::getManifestFile.

See the upstream PR for full RCA, implementation details, and thread-safety analysis.

Conflicts resolved during cherry-pick

  • src/Common/ProfileEvents.cpp: added IcebergManifestFilesParallelFetchMicroseconds; dropped master-only ParquetMetadata* events not present in Antalya 25.8.
  • src/Common/setThreadName.h: kept Antalya HEAD version. Antalya 25.8 lacks the ThreadName enum that master added, so the thread name is set via string literal "IcebergMFetch" directly in the initParallelPrefetch lambda.
  • src/Core/Settings.cpp: added iceberg_metadata_files_parallel_loading_threads; dropped master-only iceberg_metadata_staleness_ms and use_parquet_metadata_cache settings.
  • src/Core/SettingsChangesHistory.cpp: auto-merged into the 25.8 section (correct for Antalya).
  • IcebergIterator.{h,cpp}: adapted parallel prefetch to Antalya API (ManifestFilePtr futures, Antalya getManifestFile signature, Antalya pruning/logging logic preserved on the serial path); merged the two destructors so that pre-existing ProfileEvents::increment calls and the new "wait for in-flight futures" cleanup both run.
  • Integration test path: moved test file from tests/integration/test_storage_iceberg_no_spark/ (master) to tests/integration/test_storage_iceberg/ (Antalya 25.8 directory rename); updated fixture name and the allow_experimental_insert_into_iceberg setting reference to match Antalya conventions.

Tested

  • Cherry-pick applied cleanly after manual conflict resolution; commit on antalya-25.8/iceberg-parallel-manifest-loading.
  • Local build not verified on this branch: dev environment is on clang 21 while Antalya 25.8 requires clang 19 (per cmake/tools.cmake). Antalya CI on this PR will validate the build with the correct toolchain.
  • The same patch builds and the stateless test 04075_iceberg_parallel_manifest_loading passes on the upstream branch (feature/iceberg-parallel-manifest-loading, clang 21, RelWithDebInfo).

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

…eneck

< Root cause >
`SingleThreadIcebergKeysIterator::next()` calls `Iceberg::getManifestFile`
for each `ManifestFileCacheKey` in `manifest_list_entries` one at a time.
On a cold `IcebergMetadataFilesCache` (server start, snapshot rotation, or
cache eviction) each call blocks on an S3 GET + Avro parse — roughly 50-100 ms
per manifest at typical S3 RTT. With 30 manifests this accumulates to 1.5-3 s
of serial latency before the first file is handed to the query executor.

The deletes drain in `IcebergIterator::IcebergIterator()` suffers the same
problem: it calls `deletes_iterator.next()` in a tight loop, each iteration
waiting on a serial manifest fetch.

`IcebergMetadataFilesCache` was warm on repeated queries (hiding the issue),
but every first query after server start or snapshot change paid the full cost.

< Fix >
Add `iceberg_metadata_files_parallel_loading_threads` setting (default 8,
range 1-64). When > 1, `SingleThreadIcebergKeysIterator` fans out all
`getManifestFile` calls to `getIOThreadPool()` on the first `next()` call
(lazy init), captures one `std::promise`/`std::future` per matching manifest
entry, and then consumes futures in `manifest_list_entry` order in subsequent
`next()` calls.

Thread-safety analysis:
- `CacheBase::getOrSet` has singleflight protection (one thread calls
  `load_func`, others wait), so N parallel callers for the same key produce
  exactly one S3 GET.
- `IcebergSchemaProcessor::addIcebergTableSchema` uses an exclusive lock
  (already thread-safe for concurrent callers).
- `AvroForIcebergDeserializer` uses `SharedMutex` on mutable cache fields
  (already thread-safe).
- `ManifestFileIterator::create()` is called serially in `next()` after the
  future resolves — no new concurrency.

Ordering invariant preserved: futures are submitted and consumed in
`manifest_list_entries` index order, so the `previous_entry_schema` warning
logic and downstream pruning remain correct.

Cancellation: the lambda captures all dependencies by value (`shared_ptr`s).
The destructor waits for in-flight futures so no use-after-free is possible
even if the iterator is destroyed mid-query.

Setting = 1 bypasses the parallel path entirely and falls through to the
original serial loop — a safe fallback and the CI test mode.

The deletes drain in `IcebergIterator::IcebergIterator()` automatically
benefits: the first `deletes_iterator.next()` triggers `initParallelPrefetch`,
then the drain loop consumes pre-fetched futures, so delete manifests are also
fetched in parallel.

New observable signals:
- `IcebergManifestFilesParallelFetchMicroseconds`: per-future wait time
  accumulated across all `fut.get()` calls in a query. On a warm cache this
  is near-zero; on a cold cache it approaches (max single manifest RTT) instead
  of (sum of all manifest RTTs) — directly measuring the parallelism benefit.
- `ThreadName::ICEBERG_MANIFEST_FETCH` ("IcebergMFetch") on pool threads.

< Tests >
- `04075_iceberg_parallel_manifest_loading.sh`: creates 30-manifest local
  Iceberg table via 30 separate inserts, queries with threads=1 and threads=16,
  asserts identical count (30), sum (465), and full row content.
- `test_storage_iceberg_no_spark/test_parallel_manifest_loading.py`: same
  correctness check in the integration harness; second test verifies
  `IcebergManifestFilesParallelFetchMicroseconds` > 0 when threads=8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant