Skip to content

Add Iceberg sync tests and fix deletion detection (#267) and no_action (#268)#263

Closed
tlee732 wants to merge 4 commits into
indextables:mainfrom
tlee732:test/iceberg-sync-parity
Closed

Add Iceberg sync tests and fix deletion detection (#267) and no_action (#268)#263
tlee732 wants to merge 4 commits into
indextables:mainfrom
tlee732:test/iceberg-sync-parity

Conversation

@tlee732
Copy link
Copy Markdown
Contributor

@tlee732 tlee732 commented Mar 26, 2026

Summary

Tests Iceberg companion sync behavior that was previously only tested for Delta (40 tests total). Delta has ~95 sync tests; Iceberg had ~18 before this PR.

What's tested

Incremental sync after data mutations (10 tests) — Verifies companion correctly handles Iceberg table changes: file appends across multiple snapshots, file deletions (equivalent to Delta DELETE), file overwrites (equivalent to OPTIMIZE/compaction), mixed operations, large snapshot gaps, and partitioned table mutations. Each test verifies exact row counts and split invalidation/creation.

WHERE clause partition filtering (17 tests) — Verifies BUILD COMPANION with WHERE clause correctly filters Iceberg partitions: equality (region = 'us-east'), IN, !=, range (>=), strict greater-than, compound AND, OR, and BETWEEN-equivalent. Also tests scoped invalidation (splits outside WHERE range preserved) and INVALIDATE ALL PARTITIONS.

Metadata tracking (3 tests) — Verifies companion records correct Iceberg snapshot IDs and returns no_action when source hasn't changed.

Bug fixes included

This PR also fixes the 4 bugs that the tests exposed:

Bug Fix Issue
Re-sync returns "success" instead of "no_action" Read lastSyncedVersion from companion metadata for non-streaming Iceberg syncs #268
Deleted files not detected by anti-join Replace incremental manifest-diff with full-scan fallback when snapshot changes #267
Invalidation-only syncs silently skip commit Commit remove actions directly when no new files to index #267

Test plan

export JAVA_HOME=/opt/homebrew/opt/openjdk@11
mvn test-compile scalatest:test -DwildcardSuites='io.indextables.spark.sync.IcebergMutationSyncTest,io.indextables.spark.sync.IcebergDistributedSyncTest,io.indextables.spark.sync.IcebergWhereClauseExtendedTest'

🤖 Generated with Claude Code

Closes the 5:1 Delta-to-Iceberg test gap for sync operations:
- IcebergMutationSyncTest: 10 tests — append/delete/overwrite/mixed ops, partitioned tables
- IcebergDistributedSyncTest: 12 tests — WHERE clause (=, IN, !=, range, AND, OR), scoped invalidation
- IcebergWhereClauseExtendedTest: 5 tests — compound AND, OR, strict >, BETWEEN, range filters
- IcebergSchemaEvolutionSyncTest: 4 tests — add/drop/rename column, new column indexing

7 tests intentionally fail, catching real Iceberg bugs:
- Scoped invalidation doesn't detect deleted files within WHERE range
- Sync returns "success" instead of "no_action" for unchanged snapshots
- Sync after add/drop/rename column returns "error" (schema evolution not handled)
- Partitioned delete not detected by anti-join

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tlee732 tlee732 changed the title Add 44 Iceberg sync parity tests (F1-F4) Add tests for Iceberg incremental sync after mutations, WHERE clause filtering, and schema evolution Mar 26, 2026
@schenksj
Copy link
Copy Markdown
Collaborator

The system doesn't currently support schema evolution. We'd need to add that support before merging these.

@schenksj schenksj marked this pull request as draft March 27, 2026 01:06
tlee732 and others added 3 commits March 29, 2026 14:09
Schema evolution support doesn't exist yet (indextables#269), so these tests
shouldn't be part of this PR. They can be reintroduced alongside
the feature implementation.

Addresses reviewer feedback from @schenksj.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ndextables#268)

Three changes:

1. Read lastSyncedVersion from companion metadata for non-streaming
   Iceberg syncs, enabling the snapshot ID comparison that returns
   "no_action" when the source hasn't changed. (fixes indextables#268)

2. Replace the incremental manifest-diff path with a full-scan
   fallback when the Iceberg snapshot has changed. The manifest-diff
   only detected new files — it never populated removedSourcePaths,
   so file deletions were invisible. The full-scan anti-join already
   handles both new and deleted files correctly. (fixes indextables#267)

3. Handle invalidation-only syncs (splits to remove, no new files).
   Previously, remove actions were only committed during batch
   processing — if there were no files to index, no batch ran and
   the stale splits were never removed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace hand-built Row in invalidation-only path with buildResultRow
- Hoist val externalStorageRoot before step 8b to eliminate duplicate
  icebergStorageRoot.orElse(resolvedStorageLocation) expression
- Trim verbose comments: effectiveFromSnapshot (4→1 line), remove
  comment duplicating log message, simplify scanner block comment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tlee732
Copy link
Copy Markdown
Contributor Author

tlee732 commented Mar 29, 2026

Addressed — schema evolution tests removed. The PR now contains only tests for functionality the system already supports (mutations, WHERE clauses, metadata tracking).

While writing these tests, they exposed 3 real bugs in the Iceberg sync path which are now also fixed in this PR:

  1. Iceberg companion sync returns 'success' instead of 'no_action' for unchanged snapshots #268: Re-sync with no changes returned "success" instead of "no_action" — non-streaming Iceberg syncs now read lastSyncedVersion from companion metadata to enable the snapshot comparison fast-path.

  2. Iceberg scoped invalidation does not detect deleted files #267: File deletions not detected — the incremental manifest-diff path only tracked new files and never populated removedSourcePaths. Replaced with a full-scan fallback when the snapshot changes, letting the existing anti-join handle both new and deleted files.

  3. Iceberg scoped invalidation does not detect deleted files #267: Invalidation-only syncs (files deleted, nothing new to index) silently skipped the commit — the remove actions were only committed during batch processing, so an empty batch meant stale splits were never removed.

All 27 tests pass (0 failures). The production changes are in SyncToExternalCommand.scala and DistributedSourceScanner.scala.

@tlee732 tlee732 changed the title Add tests for Iceberg incremental sync after mutations, WHERE clause filtering, and schema evolution Add Iceberg sync tests and fix deletion detection (#267) and no_action (#268) Mar 29, 2026
@tlee732 tlee732 marked this pull request as ready for review March 30, 2026 05:47
Copy link
Copy Markdown
Collaborator

@schenksj schenksj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overview

Solid PR — the three bug fixes are directionally correct and the new tests provide meaningful coverage. Key findings below, ranked by severity.


Bug Fix 1: Re-sync no_action (non-streaming Iceberg)

Correct fix for the distributed path. The effectiveFromSnapshot = fromSnapshot.orElse(readLastSyncedVersionFromLog(...)) change correctly avoids re-indexing already-synced files.

Gap — non-distributed reader path not fixed. The fix is only applied in the case "iceberg" distributed block (~line 321). The non-distributed path at ~line 240 still passes the original fromSnapshot (which is None on a manual re-sync) to IcebergSourceReader. If spark.indextables.companion.sync.distributed.enabled=false, this bug persists. Either fix the reader path or document this gap.

Minor: readLastSyncedVersionFromLog opens the transaction log on every non-streaming Iceberg sync, even for a first-ever sync when no log exists. The method handles this gracefully (returns None), but it's unnecessary filesystem I/O. Low priority.


Bug Fix 2: Deleted files not detected (manifest-diff removal)

Correct trade-off. The manifest-diff path was genuinely broken for deletions (it only detected new manifests, missing file removals from existing manifests). Falling back to the full-scan anti-join is semantically correct.

High-severity operational concern: The manifest-diff path was the streaming fast-path. Removing it means every streaming micro-batch where the snapshot has changed now triggers a full distributed scan. For large tables with frequent small appends this is potentially orders-of-magnitude slower. Please:

  1. Add a release note / migration note for streaming Iceberg users.
  2. Consider whether Iceberg's IncrementalAppendScan (which tracks both adds and deletes incrementally) can replace the old approach.

maxIncrementalManifests config is now silently dead. Any user who set spark.indextables.companion.sync.iceberg.maxIncrementalManifests in their config will see no error but no effect. This should be documented as deprecated or removed.

Code clarity: In the case Some(fsnap) block, the fall-through from the else branch to the full-scan code below is implicit. Add a // fall through to full scan comment to the else branch body for readability.


Bug Fix 3: Invalidation-only syncs skip commit

Clearly correct. The commit path was only reachable via dispatchSyncTasksBatched, so deletion-only syncs (no new files to index) never committed their remove actions. The new early-return block is clean and follows existing patterns.

Minor: The invalidation-only path skips streaming metrics emission (no Spark job is run). This is acceptable but worth documenting.


Test Quality

Potential Test Bug (High)

IcebergMutationSyncTest at the snapshot_meta test calls TransactionLogFactory.create(new Path(indexPath), spark) without passing allowDirectUsage -> true. Every other test in this codebase that directly instantiates TransactionLog requires this flag. This test may fail with an "allowDirectUsage" exception. Please verify.

// Known bug: comment in IcebergDistributedSyncTest

There's a // Known bug: comment on a test that appears to be testing the fixed behavior. If the fix in this PR makes the test pass, remove the comment. If the test is still expected to fail, convert it to pendingUntilFixed.

Misleading comment on splits count

IcebergMutationSyncTest line ~925: row2.getInt(6) shouldBe 2 // re-indexes existing + 1 new file. With the manifest-diff path removed, the full-scan anti-join runs. The assertion value may need adjustment and the comment ("re-indexes existing + 1 new file") implies incremental semantics that no longer apply.

Code duplication

All three new test files have near-identical beforeAll() SparkSession setup (~22 lines) and an identical private flushCaches() method. This is a 3x maintenance risk — extract to a shared IcebergSyncTestBase trait.

Missing coverage

  1. IcebergWhereClauseExtendedTest has no mutation/re-sync tests. It only tests initial sync with WHERE filters. The scoped-invalidation bug fix is most relevant during incremental re-sync — add at least one test that deletes a partitioned file and verifies scoped re-sync with a WHERE clause.
  2. Empty table first sync — no test for syncing an Iceberg table with no snapshots yet.
  3. Combined Bug 2 + Bug 3 — no test for INVALIDATE ALL PARTITIONS combined with deletion-only (no new files), which exercises both fixes together.

Summary Table

Finding Severity
Non-distributed reader path not fixed for Bug 1 Medium
Streaming perf regression (full scan on every incremental cycle) High
maxIncrementalManifests config is silently dead Medium
TransactionLogFactory.create missing allowDirectUsage flag (potential test failure) High
// Known bug: comment likely stale after this fix Medium
Misleading splits count comment after full-scan change Low
Duplicate beforeAll / flushCaches across 3 test files Low
Missing WHERE + mutation re-sync tests Low
fromSnapshotId fall-through needs comment Low

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator

@schenksj schenksj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the findings in the review added on the conversation. The biggest issue (I consider a major blocker) is that this remove incremental/steaming synchronization which would be a massive performance regression. We can't promote this if it breaks incremental sync.

@schenksj schenksj marked this pull request as draft March 30, 2026 13:51
@schenksj
Copy link
Copy Markdown
Collaborator

also, please resolve merge conflicts

@tlee732
Copy link
Copy Markdown
Contributor Author

tlee732 commented Apr 1, 2026

Superseded by #286, which combines this PR with #264 and addresses all reviewer feedback (streaming resume separation, enhanced manifest-diff with deletion detection, invalidation-only commit path).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants