Skip to content

[GOBBLIN-ICEBERG] Add configurable partition filter with hourly look…#4171

Open
debabhishek53 wants to merge 8 commits intoapache:masterfrom
debabhishek53:master
Open

[GOBBLIN-ICEBERG] Add configurable partition filter with hourly look…#4171
debabhishek53 wants to merge 8 commits intoapache:masterfrom
debabhishek53:master

Conversation

@debabhishek53
Copy link

This PR makes Iceberg partition filtering fully configurable and reusable across all copy flows.

Previously the partition filter was hardcoded to append -00 for hourly tables and only supported yyyy-MM-dd or yyyy-MM-dd-HH patterns with a fixed daily lookback. This change generalizes the entire mechanism

New configs:

  • iceberg.partition.value.format — accept any DateTimeFormatter pattern (yyyy-MM-dd-HH, dd-MM-yyyy-HH, yyyyMMdd, etc.) so tables with non-standard partition naming just work out of the box
  • iceberg.partition.hour (0–23) — explicitly control which hour is embedded in daily partition values instead of always defaulting to 00
  • iceberg.lookback.hours — hour-level granularity lookback, naturally crossing midnight and month boundaries & takes precedence over iceberg.lookback.days when set

New utility: IcebergPartitionFilterGenerator

Extracted a pure, config-agnostic utility class with three public static methods:

  • forDays(...) — N daily partition values, most-recent first
  • forHours(...) — N hourly partition values with natural day/month boundary crossing
  • buildOrExpression(...) — builds an Iceberg OR expression from any pre-computed value list

This utility is fully decoupled from Gobblin config so it can be called from any flow, not just the copy pipeline.

Backward compatible: When iceberg.partition.value.format is absent, the legacy iceberg.hourly.partition.enabled path runs unchanged.

Test Plan

  • 21 new unit tests in IcebergPartitionFilterGeneratorTest — 100% instruction/branch/line coverage
  • 19 new tests in IcebergSourceTest covering all new config paths: custom formats, reversed date patterns, hour override, hourly lookback, boundary crossing
  • All tests in gobblin-data-management pass

…back support

                  - Introduce IcebergPartitionFilterGenerator utility for reusable Iceberg OR
                    expression building (forDays, forHours, buildOrExpression)
                  - Add iceberg.partition.value.format for arbitrary DateTimeFormatter patterns
                  - Add iceberg.partition.hour (0-23) for explicit hour control in daily partitions
                  - Add iceberg.lookback.hours for hour-level granularity, takes precedence over
                    iceberg.lookback.days when set
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR generalizes Iceberg partition filtering in IcebergSource by introducing configurable partition value formatting, hour control, and hour-granularity lookback, and by extracting reusable partition filter generation logic into a new utility.

Changes:

  • Add iceberg.partition.value.format, iceberg.partition.hour, and iceberg.lookback.hours support (with lookback.hours taking precedence when > 0).
  • Refactor IcebergSource partition filter generation to use the new IcebergPartitionFilterGenerator utility.
  • Add extensive unit test coverage for the new generator utility and the new/legacy config paths in IcebergSource.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergSource.java Adds new configs and refactors partition filter generation to use a formatter + generator, including hourly lookback precedence.
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionFilterGenerator.java Introduces a config-agnostic utility to generate partition value lists and OR filter expressions for day/hour lookbacks.
gobblin-data-management/src/test/java/org/apache/gobblin/data/management/copy/iceberg/IcebergSourceTest.java Adds tests covering custom format patterns, hour override, hourly lookback behavior, and precedence rules.
gobblin-data-management/src/test/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionFilterGeneratorTest.java Adds unit tests for generator behavior, immutability guarantees, and expression generation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

debabhishek53 and others added 5 commits March 9, 2026 14:14
…management/copy/iceberg/IcebergSource.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…management/copy/iceberg/IcebergPartitionFilterGeneratorTest.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Member

@Blazer-007 Blazer-007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets have a proper consistent format i.e. if filter date is provided then it should be in format of the pattern specified as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants