[FLINK-38830][runtime] Handle duplicate AddColumnEvent gracefully to prevent pipeline crashes by VinaySagarGonabavi · Pull Request #4279 · apache/flink-cdc

VinaySagarGonabavi · 2026-02-23T23:14:26Z

What is the purpose of the change

Flink CDC pipelines crash with IllegalArgumentException: Field names must be unique when duplicate
AddColumnEvents are received for the same column. This commonly occurs during online schema migration
tools like gh-ost which perform shadow-table operations that can produce duplicate schema change events.

Root cause: Neither SchemaUtils.applySchemaChange() nor the transform operators check whether a
column already exists before applying an AddColumnEvent, causing the schema to contain duplicate field
names and triggering the exception in Schema.Builder.

Related JIRA: FLINK-38830

Brief change log

This PR implements a two-layer defense strategy:

Proactive filtering (Layer 1): New SchemaUtils.filterRedundantAddColumns() utility method that
filters out AddColumnEvent.ColumnWithPosition entries where the column already exists in the current
schema, and also deduplicates columns within the same event. This is called in both
PreTransformOperator and PostTransformOperator before applying schema changes, converting
fully-redundant events to no-ops.
Idempotent safety net (Layer 2): SchemaUtils.applyAddColumnEvent() is made idempotent — if a
column with the same name already exists, the add is silently skipped instead of throwing an exception.
This catches any edge cases not handled by Layer 1.

Files changed:

flink-cdc-common/.../utils/SchemaUtils.java — Added filterRedundantAddColumns() and made applyAddColumnEvent() idempotent
flink-cdc-runtime/.../operators/transform/PreTransformOperator.java — Calls filter before applying schema changes
flink-cdc-runtime/.../operators/transform/PostTransformOperator.java — Calls filter before applying schema changes
flink-cdc-common/.../utils/SchemaUtilsTest.java — 8 new unit tests covering all duplicate scenarios
flink-cdc-runtime/.../transform/TransformOperatorWithSchemaEvolveTest.java — 2 new integration tests

Verifying this change

This change adds 10 new tests (917 total tests, 0 failures):

8 unit tests in SchemaUtilsTest:
- testFilterRedundantAddColumns_noDuplicates
- testFilterRedundantAddColumns_allDuplicates
- testFilterRedundantAddColumns_partialDuplicates
- testFilterRedundantAddColumns_emptyEvent
- testFilterRedundantAddColumns_withPositions
- testFilterRedundantAddColumns_intraEventDuplicates
- testApplyAddColumnEvent_idempotent
- testApplyAddColumnEvent_allDuplicates
2 integration tests in TransformOperatorWithSchemaEvolveTest:
- testDuplicateAddColumnEventsInPreTransform
- testDuplicateAddColumnEventsInPostTransform

Does this pull request potentially affect one of the following parts

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): yes (minimal — one Set.contains() check per column in AddColumnEvents only)
Anything that affects deployment or recovery: no

Documentation

Does this pull request introduce a new feature? no (bug fix)
If yes, how is the feature documented? N/A

…prevent pipeline crashes

Copilot

Pull request overview

Fixes Flink CDC pipeline crashes caused by duplicate AddColumnEvents (e.g., emitted by gh-ost/pt-osc during online schema migrations) by filtering redundant column-add operations and making add-column schema application idempotent.

Changes:

Add SchemaUtils.filterRedundantAddColumns(...) to drop already-existing columns and intra-event duplicates (returning empty when fully redundant).
Make SchemaUtils add-column application skip already-existing column names instead of throwing.
Apply redundant-add filtering in both PreTransformOperator and PostTransformOperator, and add unit/integration tests for duplicate scenarios.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
flink-cdc-common/src/main/java/org/apache/flink/cdc/common/utils/SchemaUtils.java	Adds redundant AddColumn filtering utility; makes add-column application idempotent.
flink-cdc-runtime/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PreTransformOperator.java	Filters redundant `AddColumnEvent`s before schema application (but introduces an early-return risk).
flink-cdc-runtime/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PostTransformOperator.java	Filters redundant `AddColumnEvent`s before applying schema changes in post-transform.
flink-cdc-common/src/test/java/org/apache/flink/cdc/common/utils/SchemaUtilsTest.java	Adds unit tests for redundant filtering and idempotent add-column application.
flink-cdc-runtime/src/test/java/org/apache/flink/cdc/runtime/operators/transform/TransformOperatorWithSchemaEvolveTest.java	Adds integration tests for duplicate/partial-duplicate add-column events through transform operators.

Comments suppressed due to low confidence (2)

flink-cdc-runtime/src/test/java/org/apache/flink/cdc/runtime/operators/transform/TransformOperatorWithSchemaEvolveTest.java:928

This Javadoc says the duplicate event is "sent directly to the post-transform operator", but the test uses evolveFromSource(...), which exercises both operators in the normal pipeline. Either adjust the wording to match what is actually being tested, or extend the harness/test case to inject schema-change events directly into PostTransformOperator if that behavior is important to cover.

    /**
     * This case tests that duplicate AddColumnEvents are handled gracefully by
     * PostTransformOperator. When the same column is added twice, the second event should be
     * filtered out even when sent directly to the post-transform operator.
     */
    @Test
    void testDuplicateAddColumnEventPostTransform() throws Exception {
        TableId tableId = TableId.tableId("my_company", "my_branch", "data_changes");
        TransformWithSchemaEvolveTestCase.of(
                        tableId,
                        "*, id + age as computed",
                        "name <> 'Alice'",

flink-cdc-runtime/src/test/java/org/apache/flink/cdc/runtime/operators/transform/TransformOperatorWithSchemaEvolveTest.java:852

The PR description lists integration tests named testDuplicateAddColumnEventsInPreTransform / testDuplicateAddColumnEventsInPostTransform, but the actual test method name here is testDuplicateAddColumnEventPreTransform. If the listed names are meant to be accurate for verification automation or documentation, please align either the method names or the description.

    @Test
    void testDuplicateAddColumnEventPreTransform() throws Exception {
        TableId tableId = TableId.tableId("my_company", "my_branch", "data_changes");
        TransformWithSchemaEvolveTestCase.of(
                        tableId,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-28T08:12:48Z

.../org/apache/flink/cdc/runtime/operators/transform/TransformOperatorWithSchemaEvolveTest.java

+                                                null))))
+                .expectNothingInPreTransformed()
+                .expectNothingInPostTransformed()
+                .runTests("duplicate add extras column should be filtered")


These duplicate-AddColumnEvent tests stop immediately after asserting the redundant event is filtered. To prevent regressions like losing internal transform processors/state, it would be useful to also process at least one subsequent DataChangeEvent after the duplicate schema event and assert the pipeline continues to transform records correctly (i.e., no operator crash / missing schema view).

Suggested change

.runTests("duplicate add extras column should be filtered")

.runTests("duplicate add extras column should be filtered")

// After the duplicate AddColumnEvent has been filtered, process a subsequent

// DataChangeEvent to verify that the pipeline continues to transform records

// correctly (i.e., no operator crash or missing schema view).

// The concrete DataChangeEvent and its expected transformed output should follow

// the same construction patterns as other tests in this class.

// For example (pseudocode, to be aligned with existing helpers in this test):

// .evolveFromSource(

// someInsertOrUpdateDataChangeEventFor(tableId, id, name, age, extras))

// .expectInPreTransformed(expectedPreTransformEvent)

// .expectInPostTransformed(expectedPostTransformEvent)

// .runTests("data change after duplicate add extras column is still processed")

In the new commit, both testDuplicateAddColumnEventPreTransform() and testDuplicateAddColumnEventPostTransform() send a DataChangeEvent after the duplicate AddColumnEvent and assert it's processed correctly through the transform pipeline

Copilot · 2026-02-28T08:12:49Z

...ime/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PreTransformOperator.java

+        // Filter out redundant AddColumnEvent columns that already exist in the schema
+        // to handle duplicate events from tools like gh-ost online schema migrations
+        if (event instanceof AddColumnEvent) {
+            AddColumnEvent addColumnEvent = (AddColumnEvent) event;
+            Schema currentSchema = tableChangeInfo.getSourceSchema();
+            Optional<AddColumnEvent> filtered =
+                    SchemaUtils.filterRedundantAddColumns(currentSchema, addColumnEvent);
+            if (!filtered.isPresent()) {
+                LOG.debug(
+                        "Skipping fully redundant AddColumnEvent for table {} "
+                                + "- all columns already exist",
+                        tableId);
+                return Optional.empty();
+            }
+            event = filtered.get();
+        }


In processEvent, preTransformProcessorMap.remove(tableId) is executed for every SchemaChangeEvent. With the new early-return here, a fully-redundant AddColumnEvent will return Optional.empty() before cachePreTransformProcessor(...) is called, so the processor map entry is never rebuilt. This can cause subsequent DataChangeEvents for the table to fail the processor != null check and crash the pipeline. Ensure the processor is re-cached even when the AddColumnEvent is filtered to a no-op (or avoid removing the processor map entry in the redundant case).

Removed preTransformProcessorMap.remove(tableId) before cacheChangeSchema(). Now the processor stays intact on the filtered path and gets overwritten via cachePreTransformProcessor() on the non-filtered path

Copilot · 2026-02-28T08:12:49Z

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/utils/SchemaUtils.java

+        Set<String> existingColumnNames =
+                columns.stream()
+                        .map(Column::getName)
+                        .collect(Collectors.toCollection(HashSet::new));
        for (AddColumnEvent.ColumnWithPosition columnWithPosition : event.getAddedColumns()) {
+            // Skip columns that already exist in the schema to handle duplicate AddColumnEvents
+            // (e.g., from gh-ost online schema migrations)
+            if (existingColumnNames.contains(columnWithPosition.getAddColumn().getName())) {
+                continue;
+            }


applyAddColumnEvent now silently skips adding a column when the name already exists. This can mask upstream inconsistencies (e.g., same column name but different type/comment/default), leaving the schema potentially out of sync with the source without any signal. Consider validating that the existing column definition matches the incoming addColumn (and throw or at least log/warn when it differs) so only true duplicates are treated as idempotent.

Added LOG field to SchemaUtils. When skipping a duplicate column, compares DataType of existing vs incoming column and logs at WARN level if they differ

…dation, add type-mismatch warnings, extract shared filter helper

yuxiqian · 2026-03-10T03:49:20Z

...c/main/java/org/apache/flink/cdc/runtime/operators/transform/TransformSchemaChangeUtils.java

+ * Shared utility methods for filtering schema change events in both {@link PreTransformOperator}
+ * and {@link PostTransformOperator}.
+ */
+class TransformSchemaChangeUtils {


Is it worth creating a new Utils class for merely such a method? Putting it in SchemaUtils makes more sense.

++. Moved the method to SchemaUtils

yuxiqian · 2026-03-10T03:49:42Z

...c/main/java/org/apache/flink/cdc/runtime/operators/transform/TransformSchemaChangeUtils.java

+     * @return the filtered event, or {@link Optional#empty()} if the event is fully redundant
+     */
+    static Optional<SchemaChangeEvent> filterDuplicateAddColumns(
+            Schema currentSchema, SchemaChangeEvent event, Logger log) {


It's unusual to pass the Logger as argument.

Avoided this pattern in the new commit

yuxiqian · 2026-03-10T03:50:40Z

.../org/apache/flink/cdc/runtime/operators/transform/TransformOperatorWithSchemaEvolveTest.java

+     * after duplicate filtering, with the computed column correctly evaluated.
+     */
+    @Test
+    void testDuplicateAddColumnEventPreTransform() throws Exception {


Please also add an ITCase or E2e test in flink-cdc-composer.

New ITCase is added

yuxiqian · 2026-03-10T03:51:28Z

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/utils/SchemaUtils.java

+                                    if (!existingColumn
+                                            .getType()
+                                            .equals(incomingColumn.getType())) {


What will happen if existing column definition and incoming definition is incompatible? Will there be coercions or implicit casting?

No coercions or implicit casting are performed. When a duplicate column is detected by name, the existing column definition is preserved as-is and the incoming duplicate is skipped. If the types differ, a WARN log is emitted: "Skipping duplicate column '{}' for table {} but types differ: existing={}, incoming={}". This is intentional for the gh-ost use case. Duplicate AddColumnEvents from online schema migration tools should have matching types. A type mismatch indicates a potential upstream inconsistency that should be investigated, but we don't want to crash the pipeline over it.

… to SchemaUtils, document type-mismatch behavior

…iltering through pipeline

[FLINK-38830][runtime] Handle duplicate AddColumnEvent gracefully to …

8962d4f

…prevent pipeline crashes

VinaySagarGonabavi force-pushed the fix-duplicate-add-column-events branch from b7876af to 8962d4f Compare February 24, 2026 00:02

lvyanquan requested a review from Copilot February 28, 2026 08:08

Copilot started reviewing on behalf of lvyanquan February 28, 2026 08:08 View session

Copilot AI reviewed Feb 28, 2026

View reviewed changes

[FLINK-38830] Address PR apache#4279 review comments

8013129

github-actions bot added common runtime labels Mar 4, 2026

[FLINK-38830][cdc-common][cdc-runtime] Restore processor cache invali…

c58a959

…dation, add type-mismatch warnings, extract shared filter helper

yuxiqian self-requested a review March 10, 2026 03:46

yuxiqian reviewed Mar 10, 2026

View reviewed changes

VinaySagarGonabavi added 2 commits March 11, 2026 15:37

[FLINK-38830][cdc-common][cdc-runtime] Move filterDuplicateAddColumns…

6d43f8d

… to SchemaUtils, document type-mismatch behavior

[FLINK-38830][cdc-composer] Add ITCase for duplicate AddColumnEvent f…

8a303d3

…iltering through pipeline

github-actions bot added the composer label Mar 11, 2026

VinaySagarGonabavi requested a review from yuxiqian March 12, 2026 15:50

-                .runTests("duplicate add extras column should be filtered")
+                .runTests("duplicate add extras column should be filtered")
+                // After the duplicate AddColumnEvent has been filtered, process a subsequent
+                // DataChangeEvent to verify that the pipeline continues to transform records
+                // correctly (i.e., no operator crash or missing schema view).
+                // The concrete DataChangeEvent and its expected transformed output should follow
+                // the same construction patterns as other tests in this class.
+                // For example (pseudocode, to be aligned with existing helpers in this test):
+                // .evolveFromSource(
+                //         someInsertOrUpdateDataChangeEventFor(tableId, id, name, age, extras))
+                // .expectInPreTransformed(expectedPreTransformEvent)
+                // .expectInPostTransformed(expectedPostTransformEvent)
+                // .runTests("data change after duplicate add extras column is still processed")

Conversation

VinaySagarGonabavi commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts

Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

VinaySagarGonabavi commented Feb 23, 2026 •

edited

Loading