Skip to content

FLINK-38334: Fix MySQL CDC source stuck in INITIAL_ASSIGNING#4278

Merged
lvyanquan merged 1 commit intoapache:masterfrom
morozov:FLINK-38334
Mar 5, 2026
Merged

FLINK-38334: Fix MySQL CDC source stuck in INITIAL_ASSIGNING#4278
lvyanquan merged 1 commit intoapache:masterfrom
morozov:FLINK-38334

Conversation

@morozov
Copy link
Contributor

@morozov morozov commented Feb 15, 2026

When a table is excluded from configuration after a restart from savepoint, the MySQL CDC source could get stuck in the INITIAL_ASSIGNING state. This happened because table exclusion cleanup was only performed when isAssigningFinished() was true, but the assigner couldn't finish because excluded table splits were never reported as finished.

The fix separates two concerns in captureNewlyAddedTables():

  • Adding new tables: should only happen when isAssigningFinished()
  • Removing excluded tables: must happen regardless of assigner status

Added integration test TableExclusionDuringSnapshotIT that reproduces the issue by using a blocking hook to take a savepoint during INITIAL_ASSIGNING phase, then restarting with a table excluded from configuration.

@morozov
Copy link
Contributor Author

morozov commented Feb 16, 2026

@leonardBang could you approve running the workflow?

@morozov
Copy link
Contributor Author

morozov commented Feb 17, 2026

The tests are failing due a Docker API incompatibility issue. It looks like you guys are working on it in #4275.

@yuxiqian
Copy link
Member

yuxiqian commented Mar 3, 2026

Thanks for the patch. Could you please rebase with master as #4275 has been merged?

Gently ping @lvyanquan as you've investigated similar problem.

When a table is excluded from configuration after a restart from savepoint,
the MySQL CDC source could get stuck in the INITIAL_ASSIGNING state. This
happened because table exclusion cleanup was only performed when
isAssigningFinished() was true, but the assigner couldn't finish because
excluded table splits were never reported as finished.

The fix separates two concerns in captureNewlyAddedTables():
- Adding new tables: should only happen when isAssigningFinished()
- Removing excluded tables: must happen regardless of assigner status

Added integration test TableExclusionDuringSnapshotIT that reproduces the
issue by using a blocking hook to take a savepoint during INITIAL_ASSIGNING
phase, then restarting with a table excluded from configuration.
@morozov
Copy link
Contributor Author

morozov commented Mar 3, 2026

@yuxiqian, done. Please approve running the workflow.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a restore-time deadlock in the MySQL CDC snapshot enumerator where excluded tables could leave “assigned but never finished” splits in state, causing the job to remain stuck in INITIAL_ASSIGNING after restarting from a savepoint.

Changes:

  • Adjust MySqlSnapshotSplitAssigner#captureNewlyAddedTables() to always clean up removed/excluded tables, while only adding newly discovered tables once assigning is finished.
  • Add a new integration test (TableExclusionDuringSnapshotIT) plus its DDL to reproduce the issue by taking a savepoint during INITIAL_ASSIGNING and restarting with a table excluded.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
.../src/main/java/.../MySqlSnapshotSplitAssigner.java Separates “remove excluded tables” from “add newly added tables” to prevent restore-time deadlock.
.../src/test/java/.../TableExclusionDuringSnapshotIT.java New IT that reproduces the stuck INITIAL_ASSIGNING scenario via savepoint + restart with excluded table.
.../src/test/resources/ddl/table_exclusion_snapshot.sql DDL for the new IT database/tables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@lvyanquan lvyanquan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

@lvyanquan lvyanquan merged commit c05af15 into apache:master Mar 5, 2026
56 of 59 checks passed
@morozov morozov deleted the FLINK-38334 branch March 5, 2026 14:32
@morozov
Copy link
Contributor Author

morozov commented Mar 5, 2026

@lvyanquan thank you for the review. While you're at it, do you mind taking another look at #4087? It fixes another "source stuck" situation. We've been running this patch in production since August w/o any issues.

ThorneANN pushed a commit to ThorneANN/flink-cdc that referenced this pull request Mar 6, 2026
…pache#4278)

When a table is excluded from configuration after a restart from savepoint,
the MySQL CDC source could get stuck in the INITIAL_ASSIGNING state. This
happened because table exclusion cleanup was only performed when
isAssigningFinished() was true, but the assigner couldn't finish because
excluded table splits were never reported as finished.

The fix separates two concerns in captureNewlyAddedTables():
- Adding new tables: should only happen when isAssigningFinished()
- Removing excluded tables: must happen regardless of assigner status

Added integration test TableExclusionDuringSnapshotIT that reproduces the
issue by using a blocking hook to take a savepoint during INITIAL_ASSIGNING
phase, then restarting with a table excluded from configuration.
suhwan-cheon pushed a commit to suhwan-cheon/flink-cdc that referenced this pull request Mar 9, 2026
…pache#4278)

When a table is excluded from configuration after a restart from savepoint,
the MySQL CDC source could get stuck in the INITIAL_ASSIGNING state. This
happened because table exclusion cleanup was only performed when
isAssigningFinished() was true, but the assigner couldn't finish because
excluded table splits were never reported as finished.

The fix separates two concerns in captureNewlyAddedTables():
- Adding new tables: should only happen when isAssigningFinished()
- Removing excluded tables: must happen regardless of assigner status

Added integration test TableExclusionDuringSnapshotIT that reproduces the
issue by using a blocking hook to take a savepoint during INITIAL_ASSIGNING
phase, then restarting with a table excluded from configuration.
ThorneANN pushed a commit to ThorneANN/flink-cdc that referenced this pull request Mar 19, 2026
…pache#4278)

When a table is excluded from configuration after a restart from savepoint,
the MySQL CDC source could get stuck in the INITIAL_ASSIGNING state. This
happened because table exclusion cleanup was only performed when
isAssigningFinished() was true, but the assigner couldn't finish because
excluded table splits were never reported as finished.

The fix separates two concerns in captureNewlyAddedTables():
- Adding new tables: should only happen when isAssigningFinished()
- Removing excluded tables: must happen regardless of assigner status

Added integration test TableExclusionDuringSnapshotIT that reproduces the
issue by using a blocking hook to take a savepoint during INITIAL_ASSIGNING
phase, then restarting with a table excluded from configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants