FLINK-38334: Fix MySQL CDC source stuck in INITIAL_ASSIGNING#4278
FLINK-38334: Fix MySQL CDC source stuck in INITIAL_ASSIGNING#4278lvyanquan merged 1 commit intoapache:masterfrom
Conversation
|
@leonardBang could you approve running the workflow? |
|
The tests are failing due a Docker API incompatibility issue. It looks like you guys are working on it in #4275. |
|
Thanks for the patch. Could you please rebase with Gently ping @lvyanquan as you've investigated similar problem. |
When a table is excluded from configuration after a restart from savepoint, the MySQL CDC source could get stuck in the INITIAL_ASSIGNING state. This happened because table exclusion cleanup was only performed when isAssigningFinished() was true, but the assigner couldn't finish because excluded table splits were never reported as finished. The fix separates two concerns in captureNewlyAddedTables(): - Adding new tables: should only happen when isAssigningFinished() - Removing excluded tables: must happen regardless of assigner status Added integration test TableExclusionDuringSnapshotIT that reproduces the issue by using a blocking hook to take a savepoint during INITIAL_ASSIGNING phase, then restarting with a table excluded from configuration.
|
@yuxiqian, done. Please approve running the workflow. |
There was a problem hiding this comment.
Pull request overview
Fixes a restore-time deadlock in the MySQL CDC snapshot enumerator where excluded tables could leave “assigned but never finished” splits in state, causing the job to remain stuck in INITIAL_ASSIGNING after restarting from a savepoint.
Changes:
- Adjust
MySqlSnapshotSplitAssigner#captureNewlyAddedTables()to always clean up removed/excluded tables, while only adding newly discovered tables once assigning is finished. - Add a new integration test (
TableExclusionDuringSnapshotIT) plus its DDL to reproduce the issue by taking a savepoint duringINITIAL_ASSIGNINGand restarting with a table excluded.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
.../src/main/java/.../MySqlSnapshotSplitAssigner.java |
Separates “remove excluded tables” from “add newly added tables” to prevent restore-time deadlock. |
.../src/test/java/.../TableExclusionDuringSnapshotIT.java |
New IT that reproduces the stuck INITIAL_ASSIGNING scenario via savepoint + restart with excluded table. |
.../src/test/resources/ddl/table_exclusion_snapshot.sql |
DDL for the new IT database/tables. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...c/test/java/org/apache/flink/cdc/connectors/mysql/source/TableExclusionDuringSnapshotIT.java
Show resolved
Hide resolved
...c/test/java/org/apache/flink/cdc/connectors/mysql/source/TableExclusionDuringSnapshotIT.java
Show resolved
Hide resolved
|
@lvyanquan thank you for the review. While you're at it, do you mind taking another look at #4087? It fixes another "source stuck" situation. We've been running this patch in production since August w/o any issues. |
…pache#4278) When a table is excluded from configuration after a restart from savepoint, the MySQL CDC source could get stuck in the INITIAL_ASSIGNING state. This happened because table exclusion cleanup was only performed when isAssigningFinished() was true, but the assigner couldn't finish because excluded table splits were never reported as finished. The fix separates two concerns in captureNewlyAddedTables(): - Adding new tables: should only happen when isAssigningFinished() - Removing excluded tables: must happen regardless of assigner status Added integration test TableExclusionDuringSnapshotIT that reproduces the issue by using a blocking hook to take a savepoint during INITIAL_ASSIGNING phase, then restarting with a table excluded from configuration.
…pache#4278) When a table is excluded from configuration after a restart from savepoint, the MySQL CDC source could get stuck in the INITIAL_ASSIGNING state. This happened because table exclusion cleanup was only performed when isAssigningFinished() was true, but the assigner couldn't finish because excluded table splits were never reported as finished. The fix separates two concerns in captureNewlyAddedTables(): - Adding new tables: should only happen when isAssigningFinished() - Removing excluded tables: must happen regardless of assigner status Added integration test TableExclusionDuringSnapshotIT that reproduces the issue by using a blocking hook to take a savepoint during INITIAL_ASSIGNING phase, then restarting with a table excluded from configuration.
…pache#4278) When a table is excluded from configuration after a restart from savepoint, the MySQL CDC source could get stuck in the INITIAL_ASSIGNING state. This happened because table exclusion cleanup was only performed when isAssigningFinished() was true, but the assigner couldn't finish because excluded table splits were never reported as finished. The fix separates two concerns in captureNewlyAddedTables(): - Adding new tables: should only happen when isAssigningFinished() - Removing excluded tables: must happen regardless of assigner status Added integration test TableExclusionDuringSnapshotIT that reproduces the issue by using a blocking hook to take a savepoint during INITIAL_ASSIGNING phase, then restarting with a table excluded from configuration.
When a table is excluded from configuration after a restart from savepoint, the MySQL CDC source could get stuck in the
INITIAL_ASSIGNINGstate. This happened because table exclusion cleanup was only performed whenisAssigningFinished()wastrue, but the assigner couldn't finish because excluded table splits were never reported as finished.The fix separates two concerns in
captureNewlyAddedTables():isAssigningFinished()Added integration test
TableExclusionDuringSnapshotITthat reproduces the issue by using a blocking hook to take a savepoint duringINITIAL_ASSIGNINGphase, then restarting with a table excluded from configuration.