Restore reminder clearing on DagActionStore DELETE via wildcard match#4193
Open
agam-99 wants to merge 1 commit into
Open
Restore reminder clearing on DagActionStore DELETE via wildcard match#4193agam-99 wants to merge 1 commit into
agam-99 wants to merge 1 commit into
Conversation
GOBBLIN-2090 (apache#3973) added "delete deadline triggers in all hosts" on the DagActionStore DELETE path, relying on the change monitor's broadcast Kafka consumer group (UUID-suffixed group id) so every GaaS instance independently clears its own in-memory Quartz scheduler. GOBBLIN-2016 (apache#3995) then added eventTimeMillis to the Quartz job key to fix a separate same-DagAction multi-attempt collision, and in the same PR commented out the DELETE-side clear with a TODO -- because eventTimeMillis is per-lease-attempt consensus state computed at runtime by MysqlMultiActiveLeaseArbiter and is not carried on the DagActionStoreChangeEvent, so the exact-match unschedule call could no longer construct the right key. Approach: rather than plumb a value that structurally cannot exist on the change event, restore the GOBBLIN-2090 cross-host clearing by prefix-matching on the five DagAction fields shared by every reminder for a given action regardless of eventTimeMillis. The Kafka broadcast still drives cross-instance fan-out; each instance scans its own local RAMJobStore and deletes matching keys via GroupMatcher.jobGroupEquals + JobKey name-prefix filter. Scope matches GOBBLIN-2090: clearing is invoked only for ENFORCE_JOB_START_DEADLINE and ENFORCE_FLOW_FINISH_DEADLINE DELETEs. KILL/RESUME retry reminders continue to no-op via the lease arbiter on fire, as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
DagManagementDagActionStoreChangeMonitor.handleDagActionhas had this TODO since GOBBLIN-2016 / #3995 (Jul 2024):Background. GOBBLIN-2090 / #3973 (Jun 2024, titled literally "delete deadline triggers in all hosts") introduced this clearing path so that when a deadline
DagActionrow is deleted fromdag_action, every GaaS instance independently clears its in-memory Quartz reminder for that action. Cross-host coverage falls out of theDagActionStoreChangeMonitorconsumer group id (DAG_ACTION_CHANGE_MONITOR_PREFIX + UUID.randomUUID()) — every instance is in its own group, so every instance receives every CDC event and runs the same handler against its ownRAMJobStore.What broke. Three weeks later, GOBBLIN-2016 / #3995 added
eventTimeMillisto the Quartz job-key name to disambiguate concurrent lease attempts (KILL/RESUME issued in quick succession, retry reminders firing while the original lease is still valid, etc.):eventTimeMillisis per-lease-attempt consensus state computed byMysqlMultiActiveLeaseArbiter. It is not persisted on thedag_actionrow and not carried onDagActionStoreChangeEvent. Structurally, no plumbing change on the CDC event can recover it at DELETE time — there can be multipleeventTimeMillisvalues per DagAction (one per attempt), so there is no single value the producer could put on the event. The author of #3995 punted with the TODO above and accepted "let them fire and no-op via the lease arbiter."Change. Restore the GOBBLIN-2090 cross-host clearing semantics by prefix-matching on the five DagAction fields (everything the DELETE event has) rather than constructing an exact key:
DagActionReminderScheduler#unscheduleRemindersForDagAction(DagAction, boolean isDeadlineReminder)that:JobKeys in the relevant group viaScheduler.getJobKeys(GroupMatcher.jobGroupEquals(group)).JobKey.getName().startsWith(prefix)where the prefix is the first five segments + trailing.(helpercreateDagActionKeyNamePrefix).Scheduler.deleteJobs(...)on matches in bulk.@VisibleForTestinghelpercreateDagActionKeyNamePrefix(DagAction)that mirrors the first five segments ofcreateDagActionReminderKeyand ensures the trailing separator so prefix matches cannot span unrelated keys.DagManagementDagActionStoreChangeMonitor.handleDagAction("DELETE", ...)is uncommented and calls the new method twice — once forDeadlineReminderKeyGroup, once forRetryReminderKeyGroup. Scope is unchanged from GOBBLIN-2090: onlyENFORCE_JOB_START_DEADLINEandENFORCE_FLOW_FINISH_DEADLINEDELETEs trigger the clear. KILL/RESUME retry reminders continue to no-op via the lease arbiter on fire, matching prior behavior.Why prefix-match is correct here.
createDagActionReminderKeyalready notes: "Applicable only for KILL and RESUME actions; duplication for other actions is an error." For deadline actions the prefix-match set size is ≤1 in normal operation, so the wildcard is semantically equivalent to exact-match for the cases we clear.eventTimeMillisvalues ended up in the local scheduler.Edge cases considered.
NoLongerLeasingStatus. Strictly no worse than the status quo "let them fire" behavior.flowGroup/flowName— would create ambiguity between e.g.flowGroup="a.b", flowName="c"andflowGroup="a", flowName="b.c". This ambiguity is pre-existing in the key construction (same risk for the exact-matchunscheduleReminderJob) and is out of scope for this PR; in practice these identifiers are alphanumeric.Tests
Four new TestNG cases in
DagActionReminderSchedulerTest:testCreateDagActionKeyNamePrefixSharedAcrossEventTimes— asserts the prefix is shared by everycreateDagActionReminderKeyoutput for the same DagAction across distincteventTimeMillisvalues, and that the prefix ends in.so it cannot match unrelated keys.testUnscheduleRemindersForDagActionClearsAllEventTimes— schedules two deadline reminders for the same DagAction at distincteventTimeMillis(simulating multiple lease attempts), invokes the wildcard clear, asserts both keys are gone and the returned count is 2.testUnscheduleRemindersForDagActionScopedToGroup— schedules a deadline-group and a retry-group reminder; asserts clearing one group does not touch the other.testUnscheduleRemindersForDagActionNoopWhenNoneScheduled— asserts the wildcard clear returns 0 (and does not throw) when nothing is scheduled for the DagAction.Existing
testRemindersForMultipleFlowExecutionscontinues to cover the exact-matchunscheduleReminderJobpath, which is unchanged.Test update in
DagManagementDagActionStoreChangeMonitorTest.testProcessMessageWithDelete: the prior assertion block was inside the TODO/* ... */and could never trigger. It now verifies the DELETE handler invokesunscheduleRemindersForDagAction(dagAction, true)andunscheduleRemindersForDagAction(dagAction, false)exactly once each for a deadline-type DELETE.All affected unit tests pass locally:
Commits