Skip to content

fix(scheduler): guard grouped resume reapply#199

Draft
aparajon wants to merge 19 commits into
mainfrom
armand/grouped-resume-final-check
Draft

fix(scheduler): guard grouped resume reapply#199
aparajon wants to merge 19 commits into
mainfrom
armand/grouped-resume-final-check

Conversation

@aparajon
Copy link
Copy Markdown
Collaborator

@aparajon aparajon commented May 30, 2026

Why

Grouped resume can be retried after the target schema has already advanced, and remote stop→start can race with stale stop reconciliation. SchemaBot should verify there is still work left before reapplying and keep an accepted start request from being undone by an older stopped remote progress report.

What

Grouped resume final-check

  • Re-plan grouped/atomic resume immediately before engine apply
  • Complete the apply without reapplying when the final schema check has no remaining changes
  • Complete resolved start/stop control requests on the no-reapply resume path
flowchart LR
  A[Grouped resume] --> B[Final schema check]
  B --> C{Changes remain?}
  C -- No --> D[Complete apply without reapplying]
  C -- Yes --> E[Call engine Apply]
Loading

gRPC stop/start reconciliation

  • Allow /start to proceed when a pending remote stop request has already reached stopped remotely
  • Make local apply cancellation generation-owned so a late stop cannot cancel or clear a newer resume
  • Let a scheduler-owned gRPC start keep the accepted start request pending while it tolerates brief previous-stopped remote progress, bounded by a grace window
  • Strengthen the stop/start e2e with enough rows to reliably observe in-flight work

/start is a separate operator/caller request after the stop path has recorded stopped state; SchemaBot does not automatically start after completing a stop request.

Common setup: an operator requests /stop; SchemaBot records the stop request, the scheduler stops remote Tern, and stopped progress is stored. /start is a later, separate operator request.

Start succeeds

Step Actor What happens User-visible result
1 Operator / CLI Requests /start CLI receives start request accepted
2 API Records a durable start request Start work can be recovered by the scheduler
3 Scheduler Calls remote Tern Start Remote Tern acknowledges the request
4 Scheduler Polls remote progress Remote reports running or completed progress
5 Storage / observer Stores progress and completes the start request CLI watch / PR comment / check show running or completed

End state: apply is running or completed; the start request is completed.

Start RPC fails or times out

Step Actor What happens User-visible result
1 Operator / CLI Requests /start CLI receives start request accepted
2 API Records a durable start request Start work can be recovered by the scheduler
3 Scheduler Calls remote Tern Start Remote call errors or times out
4 Scheduler Stores stopped progress, warning apply log, and error message CLI watch / PR comment show stopped with the error
5 Storage Marks the start request failed Operator can issue a new /start after investigating

End state: apply remains stopped; the error is visible; the start request is failed, so the scheduler does not retry forever.

Start accepted, but remote still reports stopped

Step Actor What happens User-visible result
1 Operator / CLI Requests /start CLI receives start request accepted
2 API Records a durable start request Start work can be recovered by the scheduler
3 Scheduler Calls remote Tern Start Remote Tern acknowledges the request
4 Scheduler Polls remote progress during the bounded grace window Previous stopped state does not cancel the start request
5a Remote Tern Reports running or completed before grace expires Storage adopts progress and completes the start request
5b Remote Tern Still reports stopped when grace expires Storage records stopped progress, warning apply log, and error message

End state: either running/completed with the start request completed, or stopped with a visible error and the start request failed, so the scheduler does not retry forever.

Risk Assessment

Medium — scoped to grouped resume dispatch and immediate remote stop→start reconciliation. The unchanged path still calls the engine when the re-plan finds active table changes, stop priority remains intact, and start only proceeds after stored or remote state proves the stop has resolved.

Generated with Amp

Copilot AI review requested due to automatic review settings May 30, 2026 15:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR guards grouped resume so SchemaBot re-plans immediately before reapplying and completes the apply without invoking the engine when no schema changes remain.

Changes:

  • Adds a final grouped-resume schema check before engine apply.
  • Short-circuits grouped resume to mark the apply completed when all tasks are already done.
  • Extends local tests and fake engine behavior to cover the no-remaining-work path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/tern/local_control_resume.go Adds final grouped resume re-plan, engine drain, and completed short-circuit path.
pkg/tern/local_client_test.go Updates fake control engine and adds coverage for the new short-circuit behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/tern/local_control_resume.go
Comment thread pkg/tern/local_control_resume.go
@aparajon aparajon force-pushed the armand/grouped-resume-final-check branch from ecc622a to 9516df2 Compare May 30, 2026 16:12
@aparajon aparajon changed the title Guard grouped resume reapply fix(scheduler): guard grouped resume reapply May 30, 2026
@aparajon aparajon force-pushed the armand/grouped-resume-final-check branch from 9516df2 to a775594 Compare May 30, 2026 16:22
@aparajon aparajon force-pushed the armand/grouped-resume-final-check branch from a775594 to b47f8a8 Compare May 30, 2026 16:58
aparajon and others added 18 commits May 30, 2026 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants