Skip to content

fix workflow system so scheduled triggers run on latest code#2706

Open
shagun-singh-inkeep wants to merge 21 commits intomainfrom
feat/manage-table-cron-dispatcher-v2
Open

fix workflow system so scheduled triggers run on latest code#2706
shagun-singh-inkeep wants to merge 21 commits intomainfrom
feat/manage-table-cron-dispatcher-v2

Conversation

@shagun-singh-inkeep
Copy link
Collaborator

No description provided.

@changeset-bot
Copy link

changeset-bot bot commented Mar 16, 2026

🦋 Changeset detected

Latest commit: feb29ce

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages
Name Type
@inkeep/agents-core Patch
@inkeep/agents-api Patch
@inkeep/agents-manage-ui Patch
@inkeep/agents-cli Patch
@inkeep/agents-sdk Patch
@inkeep/agents-work-apps Patch
@inkeep/ai-sdk-provider Patch
@inkeep/create-agents Patch
@inkeep/agents-email Patch
@inkeep/agents-mcp Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link

vercel bot commented Mar 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 16, 2026 8:00pm
agents-docs Ready Ready Preview, Comment Mar 16, 2026 8:00pm
agents-manage-ui Ready Ready Preview, Comment Mar 16, 2026 8:00pm

Request Review

@pullfrog
Copy link
Contributor

pullfrog bot commented Mar 16, 2026

TL;DR — Replaces the per-trigger daisy-chaining workflow architecture with a single centralized scheduler workflow that polls every 60 seconds, dispatches due triggers as independent one-shot workflows, and stores next_run_at directly on the scheduled_triggers manage table. A post-deploy CI step restarts the scheduler on the latest Vercel deployment so scheduled triggers always run current code.

Key changes

  • schedulerWorkflow — New long-lived singleton workflow that ticks every 60 s, queries all projects for due triggers, and dispatches one-shot execution workflows.
  • triggerDispatcher — New centralized dispatcher that scans next_run_at across all project branches, atomically advances the timestamp (with rollback on failure), and starts scheduledTriggerRunnerWorkflow per trigger.
  • next_run_at on scheduled_triggers — New column on the manage schema that replaces the old scheduled_workflows lookup; routes compute it on create/update.
  • scheduler_state table — New runtime singleton table tracking the current scheduler workflow run ID and deployment ID for supersession detection.
  • scheduledTriggerRunnerWorkflow simplification — Reduced from a daisy-chaining loop (sleep → execute → chain) to a stateless one-shot (check → create invocation → execute with retries → done).
  • restartScheduler deploy hook — New POST /api/deploy/restart-scheduler endpoint called by CI after Vercel deploy to start a fresh scheduler workflow on the new deployment.
  • ScheduledTriggerService gutted — Removed onTriggerCreated, onTriggerDeleted, startScheduledTriggerWorkflow, restartScheduledTriggerWorkflow, and signalStopScheduledTriggerWorkflow; only onTriggerUpdated (for invocation cancellation on re-enable/reschedule) survives.
  • Route handlersagentFull, projectFull, and scheduledTriggers routes no longer call onTriggerCreated/onTriggerDeleted; create/update routes set nextRunAt inline via computeNextRunAt.
  • Data reconciliationcheck function simplified from workflow-status verification to a nextRunAt-presence check on enabled triggers; onCreated/onDeleted handlers removed.

Summary | 34 files | 9 commits | base: mainfeat/manage-table-cron-dispatcher-v2


Centralized scheduler workflow and trigger dispatcher

Before: Each scheduled trigger got its own long-lived workflow that slept until the next cron tick, executed the agent, then daisy-chained to a new workflow for the next iteration. Workflow liveness was tracked in a scheduled_workflows manage table, with adoption and supersession logic.
After: A single schedulerWorkflow runs as a global singleton, ticks every 60 seconds, and delegates to dispatchDueTriggers() which queries next_run_at <= now() across all project branches and starts independent one-shot scheduledTriggerRunnerWorkflow instances.

The dispatcher atomically advances next_run_at before starting the workflow. If the workflow start fails, it rolls back the timestamp. One-time triggers (runAt without cronExpression) set enabled = false and next_run_at = null after dispatch.

How does supersession work on deploy? The scheduler registers its workflow run ID in a `scheduler_state` singleton row. On each tick it checks whether it's still the active scheduler. After a Vercel deploy, CI calls `POST /api/deploy/restart-scheduler` which starts a new scheduler workflow and updates `scheduler_state` — the old scheduler sees the mismatch on its next tick and exits gracefully.

schedulerWorkflow.ts · triggerDispatcher.ts · SchedulerService.ts · schedulerSteps.ts


next_run_at column and manage-table driven scheduling

Before: Scheduling state lived in a separate scheduled_workflows table linking triggers to workflow run IDs. The workflow itself computed the next execution time internally via calculateNextExecutionStep and slept until then.
After: next_run_at is a first-class timestamptz column on scheduled_triggers. Route handlers compute it on create/update using computeNextRunAt(), and the dispatcher advances it after each dispatch.

The computeNextRunAt utility handles both cron and one-time triggers, accepting an optional lastScheduledFor to base the next tick relative to the previous scheduled time rather than wall-clock time.

Migration DB Change
0013_lumpy_apocalypse.sql manage (Doltgres) ALTER TABLE scheduled_triggers ADD COLUMN next_run_at timestamptz
0023_broad_sharon_ventura.sql runtime (Postgres) CREATE TABLE scheduler_state (singleton)

manage-schema.ts · computeNextRunAt.ts · scheduledTriggers.ts (DAL)


One-shot scheduledTriggerRunnerWorkflow

Before: The runner was a long-lived daisy-chaining workflow: check trigger → compute next time → create invocation → sleep → post-sleep re-check → execute → chain to next iteration. It handled adoption, supersession, and cancellation mid-sleep.
After: The runner is a stateless one-shot: check trigger enabled → create idempotent invocation → retry loop with cancellation checks → mark completed or failed → exit.

All scheduling concerns (when to run, what's due) are now the dispatcher's responsibility. The runner only needs a TriggerPayload with scheduledFor and focuses purely on execution.

scheduledTriggerRunner.ts · scheduledTriggerSteps.ts


Deploy restart endpoint and CI integration

Before: No mechanism to move scheduled trigger workflows to a new deployment; they continued running on the old instance until they naturally chained.
After: POST /api/deploy/restart-scheduler starts a fresh scheduler workflow, superseding the old one. The Vercel production workflow calls this after promote + deploy.

The endpoint uses constant-time comparison of INKEEP_AGENTS_RUN_API_BYPASS_SECRET for auth and is registered with noAuth() / security: [] in the OpenAPI spec.

restartScheduler.ts · vercel-production.yml

Pullfrog  | View workflow run | Using Claude Code | Triggered by Pullfrogpullfrog.com𝕏

Copy link
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant architectural improvement — moving from per-trigger daisy-chaining workflows to a centralized scheduler with next_run_at on the manage table is a cleaner model. The supersession mechanism and one-shot runner design are well thought out.

There are a few issues to address before merging, roughly in priority order:

  1. Reconciliation check is brokenlistEnabledScheduledTriggers only selects { id, name }, so (t as any).nextRunAt is always undefined and every enabled trigger will be flagged as missing.
  2. agentFull and projectFull create paths don't set nextRunAt — triggers created through these bulk routes will sit dormant until reconciliation detects them.
  3. Crash between advance-and-dispatch loses one-time triggers — if the process dies after advanceScheduledTriggerNextRunAt commits but before start(workflow) executes, one-time triggers are permanently disabled with no execution.
  4. as any castsnextRunAt is omitted from ScheduledTriggerInsertSchema but passed at the call site, causing multiple as any casts. Clean fix: make nextRunAt an accepted (optional) field in the insert schema.
  5. Security nits on the deploy endpoint — timing-safe comparison leaks secret length; error responses expose err.message.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow runpullfrog.com𝕏


const missingWorkflows = enabledTriggers
.filter((t) => !workflowsByTriggerId.has(t.id))
.filter((t) => !(t as any).nextRunAt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: listEnabledScheduledTriggers (in audit-queries.ts) only selects { id, name }nextRunAt is never present on the returned objects. This means !(t as any).nextRunAt is always true, and every enabled trigger will be reported as missing.

Fix: add nextRunAt: scheduledTriggers.nextRunAt to the select() in listEnabledScheduledTriggers, then remove this as any cast.

Comment on lines +27 to +30
orphanedWorkflows: [],
staleWorkflows: [],
deadWorkflows: [],
verificationFailures: [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These four fields (orphanedWorkflows, staleWorkflows, deadWorkflows, verificationFailures) are now hardcoded empty arrays. Consider updating ScheduledTriggerAuditResult to remove or mark them optional — returning dead-letter fields that can never be populated adds noise.

Comment on lines +397 to +398
nextRunAt,
} as any);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The as any cast is needed because nextRunAt is omitted from ScheduledTriggerInsertSchemaBase. Since the create route now always computes and passes nextRunAt, the insert type should accept it.

Fix: remove nextRunAt from the .omit() in ScheduledTriggerInsertSchemaBase (or add it as .optional()) so the DAL function accepts it without a cast.

Comment on lines +565 to +582
const mergedEnabled = body.enabled !== undefined ? body.enabled : existing.enabled;
const enabledChanged = body.enabled !== undefined && body.enabled !== existing.enabled;

let nextRunAt: string | null | undefined;
if (!mergedEnabled) {
nextRunAt = null;
} else if (scheduleChanged || enabledChanged) {
const mergedCron =
body.cronExpression !== undefined ? body.cronExpression : existing.cronExpression;
const mergedTimezone =
body.cronTimezone !== undefined ? body.cronTimezone : existing.cronTimezone;
const mergedRunAt = body.runAt !== undefined ? body.runAt : existing.runAt;
nextRunAt = computeNextRunAt({
cronExpression: mergedCron,
cronTimezone: mergedTimezone,
runAt: mergedRunAt,
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextRunAt is only recomputed when scheduleChanged || enabledChanged. If only payload or messageTemplate changes, nextRunAt is left unchanged — that's correct.

However, the enabled → disabled transition sets nextRunAt = null here, but onTriggerUpdated no longer cancels pending invocations for that case (the old Case 2 was removed from ScheduledTriggerService.ts). Already-queued pending or running invocations will continue executing even though the user disabled the trigger. Consider adding cancelPendingInvocationsForTrigger to the enabled→disabled path, either here or in onTriggerUpdated.

Comment on lines +78 to +89
await withRef(
manageDbPool,
resolvedRef,
(db) =>
advanceScheduledTriggerNextRunAt(db)({
scopes: { tenantId, projectId, agentId },
scheduledTriggerId,
nextRunAt,
enabled: isOneTime ? false : undefined,
}),
{ commit: true, commitMessage: `Advance next_run_at for trigger ${scheduledTriggerId}` }
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risk: crash-window between advance and dispatch. If the process dies after this advanceScheduledTriggerNextRunAt commit but before start(scheduledTriggerRunnerWorkflow) on line 100, nextRunAt is already advanced but no workflow was started. The rollback on line 102 only runs if start() throws, not on a process crash.

  • Cron triggers: miss one execution (next tick computes a new nextRunAt) — acceptable.
  • One-time triggers: permanently disabled (enabled=false, nextRunAt=null) with no execution — data loss.

Consider reversing the order: start the workflow first (idempotent via createInvocationIdempotentStep), then advance nextRunAt. If advance fails, the trigger is picked up again next tick; the idempotency key prevents double-execution.

Comment on lines +18 to +21
await upsertSchedulerState(runDbClient)({
currentRunId: run.runId,
deploymentId: getDeploymentId(),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor race: this upsertSchedulerState call is redundant with registerSchedulerStep inside the workflow. Both write currentRunId to the same singleton row. If the deploy endpoint is called twice in quick succession (e.g. CI retry), you can get:

  1. Call 1: start() → run C
  2. Call 2: start() → run D
  3. Call 1: upsertSchedulerState(C)
  4. Call 2: upsertSchedulerState(D)
  5. Workflow C: registerSchedulerStep(C) — overwrites D

Now both C and D think they're current. Consider removing this outer upsertSchedulerState and relying solely on registerSchedulerStep inside the workflow.

Comment on lines +11 to +15
function constantTimeEqual(a: string, b: string): boolean {
const bufA = Buffer.from(a);
const bufB = Buffer.from(b);
if (bufA.length !== bufB.length) return false;
return timingSafeEqual(bufA, bufB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return false on length mismatch exits faster than timingSafeEqual, letting an attacker binary-search the secret's byte length via response timing. Low-severity in practice (network jitter dominates), but cheap to fix:

import { createHash, timingSafeEqual } from 'node:crypto';

function constantTimeEqual(a: string, b: string): boolean {
  const hash = (s: string) => createHash('sha256').update(s).digest();
  return timingSafeEqual(hash(a), hash(b));
}

{ error: err instanceof Error ? err.message : String(err) },
'Failed to restart scheduler workflow'
);
return c.json({ error: err instanceof Error ? err.message : 'Internal error' }, 500);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err.message is returned to the caller, potentially leaking internal details (DB connection strings, file paths). Since the caller is CI, return a generic message and rely on the server-side log (line 53) for debugging:

return c.json({ error: 'Failed to restart scheduler workflow' }, 500);

Comment on lines +228 to +232
const rows = await db.execute(
sql`SELECT id, tenant_id, project_id, agent_id,
cron_expression, cron_timezone, run_at,
next_run_at, enabled
FROM scheduled_triggers AS OF ${sql.raw(`'${branchName}'`)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branchName is interpolated unescaped into the query via sql.raw. It's constructed from tenantId/projectId via getProjectScopedRef (simple concatenation, no sanitization). The values come from the runtime DB so they're trusted internal data — not a regression since this pattern exists elsewhere in the dolt module — but worth hardening. Consider a shared helper that validates branch names for AS OF clauses (e.g., reject values containing ').

}): Promise<DueScheduledTrigger[]> => {
const allDue: DueScheduledTrigger[] = [];

for (const project of params.projects) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loops over every project one-at-a-time with a separate SQL query per project. Fine for small deployments, but could become a bottleneck at scale (N round-trips to Doltgres). Consider adding a log/metric for the iteration count so you can detect when this becomes slow.

claude[bot]

This comment was marked as outdated.

@github-actions github-actions bot deleted a comment from claude bot Mar 16, 2026
@inkeep inkeep deleted a comment from claude bot Mar 16, 2026
@vercel vercel bot temporarily deployed to Preview – agents-docs March 16, 2026 18:49 Inactive
claude[bot]

This comment was marked as outdated.

@pullfrog
Copy link
Contributor

pullfrog bot commented Mar 16, 2026

TL;DR — Replaces the per-trigger daisy-chaining workflow architecture with a single centralized scheduler workflow that ticks every 60 seconds, queries a new next_run_at column on the manage scheduled_triggers table, and dispatches one-shot runner workflows for each due trigger. A post-deploy CI step restarts the scheduler on the latest deployment so scheduled triggers always execute against current code.

Key changes

  • schedulerWorkflow — New long-lived workflow that sleeps until the next minute, queries all due triggers, and dispatches independent runner workflows. Self-supersedes via a scheduler_state singleton row.
  • triggerDispatcher — New service that queries next_run_at <= now() across all project branches using DoltgreSQL AS OF syntax, starts a one-shot runner for each, and advances next_run_at (or disables one-time triggers).
  • scheduledTriggerRunner — Rewritten from a daisy-chaining multi-iteration workflow to a simple one-shot executor: check enabled → create invocation → retry loop → done. Removes startNextIterationStep, sleep-until-scheduled, and parent/child adoption logic.
  • ScheduledTriggerService — Gutted from ~340 → ~70 lines. Removes startScheduledTriggerWorkflow, signalStopScheduledTriggerWorkflow, restartScheduledTriggerWorkflow, onTriggerCreated, and onTriggerDeleted. Only onTriggerUpdated survives (for cancelling stale invocations).
  • next_run_at column — Added to scheduled_triggers in manage schema. Computed on create/upsert via computeNextRunAt() and advanced by the dispatcher after each dispatch.
  • scheduler_state table — New runtime singleton table tracking the active scheduler's current_run_id and deployment_id for supersession detection.
  • /api/deploy/restart-scheduler — New endpoint authenticated via INKEEP_AGENTS_RUN_API_BYPASS_SECRET. Called by a new CI job in vercel-production.yml after deploy to restart the scheduler on the fresh instance.
  • computeNextRunAt() — Extracted to agents-core as a shared pure function (cron-parser based), replacing the workflow step that did the same inline.
  • Data reconciliationscheduled-triggers handler simplified: removes onCreated/onDeleted hooks and workflow-based audit checks, replaces them with a next_run_at-based missing-workflow check.

Summary | 41 files | 16 commits | base: mainfeat/manage-table-cron-dispatcher-v2


Centralized schedulerWorkflow replaces per-trigger daisy-chaining

Before: Each scheduled trigger had its own long-lived workflow that slept until the next execution time, ran the agent, then daisy-chained to a fresh workflow for the next iteration. Trigger lifecycle events (create/update/delete) directly started/stopped these per-trigger workflows.
After: A single schedulerWorkflow runs an infinite loop — sleep until next minute → query all due triggers → dispatch independent one-shot runners. Lifecycle orchestration moves to the data layer (next_run_at column), and the workflow system only handles execution.

The old model was fragile: if a daisy-chain broke (crash between starting the child and updating the DB), the trigger became orphaned. The new model is self-healing — the scheduler re-queries every 60 seconds, so missed triggers are picked up on the next tick.

How does supersession work across deploys? The scheduler registers its run ID in a scheduler_state singleton row on startup. Each tick checks scheduler_state.current_run_id === myRunId. When a new deployment starts a new scheduler (via the /api/deploy/restart-scheduler CI hook), it overwrites current_run_id, causing the old scheduler to return { status: 'superseded' } on its next tick.

schedulerWorkflow.ts · schedulerSteps.ts · SchedulerService.ts · schedulerState.ts


triggerDispatcher — cross-project due-trigger query and dispatch

Before: Each workflow independently computed its own next execution time and slept until then. No centralized view of what's due.
After: dispatchDueTriggers() queries all projects' scheduled_triggers table (via DoltgreSQL AS OF per branch) for rows where enabled = true AND next_run_at <= now(), then dispatches runners in parallel with Promise.allSettled.

After dispatching, the dispatcher advances next_run_at to the next cron occurrence (or sets it to null and disables one-time triggers). If the workflow start succeeds but next_run_at advance fails, the next tick will retry — the runner is idempotent via the sched_{triggerId}_{scheduledFor} key.

triggerDispatcher.ts · triggerDispatcher.test.ts · scheduledTriggers.ts (DAL)


scheduledTriggerRunner — one-shot executor

Before: 160-line workflow with sleep-until-scheduled, pre/post-sleep enabled checks, daisy-chain logic, parent run ID adoption, and cron next-time computation.
After: ~50-line workflow: check enabled → create idempotent invocation → retry loop → mark completed/failed. No sleeping, no chaining, no adoption. TriggerPayload now includes scheduledFor directly (computed by the dispatcher).

scheduledTriggerRunner.ts · scheduledTriggerSteps.ts


Schema: next_run_at column and scheduler_state table

Before: Scheduling state lived in scheduled_workflows rows (one per trigger) with workflow_run_id tracking the active daisy-chain.
After: Scheduling state is a next_run_at timestamp directly on scheduled_triggers. A new scheduler_state singleton table in the runtime DB tracks which scheduler workflow is active.

Migration Target Change
0013_lumpy_apocalypse.sql manage ALTER TABLE scheduled_triggers ADD COLUMN next_run_at timestamptz
0023_broad_sharon_ventura.sql runtime CREATE TABLE scheduler_state (singleton with current_run_id, deployment_id)

manage-schema.ts · runtime-schema.ts · compute-next-run-at.ts


Post-deploy restart endpoint and CI integration

Before: No mechanism to move scheduled triggers onto a new deployment. Old workflows continued running on stale instances until they naturally terminated.
After: POST /api/deploy/restart-scheduler (authenticated via INKEEP_AGENTS_RUN_API_BYPASS_SECRET) starts a new scheduler workflow, superseding the old one. A new restart-scheduler CI job in vercel-production.yml calls this endpoint after deploy.

restartScheduler.ts · vercel-production.yml


ScheduledTriggerService and reconciliation simplification

Before: ScheduledTriggerService (~340 lines) managed full lifecycle — start/stop/restart workflows on create/update/delete. The reconciliation handler checked for orphaned/stale/dead workflows by cross-referencing workflow run statuses.
After: Service reduced to ~70 lines — only onTriggerUpdated survives for cancelling stale invocations on re-enable or schedule change. Reconciliation checks only for enabled triggers missing a next_run_at value.

ScheduledTriggerService.ts · scheduled-triggers.ts (reconciliation)

Pullfrog  | View workflow run | Using Claude Code | Triggered by Pullfrogpullfrog.com𝕏

claude[bot]

This comment was marked as outdated.

@github-actions github-actions bot deleted a comment from claude bot Mar 16, 2026
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(1) Total Issues | Risk: High

This is a well-designed architectural improvement — moving from per-trigger daisy-chaining workflows to a centralized scheduler is cleaner and more maintainable. The supersession mechanism, one-shot runner design, and upsertScheduledTrigger fix (computing nextRunAt at the DAL layer) are all well thought out.

✅ Issues Fixed Since Prior Reviews

Prior Issue Status Evidence
Reconciliation check broken (nextRunAt missing in select) Fixed audit-queries.ts:18 now includes nextRunAt
Bulk routes (agentFull/projectFull) don't set nextRunAt Fixed upsertScheduledTrigger() at scheduledTriggers.ts:154-160 now computes nextRunAt internally — all callers benefit
Crash between advance-and-dispatch loses one-time triggers Fixed triggerDispatcher.ts:84 now starts workflow before advancing
Timing-safe comparison leaks secret length Fixed restartScheduler.ts:11-14 uses SHA256 hash comparison
Error response exposes err.message Fixed restartScheduler.ts:54 returns generic error
No test coverage for computeNextRunAt Fixed 136 lines of tests added
No test coverage for triggerDispatcher Fixed 240 lines of tests added

🔴❗ Critical (1) ❗🔴

🔴 1) 0013_lumpy_apocalypse.sql:1 Missing data migration for existing enabled triggers

Issue: The migration adds a nullable next_run_at column but does NOT backfill existing enabled triggers. All currently-enabled triggers will have next_run_at = NULL after migration.

Why: The scheduler at findDueScheduledTriggersAcrossProjects only dispatches triggers where next_run_at IS NOT NULL AND next_run_at <= now(). Existing enabled triggers will silently stop running after deploy — a production outage for customers relying on scheduled triggers. This is a one-way door that's very difficult to diagnose in production.

This issue has been raised in all 4 prior reviews and remains the sole blocker.

Fix: Add a backfill to the migration:

ALTER TABLE "scheduled_triggers" ADD COLUMN "next_run_at" timestamp with time zone;

UPDATE scheduled_triggers 
SET next_run_at = COALESCE(run_at, NOW()) 
WHERE enabled = true;

Refs:

Inline Comments:

  • 🔴 Critical: 0013_lumpy_apocalypse.sql:1 Missing data migration backfill

🟠⚠️ Major (1) 🟠⚠️

Inline Comments:

  • 🟠 Major: triggerDispatcher.ts:44 Unbounded concurrent dispatches

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: triggerDispatcher.ts:102-105 Error log missing correlation context

💭 Consider (1) 💭

Inline Comments:

  • 💭 Consider: computeNextRunAt.test.ts:104-126 DST tests use weak assertions

🕐 Pending Recommendations (2)

Prior issues from pullfrog and earlier claude reviews that remain relevant (lower priority than the critical blocker):


🚫 REQUEST CHANGES

Summary: Excellent progress on this PR — 7 of 8 prior issues have been addressed, including critical fixes to bulk route handling, dispatch ordering, test coverage, and security. The architecture is sound.

However, the data migration backfill remains the critical blocker. Existing customers with scheduled triggers will experience a silent production outage after deploy because their triggers will have next_run_at = NULL and won't be picked up by the scheduler.

This is straightforward to fix with a few lines of SQL added to the migration file. Once that's addressed, this PR is ready to merge.

Discarded (5)
Location Issue Reason Discarded
schedulerWorkflow.ts 60-second polling could miss sub-minute schedules Intentional design — cron granularity is minutes
restartScheduler.ts No rate limiting on restart endpoint Acceptable — endpoint is protected by secret and only called by CI
schedulerSteps.ts Supersession check returns false when state missing Edge case — scheduler re-registers on next tick anyway
scheduled-triggers.ts Removed onCreated/onDeleted handlers Intentional — scheduler polls instead of event-driven
triggerDispatcher.ts Success log emitted when advance fails Acceptable — log says "dispatched" which is accurate; advance failure logged separately
Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator (manual analysis) 8 1 1 0 4 2 5
Total 8 1 1 0 4 2 5

Note: Subagent reviewers could not be dispatched due to permissions. Manual analysis performed by orchestrator using comprehensive codebase exploration.

@@ -0,0 +1 @@
ALTER TABLE "scheduled_triggers" ADD COLUMN "next_run_at" timestamp with time zone; No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL: Missing data migration for existing enabled triggers

Issue: This migration adds a nullable next_run_at column but does NOT backfill existing enabled triggers. All currently-enabled triggers will have next_run_at = NULL after migration.

Why: The scheduler at findDueScheduledTriggersAcrossProjects only dispatches triggers where next_run_at IS NOT NULL AND next_run_at <= now() (line 246). Existing enabled triggers will silently stop running after deploy — a production outage for customers relying on scheduled triggers.

This issue has been raised in all 4 prior reviews and remains the sole blocker.

Fix options:

  1. Add backfill to this migration (recommended):
ALTER TABLE "scheduled_triggers" ADD COLUMN "next_run_at" timestamp with time zone;

-- Backfill: set next_run_at = NOW() for all enabled triggers 
-- so scheduler picks them up on next tick
UPDATE scheduled_triggers 
SET next_run_at = COALESCE(run_at, NOW()) 
WHERE enabled = true;
  1. Add startup reconciliation in SchedulerService.startSchedulerWorkflow() that calls computeNextRunAt for any enabled trigger with NULL next_run_at before the first dispatch tick.

Refs:


logger.info({ dueCount: dueTriggers.length }, 'Found due triggers');

const results = await Promise.allSettled(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Unbounded concurrent dispatch could overwhelm the system

Issue: All due triggers are dispatched in parallel via Promise.allSettled with no concurrency limit. If many triggers become due simultaneously (scheduler outage recovery, popular cron times like midnight UTC), this could spawn hundreds of concurrent workflow starts.

Why: Each dispatchSingleTrigger performs DB writes and workflow starts. Unbounded parallelism risks DB connection pool exhaustion, workflow queue backpressure, and cascading failures during recovery scenarios.

Fix: Add a concurrency limit using a semaphore:

import pLimit from 'p-limit';

const limit = pLimit(10); // max 10 concurrent dispatches

const results = await Promise.allSettled(
  dueTriggers.map((trigger) => limit(() => dispatchSingleTrigger(trigger)))
);

Refs:

  • p-limit — popular concurrency limiter

Comment on lines +102 to +105
logger.error(
{ scheduledTriggerId, err },
'Failed to advance next_run_at after workflow start; next tick will retry (idempotent)'
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Error log missing correlation context for incident debugging

Issue: This log includes scheduledTriggerId and err but is missing tenantId, projectId, agentId, and scheduledFor — all of which are available in scope from the trigger object on line 64.

Why: During an incident with multiple failing triggers, operators would need to manually correlate trigger IDs back to tenant/project context. Including all identifiers makes debugging significantly easier and costs nothing.

Fix:

Suggested change
logger.error(
{ scheduledTriggerId, err },
'Failed to advance next_run_at after workflow start; next tick will retry (idempotent)'
);
logger.error(
{ scheduledTriggerId, tenantId, projectId, agentId, scheduledFor: trigger.nextRunAt, err },
'Failed to advance next_run_at after workflow start; next tick will retry (idempotent)'
);

Refs:

Comment on lines +104 to +126
it('handles DST spring-forward transition', () => {
const result = computeNextRunAt({
cronExpression: '30 2 * * *',
cronTimezone: 'America/New_York',
lastScheduledFor: '2026-03-07T07:30:00.000Z',
});

expect(result).toBeDefined();
const nextDate = new Date(result as string);
expect(nextDate.getTime()).toBeGreaterThan(new Date('2026-03-07T07:30:00.000Z').getTime());
});

it('handles DST fall-back transition', () => {
const result = computeNextRunAt({
cronExpression: '30 1 * * *',
cronTimezone: 'America/New_York',
lastScheduledFor: '2026-10-31T05:30:00.000Z',
});

expect(result).toBeDefined();
const nextDate = new Date(result as string);
expect(nextDate.getTime()).toBeGreaterThan(new Date('2026-10-31T05:30:00.000Z').getTime());
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Consider: DST tests use weak assertions

Issue: These DST tests only assert that the result is defined and later than the input. They don't verify the exact expected next run time, meaning bugs in DST handling (e.g., skipping to the wrong day, firing twice during fall-back) would pass.

Why: DST bugs in scheduler systems are notoriously hard to debug in production. Spring-forward and fall-back transitions can cause triggers to fire at unexpected times or not at all.

Fix: Consider strengthening assertions to verify exact expected times. For the spring-forward test (March 8, 2026 in America/New_York), 2:30 AM doesn't exist during spring-forward, so the next valid occurrence should be verified explicitly:

// Example: Assert the exact expected time after DST transition
expect(result).toBe('2026-03-09T07:30:00.000Z'); // or similar

Refs:

@github-actions github-actions bot deleted a comment from claude bot Mar 16, 2026
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(1) Total Issues | Risk: High

This is a delta review covering 2 commits since the last automated review (19ce6ddac87f). The delta is small and focused:

  • Removed unused workflowProcess.ts endpoint
  • Refactored findDueScheduledTriggersAcrossProjects from sequential to parallel queries
  • Cleaned up unused type fields in ScheduledTriggerAuditResult

✅ Delta Changes Analyzed

File Change Assessment
createApp.ts Remove unused endpoint mounting ✅ Clean
workflowProcess.ts Deleted (unused) ✅ Clean
scheduledTriggers.ts Sequential → parallel query refactor ⚠️ Minor issue (see inline)
types.ts Remove unused type fields ✅ Clean

The parallel query refactor in scheduledTriggers.ts is a good performance improvement for deployments with many projects, but loses project context in error logging (see inline comment).

🔴❗ Critical (1) ❗🔴

🔴 1) 0013_lumpy_apocalypse.sql:1 Missing data migration for existing enabled triggers

Issue: The migration adds a nullable next_run_at column but does NOT backfill existing enabled triggers. All currently-enabled triggers will have next_run_at = NULL after migration.

Why: The scheduler at findDueScheduledTriggersAcrossProjects only dispatches triggers where next_run_at IS NOT NULL AND next_run_at <= now(). Existing enabled triggers will silently stop running after deploy — a production outage for customers relying on scheduled triggers.

This issue has been raised in 5 prior reviews and remains the sole blocker.

Fix: Add a backfill to the migration:

ALTER TABLE "scheduled_triggers" ADD COLUMN "next_run_at" timestamp with time zone;

-- Backfill: set next_run_at for all enabled triggers so scheduler picks them up
UPDATE scheduled_triggers 
SET next_run_at = COALESCE(run_at, NOW()) 
WHERE enabled = true;

Refs:

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: scheduledTriggers.ts:268 Error log lost project context after parallel refactor

🕐 Pending Recommendations (2)

Prior issues that were noted but are lower priority than the critical blocker:


🚫 REQUEST CHANGES

Summary: The delta is clean with one minor issue (error logging context). However, the critical data migration issue remains unresolved and continues to block this PR.

All existing scheduled triggers will have next_run_at = NULL after migration and will silently stop running. This is a production outage waiting to happen. Please add a backfill mechanism before merging.

The architecture is solid and this PR is very close to being ready. This one remaining issue is straightforward to fix with a few lines of SQL added to the migration file.

Discarded (0)

No findings were discarded in this delta review.

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator (direct analysis) 2 1 0 0 1 2 0
Total 2 1 0 0 1 2 0

Note: This was a focused delta review of a small changeset. Subagent reviewers were not dispatched as the delta contained only 4 file changes with narrow scope.

allDue.push(...result.value);
} else {
logger.warn(
{ err: result.reason },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Error log lost project context after parallel refactor

Issue: The refactor from sequential to parallel queries removed tenantId and projectId from the error log. Previously logged { tenantId, projectId, err }, now only logs { err: result.reason }.

Why: When a project branch query fails, operators won't be able to identify which project failed without correlating timestamps with other logs. This makes debugging harder during incidents affecting specific tenants.

Fix: Capture project context in the rejection or wrap the async function to preserve it:

Suggested change
{ err: result.reason },
const results = await Promise.allSettled(
params.projects.map(async (project) => {
const branchName = getProjectScopedRef(project.tenantId, project.projectId, 'main');
try {
const rows = await db.execute(
sql`SELECT id, tenant_id, project_id, agent_id,
cron_expression, cron_timezone, run_at,
next_run_at, enabled
FROM scheduled_triggers AS OF ${sql.raw(`'${branchName}'`)}
WHERE enabled = true
AND next_run_at IS NOT NULL
AND next_run_at <= ${params.asOf}`
);
return rows.rows.map((row) => ({
id: row.id as string,
tenantId: row.tenant_id as string,
projectId: row.project_id as string,
agentId: row.agent_id as string,
cronExpression: row.cron_expression as string | null,
cronTimezone: row.cron_timezone as string | null,
runAt: row.run_at as string | null,
nextRunAt: row.next_run_at as string | null,
enabled: row.enabled as boolean,
}));
} catch (err) {
logger.warn(
{ tenantId: project.tenantId, projectId: project.projectId, err },
'Failed to query due triggers for project branch, skipping'
);
return [];
}
})
);
return results
.filter((r): r is PromiseFulfilledResult<DueScheduledTrigger[]> => r.status === 'fulfilled')
.flatMap((r) => r.value);

Refs:

@github-actions github-actions bot deleted a comment from claude bot Mar 16, 2026
@itoqa
Copy link

itoqa bot commented Mar 16, 2026

Ito Test Report ❌

19 test cases ran. 18 passed, 1 failed.

✅ Most scheduled-trigger API, UI, validation, auth-boundary, and security checks behaved as expected in this run. 🔍 One race-condition case (rapid double-submit during scheduled-trigger creation) reproduced a real duplicate-persistence defect, and code inspection supports this as an application bug rather than test setup noise.

✅ Passed (18)
Test Case Summary Timestamp Screenshot
ROUTE-1 Authorized restart scheduler calls returned 200 and second response linked to prior run metadata. 4:47 ROUTE-1_4-47.png
ROUTE-2 Missing Authorization header returned 401 Unauthorized. 4:47 ROUTE-2_4-47.png
ROUTE-3 Malformed or incorrect bearer variants returned 401 Unauthorized. 4:47 ROUTE-3_4-47.png
ROUTE-4 Legacy workflow keep-alive endpoint returned 404 and is not available. 0:00 ROUTE-4_0-00.png
LOGIC-1 Enabled recurring trigger was created with populated Next Run. 1:26:06 LOGIC-1_1-26-06.png
LOGIC-2 One-time trigger persisted with Next Run matching selected runAt. 1:26:57 LOGIC-2_1-26-57.png
LOGIC-3 Disabled trigger showed Disabled status and empty Next Run. 1:27:19 LOGIC-3_1-27-19.png
LOGIC-4 Editing cron cadence recomputed Next Run and persisted after refresh. 1:28:07 LOGIC-4_1-28-07.png
LOGIC-5 Disable/re-enable flow correctly cleared and repopulated Next Run. 1:28:35 LOGIC-5_1-28-35.png
LOGIC-6 Full agent upsert persisted new scheduled trigger with non-null nextRunAt. 1:34:52 LOGIC-6_1-34-52.png
LOGIC-7 Full project upsert persisted new trigger with non-null nextRunAt. 1:34:52 LOGIC-7_1-34-52.png
EDGE-2 After rapid toggles, persisted state matched the final intended state. 1:37:59 EDGE-2_1-37-59.png
EDGE-3 Non-linear edit navigation remained stable and saved only intended updates. 1:38:40 EDGE-3_1-38-40.png
EDGE-4 Mobile viewport edit flow remained accessible and saved successfully. 1:39:07 EDGE-4_1-39-07.png
ADV-1 Invalid payload with both cronExpression and runAt was rejected with 400. 1:34:52 ADV-1_1-34-52.png
ADV-2 Invalid null schedule patch was rejected with 400 and prior schedule remained intact. 1:34:52 ADV-2_1-34-52.png
ADV-3 Non-admin cross-user runAsUserId mutation was rejected with 403. 1:34:52 ADV-3_1-34-52.png
ADV-4 Injection-like strings rendered inertly with no script execution side effects. 1:40:31 ADV-4_1-40-31.png
❌ Failed (1)
Test Case Summary Timestamp Screenshot
EDGE-1 Rapid double submit created duplicate persisted scheduled-trigger rows instead of one. 1:37:39 EDGE-1_1-37-39.png
Rapid double-submit on create form does not duplicate trigger – Failed
  • Where: Scheduled trigger creation flow (Manage UI form + create scheduled-trigger API endpoint).

  • Steps to reproduce: Open create scheduled trigger form, enter a valid trigger, rapidly submit twice, then inspect scheduled trigger list.

  • What failed: Two persisted rows were created for the same user action window; expected behavior is one logical creation.

  • Code analysis: The UI sends create requests without an idempotency key, and the API path accepts each POST as a fresh insert by generating a new ID when one is not provided. The data layer inserts directly, and schema constraints only enforce primary key uniqueness by ID (not logical dedupe for repeated submit intent).

  • Relevant code:

    agents-manage-ui/src/components/scheduled-triggers/scheduled-trigger-form.tsx (lines 244-251)

    if (usersToCreate.length === 1) {
      const apiPayload = {
        ...basePayload,
        name: data.name,
        runAsUserId: resolveRunAsUserId(usersToCreate[0]),
      };
      const result = await createScheduledTriggerAction(tenantId, projectId, agentId, apiPayload);
    }

    agents-api/src/domains/manage/routes/scheduledTriggers.ts (lines 348-349, 382-387)

    const id = body.id || generateId();
    
    const trigger = await createScheduledTrigger(db)({
      ...body,
      id,
      tenantId,
      projectId,
      agentId,
    });

    packages/agents-core/src/data-access/manage/scheduledTriggers.ts (lines 83-87) and packages/agents-core/src/db/manage/manage-schema.ts (lines 186-187)

    const result = await db
      .insert(scheduledTriggers)
      .values(params as any)
      .returning();
    return result[0] as ScheduledTrigger;
    primaryKey({ columns: [table.tenantId, table.projectId, table.agentId, table.id] }),
    foreignKey({
  • Why this is likely a bug: The create path is non-idempotent for rapid duplicate submissions and lacks a logical dedupe guard, so concurrent/repeated submits can persist duplicate trigger records.

  • Introduced by this PR: Unknown – unable to determine.

  • Timestamp: 1:37:39

📋 View Recording

Screen Recording

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant