Conversation
…ock contention When processing batchTriggerAndWait items, each batch item was acquiring a Redis lock on the parent run to insert a TaskRunWaitpoint row. With high concurrency (processingConcurrency=50), this caused LockAcquisitionTimeoutError (880 errors/24h in prod), orphaned runs, and stuck parent runs. Since blockRunWithCreatedBatch already transitions the parent to EXECUTING_WITH_WAITPOINTS before items are processed, the per-item lock is unnecessary. The new blockRunWithWaitpointLockless method performs only the idempotent CTE insert and timeout scheduling without acquiring the lock. refs TRI-7837
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
WalkthroughAdds a lockless waitpoint insertion path to reduce Redis lock contention for batchTriggerAndWait processing. A new method, blockRunWithWaitpointLockless, was introduced in WaitpointSystem to perform idempotent CTE inserts (ON CONFLICT DO NOTHING) of TaskRunWaitpoint and _WaitpointRunConnections and to enqueue timeout jobs when provided; it intentionally avoids acquiring the parent run lock. The engine trigger flow was changed to call the lockless path for batch items and retain the locked path for non-batch items. The new method was duplicated in the file. The RunLocker test timeout was increased. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (2)
540-559: Guard the lockless path with an explicit state invariant check.This method relies on a strict precondition (parent already in
EXECUTING_WITH_WAITPOINTS). Add a fast-fail guard so future call-site drift doesn’t silently create blockers without the corresponding snapshot transition.Suggested hardening
async blockRunWithWaitpointLockless({ @@ }): Promise<void> { const prisma = tx ?? this.$.prisma; const $waitpoints = typeof waitpoints === "string" ? [waitpoints] : waitpoints; + const snapshot = await getLatestExecutionSnapshot(prisma, runId); + if (snapshot.executionStatus !== "EXECUTING_WITH_WAITPOINTS") { + throw new Error( + `blockRunWithWaitpointLockless requires EXECUTING_WITH_WAITPOINTS; got ${snapshot.executionStatus}` + ); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts` around lines 540 - 559, Add a fast-fail state invariant check at the start of blockRunWithWaitpointLockless: fetch the parent/run record for runId (using the same prisma client used in the function) and verify its state equals the expected EXECUTING_WITH_WAITPOINTS constant; if not, throw or return an explicit error indicating the precondition failed. Place this guard immediately after resolving prisma and $waitpoints so the function fails fast on incorrect call-site state drift rather than proceeding to create blockers. Use the function name blockRunWithWaitpointLockless, the runId param, and the EXECUTING_WITH_WAITPOINTS state constant to locate where to add the check.
528-605: Add observability instrumentation and crumbs to this new system method.Please wrap this path with tracer/meter instrumentation (low-cardinality attributes only) and add
@crumbsmarkers around insert + timeout scheduling branches.As per coding guidelines
internal-packages/run-engine/src/engine/systems/**/*.ts:Integrate OpenTelemetry tracer and meter instrumentation in RunEngine systems for observability; and**/*.{ts,tsx,js}:Add crumbs as you write code using //@Crumbscomments or //#region@Crumbsblocks for agentcrumbs debug tracing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts` around lines 528 - 605, Wrap the body of blockRunWithWaitpointLockless in an OpenTelemetry span (use this.$.tracer.startActiveSpan or equivalent) and record low-cardinality attributes (runId, projectId, batchId, batchIndex, waitpointsCount, spanIdToComplete, and a boolean batch flag); use a meter counter to record successful insert attempts and timeout-job scheduling counts. Place agent-crumbs markers around the CTE insert block and around the timeout scheduling loop (e.g. // `@crumbs` begin insert, // `@crumbs` end insert, and // `@crumbs` begin schedule-timeouts, // `@crumbs` end schedule-timeouts). Ensure exceptions inside the span are recorded and rethrown, and keep waitpoint IDs out of span attributes to avoid high-cardinality telemetry.internal-packages/run-engine/src/engine/index.ts (1)
731-756: Add@crumbstracing to the new batch/non-batch waitpoint branch.Please add crumbs on this new execution fork so lockless-vs-locked routing is visible in agent debugging traces.
As per coding guidelines
**/*.{ts,tsx,js}:Add crumbs as you write code using //@Crumbscomments or //#region@Crumbsblocks for agentcrumbs debug tracing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal-packages/run-engine/src/engine/index.ts` around lines 731 - 756, The new branching between lockless and locked waitpoint paths (inside the surrounding method that calls this.waitpointSystem) lacks `@crumbs` tracing; add // `@crumbs` comments before the lockless call (blockRunWithWaitpointLockless) and before the locked call (blockRunWithWaitpoint) so agent traces show which path was taken. Each crumb should be a single-line comment with a clear message (e.g. "waitpoint:lockless" vs "waitpoint:locked") and include key context such as runId (parentTaskRunId), waitpoint id (taskRun.associatedWaitpoint.id) and batch flag so debugging traces can correlate runs. Place the crumbs immediately above the respective await calls in the same function so the agentcrumbs tooling picks them up. Ensure you follow the project's crumb format used elsewhere (// `@crumbs` ...) for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal-packages/run-engine/src/engine/index.ts`:
- Around line 731-756: The new branching between lockless and locked waitpoint
paths (inside the surrounding method that calls this.waitpointSystem) lacks
`@crumbs` tracing; add // `@crumbs` comments before the lockless call
(blockRunWithWaitpointLockless) and before the locked call
(blockRunWithWaitpoint) so agent traces show which path was taken. Each crumb
should be a single-line comment with a clear message (e.g. "waitpoint:lockless"
vs "waitpoint:locked") and include key context such as runId (parentTaskRunId),
waitpoint id (taskRun.associatedWaitpoint.id) and batch flag so debugging traces
can correlate runs. Place the crumbs immediately above the respective await
calls in the same function so the agentcrumbs tooling picks them up. Ensure you
follow the project's crumb format used elsewhere (// `@crumbs` ...) for
consistency.
In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts`:
- Around line 540-559: Add a fast-fail state invariant check at the start of
blockRunWithWaitpointLockless: fetch the parent/run record for runId (using the
same prisma client used in the function) and verify its state equals the
expected EXECUTING_WITH_WAITPOINTS constant; if not, throw or return an explicit
error indicating the precondition failed. Place this guard immediately after
resolving prisma and $waitpoints so the function fails fast on incorrect
call-site state drift rather than proceeding to create blockers. Use the
function name blockRunWithWaitpointLockless, the runId param, and the
EXECUTING_WITH_WAITPOINTS state constant to locate where to add the check.
- Around line 528-605: Wrap the body of blockRunWithWaitpointLockless in an
OpenTelemetry span (use this.$.tracer.startActiveSpan or equivalent) and record
low-cardinality attributes (runId, projectId, batchId, batchIndex,
waitpointsCount, spanIdToComplete, and a boolean batch flag); use a meter
counter to record successful insert attempts and timeout-job scheduling counts.
Place agent-crumbs markers around the CTE insert block and around the timeout
scheduling loop (e.g. // `@crumbs` begin insert, // `@crumbs` end insert, and //
`@crumbs` begin schedule-timeouts, // `@crumbs` end schedule-timeouts). Ensure
exceptions inside the span are recorded and rethrown, and keep waitpoint IDs out
of span attributes to avoid high-cardinality telemetry.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 15f9a942-52b3-4d81-9d6f-83e06f97c480
📒 Files selected for processing (3)
.server-changes/fix-batch-waitpoint-lock-contention.mdinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/waitpointSystem.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)
- GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: typecheck / typecheck
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: sdk-compat / Bun Runtime
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: sdk-compat / Cloudflare Workers
- GitHub Check: sdk-compat / Deno Runtime
- GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
**/*.{ts,tsx}: Use task export syntax: export const myTask = task({ id: 'my-task', run: async (payload) => { ... } })
Use Run Engine 2.0 (@internal/run-engine) and redis-worker for all new work - avoid DEPRECATED zodworker (Graphile-worker wrapper)
Prisma 6.14.0 client and schema use PostgreSQL in internal-packages/database - import only from Prisma client
Files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
Files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
Format code using Prettier before committing
Files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{ts,tsx,js}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.{ts,tsx,js}: Always import from@trigger.dev/sdkfor Trigger.dev tasks - never use@trigger.dev/sdk/v3or deprecated client.defineJob
Import subpaths only from@trigger.dev/core, never import from root
Add crumbs as you write code using //@crumbscomments or //#region@crumbsblocks for agentcrumbs debug tracing
Files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/waitpointSystem.ts
internal-packages/run-engine/src/engine/systems/**/*.ts
📄 CodeRabbit inference engine (internal-packages/run-engine/CLAUDE.md)
Integrate OpenTelemetry tracer and meter instrumentation in RunEngine systems for observability
Files:
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
🧠 Learnings (13)
📓 Common learnings
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: internal-packages/run-engine/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:25.254Z
Learning: Use Redis for distributed locks and queues in the RunEngine system via `RunQueue` and `RunLocker` utilities
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerByTaskAndWait()` to batch trigger tasks by passing task instances and wait for results
Applied to files:
.server-changes/fix-batch-waitpoint-lock-contention.mdinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/waitpointSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerAndWait()` to batch trigger multiple different tasks and wait for results
Applied to files:
.server-changes/fix-batch-waitpoint-lock-contention.mdinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/waitpointSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `yourTask.batchTriggerAndWait()` to batch trigger tasks and wait for all results from a parent task
Applied to files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/waitpointSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `runs.subscribeToBatch()` to subscribe to changes for all runs in a batch
Applied to files:
internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2026-03-02T12:42:56.114Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/webapp/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:56.114Z
Learning: Applies to apps/webapp/app/v3/services/**/*.server.ts : When editing services that branch on `RunEngineVersion` to support both V1 and V2 (e.g., `cancelTaskRun.server.ts`, `batchTriggerV3.server.ts`), only modify V2 code paths
Applied to files:
internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `yourTask.batchTrigger()` to trigger multiple runs of a task from inside another task
Applied to files:
internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2026-03-02T12:43:43.173Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: packages/redis-worker/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:43.173Z
Learning: Applies to packages/redis-worker/**/redis-worker/src/worker.ts : Worker loop and job processing should implement concurrency control in src/worker.ts
Applied to files:
internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2026-03-13T13:37:49.544Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T13:37:49.544Z
Learning: Applies to **/*.{ts,tsx} : Use Run Engine 2.0 (internal/run-engine) and redis-worker for all new work - avoid DEPRECATED zodworker (Graphile-worker wrapper)
Applied to files:
internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.trigger()` to trigger multiple different tasks at once from backend code
Applied to files:
internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `tasks.batchTrigger()` to trigger multiple runs of a single task with different payloads
Applied to files:
internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `yourTask.triggerAndWait()` to trigger a task and wait for its result from a parent task
Applied to files:
internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerByTask()` to batch trigger tasks by passing task instances for static task sets
Applied to files:
internal-packages/run-engine/src/engine/index.ts
🔇 Additional comments (1)
.server-changes/fix-batch-waitpoint-lock-contention.md (1)
1-6: Clear and actionable release note.Good summary of the production symptom, root cause, and behavioral change.
|
Needs approval |
When processing batchTriggerAndWait items, each batch item was acquiring a
Redis lock on the parent run to insert a TaskRunWaitpoint row. With high
concurrency (processingConcurrency=50), this caused LockAcquisitionTimeoutError
(880 errors/24h in prod), orphaned runs, and stuck parent runs.
Since blockRunWithCreatedBatch already transitions the parent to
EXECUTING_WITH_WAITPOINTS before items are processed, the per-item lock is
unnecessary. The new blockRunWithWaitpointLockless method performs only the
idempotent CTE insert and timeout scheduling without acquiring the lock.