Skip to content

ENG-1815 Add sync attempt telemetry and stale end-task handling#1094

Open
mdroidian wants to merge 6 commits into
mainfrom
eng-1815-add-sync-attempt-telemetry-and-stale-end-task-handling
Open

ENG-1815 Add sync attempt telemetry and stale end-task handling#1094
mdroidian wants to merge 6 commits into
mainfrom
eng-1815-add-sync-attempt-telemetry-and-stale-end-task-handling

Conversation

@mdroidian
Copy link
Copy Markdown
Member

@mdroidian mdroidian commented May 28, 2026

This is from a back and forth with codex 5.5 xhigh with linear/posthog mcp access.

Summary

Fixes misleading sync reporting around ENG-1322 and adds the telemetry needed to debug remaining “Wrong worker” sync failures.

Previously, Roam could hit an end_sync_task failure and still emit Sync complete, which made PostHog sessions look contradictory. This change makes end_sync_task return structured results, classifies stale completions explicitly, and only reports sync success after the database confirms the task was ended successfully.

What Changed

  • Changed Roam sync worker identity from the Roam user UID to a per-runtime random worker ID.
  • Added per-sync syncAttemptId, syncWorkerId, syncUserUid, space ID, status, and per-phase timing telemetry.
  • Changed endSyncTask to return a typed success/stale/error result instead of swallowing failures.
  • Sync complete now only fires after end_sync_task returns success.
  • Stale completions now emit Sync stale; end-task failures emit Sync error with status: "end_task_failed".

Database Changes

  • public.end_sync_task(...) now returns jsonb instead of void.
  • If an older sync attempt finishes after a newer task has already started, the function returns:
    • ok: false
    • stale: true
    • reason: "completed_by_newer_task"
  • Real worker mismatches still raise Wrong worker, but now include diagnostic detail in the Postgres error detail payload.
  • Added migration 20260528033000_end_sync_task_result.sql.
  • Updated generated DB types and sync function docs.

Why This Helps

  • Prevents false-positive Sync complete events after failed task cleanup.
  • Separates stale/overlapping sync attempts from true wrong-worker errors.
  • Gives PostHog enough context to debug remaining ENG-1322 cases by phase duration, worker, attempt, and DB task state.

Out of Scope

  • Fully resolving every root cause of ENG-1322.
  • Changing sync timeout/interval policy.
  • Reworking the broader sync scheduler.

Open in Devin Review

Summary by CodeRabbit

Release Notes

  • Improvements

    • Enhanced sync task reliability with improved error detection and recovery mechanisms
    • Better handling of concurrent sync operations to prevent duplicate or stale task processing
  • Documentation

    • Updated sync function documentation for clearer completion and worker behavior specifications

Review Change Stack

…text

- Refactor end_sync_task to return a JSON object containing task status and metadata.
- Enhance error handling to include telemetry context for better debugging.
- Update related API routes and database schema to accommodate new return type.
- Document changes in sync_functions.md to reflect the new response structure.
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 28, 2026

ENG-1815

@graphite-app
Copy link
Copy Markdown
Contributor

graphite-app Bot commented May 28, 2026

PR size/scope check

This PR is over our review-size guideline.

  • Recommended: ~200 lines changed
  • Acceptable limit: up to 400 lines when well-scoped/self-contained
  • Preferred file count: fewer than 5 files

Please split this into smaller PRs unless there is a clear reason the changes need to land together.

If keeping it as one PR, please add a brief justification covering:

  • What single problem this PR solves
  • Why the files/changes are coupled

@supabase
Copy link
Copy Markdown

supabase Bot commented May 28, 2026

Updates to Preview Branch (eng-1815-add-sync-attempt-telemetry-and-stale-end-task-handling) ↗︎

Deployments Status Updated
Database Thu, 28 May 2026 20:21:53 UTC
Services Thu, 28 May 2026 20:21:53 UTC
APIs Thu, 28 May 2026 20:21:53 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Thu, 28 May 2026 20:21:55 UTC
Migrations Thu, 28 May 2026 20:21:55 UTC
Seeding Thu, 28 May 2026 20:21:56 UTC
Edge Functions Thu, 28 May 2026 20:22:00 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

@mdroidian

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

…error responses

- Update EndSyncTaskRpcResult to require 'ok' and 'stale' fields.
- Enhance isEndSyncTaskRpcResult function to validate response structure.
- Modify endSyncTask function to handle claimed timestamps and improve error handling.
- Update API route to parse and validate request body for task status and timestamps.
- Document changes in sync_functions.md to reflect new requirements for end_sync_task.
@mdroidian mdroidian marked this pull request as draft May 28, 2026 04:40
@mdroidian
Copy link
Copy Markdown
Member Author

PR size/scope check

This PR is over our review-size guideline.

  • Recommended: ~200 lines changed
  • Acceptable limit: up to 400 lines when well-scoped/self-contained
  • Preferred file count: fewer than 5 files

Please split this into smaller PRs unless there is a clear reason the changes need to land together.

If keeping it as one PR, please add a brief justification covering:

  • What single problem this PR solves
  • Why the files/changes are coupled

This PR fixes one sync completion correctness issue: end_sync_task failures/stale completions were not distinguishable enough, and Roam could report Sync complete even when completion did not actually succeed.

The files are coupled by the same contract change. The database function now returns structured completion results, the generated DB type and website route need to reflect that return shape, and the Roam caller must consume the result to avoid false success telemetry/backoff behavior. Splitting would create intermediate states where the server exposes a new contract without the client using it, or the client expects behavior the DB has not yet shipped.

@mdroidian mdroidian marked this pull request as ready for review May 28, 2026 04:43
@mdroidian mdroidian assigned maparent and unassigned maparent May 28, 2026
@mdroidian mdroidian requested a review from maparent May 28, 2026 04:44
devin-ai-integration[bot]

This comment was marked as resolved.

mdroidian added 3 commits May 27, 2026 23:02
…prove request body validation

- Made 'startedAt' in ParsedEndSyncTaskBody optional.
- Enhanced parseEndSyncTaskBody to accept a task status string directly.
- Updated error messages for better clarity on request body requirements.
- Adjusted RPC call to conditionally include 's_started_at' based on availability.
Copy link
Copy Markdown
Collaborator

@maparent maparent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Nothing wrong, but one change I'd appreciate:
Add a version number to the JSON output, and check for it in the code. If the version number coming from the database is higher than the one in the code, be more lenient with json parsing errors. This gives you more leeway to change the database json format without breaking old versions of the plugins.
Otherwise lgtm.

@mdroidian mdroidian requested a review from maparent May 28, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants