Skip to content

Investigate converting ShapeStream to explicit state machine #3785

@KyleAMathews

Description

@KyleAMathews

Summary

Recent bugs in ShapeStream have been caused by implicit state machine complexity. Both #3773 (infinite loop in replay mode) and the stale cache offset update bug share a common root cause: the client maintains many interdependent state variables that are updated across scattered code paths, making it easy to forget to update related state when taking a specific branch.

The Problem

ShapeStream currently maintains 20+ private state variables that form an implicit state machine:

Core sync state:

  • #shapeHandle, #lastOffset, #liveCacheBuster, #schema

Connection state:

  • #connected, #started, #state, #isUpToDate, #isMidStream, #lastSyncedAt

Stale cache handling:

  • #staleCacheRetryCount, #staleCacheBuster, #lastSeenCursor

SSE handling:

  • #sseFallbackToLongPolling, #consecutiveShortSseConnections, #lastSseConnectionStartTime

These variables are updated across many methods (#onInitialResponse, #onMessages, #requestShape, etc.), creating an implicit state machine where:

  1. State transitions are scattered - no single place defines what variables change together
  2. Valid state combinations are implicit - easy to create invalid states
  3. Invariants are maintained manually - easy to update one variable but forget another
  4. Testing requires mocking internals - can't test state machine logic in isolation

Bug Pattern

Both recent bugs followed the same pattern:

PR #3773 (Replay mode infinite loop):

  • Code takes early return when cursor matches
  • #lastSeenCursor wasn't cleared → replay mode never exits → infinite loop

Stale cache offset bug:

  • Code logs warning and continues when stale response detected with existing handle
  • #lastOffset was updated from stale response → handle/offset mismatch → server errors

Both bugs: early return/branch didn't handle all related state variables.

Proposed Solution

Investigate converting the implicit state to an explicit state machine:

Option A: Discriminated Union States

type SyncState = 
  | { phase: 'initial'; offset: '-1' }
  | { phase: 'syncing'; handle: string; offset: Offset; schema: Schema }
  | { phase: 'live'; handle: string; offset: Offset; schema: Schema; cursor: string }
  | { phase: 'stale-retry'; handle: string; offset: Offset; retryCount: number; cacheBuster: string }
  | { phase: 'paused'; handle: string; offset: Offset }

// Transitions are explicit functions that return new state
function handleResponse(state: SyncState, response: Response): SyncState {
  // All related state changes happen together
  // TypeScript ensures we handle all cases
}

Option B: XState or Similar

Use a formal state machine library for complex transitions:

  • Visual state charts for documentation
  • Built-in guards and actions
  • Automatic testing of valid transitions

Option C: State Reducer Pattern

Centralize state updates through a reducer:

type StateAction = 
  | { type: 'RESPONSE_RECEIVED'; handle: string; offset: Offset; ... }
  | { type: 'STALE_RESPONSE_IGNORED' }  // No state changes!
  | { type: 'ENTER_REPLAY_MODE'; cursor: string }
  | { type: 'EXIT_REPLAY_MODE' }

function reduce(state: State, action: StateAction): State {
  // Single source of truth for state transitions
}

Benefits

  1. Bugs become obvious - forgetting to handle state in a transition is a type error
  2. Testable in isolation - state machine can be unit tested without network mocks
  3. Self-documenting - state types and transitions document valid states
  4. Easier reviews - PRs show explicit state transition changes

Scope

Start with the most bug-prone areas:

  1. Stale cache detection and retry logic
  2. Replay mode (cursor tracking)
  3. Handle/offset consistency

Later extend to:

  • SSE fallback logic
  • Pause/resume
  • Error recovery

Questions to Answer

  1. Is the complexity worth it for a client library?
  2. Which approach fits best with the existing codebase?
  3. Should we use a library (XState) or roll our own?
  4. Can we migrate incrementally or is it all-or-nothing?

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions