Skip to content

Scaling bottlenecks under high-integration, high-webhook-rate workloads #161

@khaliqgant

Description

@khaliqgant

Context

Under workloads with many integrations (30+), high webhook rates, and active mounts, several independent bottlenecks compound. This issue documents them in priority order with enough specificity to investigate and act on each independently.


1. Worker and queue defaults are too conservative for production loads

EnvelopeWorkers, WritebackWorkers, ProviderMaxConcurrency, and EnvelopeQueueSize all default to 1, 1, 1, and 1024 respectively (store.go:803, 807, 811, 791). At 30 integrations with chatty webhooks the queue saturates in seconds and starts returning 429s. These are already configurable via StoreOptions, but there's no way to tune them without code changes — unlike FullPullEvery, which already supports RELAYFILE_MOUNT_FULL_PULL_EVERY.

Suggestion: expose the worker and queue settings as environment variables so they can be tuned operationally, and reconsider whether the hardcoded defaults match realistic workloads.


2. Full-pull ReadFile loop is sequential

pullRemoteFullTree in syncer.go (around the ReadFile call at line 1826) fetches files one at a time in a tight loop with no concurrency. At ~20ms per file, 1000 files takes ~20s — exceeding the default 15s cycle timeout. The timeout warning is already in the code, but the underlying cause isn't addressed.

The ListTree page size is also hardcoded to 10 (syncer.go:1792), producing 100+ round trips for a moderately sized workspace.

Suggestion: parallelize the ReadFile calls with a bounded worker pool, and increase the ListTree page size. Results can be gathered concurrently and applied in order. The pullRemoteFullExport fast path already exists as a better alternative when the backend supports it — worth ensuring that path is preferred.


3. Periodic full pull fires too frequently for large workspaces

defaultFullPullEvery = 20 cycles (syncer.go:68) means a full O(n-files) tree pull every ~10 minutes at default interval. For workspaces with hundreds of files this interacts badly with bottleneck #2. The incremental pull path (event-cursor based) is already cheap and correct for steady-state; the periodic full pull is a defensive correctness measure against revision reuse.

RELAYFILE_MOUNT_FULL_PULL_EVERY already supports negative values to disable this. The question is whether the default is set appropriately for the assumed file-count range.

Suggestion: consider a higher default (100+), document the -1 escape hatch more prominently, or gate the full pull on a file-count threshold rather than a fixed cycle count.


4. Global Store.mu serializes all workspaces

A single sync.RWMutex at store.go:430 protects all state across all workspaces. The envelope-related maps (envelopesByID, deliveryIndex, coalesceIndex, envelopeAttempts, deadLettersstore.go:442–455) are global rather than per-workspace, so every envelope operation across every integration contends on the same lock regardless of workspace.

The HTTP and provider calls already happen outside the lock (store.go:3658–3672), so the lock is only held for in-memory state mutations — the contention is real but the scope of what's locked is bounded.

workspaceState (store.go:481) already partitions files, events, ops, and watermarks per workspace. Extending that partition to the envelope maps and adding a per-workspace mutex would let workspaces process envelopes independently.

Suggestion: this is the right long-term fix for multi-workspace throughput. Scope can be managed by moving the envelope maps into workspaceState incrementally and protecting them with a per-workspace lock, while keeping the global lock for workspace map structure changes.


5. Coalesced envelopes are still processed individually

The coalesce logic (store.go:3047–3086) merges duplicate webhooks for the same object within a time window, which is good. But coalesced envelopes are still enqueued and processed one at a time. For bursty traffic to the same objects (e.g., a file edited rapidly), this means multiple lock acquisitions and provider write calls for what could be a single batched operation.

Suggestion: consider whether writes for the same object arriving within the coalesce window can be held and flushed as a single operation, rather than processed as they arrive. This is most valuable when CoalesceWindow is set longer than the default 5s.


Summary

Bottleneck Location Effort Impact
No env vars for worker/queue tuning store.go:791–811 Low High
Sequential ReadFile in full pull syncer.go:1826 Medium High
Full pull cadence vs. file count syncer.go:68 Low Medium
Global lock across all workspaces store.go:430, 442–455 High Very high
No batching after coalesce store.go:3047–3088 Medium Medium

Items 1 and 3 can be addressed without touching core logic. Item 2 is self-contained in pullRemoteFullTree. Item 4 is the structural change that removes the throughput ceiling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions