Skip to content

WIP: Troubleshoot the bucket limit for sync streams#374

Open
benitav wants to merge 14 commits intomainfrom
sync-streams-buckets
Open

WIP: Troubleshoot the bucket limit for sync streams#374
benitav wants to merge 14 commits intomainfrom
sync-streams-buckets

Conversation

@benitav
Copy link
Collaborator

@benitav benitav commented Mar 6, 2026

This is currently not necessarily an exhaustive list

@benitav benitav changed the title Troubleshoot the bucket limit for sync streams WIP: Troubleshoot the bucket limit for sync streams Mar 9, 2026
Copy link
Contributor

@simolus3 simolus3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only checked the examples so far, not the suggested workarounds.

| Subscription parameter: `WHERE project_id = subscription.parameter('project_id')` | 1 per unique parameter value the client subscribes with |
| Subquery returning N rows: `WHERE id IN (SELECT org_id FROM org_membership WHERE user_id = auth.user_id())` | N — one per result row of the subquery |
| INNER JOIN through an intermediate table: `SELECT tasks.* FROM tasks JOIN projects ON tasks.project_id = projects.id WHERE projects.org_id IN (...)` | N — one per row of the joined table (one per project) |
| Many-to-many JOIN: `SELECT assets.* FROM assets JOIN project_assets ON project_assets.asset_id = assets.id WHERE project_assets.project_id IN (...)` | N — one per primary table row (one per asset) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, but I wonder if explaining it as a special case (the paragraph above also calls out many-to-many joins as an exception to the rule) is really that helpful. The two paragraphs below also point this out separately.

For subqueries and one-to-many JOINs, each row returned creates a bucket

That also applies here, there would be one bucket per project_assets row for the user. So reach row returend in the joined table creates one bucket, the many-to-many join doesn't make this an exception.

Maybe it makes sense to explain the general rule first (for each expression of the form a = b, a IN b and a && b where either a or b depend on the table being synced, we create a parameter). Assuming that a is the expression depending on the table being synced, we create one bucket per row of b. All of the cases can be explained with that rule, it might be harder to grasp but it avoids having to explain many-to-many joins separately.

| No parameters: `SELECT * FROM regions` | 1 global bucket, shared by all users |
| Direct auth filter only: `WHERE user_id = auth.user_id()` | 1 per user |
| Subscription parameter: `WHERE project_id = subscription.parameter('project_id')` | 1 per unique parameter value the client subscribes with |
| Subquery returning N rows: `WHERE id IN (SELECT org_id FROM org_membership WHERE user_id = auth.user_id())` | N — one per result row of the subquery |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a JSON array example, e.g. WHERE id IN auth.parameter('project_ids') would give jwt.project_ids.length buckets.

user_projects → [proj-1, proj-2, proj-3, proj-4, proj-5, proj-6] (6 values)
```

Each query creates its own bucket namespace, even when two queries use the same CTE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not generally true, the compiler is allowed to merge the projects and tasks queries into a single bucket in this example, precisely because the CTE is the same (or generally, because the buckets have the same instantiation and are part of the same stream).

(I haven't checked whether the buckets are actually merged in this case, but we are supposed to be able to expoit that)

| `tasks` | `user_projects` | proj-1 … proj-6 | 6 |
| | | **Total** | **14** |

At scale — 10 orgs and 50 projects per org — this becomes 10 + 500 + 500 = 1,010 buckets, which exceeds the limit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This O(n * m) blowup still applies if the buckets are merged and is a good thing to be aware of so I think we should mention it fwiw. But removing one of the projects / tasks queries might be easier to understand.

↔ users (org_membership.user_id → users.id)
```

| Query pattern | Buckets per user |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also add an example using multiple parameters (WHERE id IN (SELECT org_id FROM org_membership WHERE user_id = auth.user_id()) AND region = subscription.parameter('region')).
This would give one bucket per (org_id, region) pair, so up to N * M for N org ids and M distinct subscriptions.

We could also explain OR separately (those would give N + M buckets in most cases).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants