Skip to content

doc/design: add design doc for incremental, occ-based read-then-write#34977

Open
aljoscha wants to merge 1 commit intoMaterializeInc:mainfrom
aljoscha:design-incremental-occ-read-then-write
Open

doc/design: add design doc for incremental, occ-based read-then-write#34977
aljoscha wants to merge 1 commit intoMaterializeInc:mainfrom
aljoscha:design-incremental-occ-read-then-write

Conversation

@aljoscha
Copy link
Contributor

Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this direction!


- If the timestamp is still valid: the write is committed at exactly that
timestamp, and the oracle is advanced past it. Any concurrent OCC loops that
were targeting the same timestamp will fail and retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to worry about how table commit advances the logical frontier for all tables? At first blush it seems like you might get spurious conflicts . (Where eg. my write to table A advanced the frontier for table B, so my write attempt to B fails even though it hasn't changed.)

In theory the txn mechanism has enough metadata in its shard to know that there are not actually any changes in B between my read time and the new frontier, so it was safe to advance the write timestamp further. But I suppose we're going to find that out via the subscribe pretty quickly anyways...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, any table write will clobber your write attempt. And as is in the prototype, you would learn through the subscribe that nothing changed and re-try. Not ideal, but also ... 🤷

This removes a significant amount of complexity and uncertainty from the
codebase.

## Alternative: distributed locking service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative I'd like to see discussed is having the OCC loop run in clusterd as a dataflow export. On first glance, that seems attractive because it avoids having to stash all the data read in environmentd (which might oom or abort the read because its response is too large). I imagine it might not work because table writes have to be performed by envd, possibly due to some txn-wal reason?

Copy link
Contributor

@ggevay ggevay Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think we eventually want to make multiple envds able to submit read-then-writes concurrently, in which case whatever multiple envds can cooperate on, a clusterd should be able to do as well, right? E.g., clusterds should also able to access the timestamp oracle.

Copy link
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, like it!

(As discussed offline, this will also unblock the removal of the old peek sequencing code, because the current ReadThenWrite code calls the old peek sequencing from the Coordinator.)


The goal is not to make writes faster, but to not regress significantly.
Benchmarking a PoC-level implementation of the OCC approach against `main` for
`UPDATE t SET x = x + 1` shows the following:
Copy link
Contributor

@ggevay ggevay Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peeks can be faster than subscribes in various situations:

  • If there is an index, then a fast-path peek can avoid building a dataflow.
  • We can do monotonic TopKs and min/max.

Anyhow, if this turns out that it matters for some users, then a follow-up optimization could be to first try a peek-and-write-at-the-same-timestamp instead of a subscribe, and fall back to the subscribe-based approach only if the first write doesn't complete fast enough and fails.

3. After committing at timestamp T, the oracle advances past T, so additional
writes at T would fail anyway. We fail them early to avoid unnecessary work.

### Comparison with the old approach
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe somewhat of a corner case, but a slight concern is what happens when the subscribe can't keep up with input changes, and thus the operation can never complete. With the old approach, the operation would usually either complete after a while or OOM the cluster, making some noise. In contrast, the new approach can silently get into a state where it can never finish, e.g., if there is some window function or cross join that keeps rewriting a significant portion of the subscribe's output at every input change. Maybe we could try to detect if the subscribe just keeps falling behind, and show a notice to the user, or even entirely fail the operation.

@ggevay
Copy link
Contributor

ggevay commented Feb 16, 2026

One more thought: the current ReadThenWrite code has the limitation that even the read part is not allowed to refer to sources, only to tables. (See AdapterError::InvalidTableMutationSelection.) This is because we can't put a lock on sources, so source contents might change by the time we complete the read and then choose a write timestamp. However, with the new approach, lifting this limitation might be possible? Edit: or maybe even trivial, because the read timestamp will simply be the same as the write timestamp, so the sources won't move forward between them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants