doc/design: add design doc for incremental, occ-based read-then-write#34977
doc/design: add design doc for incremental, occ-based read-then-write#34977aljoscha wants to merge 1 commit intoMaterializeInc:mainfrom
Conversation
|
|
||
| - If the timestamp is still valid: the write is committed at exactly that | ||
| timestamp, and the oracle is advanced past it. Any concurrent OCC loops that | ||
| were targeting the same timestamp will fail and retry. |
There was a problem hiding this comment.
Do we have to worry about how table commit advances the logical frontier for all tables? At first blush it seems like you might get spurious conflicts . (Where eg. my write to table A advanced the frontier for table B, so my write attempt to B fails even though it hasn't changed.)
In theory the txn mechanism has enough metadata in its shard to know that there are not actually any changes in B between my read time and the new frontier, so it was safe to advance the write timestamp further. But I suppose we're going to find that out via the subscribe pretty quickly anyways...
There was a problem hiding this comment.
Yeah, any table write will clobber your write attempt. And as is in the prototype, you would learn through the subscribe that nothing changed and re-try. Not ideal, but also ... 🤷
| This removes a significant amount of complexity and uncertainty from the | ||
| codebase. | ||
|
|
||
| ## Alternative: distributed locking service |
There was a problem hiding this comment.
An alternative I'd like to see discussed is having the OCC loop run in clusterd as a dataflow export. On first glance, that seems attractive because it avoids having to stash all the data read in environmentd (which might oom or abort the read because its response is too large). I imagine it might not work because table writes have to be performed by envd, possibly due to some txn-wal reason?
There was a problem hiding this comment.
Well, I think we eventually want to make multiple envds able to submit read-then-writes concurrently, in which case whatever multiple envds can cooperate on, a clusterd should be able to do as well, right? E.g., clusterds should also able to access the timestamp oracle.
|
|
||
| The goal is not to make writes faster, but to not regress significantly. | ||
| Benchmarking a PoC-level implementation of the OCC approach against `main` for | ||
| `UPDATE t SET x = x + 1` shows the following: |
There was a problem hiding this comment.
Peeks can be faster than subscribes in various situations:
- If there is an index, then a fast-path peek can avoid building a dataflow.
- We can do monotonic TopKs and min/max.
Anyhow, if this turns out that it matters for some users, then a follow-up optimization could be to first try a peek-and-write-at-the-same-timestamp instead of a subscribe, and fall back to the subscribe-based approach only if the first write doesn't complete fast enough and fails.
| 3. After committing at timestamp T, the oracle advances past T, so additional | ||
| writes at T would fail anyway. We fail them early to avoid unnecessary work. | ||
|
|
||
| ### Comparison with the old approach |
There was a problem hiding this comment.
Maybe somewhat of a corner case, but a slight concern is what happens when the subscribe can't keep up with input changes, and thus the operation can never complete. With the old approach, the operation would usually either complete after a while or OOM the cluster, making some noise. In contrast, the new approach can silently get into a state where it can never finish, e.g., if there is some window function or cross join that keeps rewriting a significant portion of the subscribe's output at every input change. Maybe we could try to detect if the subscribe just keeps falling behind, and show a notice to the user, or even entirely fail the operation.
|
One more thought: the current ReadThenWrite code has the limitation that even the read part is not allowed to refer to sources, only to tables. (See |
Rendered: https://github.com/aljoscha/materialize/blob/design-incremental-occ-read-then-write/doc/developer/design/20260210_incremental_occ_read_then_write.md