logical: fix ReplicationError DistSQL propagation#170835
Open
DarrylWong wants to merge 4 commits into
Open
Conversation
Running on both single node and multinode is a regression test for the applier not properly propagating errors that were supposed to shut down the flow. Specifically when all the processors are draining, the DistSQL router knows to shut down. This behavior made it seem as though the error propagation was working as intended.
Previously, EndTime was set to the timestamp of a replication conflict, and we wanted to replicate timestamps strictly less than the conflict. However, this was fragile as it required all consumers of EndTime to remember to call EndTime.Prev(). Instead, we change EndTime to be inclusive and set it to the timestamp of the replication conflict minus one. In other words, we used to stop the flow once the replicated time was one logical tick before the EndTime. Now we stop the flow once the replicated time is at the EndTime. Additionally, we add a fast path to skip spinning up the DistSQL flow all together if we have already reached the EndTime. This fixes a latent bug where we failed to properly pause the job when there were no events to process (and advance the frontier) and we were exiting early before checking if we reached the end time.
Contributor
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
Member
DarrylWong
commented
May 22, 2026
| p.grp.GoCtx(func(ctx context.Context) error { | ||
| defer close(p.applierEvents) | ||
| return p.runInputReader(ctx) | ||
| p.sendError(p.runInputReader(ctx)) |
Contributor
Author
There was a problem hiding this comment.
@jeffswenson you mentioned you weren't a fan of this pattern but I think I misunderstood what the alternative was. We can't stick the sendError in the error group Wait() as it is called in close(). We need the error to be sent to Next() so we can start draining, but Close() is not called until after we start draining.
Previously, we propagated applier errors through TrailingMetaCallback, which only fires once an applier has finished draining. However, we only finish draining once the coordinator stops sending rows. When we only have a single applier, the coordinator sees we have zero non-draining appliers and begins shutdown. However when we have multiple appliers, the coordinator never stops sending rows so it never receives the error telling it to stop. Instead, we remove the TrailingMetaCallback approach and replace it with an errCh that will be checked in the Next() flow and signal draining.
We may need to send the ReplicationError over the wire if the coordinator is on a different node. Previously we were not properly serializing the error and we would lose the ReplicationError unwrapping, causing the pause detection to fail.
fdee629 to
06482ff
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes how LDR txn mode propagates ReplicationErrors. See individual commits for explanation of fixes.
Fixes: #170452
Release note: none