Skip to content

logical: fix ReplicationError DistSQL propagation#170835

Open
DarrylWong wants to merge 4 commits into
cockroachdb:masterfrom
DarrylWong:ldr-adjust-end-time
Open

logical: fix ReplicationError DistSQL propagation#170835
DarrylWong wants to merge 4 commits into
cockroachdb:masterfrom
DarrylWong:ldr-adjust-end-time

Conversation

@DarrylWong
Copy link
Copy Markdown
Contributor

This PR fixes how LDR txn mode propagates ReplicationErrors. See individual commits for explanation of fixes.

Fixes: #170452
Release note: none

Running on both single node and multinode is a regression test for the applier
not properly propagating errors that were supposed to shut down the flow. Specifically
when all the processors are draining, the DistSQL router knows to shut down. This behavior
made it seem as though the error propagation was working as intended.
Previously, EndTime was set to the timestamp of a replication
conflict, and we wanted to replicate timestamps strictly less
than the conflict. However, this was fragile as it required
all consumers of EndTime to remember to call EndTime.Prev().

Instead, we change EndTime to be inclusive and set it to the
timestamp of the replication conflict minus one. In other words,
we used to stop the flow once the replicated time was one logical
tick before the EndTime. Now we stop the flow once the replicated
time is at the EndTime.

Additionally, we add a fast path to skip spinning up the DistSQL
flow all together if we have already reached the EndTime. This
fixes a latent bug where we failed to properly pause the job
when there were no events to process (and advance the frontier)
and we were exiting early before checking if we reached the end
time.
@DarrylWong DarrylWong requested review from a team as code owners May 22, 2026 21:08
@DarrylWong DarrylWong requested review from jeffswenson and michae2 and removed request for a team May 22, 2026 21:08
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 22, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

p.grp.GoCtx(func(ctx context.Context) error {
defer close(p.applierEvents)
return p.runInputReader(ctx)
p.sendError(p.runInputReader(ctx))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffswenson you mentioned you weren't a fan of this pattern but I think I misunderstood what the alternative was. We can't stick the sendError in the error group Wait() as it is called in close(). We need the error to be sent to Next() so we can start draining, but Close() is not called until after we start draining.

Previously, we propagated applier errors through TrailingMetaCallback,
which only fires once an applier has finished draining. However, we
only finish draining once the coordinator stops sending rows. When we
only have a single applier, the coordinator sees we have zero
non-draining appliers and begins shutdown. However when we have
multiple appliers, the coordinator never stops sending rows so it
never receives the error telling it to stop.

Instead, we remove the TrailingMetaCallback approach and replace it
with an errCh that will be checked in the Next() flow and signal
draining.
We may need to send the ReplicationError over the wire
if the coordinator is on a different node. Previously
we were not properly serializing the error and we would
lose the ReplicationError unwrapping, causing the pause
detection to fail.
@DarrylWong DarrylWong force-pushed the ldr-adjust-end-time branch from fdee629 to 06482ff Compare May 22, 2026 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

crosscluster/logical/txnmode: TestTxnModePauseOnConflict failed

2 participants