logical: fix ReplicationError DistSQL propagation by DarrylWong · Pull Request #170835 · cockroachdb/cockroach

DarrylWong · 2026-05-22T21:08:05Z

This PR fixes how LDR txn mode propagates ReplicationErrors. See individual commits for explanation of fixes.

Fixes: #170452
Release note: none

Running on both single node and multinode is a regression test for the applier not properly propagating errors that were supposed to shut down the flow. Specifically when all the processors are draining, the DistSQL router knows to shut down. This behavior made it seem as though the error propagation was working as intended.

Previously, EndTime was set to the timestamp of a replication conflict, and we wanted to replicate timestamps strictly less than the conflict. However, this was fragile as it required all consumers of EndTime to remember to call EndTime.Prev(). Instead, we change EndTime to be inclusive and set it to the timestamp of the replication conflict minus one. In other words, we used to stop the flow once the replicated time was one logical tick before the EndTime. Now we stop the flow once the replicated time is at the EndTime. Additionally, we add a fast path to skip spinning up the DistSQL flow all together if we have already reached the EndTime. This fixes a latent bug where we failed to properly pause the job when there were no events to process (and advance the frontier) and we were exiting early before checking if we reached the end time.

trunk-io · 2026-05-22T21:08:10Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

cockroach-teamcity · 2026-05-22T21:08:30Z

This change is

DarrylWong · 2026-05-22T21:10:52Z

 	p.grp.GoCtx(func(ctx context.Context) error {
 		defer close(p.applierEvents)
-		return p.runInputReader(ctx)
+		p.sendError(p.runInputReader(ctx))


@jeffswenson you mentioned you weren't a fan of this pattern but I think I misunderstood what the alternative was. We can't stick the sendError in the error group Wait() as it is called in close(). We need the error to be sent to Next() so we can start draining, but Close() is not called until after we start draining.

Previously, we propagated applier errors through TrailingMetaCallback, which only fires once an applier has finished draining. However, we only finish draining once the coordinator stops sending rows. When we only have a single applier, the coordinator sees we have zero non-draining appliers and begins shutdown. However when we have multiple appliers, the coordinator never stops sending rows so it never receives the error telling it to stop. Instead, we remove the TrailingMetaCallback approach and replace it with an errCh that will be checked in the Next() flow and signal draining.

We may need to send the ReplicationError over the wire if the coordinator is on a different node. Previously we were not properly serializing the error and we would lose the ReplicationError unwrapping, causing the pause detection to fail.

DarrylWong added 2 commits May 22, 2026 13:35

DarrylWong requested review from a team as code owners May 22, 2026 21:08

DarrylWong requested review from jeffswenson and michae2 and removed request for a team May 22, 2026 21:08

DarrylWong commented May 22, 2026

View reviewed changes

DarrylWong added 2 commits May 22, 2026 17:12

DarrylWong force-pushed the ldr-adjust-end-time branch from fdee629 to 06482ff Compare May 22, 2026 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logical: fix ReplicationError DistSQL propagation#170835

logical: fix ReplicationError DistSQL propagation#170835
DarrylWong wants to merge 4 commits into
cockroachdb:masterfrom
DarrylWong:ldr-adjust-end-time

DarrylWong commented May 22, 2026

Uh oh!

trunk-io Bot commented May 22, 2026

Uh oh!

cockroach-teamcity commented May 22, 2026

Uh oh!

DarrylWong May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DarrylWong commented May 22, 2026

Uh oh!

trunk-io Bot commented May 22, 2026

Uh oh!

cockroach-teamcity commented May 22, 2026

Uh oh!

DarrylWong May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants