SOLR-17821: Fix error scenario for ShardInstall or Restore #3434

HoustonPutman · 2025-07-21T20:20:59Z

https://issues.apache.org/jira/browse/SOLR-17821

The scenario:

A restore or shard install is called on a shard
A non-leader replica succeeds, all else fail

Currently, the following happens:

The ZK Shard terms are updated to ensure that all terms are non-zero
A failure is returned
But the cluster state is unchanged, and all shards are still in the state the started at. Even though not all have the same index

What we want to happen:

The ZK Shard terms are updated such that the successful replica(s) are the highest terms
Since the leader is no longer the highest term, it should give up leadership
All failing replicas should go into Leader-Initiated-Recovery
Once recovery has started, our InstallShard/Recovery command can succeed since the results will be what the user expects
- We can add a waitForAllReplicasToBeHealthy option to wait for the recoveries to finish

This requires a few changes:

Obviously we want to fix the Restore and InstallShard commands to update shard terms correctly
Leadership should be given up when the shard term is lower than the highest shard term
Recovery should succeed even though the collection is in read-only mode
The tests should be able to test that the leader fails, and all other replicas succeed
Recover and InstallShard should manipulate the responses, so that the AsyncTracker does not think we are unsuccessful when replicas are put into recovery
We should add flags so that the user can control which replicas to download to, and when the response should be sent back (after recovery or not).
The CollectionHandlingUtils need to encode and save coreName with requests/responses, in order to distinguish multiple core requests sent to the same node.

HoustonPutman · 2025-07-21T20:21:26Z

Some of the code is kind of hacky right now. But the bad stuff shouldn't be too hard to clean up.

HoustonPutman · 2025-07-21T21:31:24Z

The implementation works for InstallShard, but we need to add this same functionality to Restore as well.

github-actions · 2025-09-20T00:11:09Z

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

github-actions · 2025-11-19T00:12:44Z

This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.

…date

…cess

…restore-partial-success

HoustonPutman added 2 commits July 21, 2025 11:19

Start with test

ab05bdd

Solve issue, but need to clean up. Need to fix for Restore as well

8328c78

HoustonPutman requested a review from gerlowskija July 21, 2025 20:20

github-actions bot added test-framework client:solrj cat:search cat:cloud cat:index cat:api labels Jul 21, 2025

HoustonPutman changed the title ~~SOLR-17821: Fix error scenario for ShardInstall or Recover~~ SOLR-17821: Fix error scenario for ShardInstall or Restore Jul 21, 2025

HoustonPutman added 2 commits July 21, 2025 13:31

Use new shardrequest constructs for SyncStrategy

d1d6bf5

Response status should now be the number of replicas

f5ee234

github-actions bot added the tests label Jul 21, 2025

HoustonPutman added 6 commits July 22, 2025 11:12

Cleanup unused parts of tests

02d1e90

Huge commit - restore uses installshard - big update in locking

1be4ed0

Implement callingLock mirroring for distributed API Manager locking

8365a23

Huge commit - restore uses installshard - big update in locking

00a67b4

Implement callingLock mirroring for distributed API Manager locking

2d5e12b

Merge remote-tracking branch 'apache/main' into locking-update

37cea68

github-actions bot added the stale PR not updated in 60 days label Sep 20, 2025

github-actions bot added the closed-stale Closed after being stale for 60 days label Nov 19, 2025

github-actions bot closed this Nov 19, 2025

HoustonPutman reopened this Nov 19, 2025

github-actions bot removed closed-stale Closed after being stale for 60 days stale PR not updated in 60 days labels Nov 20, 2025

Merge remote-tracking branch 'apache/main' into locking-update

293f35d

HoustonPutman added 18 commits January 13, 2026 11:49

Move over remaining APIs and tests

4a49aa9

Merge remote-tracking branch 'apache/main' into solr-18011-locking-up…

021303a

…date

Merge remote-tracking branch 'apache/main' into solr-18011-locking-up…

e61279c

…date

Fix changelog entry

4c9a766

Remove files that shouldn't be changed.

c7cfe19

One more that shouldn't be changed

08326cc

Merge branch 'solr-18011-locking-update' into fix-restore-partial-suc…

eaddd46

…cess

Some more fixes

b90c283

SOLR-18080: Initiate Leader election for ShardTerms

b13d97e

Add changelog entry

7522caa

Fix precommit issues

776c28b

Fixes for passing callingLockIds around

f3e6ff8

Merge branch 'solr-18080-shard-term-induce-leader-election' into fix-…

a78ece5

…restore-partial-success

Fixes for syncStrategy with empty indexes

d2dad8e

Improve tests for recovery

198f01b

Make read only check better

a667bb4

Improve logging for results

ed8c839

Various fixes

3b898f9

github-actions bot added jetty-server module:gcs-repository labels Jan 24, 2026

HoustonPutman added 9 commits January 26, 2026 18:04

Let some parts of solr fetch an index writer for a read-only core

bb37120

Tidy

3135c5b

Fix collection and shard term deletion

e3d1d26

Tidy

40b62f6

make another wait conditional on not readOnly

da598a5

Add changelog entry, remove others

1b71b72

Fix no uLog error case

f6062da

Fix test

714a705

InstallShardTest failure scenario should not just be s3

201d31d

github-actions bot added the module:opentelemetry label Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-17821: Fix error scenario for ShardInstall or Restore #3434

SOLR-17821: Fix error scenario for ShardInstall or Restore #3434

Uh oh!

HoustonPutman commented Jul 21, 2025 •

edited

Loading

Uh oh!

HoustonPutman commented Jul 21, 2025

Uh oh!

HoustonPutman commented Jul 21, 2025

Uh oh!

github-actions bot commented Sep 20, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SOLR-17821: Fix error scenario for ShardInstall or Restore #3434

Are you sure you want to change the base?

SOLR-17821: Fix error scenario for ShardInstall or Restore #3434

Uh oh!

Conversation

HoustonPutman commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HoustonPutman commented Jul 21, 2025

Uh oh!

HoustonPutman commented Jul 21, 2025

Uh oh!

github-actions bot commented Sep 20, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HoustonPutman commented Jul 21, 2025 •

edited

Loading