Skip to content

Conversation

@HoustonPutman
Copy link
Contributor

@HoustonPutman HoustonPutman commented Jul 21, 2025

https://issues.apache.org/jira/browse/SOLR-17821

The scenario:

  • A restore or shard install is called on a shard
  • A non-leader replica succeeds, all else fail

Currently, the following happens:

  • The ZK Shard terms are updated to ensure that all terms are non-zero
  • A failure is returned
  • But the cluster state is unchanged, and all shards are still in the state the started at. Even though not all have the same index

What we want to happen:

  • The ZK Shard terms are updated such that the successful replica(s) are the highest terms
  • Since the leader is no longer the highest term, it should give up leadership
  • All failing replicas should go into Leader-Initiated-Recovery
  • Once recovery has started, our InstallShard/Recovery command can succeed since the results will be what the user expects
    • We can add a waitForAllReplicasToBeHealthy option to wait for the recoveries to finish

This requires a few changes:

  • Obviously we want to fix the Restore and InstallShard commands to update shard terms correctly
  • Leadership should be given up when the shard term is lower than the highest shard term
  • Recovery should succeed even though the collection is in read-only mode
  • The tests should be able to test that the leader fails, and all other replicas succeed
  • Recover and InstallShard should manipulate the responses, so that the AsyncTracker does not think we are unsuccessful when replicas are put into recovery
  • We should add flags so that the user can control which replicas to download to, and when the response should be sent back (after recovery or not).
  • The CollectionHandlingUtils need to encode and save coreName with requests/responses, in order to distinguish multiple core requests sent to the same node.

@HoustonPutman
Copy link
Contributor Author

Some of the code is kind of hacky right now. But the bad stuff shouldn't be too hard to clean up.

@HoustonPutman HoustonPutman changed the title SOLR-17821: Fix error scenario for ShardInstall or Recover SOLR-17821: Fix error scenario for ShardInstall or Restore Jul 21, 2025
@github-actions github-actions bot added the tests label Jul 21, 2025
@HoustonPutman
Copy link
Contributor Author

The implementation works for InstallShard, but we need to add this same functionality to Restore as well.

@github-actions
Copy link

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Sep 20, 2025
@github-actions
Copy link

This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.

@github-actions github-actions bot added the closed-stale Closed after being stale for 60 days label Nov 19, 2025
@github-actions github-actions bot closed this Nov 19, 2025
@HoustonPutman HoustonPutman reopened this Nov 19, 2025
@github-actions github-actions bot removed closed-stale Closed after being stale for 60 days stale PR not updated in 60 days labels Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant