-
Notifications
You must be signed in to change notification settings - Fork 809
Support running embedded-zk in "ensemble" mode #2391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…found" with an out-dated ClusterState (apache#2363)" This reverts commit 5c399dd.
This commit augments our embedded-ZK code to support running embedded-ZK
in "quorum" or ensemble mode. Multiple Solr nodes can now all have
their embedded-ZK's join a multi-node quorum upon startup. Other than
Solr and ZK sharing a process, the embedded- ZK ensemble behaves
identically to one formed of independent processes: nodes can join or
leave the cluster, etc.
Embedded-ensemble-ZK is enabled any time the `zkQuorumRun` system
property is present, along with an explicitly specified ZK host string.
On startup, Solr will identify which host in the zk-conn-string it
should be (based on admittedly hacky heuristics), and then spins up a
'ZooKeeperServerEmbedded' instance in-process to join the ensemble. e.g.
```
export LH="localhost"
bin/solr start -p 8983 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
bin/solr start -p 8984 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
bin/solr start -p 8985 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
```
Some notes:
- this doesn't (yet) work with ZK's dynamic-ensemble feature, so all
ZK nodes must be specified in a static ZK conn string provided at
startup
- this appears to run best when the security-manager is disabled.
|
FYI: https://cwiki.apache.org/confluence/display/SOLR/SIP-14+Embedded+Zookeeper for context on others' interest in moving this direction. |
|
Great work. I think we should rip out the existing embedded ZK, which is a hack. Do you want to proceed in a PR or perhaps in a central feature branch? |
…ion not found" with an out-dated ClusterState (apache#2363)"" This reverts commit 1fb376b.
88c8da4 to
de59401
Compare
|
Spent a few minutes pulling in the latest 'main'. Hoping to give it a bit more cleanup in the next few days if I can, but I've been hoping that for more than a year at this point (😬 ), so if anyone else is interested please feel free to move this forward! |
|
@gerlowskija I took a stab at getting in the use of the solr roles to determine if we hsould run embedded zk quorum mode, and it worked! Here is my start script, where each runs with |
# Conflicts: # solr/core/src/java/org/apache/solr/core/ZkContainer.java # solr/packaging/build.gradle
|
|
FYI, the |
Uses a base class SolrCloudWithEmbeddedZkQuorumTestCase New MiniSolrCloudCluster constructor that spins up a quorum cluster
|
With the help of Claude Code I added a JUnit test |
epugh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three committers on the same PR! Woot!
| this.jettyConfig = Objects.requireNonNull(jettyConfig); | ||
| this.solrXml = solrXml == null ? DEFAULT_CLOUD_SOLR_XML : solrXml; | ||
| this.trackJettyMetrics = trackJettyMetrics; | ||
| this.externalZkServer = true; // No ZkTestServer in quorum mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something about this being true seems odd... Should it be false since there isn't a externalZkServer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, what is even an "external" zk in the context of our test framework?
| } | ||
|
|
||
| // Set system properties for embedded ZK quorum mode | ||
| System.setProperty("solr.zookeeper.server.enabled", "true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit curious if we need this system property AND zookeeper_quorum:on? I wonder if the existence of zookeeper_quorum:on is enough? if this system property is what turns on zookeeper, I wish maybe it was solr.mode=clustered or something to specificy that it's the clustered (cloud) mode...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like node role should be enough, it should turn things on, perhaps set that property automagically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you dont have the node role, then you need it right? OR!!!????? here is an idea... What if you are starting a single solr node and want zk... The bin/solr start of SOlr 10, and was previoulsly bin/solr start -c in 9x... What if, instead, if you do NOT supply a -Dsolr.node.roles=data:on,overseer:allowed,zookeeper_quorum:on or a -z property then we do that FOR you in the start up script. I believe that we ALREADY have logic that if you don't provide a solr.node.roles then you get data:on,overseer:allowed by default. See https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/NodeRoles.java#L32
Would that simplify things both internally and from a user perspective?
You either:
Start Solr old school with bin/solr start --user-managed
Or you start single node with embedded zk with bin/solr start
Or you start three nodes with embedded zk quorum with bin/solr start -z blahblah:blahblah:blahblah -Dsolr.node.roles=data:on,overseer:allowed,zookeeper_quorum:on
Or you start N nodes and connect to an external ZK with bin/solr start -z external_zk:2181.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if -z should be reserved for connecting to an external ZK quorum, and we invent a new option for telling SolrCloud which solr nodes will run the zk node role? That would make it clearer to read, and we make ZK an implementation detail by not refering to it as ZK, but as quorum-node, e.g.
bin/solr start --quorum-nodes=solr-node-1:8983,solr-node-2:8983,solr-node-3:8983That would mean we want Solr to manage our ZK, and translate into the correct solr.zk.host setting for each host, no matter if the host you are starting is one of the quorum nodes or not. And if the node you are starting is a quorum node, it will detect that it's own host name matches a quorum host, and add -Dsolr.node.roles... during startup, and the ZK bootstrap code will configure its internal ZK config using the --quorum / -Dsolr.quorum.nodes values.
| if (!zkClient.exists("/solr", true)) { | ||
| zkClient.makePath("/solr", false, true); | ||
| } | ||
| if (!zkClient.exists("/solr/initialized", true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a concept in solr? this path? New to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be some hallucination going on
| formatZkServer); | ||
| MiniSolrCloudCluster cluster; | ||
|
|
||
| if (useEmbeddedZkQuorum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this big if statement seems challenging... Could a builder pattern make it easier? Thir are a lot of params and overlap with subtle differences ;-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this code is inside the builder :) So the builder chose to instantiate the MSCC using one of two constructors.
solr/test-framework/src/java/org/apache/solr/cloud/SolrCloudWithEmbeddedZkQuorumTestCase.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Test | ||
| public void testBasicQuorumFunctionality() throws Exception { | ||
| log.info("Starting testBasicQuorumFunctionality"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if claude tools could go back and revamp old unit tests to be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is verifying the happy path! I think that we need a test that demonstrates that when you LOSE a node from a three node cluster, we still function, just degraded. I don't know if we have a test that verifies this behavior elsewhere (a stress test??) that could be reused with embedded zk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There used to be such a test in the first phase, but it kept failing, so it was removed 😂🤣
Good point though. With this foundation it should be possible to keep adding more exceptional tests. Hey, perahps even randomize running with and without qorum. But currently we requre >=3 nodes. We could evolve the quorum to use 1 zk node if <3 and use 3 zk nodes for >=3. Then if you kill a solr node, it should all work even if that node also has zk.
But I think there were some issues with stopping a JettySolrRunner in quorum mode and then restarting it...
| "Cannot start Solr in cloud mode - no cloud config provided"); | ||
| } | ||
|
|
||
| if (config == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this even get hit? Sometimes I get confused by all the null checks! If you are running in standalone, wouldn't this class never be run???!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who knows? Tons of tests failed due to NPE, and the existing code on main has this check, which was removed in this PR...
…ing flag to not do the saem thing twice Our code is just single threaded at this point...
|
I asked Claude to summarize the last seven commits and he said: Recent work on the I think my summary is: I discovered that we have TWO ways of starting ZK, not including our independent We now have:
I am now wondering could we convert SolrZkServer to using a Secondly, now seeing the use of Thirdly, why is I did take a stab at the |
Thanks for chipping away on this. At some point we'll have something that can be released in 10.1 or 10.2 as an experimental feature for folks to try out, without dynamic reconfig. Then we can tackle the dynamic stuff... |
|
Thanks for taking a look at this @janhoy I think I've stalled a bit on this... It feels "so close' yet not sure how to get it over the hump. |
Description
Prior to this commit, Solr only supported running embedded-ZK in "standalone" mode (which cannot take part in any larger ZK ensemble or quorum). But there are usecases that would benefit from being able to do this, both on the development and testing side, and event for adventurous users who might want the benefits of a small multi-node SolrCloud cluster without the headache of also deploying ZK.
Solution
This commit augments our embedded-ZK code to support running embedded-ZK
in "quorum" or ensemble mode. Multiple Solr nodes can now all have
their embedded-ZK's join a multi-node quorum upon startup. Other than
Solr and ZK sharing a process, the embedded- ZK ensemble behaves
identically to one formed of independent processes: nodes can join or
leave the cluster, etc.
Embedded-ensemble-ZK is enabled any time the
zkQuorumRunsystemproperty is present, along with an explicitly specified ZK host string.
On startup, Solr will identify which host in the zk-conn-string it
should be (based on admittedly hacky heuristics), and then spins up a
'ZooKeeperServerEmbedded' instance in-process to join the ensemble. e.g.
Some notes:
- this doesn't (yet) work with ZK's dynamic-ensemble feature, so all
ZK nodes must be specified in a static ZK conn string provided at
startup
- this appears to run best when the security-manager is disabled.
- the interface in particular for how we expose this is pretty rough, and there's a lot of room for improvement.
Tests
None, yet.
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.