Skip to content

Conversation

@gerlowskija
Copy link
Contributor

Description

Prior to this commit, Solr only supported running embedded-ZK in "standalone" mode (which cannot take part in any larger ZK ensemble or quorum). But there are usecases that would benefit from being able to do this, both on the development and testing side, and event for adventurous users who might want the benefits of a small multi-node SolrCloud cluster without the headache of also deploying ZK.

Solution

This commit augments our embedded-ZK code to support running embedded-ZK
in "quorum" or ensemble mode. Multiple Solr nodes can now all have
their embedded-ZK's join a multi-node quorum upon startup. Other than
Solr and ZK sharing a process, the embedded- ZK ensemble behaves
identically to one formed of independent processes: nodes can join or
leave the cluster, etc.

Embedded-ensemble-ZK is enabled any time the zkQuorumRun system
property is present, along with an explicitly specified ZK host string.
On startup, Solr will identify which host in the zk-conn-string it
should be (based on admittedly hacky heuristics), and then spins up a
'ZooKeeperServerEmbedded' instance in-process to join the ensemble. e.g.

export LH="localhost"

# Run each below in a separate terminal, one for each host in the zk-conn string
bin/solr start -p 8983 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
bin/solr start -p 8984 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
bin/solr start -p 8985 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun

Some notes:
- this doesn't (yet) work with ZK's dynamic-ensemble feature, so all
ZK nodes must be specified in a static ZK conn string provided at
startup
- this appears to run best when the security-manager is disabled.
- the interface in particular for how we expose this is pretty rough, and there's a lot of room for improvement.

Tests

None, yet.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

…found" with an out-dated ClusterState (apache#2363)"

This reverts commit 5c399dd.
This commit augments our embedded-ZK code to support running embedded-ZK
in "quorum" or ensemble mode.  Multiple Solr nodes can now all have
their embedded-ZK's join a multi-node quorum upon startup.  Other than
Solr and ZK sharing a process, the embedded- ZK ensemble behaves
identically to one formed of independent processes: nodes can join or
leave the cluster, etc.

Embedded-ensemble-ZK is enabled any time the `zkQuorumRun` system
property is present, along with an explicitly specified ZK host string.
On startup, Solr will identify which host in the zk-conn-string it
should be (based on admittedly hacky heuristics), and then spins up a
'ZooKeeperServerEmbedded' instance in-process to join the ensemble. e.g.

```
export LH="localhost"
bin/solr start -p 8983 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
bin/solr start -p 8984 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
bin/solr start -p 8985 -z $LH:9983,$LH:9984,$LH:9985 -DzkQuorumRun
```

Some notes:
  - this doesn't (yet) work with ZK's dynamic-ensemble feature, so all
    ZK nodes must be specified in a static ZK conn string provided at
    startup
  - this appears to run best when the security-manager is disabled.
@gerlowskija
Copy link
Contributor Author

FYI: https://cwiki.apache.org/confluence/display/SOLR/SIP-14+Embedded+Zookeeper for context on others' interest in moving this direction.

@janhoy
Copy link
Contributor

janhoy commented Apr 29, 2024

Great work. I think we should rip out the existing embedded ZK, which is a hack.

Do you want to proceed in a PR or perhaps in a central feature branch?

@gerlowskija
Copy link
Contributor Author

Spent a few minutes pulling in the latest 'main'. Hoping to give it a bit more cleanup in the next few days if I can, but I've been hoping that for more than a year at this point (😬 ), so if anyone else is interested please feel free to move this forward!

@epugh
Copy link
Contributor

epugh commented Oct 14, 2025

@gerlowskija I took a stab at getting in the use of the solr roles to determine if we hsould run embedded zk quorum mode, and it worked! Here is my start script, where each runs with -f so you need three terminals. Omit the -f and you don't need the seperate terminals.

mkdir -p ./three_nodes/node1/solr
mkdir -p ./three_nodes/node2/solr
mkdir -p ./three_nodes/node3/solr

SOLR_SECURITY_MANAGER_ENABLED=false bin/solr start -f -p 8983 --solr-home "./three_nodes/node1/solr" -z localhost:9983,localhost:9984,localhost:9985 -Dsolr.node.roles=data:on,overseer:allowed,zookeeper_quorum:on
SOLR_SECURITY_MANAGER_ENABLED=false bin/solr start -f -p 8984 --solr-home "./three_nodes/node2/solr" -z localhost:9983,localhost:9984,localhost:9985 -Dsolr.node.roles=data:on,overseer:allowed,zookeeper_quorum:on
SOLR_SECURITY_MANAGER_ENABLED=false bin/solr start -f -p 8985 --solr-home "./three_nodes/node3/solr" -z localhost:9983,localhost:9984,localhost:9985 -Dsolr.node.roles=data:on,overseer:allowed,zookeeper_quorum:on

# Conflicts:
#	solr/core/src/java/org/apache/solr/core/ZkContainer.java
#	solr/packaging/build.gradle
@janhoy
Copy link
Contributor

janhoy commented Oct 15, 2025

  • Merged in main branch
  • Properly clean up the ZK server resource in close()

@epugh
Copy link
Contributor

epugh commented Oct 16, 2025

FYI, the .bats test worked a few times for me and then quit working ;-(

Uses a base class SolrCloudWithEmbeddedZkQuorumTestCase
New MiniSolrCloudCluster constructor that spins up a quorum cluster
@janhoy
Copy link
Contributor

janhoy commented Oct 16, 2025

With the help of Claude Code I added a JUnit test TestEmbeddedZkQuorum, that actually spins up a quorum cluster with zookeeper_quorum:on node roles (after securing a port so we know beforehand what the zkHost string will be).

Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three committers on the same PR! Woot!

this.jettyConfig = Objects.requireNonNull(jettyConfig);
this.solrXml = solrXml == null ? DEFAULT_CLOUD_SOLR_XML : solrXml;
this.trackJettyMetrics = trackJettyMetrics;
this.externalZkServer = true; // No ZkTestServer in quorum mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something about this being true seems odd... Should it be false since there isn't a externalZkServer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, what is even an "external" zk in the context of our test framework?

}

// Set system properties for embedded ZK quorum mode
System.setProperty("solr.zookeeper.server.enabled", "true");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit curious if we need this system property AND zookeeper_quorum:on? I wonder if the existence of zookeeper_quorum:on is enough? if this system property is what turns on zookeeper, I wish maybe it was solr.mode=clustered or something to specificy that it's the clustered (cloud) mode...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like node role should be enough, it should turn things on, perhaps set that property automagically?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you dont have the node role, then you need it right? OR!!!????? here is an idea... What if you are starting a single solr node and want zk... The bin/solr start of SOlr 10, and was previoulsly bin/solr start -c in 9x... What if, instead, if you do NOT supply a -Dsolr.node.roles=data:on,overseer:allowed,zookeeper_quorum:on or a -z property then we do that FOR you in the start up script. I believe that we ALREADY have logic that if you don't provide a solr.node.roles then you get data:on,overseer:allowed by default. See https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/NodeRoles.java#L32

Would that simplify things both internally and from a user perspective?

You either:

Start Solr old school with bin/solr start --user-managed

Or you start single node with embedded zk with bin/solr start

Or you start three nodes with embedded zk quorum with bin/solr start -z blahblah:blahblah:blahblah -Dsolr.node.roles=data:on,overseer:allowed,zookeeper_quorum:on

Or you start N nodes and connect to an external ZK with bin/solr start -z external_zk:2181.

Copy link
Contributor

@janhoy janhoy Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if -z should be reserved for connecting to an external ZK quorum, and we invent a new option for telling SolrCloud which solr nodes will run the zk node role? That would make it clearer to read, and we make ZK an implementation detail by not refering to it as ZK, but as quorum-node, e.g.

bin/solr start --quorum-nodes=solr-node-1:8983,solr-node-2:8983,solr-node-3:8983

That would mean we want Solr to manage our ZK, and translate into the correct solr.zk.host setting for each host, no matter if the host you are starting is one of the quorum nodes or not. And if the node you are starting is a quorum node, it will detect that it's own host name matches a quorum host, and add -Dsolr.node.roles... during startup, and the ZK bootstrap code will configure its internal ZK config using the --quorum / -Dsolr.quorum.nodes values.

if (!zkClient.exists("/solr", true)) {
zkClient.makePath("/solr", false, true);
}
if (!zkClient.exists("/solr/initialized", true)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a concept in solr? this path? New to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be some hallucination going on

formatZkServer);
MiniSolrCloudCluster cluster;

if (useEmbeddedZkQuorum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this big if statement seems challenging... Could a builder pattern make it easier? Thir are a lot of params and overlap with subtle differences ;-).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this code is inside the builder :) So the builder chose to instantiate the MSCC using one of two constructors.


@Test
public void testBasicQuorumFunctionality() throws Exception {
log.info("Starting testBasicQuorumFunctionality");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if claude tools could go back and revamp old unit tests to be clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is verifying the happy path! I think that we need a test that demonstrates that when you LOSE a node from a three node cluster, we still function, just degraded. I don't know if we have a test that verifies this behavior elsewhere (a stress test??) that could be reused with embedded zk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There used to be such a test in the first phase, but it kept failing, so it was removed 😂🤣

Good point though. With this foundation it should be possible to keep adding more exceptional tests. Hey, perahps even randomize running with and without qorum. But currently we requre >=3 nodes. We could evolve the quorum to use 1 zk node if <3 and use 3 zk nodes for >=3. Then if you kill a solr node, it should all work even if that node also has zk.

But I think there were some issues with stopping a JettySolrRunner in quorum mode and then restarting it...

"Cannot start Solr in cloud mode - no cloud config provided");
}

if (config == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this even get hit? Sometimes I get confused by all the null checks! If you are running in standalone, wouldn't this class never be run???!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who knows? Tons of tests failed due to NPE, and the existing code on main has this check, which was removed in this PR...

@epugh
Copy link
Contributor

epugh commented Nov 3, 2025

I asked Claude to summarize the last seven commits and he said:

Recent work on the spike-zk-quorum branch has focused on refining Solr's embedded ZooKeeper quorum mode implementation through targeted code improvements. The codebase has been streamlined by removing unnecessary variables and conditional logic in the SolrZkServer class, particularly eliminating redundant checks around the zkServerEnabled flag and renaming methods like startZKSE to more descriptive alternatives. Initialization has been optimized by removing the /solr/initialized ZooKeeper node, which was only needed for multi-threading protection that's unnecessary in the current implementation. The ZooKeeper initialization logic in ZkContainer has been restructured with clearer nesting and better separation between standalone and quorum modes, making the codebase more maintainable. These changes collectively enhance the embedded ZooKeeper quorum functionality while simplifying the code structure, resulting in a more robust and easier-to-understand implementation that properly manages ZooKeeper server resources throughout Solr's lifecycle.

I think my summary is: I discovered that we have TWO ways of starting ZK, not including our independent ZkTestServer that runs a ZkTestServer.ZKServerMain during tests.

We now have:

  • For testing only, we have org.apache.solr.cloud.ZkTestServer.ZKServerMain that wraps org.apache.zookeeper.server.ZooKeeperServer.
  • org.apache.solr.cloud.SolrZkServer that wraps either a org.apache.zookeeper.server.ZooKeeperServer OR if there is more then one, then does a org.apache.zookeeper.server.quorum.QuorumPeerMain
  • org.apache.solr.core.ZkContainer that starts up org.apache.zookeeper.server.embedded.ZooKeeperServerEmbedded

I am now wondering could we convert SolrZkServer to using a ZooKeeperServerEmbedded instead of a ZooKeeperServer?

Secondly, now seeing the use of QuorumPeerMain, which currently kicks in if there are multiple ZooKeeper servers defined (presumably only in an external cluster set up), I wonder if we need to rename roles: zookeeper_quorum to zookeeper. If you have zookeeper then you fire up ZooKeeperServerEmbedded. And if not, then you default to QuorumPeerMain.

Thirdly, why is ZkContainer in o.a.s.core and not o.a.s.cloud?

I did take a stab at the SolrZkServer using the ZooKeeperServerEmbedded, updating the start/stop methods, but failed ;-(. Going to need to put this down for a bit.

@janhoy
Copy link
Contributor

janhoy commented Nov 3, 2025

I did take a stab at the SolrZkServer using the ZooKeeperServerEmbedded, updating the start/stop methods, but failed ;-(. Going to need to put this down for a bit.

Thanks for chipping away on this. At some point we'll have something that can be released in 10.1 or 10.2 as an experimental feature for folks to try out, without dynamic reconfig. Then we can tackle the dynamic stuff...

@epugh
Copy link
Contributor

epugh commented Jan 28, 2026

Thanks for taking a look at this @janhoy I think I've stalled a bit on this... It feels "so close' yet not sure how to get it over the hump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants