use current node load estimation when placing search jobs#6390
Merged
Conversation
PSeitz
reviewed
May 7, 2026
PSeitz
reviewed
May 11, 2026
| // Seed each candidate node with its current load so the placer avoids | ||
| // routing work to already-loaded nodes. If a node fails to report its | ||
| // load (error or timeout), `load` stays `None`: we still route work | ||
| // there if all other nodes are overloaded, but we prefer reachable |
Collaborator
There was a problem hiding this comment.
maybe we should error instead if all nodes are not reachable or overloaded
Platane
reviewed
May 22, 2026
PSeitz
reviewed
May 22, 2026
| const GET_LOAD_TIMEOUT: Duration = Duration::from_millis(200); | ||
| let load_futures = candidate_nodes.iter_mut().map(|node| { | ||
| let mut client = node.client.clone(); | ||
| async move { tokio::time::timeout(GET_LOAD_TIMEOUT, client.get_load()).await } |
Collaborator
There was a problem hiding this comment.
I wonder if we should have a 1 sec cache for the load to reduce traffic
PSeitz
reviewed
May 22, 2026
Mallets
reviewed
May 26, 2026
| self.total_job_cost.fetch_add(added, Ordering::Relaxed); | ||
| // fetch_add returns the previous value, so add `added` to get the new total. | ||
| let new_load = self.total_job_cost.fetch_add(added, Ordering::Relaxed) + added; | ||
| SEARCHER_NODE_LOAD.set(new_load as f64); |
Contributor
There was a problem hiding this comment.
nit: it seems that SEARCHER_NODE_LOAD stores the same value (1-to-1 mapping) of total_job_cost. SEARCHER_NODE_LOAD provides a get() method to retrieve the value if needed.
Contributor
Author
There was a problem hiding this comment.
i'm generally worried by global mutable state. imo it's okay to write to one for telemetry purpose, but i prefer not relying on it for any actual logic
PSeitz-dd
approved these changes
May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When placing jobs, first query all nodes for their current load, and bias placement toward less loaded nodes to even load on the cluster
this improves on a problem where some nodes might be overloaded while other are underloaded, causing queueing despite not all nodes being at max capacity
in testing, this was seen improving slightly p50+ under constant light load, and increased the max qps before latency explodes. Metrics also showed all searcher busy when some would be only part-time working before
future improvements: