Skip to content

use current node load estimation when placing search jobs#6390

Merged
trinity-1686a merged 15 commits into
mainfrom
trinity.pointard/placer-consider-load
May 27, 2026
Merged

use current node load estimation when placing search jobs#6390
trinity-1686a merged 15 commits into
mainfrom
trinity.pointard/placer-consider-load

Conversation

@trinity-1686a
Copy link
Copy Markdown
Contributor

@trinity-1686a trinity-1686a commented May 6, 2026

When placing jobs, first query all nodes for their current load, and bias placement toward less loaded nodes to even load on the cluster
this improves on a problem where some nodes might be overloaded while other are underloaded, causing queueing despite not all nodes being at max capacity
in testing, this was seen improving slightly p50+ under constant light load, and increased the max qps before latency explodes. Metrics also showed all searcher busy when some would be only part-time working before

future improvements:

  • we could debounce calls to GetLoad
  • fetch_docs could reuse the same searcher as used in the leaf_search phase, this guarantees a footer-cache hit and respect pre-existing load without the need for more GetLoad calls on the critical path

Comment thread quickwit/quickwit-search/src/root.rs
// Seed each candidate node with its current load so the placer avoids
// routing work to already-loaded nodes. If a node fails to report its
// load (error or timeout), `load` stays `None`: we still route work
// there if all other nodes are overloaded, but we prefer reachable
Copy link
Copy Markdown
Collaborator

@PSeitz PSeitz May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should error instead if all nodes are not reachable or overloaded

Comment thread quickwit/quickwit-search/src/service.rs
const GET_LOAD_TIMEOUT: Duration = Duration::from_millis(200);
let load_futures = candidate_nodes.iter_mut().map(|node| {
let mut client = node.client.clone();
async move { tokio::time::timeout(GET_LOAD_TIMEOUT, client.get_load()).await }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have a 1 sec cache for the load to reduce traffic

Comment thread quickwit/quickwit-search/src/search_job_placer.rs
@trinity-1686a trinity-1686a requested a review from a team as a code owner May 22, 2026 13:19
self.total_job_cost.fetch_add(added, Ordering::Relaxed);
// fetch_add returns the previous value, so add `added` to get the new total.
let new_load = self.total_job_cost.fetch_add(added, Ordering::Relaxed) + added;
SEARCHER_NODE_LOAD.set(new_load as f64);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it seems that SEARCHER_NODE_LOAD stores the same value (1-to-1 mapping) of total_job_cost. SEARCHER_NODE_LOAD provides a get() method to retrieve the value if needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm generally worried by global mutable state. imo it's okay to write to one for telemetry purpose, but i prefer not relying on it for any actual logic

@trinity-1686a trinity-1686a enabled auto-merge (squash) May 27, 2026 08:57
@trinity-1686a trinity-1686a merged commit 9a7b228 into main May 27, 2026
9 checks passed
@trinity-1686a trinity-1686a deleted the trinity.pointard/placer-consider-load branch May 27, 2026 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants