-
Notifications
You must be signed in to change notification settings - Fork 510
Description
Hi everyone, I need some help in debugging Quickwit. We have been having intermittent issues with our Quickwit setup.
We have a cluster of 4 indexer nodes (each with 4 CPU minimum and up to 32, 8Gi RAM up to 48Gi) handling about 440 indexes (110 tenants X 4 indexes per tenant)
For each tenants the source config for all 4 indexes end up creating 16 indexing pipelines connected to Kafka (total of 1760 pipelines). All the tenants and the indexes don't have the same rate. for each tenant 1 index (main) is the one expected to receive most data. Also we have few heavy tenants (5), all the others send us less data.
With this setup, Indexing works most times but sometimes it just stops working and we have not been able to find the root cause.
The last occurrence of this issue appeared when we deployed a change to reconfigure the metastore postgres min_connection:8 and max_connection: 20, after this change we started seeing these errors related to the indexing actors failing (see screenshot).
We reverted the change and the errors stopped but the indexer were still not picking up. usually when we have indexing issue: Some tenants or all tenant not indexing, we restart the control-plane. This restart trigger a new indexing plan that stops the old pipelines and start new ones. This sometimes solve the issue but this time it did not work.
Another solution that worked previously, was the saviour this time. We selected few of our tenants, disabled their index source, thus stopping the indexing pipelines and enabled them again. We did this in batch and after two batch we had most tenants indexing except 4. we had to redo it for the remaining ones again before all tenants indexes started fully working.
My hypothesis:
- we probably are under provisioned and currently rely on the randomness of the indexing pipeline getting assigned to the correct node
- creating 4 to 8 pipelines as new indexes are created is fine but when a restart is triggered, starting all tenants pipelines seems to be problematic. but it's a small number of pipeline 440 indexing pipelines per node if we assume perfect balancing per node.Maybe spawning all these pipelines at the same time creates pressure on the metastore (PG connection)
