Describe the bug
Occasionally, indexers do not terminate gracefully, and get force killed after timeout (5 mins).
Steps to reproduce (if applicable)
I believe this is what is happening:
- On SIGTERM, indexer-1 drains its local shards
- However, indexer-1 might be assigned shard A that is owned by indexer-0. Data will continue to be written to shard A because its owned by healthy indexer-0. indexer-1 will continue to fetch data from indexer-0 WAL, and indexer-1 will not get decomissioned.
- The control plane doesn't know if an indexer is being decomissioned, so if an indexer was assigned a remote shard, it will continue to be assigned that remote shard.
Expected behavior
The control plane should probably reassign shard A to an indexer that's not being decomissioned. https://github.com/quickwit-oss/quickwit/blob/main/quickwit/quickwit-control-plane/src/indexing_scheduler/mod.rs#L390
Configuration:
Describe the bug
Occasionally, indexers do not terminate gracefully, and get force killed after timeout (5 mins).
Steps to reproduce (if applicable)
I believe this is what is happening:
Expected behavior
The control plane should probably reassign shard A to an indexer that's not being decomissioned. https://github.com/quickwit-oss/quickwit/blob/main/quickwit/quickwit-control-plane/src/indexing_scheduler/mod.rs#L390
Configuration: