Skip to content

Indexers not gracefully terminating and getting force killed #6413

@mattmkim

Description

@mattmkim

Describe the bug
Occasionally, indexers do not terminate gracefully, and get force killed after timeout (5 mins).

Steps to reproduce (if applicable)
I believe this is what is happening:

  1. On SIGTERM, indexer-1 drains its local shards
  2. However, indexer-1 might be assigned shard A that is owned by indexer-0. Data will continue to be written to shard A because its owned by healthy indexer-0. indexer-1 will continue to fetch data from indexer-0 WAL, and indexer-1 will not get decomissioned.
  3. The control plane doesn't know if an indexer is being decomissioned, so if an indexer was assigned a remote shard, it will continue to be assigned that remote shard.

Expected behavior
The control plane should probably reassign shard A to an indexer that's not being decomissioned. https://github.com/quickwit-oss/quickwit/blob/main/quickwit/quickwit-control-plane/src/indexing_scheduler/mod.rs#L390

Configuration:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions