Skip to content

feat(coordination): add cross-instance Coordinator for multi-instance consistency#2363

Open
magiccao wants to merge 1 commit into
volcengine:mainfrom
magiccao:feat/multi-instance-coordinator
Open

feat(coordination): add cross-instance Coordinator for multi-instance consistency#2363
magiccao wants to merge 1 commit into
volcengine:mainfrom
magiccao:feat/multi-instance-coordinator

Conversation

@magiccao
Copy link
Copy Markdown

@magiccao magiccao commented Jun 1, 2026

Motivation

Several process-local singletons (EmbeddingTaskTracker, RequestWaitTracker,
SemanticQueue coalesce versions, request stats) are backed by plain
dict + threading.Lock. On a single machine this is fine; across multiple
load-balanced instances the state silently diverges — coalesce versions reset
per instance, dedupe windows only protect within one process, and
request-wait tracking sees a partial view.

Changes

New: Coordinator abstraction (openviking/service/coordinator.py)

A small KV protocol with two concrete backends:

Backend Default When to use
InProcessCoordinator Single machine, zero new deps, identical behavior to today
RedisCoordinator opt-in Multi-instance, requires redis package

Activated via storage.coordination.backend = "redis" in ov.conf:

storage:
  coordination:
    backend: redis
    dsn: redis://host:6379/0   # or env OPENVIKING_COORD_DSN
    key_prefix: "ov:coord:"
    ttl_sec: 3600

Install the optional extra: pip install 'openviking[coordination]'

Bug fix: atomic memory semantic dedupe (#769)

The previous get_int → check → set_int in SemanticQueue was a TOCTOU
race: two instances could both read 0 and both enqueue the same memory
semantic task. Replaced with a single atomic set_if_absent() call
(Redis SET NX EX; in-process monotonic-deadline check under lock).

Memory safety: amortized claim pruning

_claim_deadlines in InProcessCoordinator is swept in bulk once it
exceeds a threshold, preventing unbounded growth from one-shot
mem_dedupe:* keys that are never revisited.

Refactor: extract RequestQueueStats / RequestStatsAccumulator

The inline stats dict+lock duplicated in SemanticProcessor and
collection_schemas is unified into openviking/telemetry/request_queue_stats.py.

Compatibility

Single-machine deployments: no behavior change, no new dependencies.
The InProcessCoordinator is the default and behaves identically to the
prior singletons.

Test plan

  • tests/misc/test_coordinator.py — InProcess full coverage + Redis parity via fakeredis (skipped when absent)
  • tests/misc/test_semantic_coalesce.py — staleness check
  • tests/misc/test_request_queue_stats.py — RequestStatsAccumulator
  • tests/storage/test_embedding_tracker.py — Coordinator-backed tracker
  • tests/storage/test_semantic_queue_memory_dedupe.py — atomic dedupe regression
  • ruff check / ruff format — all clean

🤖 Generated with Claude Code

… consistency

Several process-local singletons (EmbeddingTaskTracker, RequestWaitTracker,
SemanticQueue coalesce versions, request stats) were backed by plain
dict + threading.Lock. On a single machine this is fine; across multiple
load-balanced instances the state silently diverges.

This PR introduces a Coordinator abstraction that unifies these behind a
small set of generic KV primitives:

- InProcessCoordinator (default): zero new dependencies, behaviour identical
  to the existing singletons. Single-machine deployments are unaffected.
- RedisCoordinator (opt-in): maps each primitive onto an atomic Redis command
  (INCRBY, SET NX EX, SADD, RPUSH, etc.) for cross-instance consistency.
  Requires `pip install 'openviking[coordination]'`.

Configuration (ov.conf):

    storage:
      coordination:
        backend: redis          # or "memory" (default)
        dsn: redis://host:6379  # or env OPENVIKING_COORD_DSN
        key_prefix: "ov:coord:"
        ttl_sec: 3600

Bug fix: atomic memory semantic dedupe (volcengine#769)
The previous get_int → check → set_int pattern in SemanticQueue was a
TOCTOU race: two instances could both read 0 and both enqueue the same
memory semantic task. Replaced with a single atomic set_if_absent()
(Redis SET NX EX; in-process monotonic-deadline check under lock).

Memory safety: amortized claim pruning
_claim_deadlines in InProcessCoordinator is swept in bulk once the map
exceeds a threshold, so one-shot mem_dedupe:* keys that are never
revisited cannot grow the map unboundedly.

Refactor: extract RequestQueueStats / RequestStatsAccumulator
The inline stats dict+lock duplicated across SemanticProcessor and
collection_schemas is unified into openviking/telemetry/request_queue_stats.py.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant