feat(coordination): add cross-instance Coordinator for multi-instance consistency by magiccao · Pull Request #2363 · volcengine/OpenViking

magiccao · 2026-06-01T11:33:28Z

Motivation

Several process-local singletons (EmbeddingTaskTracker, RequestWaitTracker,
SemanticQueue coalesce versions, request stats) are backed by plain
dict + threading.Lock. On a single machine this is fine; across multiple
load-balanced instances the state silently diverges — coalesce versions reset
per instance, dedupe windows only protect within one process, and
request-wait tracking sees a partial view.

Changes

New: `Coordinator` abstraction (`openviking/service/coordinator.py`)

A small KV protocol with two concrete backends:

Backend	Default	When to use
`InProcessCoordinator`	✓	Single machine, zero new deps, identical behavior to today
`RedisCoordinator`	opt-in	Multi-instance, requires `redis` package

Activated via storage.coordination.backend = "redis" in ov.conf:

storage:
  coordination:
    backend: redis
    dsn: redis://host:6379/0   # or env OPENVIKING_COORD_DSN
    key_prefix: "ov:coord:"
    ttl_sec: 3600

Install the optional extra: pip install 'openviking[coordination]'

Bug fix: atomic memory semantic dedupe (#769)

The previous get_int → check → set_int in SemanticQueue was a TOCTOU
race: two instances could both read 0 and both enqueue the same memory
semantic task. Replaced with a single atomic set_if_absent() call
(Redis SET NX EX; in-process monotonic-deadline check under lock).

Memory safety: amortized claim pruning

_claim_deadlines in InProcessCoordinator is swept in bulk once it
exceeds a threshold, preventing unbounded growth from one-shot
mem_dedupe:* keys that are never revisited.

Refactor: extract `RequestQueueStats` / `RequestStatsAccumulator`

The inline stats dict+lock duplicated in SemanticProcessor and
collection_schemas is unified into openviking/telemetry/request_queue_stats.py.

Compatibility

Single-machine deployments: no behavior change, no new dependencies.
The InProcessCoordinator is the default and behaves identically to the
prior singletons.

Test plan

tests/misc/test_coordinator.py — InProcess full coverage + Redis parity via fakeredis (skipped when absent)
tests/misc/test_semantic_coalesce.py — staleness check
tests/misc/test_request_queue_stats.py — RequestStatsAccumulator
tests/storage/test_embedding_tracker.py — Coordinator-backed tracker
tests/storage/test_semantic_queue_memory_dedupe.py — atomic dedupe regression
ruff check / ruff format — all clean

🤖 Generated with Claude Code

… consistency Several process-local singletons (EmbeddingTaskTracker, RequestWaitTracker, SemanticQueue coalesce versions, request stats) were backed by plain dict + threading.Lock. On a single machine this is fine; across multiple load-balanced instances the state silently diverges. This PR introduces a Coordinator abstraction that unifies these behind a small set of generic KV primitives: - InProcessCoordinator (default): zero new dependencies, behaviour identical to the existing singletons. Single-machine deployments are unaffected. - RedisCoordinator (opt-in): maps each primitive onto an atomic Redis command (INCRBY, SET NX EX, SADD, RPUSH, etc.) for cross-instance consistency. Requires `pip install 'openviking[coordination]'`. Configuration (ov.conf): storage: coordination: backend: redis # or "memory" (default) dsn: redis://host:6379 # or env OPENVIKING_COORD_DSN key_prefix: "ov:coord:" ttl_sec: 3600 Bug fix: atomic memory semantic dedupe (volcengine#769) The previous get_int → check → set_int pattern in SemanticQueue was a TOCTOU race: two instances could both read 0 and both enqueue the same memory semantic task. Replaced with a single atomic set_if_absent() (Redis SET NX EX; in-process monotonic-deadline check under lock). Memory safety: amortized claim pruning _claim_deadlines in InProcessCoordinator is swept in bulk once the map exceeds a threshold, so one-shot mem_dedupe:* keys that are never revisited cannot grow the map unboundedly. Refactor: extract RequestQueueStats / RequestStatsAccumulator The inline stats dict+lock duplicated across SemanticProcessor and collection_schemas is unified into openviking/telemetry/request_queue_stats.py.

github-actions · 2026-06-01T11:34:41Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

github-actions · 2026-06-01T11:35:24Z

PR Code Suggestions ✨

No code suggestions found for the PR.

github-project-automation Bot added this to OpenViking project Jun 1, 2026

github-project-automation Bot moved this to Backlog in OpenViking project Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(coordination): add cross-instance Coordinator for multi-instance consistency#2363

feat(coordination): add cross-instance Coordinator for multi-instance consistency#2363
magiccao wants to merge 1 commit into
volcengine:mainfrom
magiccao:feat/multi-instance-coordinator

magiccao commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

magiccao commented Jun 1, 2026

Motivation

Changes

New: Coordinator abstraction (openviking/service/coordinator.py)

Bug fix: atomic memory semantic dedupe (#769)

Memory safety: amortized claim pruning

Refactor: extract RequestQueueStats / RequestStatsAccumulator

Compatibility

Test plan

Uh oh!

github-actions Bot commented Jun 1, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented Jun 1, 2026

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New: `Coordinator` abstraction (`openviking/service/coordinator.py`)

Refactor: extract `RequestQueueStats` / `RequestStatsAccumulator`