fix(operator): configure embeddings first; SDK-direct catalog refresh by aaronsb · Pull Request #401 · aaronsb/knowledge-graph-system

aaronsb · 2026-05-21T13:53:13Z

Summary

Follow-up to #400. Init reached Step 6 (model catalog refresh) and crashed with the same Anthropic chicken-and-egg:

❌ Failed to refresh catalog: Anthropic requires an embedding provider.
   Set OPENAI_API_KEY, populate the OpenAI model catalog (ADR-800/801),
   or pass embedding_provider explicitly.

Two coupled root causes:

The operator container can never construct LocalEmbeddingProvider. EmbeddingModelManager is loaded only by the API container at startup (api/app/main.py:192). The operator container imports the same code but never calls init_embedding_model_manager(), so the module-level _model_manager global stays None. get_embedding_provider() in the operator therefore always returns None, and AnthropicProvider.__init__ falls into its eager OpenAIProvider() fallback — which fails because no OpenAI key is stored on first-run setup.
The wizard configured embeddings after extraction model selection. GPU/CPU choice from the start of the wizard wasn't being propagated to the embedding profile's device column.

What changed

operator/configure.py

New _fetch_catalog_via_sdk(provider) helper. Bypasses AIProvider.__init__ via __new__, sets only self.client (or self.api_key for OpenRouter — what its fetch_model_catalog actually uses), and calls the existing fetch_model_catalog method. No catalog logic is duplicated.
models refresh uses the new helper for openai/anthropic/openrouter; falls back to get_provider() for ollama/llamacpp (different requirements, not part of guided wizard).
New --device flag on embedding subcommand; writes the chosen device onto the activated profile.

operator/lib/guided-init.sh

Embedding profile activation moved from old Step 7 to new Step 4 (right after admin user creation).
Maps wizard GPU_MODE → PyTorch device string: mac→mps, nvidia→cuda, amd|amd-host→cuda (ROCm PyTorch presents as cuda), cpu→cpu.
Steps renumbered: AI provider → 5, validate key → 6, model select → 7. Garage/start-app stay at 8/9.

Test plan

Clean teardown + fresh ./operator.sh init with hot-reload + CPU + Anthropic.
Confirm Step 4 activates the local embedding profile with device=cpu.
Confirm Step 7 (models refresh anthropic) succeeds without the embedding-provider error.
Re-run with GPU_MODE=mac on an Apple Silicon host; profile shows device=mps.
Re-run picking OpenAI and OpenRouter; both refresh paths work through the same _fetch_catalog_via_sdk helper.

Two coupled fixes for Step 6 init failure ("Anthropic requires an embedding provider") on first-run setup. 1. Reorder guided-init: configure the local embedding profile right after admin user creation (new Step 4), before AI provider selection. The wizard already collects the GPU/CPU choice at the top; map that to a PyTorch device string (mac→mps, nvidia/amd/amd-host→cuda, cpu→cpu) and pass it to `configure.py embedding --device`. The API container picks up the active profile + device at startup. Renumber subsequent steps (AI provider → 5, validate key → 6, model select → 7); Garage and start-app stay at 8 and 9. 2. SDK-direct catalog refresh. `models refresh` previously went through get_provider(), which instantiates AnthropicProvider — whose __init__ eagerly constructs an OpenAIProvider as the embedding delegate. The operator container has no loaded EmbeddingModelManager (only the API container initializes one at startup), so get_embedding_provider() returns None and the eager fallback fails for lack of an OpenAI key. New _fetch_catalog_via_sdk() bypasses __init__ via __new__, sets only the SDK client (or api_key for OpenRouter), and reuses the existing fetch_model_catalog method. Mirrors the SDK-direct pattern already used by _validate_provider_key. Adds a --device flag to `configure.py embedding` so the wizard can write the chosen device onto the activated profile in one call.

set_model_default fetched the provider/category for the target row and then unpacked the result with `provider, category = row`. When the caller's connection is configured with RealDictCursor (as the operator container's configure.py is — see operator/configure.py line 39), the row is a dict subclass and tuple unpacking silently yields the column *names* — "provider" and "category" — rather than the values. The clear-existing-default UPDATE then matched zero rows, and the set-new-default UPDATE collided with the still-set old default, violating idx_catalog_default on (provider, category). The API container's path didn't hit this because AGEClient.pool doesn't set a cursor_factory; only this operator-driven path tripped on it. Replace the SELECT + tuple-unpack with a subquery so the function is cursor-factory-agnostic. As a bonus the path is now idempotent: setting a model that's already the default no longer races with itself. Manifests as Step 7 of guided init: "Models command failed: duplicate key value violates unique constraint 'idx_catalog_default'".

aaronsb added 2 commits May 21, 2026 06:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(operator): configure embeddings first; SDK-direct catalog refresh#401

fix(operator): configure embeddings first; SDK-direct catalog refresh#401
aaronsb wants to merge 2 commits into
mainfrom
fix/catalog-refresh-embedding-coupling

aaronsb commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaronsb commented May 21, 2026

Summary

What changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant