Skip to content

fix(operator): configure embeddings first; SDK-direct catalog refresh#401

Open
aaronsb wants to merge 2 commits into
mainfrom
fix/catalog-refresh-embedding-coupling
Open

fix(operator): configure embeddings first; SDK-direct catalog refresh#401
aaronsb wants to merge 2 commits into
mainfrom
fix/catalog-refresh-embedding-coupling

Conversation

@aaronsb
Copy link
Copy Markdown
Owner

@aaronsb aaronsb commented May 21, 2026

Summary

Follow-up to #400. Init reached Step 6 (model catalog refresh) and crashed with the same Anthropic chicken-and-egg:

❌ Failed to refresh catalog: Anthropic requires an embedding provider.
   Set OPENAI_API_KEY, populate the OpenAI model catalog (ADR-800/801),
   or pass embedding_provider explicitly.

Two coupled root causes:

  1. The operator container can never construct LocalEmbeddingProvider. EmbeddingModelManager is loaded only by the API container at startup (api/app/main.py:192). The operator container imports the same code but never calls init_embedding_model_manager(), so the module-level _model_manager global stays None. get_embedding_provider() in the operator therefore always returns None, and AnthropicProvider.__init__ falls into its eager OpenAIProvider() fallback — which fails because no OpenAI key is stored on first-run setup.

  2. The wizard configured embeddings after extraction model selection. GPU/CPU choice from the start of the wizard wasn't being propagated to the embedding profile's device column.

What changed

operator/configure.py

  • New _fetch_catalog_via_sdk(provider) helper. Bypasses AIProvider.__init__ via __new__, sets only self.client (or self.api_key for OpenRouter — what its fetch_model_catalog actually uses), and calls the existing fetch_model_catalog method. No catalog logic is duplicated.
  • models refresh uses the new helper for openai/anthropic/openrouter; falls back to get_provider() for ollama/llamacpp (different requirements, not part of guided wizard).
  • New --device flag on embedding subcommand; writes the chosen device onto the activated profile.

operator/lib/guided-init.sh

  • Embedding profile activation moved from old Step 7 to new Step 4 (right after admin user creation).
  • Maps wizard GPU_MODE → PyTorch device string: mac→mps, nvidia→cuda, amd|amd-host→cuda (ROCm PyTorch presents as cuda), cpu→cpu.
  • Steps renumbered: AI provider → 5, validate key → 6, model select → 7. Garage/start-app stay at 8/9.

Test plan

  • Clean teardown + fresh ./operator.sh init with hot-reload + CPU + Anthropic.
  • Confirm Step 4 activates the local embedding profile with device=cpu.
  • Confirm Step 7 (models refresh anthropic) succeeds without the embedding-provider error.
  • Re-run with GPU_MODE=mac on an Apple Silicon host; profile shows device=mps.
  • Re-run picking OpenAI and OpenRouter; both refresh paths work through the same _fetch_catalog_via_sdk helper.

aaronsb added 2 commits May 21, 2026 06:52
Two coupled fixes for Step 6 init failure ("Anthropic requires an
embedding provider") on first-run setup.

1. Reorder guided-init: configure the local embedding profile right
   after admin user creation (new Step 4), before AI provider selection.
   The wizard already collects the GPU/CPU choice at the top; map that
   to a PyTorch device string (mac→mps, nvidia/amd/amd-host→cuda,
   cpu→cpu) and pass it to `configure.py embedding --device`. The API
   container picks up the active profile + device at startup. Renumber
   subsequent steps (AI provider → 5, validate key → 6, model select →
   7); Garage and start-app stay at 8 and 9.

2. SDK-direct catalog refresh. `models refresh` previously went through
   get_provider(), which instantiates AnthropicProvider — whose __init__
   eagerly constructs an OpenAIProvider as the embedding delegate. The
   operator container has no loaded EmbeddingModelManager (only the API
   container initializes one at startup), so get_embedding_provider()
   returns None and the eager fallback fails for lack of an OpenAI key.
   New _fetch_catalog_via_sdk() bypasses __init__ via __new__, sets
   only the SDK client (or api_key for OpenRouter), and reuses the
   existing fetch_model_catalog method. Mirrors the SDK-direct pattern
   already used by _validate_provider_key.

Adds a --device flag to `configure.py embedding` so the wizard can
write the chosen device onto the activated profile in one call.
set_model_default fetched the provider/category for the target row and
then unpacked the result with `provider, category = row`. When the
caller's connection is configured with RealDictCursor (as the operator
container's configure.py is — see operator/configure.py line 39), the
row is a dict subclass and tuple unpacking silently yields the column
*names* — "provider" and "category" — rather than the values. The
clear-existing-default UPDATE then matched zero rows, and the
set-new-default UPDATE collided with the still-set old default,
violating idx_catalog_default on (provider, category).

The API container's path didn't hit this because AGEClient.pool doesn't
set a cursor_factory; only this operator-driven path tripped on it.

Replace the SELECT + tuple-unpack with a subquery so the function is
cursor-factory-agnostic. As a bonus the path is now idempotent: setting
a model that's already the default no longer races with itself.

Manifests as Step 7 of guided init: "Models command failed: duplicate
key value violates unique constraint 'idx_catalog_default'".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant