Skip to content

fix(embedding): surface upstream errors instead of silent timeout#1910

Merged
MaojiaSheng merged 1 commit into
mainfrom
fix/embedding-error-visibility
May 15, 2026
Merged

fix(embedding): surface upstream errors instead of silent timeout#1910
MaojiaSheng merged 1 commit into
mainfrom
fix/embedding-error-visibility

Conversation

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

Summary

  • Problem: When an embedding provider returns 500 or hangs (e.g., BytePlus endpoint down), ov find shows misleading Network error: HTTP request failed because the Rust CLI times out (60s) before the Python server finishes retrying the embedding call (3+ minutes with SDK retries compounding on top of OV retries). Neither ov health nor ov ready checks embedding connectivity, so users have no way to detect a broken provider.
  • Set explicit 30s timeout and disable SDK-level retries (max_retries=0) on the OpenAI embedding client — the server now fails fast enough for the CLI to receive the actual error
  • Add try-except in HierarchicalRetriever.retrieve() around embed_compat() with provider/model context in the error message, so users see exactly which backend is broken
  • Add embedding connectivity probe to /ready endpoint (10s timeout) — ov ready now catches a broken embedding provider before any search is attempted

Test plan

  • All existing embedding unit tests pass (164 passed, pre-existing failures unchanged)
  • No new test regressions introduced
  • Manual test: start server with a broken embedding config → ov ready should report embedding: error: ...
  • Manual test: ov find with broken provider → should return specific error instead of "Network error"

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 92
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: fix(embedding): surface upstream errors instead of silent timeout

Relevant files:

  • openviking/models/embedder/openai_embedders.py
  • openviking/retrieve/hierarchical_retriever.py
  • openviking/server/routers/system.py

Sub-PR theme: fix(time): use UTC-aware datetimes consistently

Relevant files:

  • openviking/storage/viking_fs.py
  • openviking/utils/time_utils.py

⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

The /ready endpoint checks AGFS, VectorDB, APIKeyManager, and Ollama
but not the embedding provider. When the provider is down (500 or
unreachable), ov find silently times out with a misleading "Network
error" and neither ov health nor ov ready surfaces the real cause.

Add an embedding probe (single-token embed with 10s timeout) so that
ov ready reports a broken embedding backend before any search is
attempted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ZaynJarvis ZaynJarvis force-pushed the fix/embedding-error-visibility branch from 10022e6 to 8163183 Compare May 8, 2026 08:01
@MaojiaSheng MaojiaSheng merged commit af4c54f into main May 15, 2026
4 of 6 checks passed
@MaojiaSheng MaojiaSheng deleted the fix/embedding-error-visibility branch May 15, 2026 12:47
@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project May 15, 2026
qin-ctx pushed a commit that referenced this pull request May 18, 2026
…1910 (#2086)

* docs(deployment): document embedding + ollama probes in /ready response (EN)

* docs(deployment): document embedding + ollama probes in /ready response (ZH)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants