Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ jobs:
- working-directory: workers
run: |
uv sync --dev
uv run pytest -v --cov=workers --cov-report=xml
uv run pytest -v --cov=workers --cov-report=xml -m "not slow"

build-web:
runs-on: ubuntu-latest
Expand Down
53 changes: 53 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,59 @@ Plan: `thoughts/shared/plans/active-2026-05-06-deliver-orchestrator-capacity-det
Investigation: `thoughts/shared/investigations/2026-05-06-ollama-think-suppression-empirical.md`
Runbook: [`docs/admin/llm-config.md`](docs/admin/llm-config.md#backend-parallelism-and-the-max_concurrent_calls-field)

**2026-05-06 worker LLM concurrency refactor** — 7 commits, `3274ade..6eed915`.
Universal per-`(provider, base_url)` gate registry in the Python worker:
every LLM and embedding call passes through a host or per-kind semaphore,
jitter-aware tenacity retry loop, optional RPM limiter, and tok/s ring
buffer. Eliminates the 5×3=15-attempt storm from stacked hand-rolled
retries. `GetProviderCapabilities.max_concurrent_calls` is now sourced
from the gate's effective cap for the resolved context, not bootstrap
config, so Go and Python agree on capacity by construction. Phase 7
extends `/api/v1/admin/llm/activity` with a `gate_snapshot` field and
adds a live "LLM Gate Activity" section to the admin monitor page.

Load-bearing constraints for future-Claude:

- **Don't re-enable SDK retry** (`max_retries=0` on `AsyncOpenAI` and
`AsyncAnthropic`). The tenacity wrapper owns retry. Re-enabling SDK
retry produces 5×3=15-attempt storms per Decision 3.
- **Don't add a `[llm.concurrency]` TOML section.** Concurrency is
operator-tunable via env vars, not `config.toml`. Decision 7.
- **The kill switch is the rollback path**: `SOURCEBRIDGE_LLM_CONCURRENCY_WRAPPER_ENABLED=false`
reverts to pre-refactor behavior without redeploy. Use it before
assuming the gate is the problem.
- **Registry is constructed-once-passed-by-reference** — constructed in
`workers/__main__.py` and `workers/common/cli_main.py` only. No
module-level singletons. Every factory call (`create_llm_provider`,
`create_embedding_provider`, etc.) receives `gate_registry=` as a
required kwarg.
- **Don't delete the empty-content retry** at
`workers/common/llm/openai_compat.py` (around lines 249–313). It
handles `<think>`-budget exhaustion (`stop_reason=length` + empty
visible content) — it is NOT a network retry and is explicitly distinct
from the tenacity wrapper retry.
- **Gate is authoritative for `GetProviderCapabilities`**: the worker's
`GetProviderCapabilities` handler reads the registry's effective cap
for the resolved-context `(provider, base_url)` via
`workers/reasoning/servicer.py`. Don't bypass back to
`WorkerConfig.llm_max_concurrent_calls` except in the legacy fallback
path (kill switch off).
- **Host gate vs. per-kind gate classification is per-provider, not
configurable per call.** Local providers (`ollama`, `vllm`,
`llama-cpp`, `sglang`, `lmstudio`) share one host gate across LLM and
embedding. Cloud providers (`openai`, `anthropic`, `gemini`,
`openrouter`) use per-kind gates. `openai-compatible` defaults host;
flip with `SOURCEBRIDGE_LLM_PROVIDER_OPENAI_COMPATIBLE_GATING=per_kind`.
- **Don't fork the cross-language plumbing.** `/api/v1/admin/llm/activity`
(REST) and `KnowledgeStreamProgress` (proto) are the sole channels for
gate snapshot and per-job tok/s. Don't add a new endpoint or proto
field; extend these.

Plan: `thoughts/shared/plans/active-2026-05-06-deliver-worker-llm-concurrency.md`
Investigation: `thoughts/shared/investigations/2026-05-06-diagnose-llm-throughput-rotten.md`
Decisions log: `thoughts/shared/plans/active-2026-05-06-deliver-worker-llm-concurrency.decisions.md`
Runbook: [`docs/admin/llm-config.md`](docs/admin/llm-config.md#operator-concurrency-tuning)

**2026-05-05 web runtime API proxy fix** — 3 commits, `1fee78b..873bc53`.
Replaces `next.config.ts rewrites()` with a Next.js middleware at
`web/src/middleware.ts` that proxies `/api/*`, `/auth/*`, `/healthz`,
Expand Down
170 changes: 170 additions & 0 deletions docs/admin/llm-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -1297,6 +1297,176 @@ and a reinforced no-think system prompt. Total attempt budget is 3

---

## Operator concurrency tuning

The Python worker enforces per-provider concurrency through a gate
registry built from `workers/common/llm/concurrency.py`. Every LLM and
embedding call passes through a provider gate that holds the semaphore,
an optional RPM limiter, the retry loop, and the tok/s ring buffer.
This section documents the operator-visible knobs.

The gate registry is the **runtime source of truth** for in-worker LLM
concurrency. The `GetProviderCapabilities.max_concurrent_calls` gRPC
field (consumed by the Go orchestrator at
`internal/qa/lazy_agent_synth.go:340`) is now sourced from the gate
registry's effective cap for the resolved-context `(provider, base_url)`,
not from the bootstrap config. Setting a per-provider env var changes
both the worker's semaphore and the value the orchestrator uses to clamp
its goroutine pool — the same knob applies on both sides.

### Env var resolution order

Resolution is first-match-wins, top to bottom.

| Env var | Scope | Default | Notes |
|---|---|---|---|
| `SOURCEBRIDGE_LLM_PROVIDER_<NAME>_MAX_CONCURRENT` | per-provider LLM (or host-total for local providers) | see table below | `<NAME>` per canonical table; e.g., `SOURCEBRIDGE_LLM_PROVIDER_OLLAMA_MAX_CONCURRENT=4` |
| `SOURCEBRIDGE_EMBEDDING_PROVIDER_<NAME>_MAX_CONCURRENT` | per-provider embedding (frontier only; ignored when host-gated) | same as LLM cap | unused for Ollama — host gate combines both kinds |
| `SOURCEBRIDGE_LLM_PROVIDER_<NAME>_RPM` | per-provider rate limit | unset (no limiter) | applies to all providers; see tier-1 cloud values below |
| `SOURCEBRIDGE_LLM_PROVIDER_OPENAI_COMPATIBLE_GATING` | `openai-compatible` gate mode | `host` | set to `per_kind` if pointing at a managed endpoint with separate chat/embedding quotas |
| `SOURCEBRIDGE_LLM_CONCURRENCY_WRAPPER_ENABLED` | kill switch | `true` | set to `false` to revert to pre-refactor behavior without redeploy |
| `SOURCEBRIDGE_LLM_RETRY_MAX_ATTEMPTS` | tenacity retry attempts | `5` | reduce to 2–3 on unreliable networks; increase risks storms |
| `SOURCEBRIDGE_LLM_METRICS_AGGREGATION_INTERVAL_SECONDS` | gate-metrics log interval | `30` | lower to 5 for debugging; not for production steady-state |
| `SOURCEBRIDGE_WORKER_LLM_MAX_CONCURRENT_CALLS` | **legacy seed** — seeds the active LLM provider's gate cap when no per-provider override is set | unset | deprecated for new deployments; preserved for backward compat |
| `SOURCEBRIDGE_LLM_PARALLEL_HINT` | alias for the legacy seed above | unset | deprecated alias; kept for backward compat |

**Canonical `<NAME>` tokens** (`key.upper().replace("-", "_")`):

| Provider | Env-var token |
|---|---|
| `openai` | `OPENAI` |
| `anthropic` | `ANTHROPIC` |
| `ollama` | `OLLAMA` |
| `vllm` | `VLLM` |
| `llama-cpp` | `LLAMA_CPP` |
| `sglang` | `SGLANG` |
| `gemini` | `GEMINI` |
| `openrouter` | `OPENROUTER` |
| `lmstudio` | `LMSTUDIO` |
| `openai-compatible` | `OPENAI_COMPATIBLE` |

The worker validates env-var tokens at startup and rejects unknown
spellings (e.g., `SOURCEBRIDGE_LLM_PROVIDER_OPENAICOMPAT_MAX_CONCURRENT`)
with an actionable error naming the canonical table.

### Default values by provider

| Provider | Gating | LLM cap | Embedding cap | RPM default |
|---|---|---|---|---|
| `ollama` | host | 1 | (shared via host gate) | none |
| `vllm` | host | 4 | (shared) | none |
| `llama-cpp` | host | 4 | (shared) | none |
| `sglang` | host | 4 | (shared) | none |
| `lmstudio` | host | 2 | (shared) | none |
| `openai-compatible` | host (operator-flippable to `per_kind`) | 4 | (shared) | none |
| `openai` | per-kind | 8 | 16 | none |
| `anthropic` | per-kind | 4 | n/a | none |
| `gemini` | per-kind | 8 | 16 | none |
| `openrouter` | per-kind | 8 | n/a | none |

**Host vs. per-kind gating**: local providers (`ollama`, `vllm`,
`llama-cpp`, `sglang`, `lmstudio`) use one host gate that combines LLM
and embedding calls. Cloud providers (`openai`, `anthropic`, `gemini`,
`openrouter`) use separate per-kind gates. `openai-compatible` defaults
to host; flip with `SOURCEBRIDGE_LLM_PROVIDER_OPENAI_COMPATIBLE_GATING=per_kind`
if the endpoint has separate chat vs. embedding quotas.

These are conservative real caps, not sentinels. Operators with
high-tier cloud accounts should raise the cap rather than disable the
gate:

```bash
SOURCEBRIDGE_LLM_PROVIDER_OPENAI_MAX_CONCURRENT=64
```

The hard ceiling is 256 concurrent calls (enforced at the gate, the
Go-side adapter clamp, and the SurrealDB `ASSERT` constraint).

### Ollama: one knob covers everything

For Ollama, `SOURCEBRIDGE_LLM_PROVIDER_OLLAMA_MAX_CONCURRENT` is the
**only** concurrency knob needed. The host gate combines LLM and
embedding calls against the same normalized origin
(`http://localhost:11434` regardless of whether the SDK uses
`/v1` or `/api`), so both kinds share one semaphore. There is no
separate `SOURCEBRIDGE_EMBEDDING_PROVIDER_OLLAMA_MAX_CONCURRENT` — it
is ignored for host-gated providers.

Set it to match `OLLAMA_NUM_PARALLEL` on the Ollama daemon.

### Capacity contract

After this refactor, `GetProviderCapabilities.max_concurrent_calls` is
sourced from the gate registry's effective cap for the **resolved
context** `(provider, base_url)` — not from the bootstrap config. The
Go orchestrator at `internal/qa/lazy_agent_synth.go:340` clamps
`MaxConcurrency` to this value. Setting the per-provider env var changes
both the worker's semaphore and the orchestrator's clamp in one step.

When the wrapper is disabled via the kill switch, the legacy
`SOURCEBRIDGE_WORKER_LLM_MAX_CONCURRENT_CALLS` / `SOURCEBRIDGE_LLM_PARALLEL_HINT`
path is used instead (same behavior as before this refactor).

### RPM values for tier-1 cloud accounts

The defaults ship with no RPM limiter (`None`). Operators on high-tier
accounts can layer on an RPM limit to prevent hitting provider-side
rate ceilings under burst load.

| Provider | Tier | Recommended env var |
|---|---|---|
| OpenAI | Tier 4 | `SOURCEBRIDGE_LLM_PROVIDER_OPENAI_RPM=10000` (chat models; per-model limits vary for embeddings — check the OpenAI usage dashboard) |
| Anthropic | Tier 2 | `SOURCEBRIDGE_LLM_PROVIDER_ANTHROPIC_RPM=4000` (Claude 3.5 Sonnet; other models differ) |
| Gemini | Tier 2 / Pro | per-model — see [Google's rate limit documentation](https://ai.google.dev/gemini-api/docs/rate-limits); set `SOURCEBRIDGE_LLM_PROVIDER_GEMINI_RPM=<value>` |
| OpenRouter | varies | leave unset; OpenRouter enforces its own rate limits and returns 429s; the retry wrapper handles them |

### Server-side companion knobs (Ollama)

These variables go on the **Ollama daemon**, not in SourceBridge. They
are the dominant bottleneck for Ollama throughput. Investigation
`thoughts/shared/investigations/2026-05-06-diagnose-llm-throughput-rotten.md`
confirmed that `OLLAMA_NUM_PARALLEL=1` (the stock default) is the
single largest throughput bottleneck — more impactful than any
SourceBridge-side tuning.

| Ollama env var | Recommended value | Effect |
|---|---|---|
| `OLLAMA_NUM_PARALLEL` | `4`–`8` (sufficient RAM) / `2`–`4` (16–32 GB) / `1` (≤8 GB) | Max in-flight requests per daemon. **Set `SOURCEBRIDGE_LLM_PROVIDER_OLLAMA_MAX_CONCURRENT` to the same value.** |
| `OLLAMA_KEEP_ALIVE` | `-1` or `24h` | Prevents model unload between Living Wiki pages and between consecutive jobs. Default `5m` causes full reload latency (~30–90 s) at the start of each page after idle. |
| `OLLAMA_MAX_LOADED_MODELS` | number of distinct models the workload uses + 1 | Prevents model thrashing when `OLLAMA_NUM_PARALLEL > 1` and multiple models are configured. Default `1` is conservative; raise when you have LLM + embedding models that must coexist in VRAM. |

Set these in your Ollama service file (e.g.,
`/etc/systemd/system/ollama.service` `[Service] Environment=...` or
`/Library/LaunchDaemons/com.ollama.serve.plist`) and restart the
daemon. Then update `SOURCEBRIDGE_LLM_PROVIDER_OLLAMA_MAX_CONCURRENT`
to match the new `OLLAMA_NUM_PARALLEL`.

### Where to look in the UI

Go to **Admin → Monitor** (`/admin/monitor`) on your SourceBridge
instance.

The **"LLM Gate Activity"** section shows live counters per active gate:

| Column | What it means |
|---|---|
| Provider / endpoint | Gate key: provider name + normalized base URL |
| Kind | `llm` or `embedding` |
| In-flight / cap | Current in-flight calls vs. the gate's `max_concurrent` |
| Queued | Calls waiting for a slot (Decision 11 waiter counter) |
| tok/s | 60-second rolling tokens-per-second for this gate |
| 429s | Rate-limit errors since the gate was created |
| Retries | Total tenacity retry attempts since start |

The per-job **tok/s pill** on each `ActiveJobCard` shows live
throughput for that specific Living Wiki generation run.

The underlying data comes from `GET /api/v1/admin/llm/activity`
(`gate_snapshot` field), populated by the `GetLLMGateSnapshot` gRPC
method on the worker's `ReasoningService`.

---

## Living Wiki page-count and ops behavior

For how Living Wiki determines the number of pages to generate, how
Expand Down
22 changes: 18 additions & 4 deletions gen/go/common/v1/knowledge_progress.pb.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading