Skip to content

fix: stabilize post-544 smoke train#551

Open
bussyjd wants to merge 18 commits into
mainfrom
fix/pr544-plus-smoke-regressions
Open

fix: stabilize post-544 smoke train#551
bussyjd wants to merge 18 commits into
mainfrom
fix/pr544-plus-smoke-regressions

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 25, 2026

Summary

This stabilizes the merged post-#544 train (#544, #546, #547, #548, #549, #550) against the smoke failures seen during the ordered test run.

LiteLLM Fork Bump

The Obol LiteLLM fork was rebuilt cleanly on top of current upstream instead of merging upstream history into the fork branch. The fork compare now shows only the six Obol-specific commits ahead of upstream and no upstream backlog:

The Renovate config now lets the existing generic ghcr.io/obolnetwork/* extractor track this image and applies the LiteLLM package rule for labels/grouping/digest pinning, avoiding a duplicate LiteLLM-specific regex manager for the same lines.

Supersedes

This PR is intentionally based on the ordered post-#544 smoke train and should be reviewed as the collapse/superseding PR for:

It contains those changes plus the follow-up fixes from the smoke review. If this PR lands, the underlying PRs should not be merged separately afterward.

Failure Map

flowchart TD
    Train["Merged PR train > #544"] --> Smoke["release-smoke run"]
    Smoke --> F06["flow-06 expected verifier replicas=2"]
    Smoke --> F04["flow-04 empty agent response"]
    Smoke --> F13["flow-13 empty agent / no PurchaseRequest"]
    Smoke --> F14["flow-14 empty agent / no PurchaseRequest"]
    Smoke --> F11["flow-11 paid inference timeout"]
    Smoke --> CF["#547 cloudflared tag/digest mismatch"]
    Smoke --> HotAdd["LiteLLM /model/new falls back"]

    F06 --> Fix06["Expect 1 local verifier replica"]
    F04 --> LLM["Endpoint preflight + discovered non-thinking request shape"]
    F13 --> LLM
    F14 --> LLM
    F11 --> Retry["Retry transient timeout once, then fail"]
    CF --> Digest["Use 2026.5.0 manifest digest"]
    HotAdd --> RuntimeConfig["Writable emptyDir runtime config"]
Loading

LiteLLM Hot-Add Root Cause

/model/new was not unreliable because of a recent #544+ product change. The failure was architectural: LiteLLM was configured to persist to /etc/litellm/config.yaml, but that path was a ConfigMap-backed subPath mount. ConfigMap volumes are read-only by design, and the container also runs with readOnlyRootFilesystem: true.

The correct Kubernetes primitive is to keep the ConfigMap as the source of truth and give the pod a writable runtime copy:

flowchart LR
    CM["litellm-config ConfigMap\nsource of truth"] --> Src["read-only mount\n/config-src/config.yaml"]
    Src --> Init["initContainer\ncopy config.yaml"]
    Init --> Work["emptyDir\n/config/config.yaml"]
    Work --> Main["LiteLLM\n/etc/litellm/config.yaml"]
    Main --> API["/model/new\nupdates router + persists YAML"]
    API --> Work
Loading

This keeps the restricted security posture intact, avoids PVC drift, and matches the existing controller/CLI flow: patch the Kubernetes ConfigMap first, then call /model/new for live routing. Restarts still converge back to the ConfigMap.

LiteLLM Model Mutation Safety

The writable runtime copy is safe for Obol-managed model changes because every supported mutation path either patches the ConfigMap and then updates the live router, or patches the ConfigMap and restarts LiteLLM so the initContainer copies the restored source of truth.

Path Persistent source Live effect Runtime-copy implication
obol model setup --provider ollama patches litellm-config /model/new; restart fallback live YAML is writable, so the API can persist
obol model setup --provider anthropic/openai patches litellm-config and Secret restart for envFrom Secret new pod copies patched ConfigMap
obol model setup custom patches litellm-config /model/new; restart fallback same as Ollama/custom OpenAI-compatible route
obol model prefer reorders litellm-config restart, because LiteLLM has no reorder API new pod copies reordered ConfigMap
obol model remove removes from litellm-config /model/delete; restart fallback is intentionally not required live route is deleted in place; ConfigMap remains source of truth
obol model sync reads litellm-config agent re-render only no LiteLLM mutation expected
purchase controller paid routes patches litellm-config /model/new / /model/delete avoids pod restart while still persisting route intent
obol stack up Helm restore preserves and restores litellm-config around Helmfile restart only when restored config semantically changes closes the upgrade window where a pod could copy chart-default config before restore

The important invariant is: no Obol CLI path relies on Kubernetes live-updating /etc/litellm/config.yaml from a ConfigMap mount. That was already false with the old subPath mount. The live paths rely on /model/new / /model/delete or an explicit rollout, and this PR makes those paths consistent with a writable LiteLLM config file.

Additional guards added after review:

  • TestLiteLLMConfigSemanticEqualIgnoresFormatting prevents formatting-only restore churn from causing unnecessary restarts.
  • TestSyncDefaultsRestartsLiteLLMAfterConfigRestore_SourceGuard pins the restore ConfigMap -> restart LiteLLM -> autoConfigureLLM ordering.

Alternatives Considered

Alternative Result Why
Keep ConfigMap subPath mount rejected subPath is read-only and does not live-update; /model/new cannot persist.
Mount the whole ConfigMap directory at /etc/litellm rejected removes subPath, but ConfigMap volumes are still read-only, so /model/new still fails.
Always restart LiteLLM for every model change rejected simple, but defeats paid-route hot-add and creates avoidable disruption.
Writable PVC for /etc/litellm rejected avoids restart drift but creates a second durable source of truth outside Helm/ConfigMap ownership.
Sidecar sync from ConfigMap to writable file rejected more moving parts and race-prone against LiteLLM writing the same file.
LiteLLM DB/Postgres persistence not for this PR durable, but introduces a database and changes the stack shape substantially.
ConfigMap source + init copy to emptyDir + explicit API/restart convergence chosen minimal Kubernetes primitive, keeps ConfigMap as source of truth, gives LiteLLM a writable configured path, and preserves zero-restart hot-add where needed.

LLM Smoke Behavior

The smoke runner now validates an endpoint before spending time on clusters:

sequenceDiagram
    participant Smoke as release-smoke / flow-01
    participant LLM as OpenAI-compatible endpoint
    participant Agent as Hermes/OpenClaw agent smoke calls

    Smoke->>LLM: GET /models
    Smoke->>LLM: POST /chat/completions marker prompt
    alt final content returned
        Smoke-->>Agent: run normal request payloads
    else reasoning only / empty final content
        Smoke->>LLM: retry with chat_template_kwargs.enable_thinking=false
        LLM-->>Smoke: final marker content
        Smoke-->>Agent: export OBOL_LLM_DISABLE_THINKING=true
        Agent->>LLM: include chat_template_kwargs.enable_thinking=false
    else still unusable
        Smoke-->>Smoke: fail fast before cluster flows
    end
Loading

This keeps smoke usable across OpenAI-compatible endpoints while avoiding the previous mismatch where the endpoint could pass a direct manual smoke only when non-thinking was forced, but the agent flows still sent a different payload.

Gateway API #544 Story

#544 bumped gatewayApiVersion in internal/embed/infrastructure/helmfile.yaml from v1.4.1 to v1.5.1. Today that value is not consumed by any rendered template or CRD installer. The Traefik Helm chart is the component that installs and owns the Gateway API CRDs in the local stack.

CRDs/resources relevant to this stack:

  • gatewayclasses.gateway.networking.k8s.io: Traefik's controller class.
  • gateways.gateway.networking.k8s.io: Traefik's traefik-gateway entry point.
  • httproutes.gateway.networking.k8s.io: eRPC, frontend, skill catalog, services JSON, ServiceOffer routes, and agent identity well-known routes.
  • referencegrants.gateway.networking.k8s.io: allows cross-namespace references needed by controller-created ServiceOffer routing to the shared x402 verifier service.
flowchart TD
    TraefikChart["traefik/traefik chart"] --> CRDs["Gateway API CRDs"]
    CRDs --> GatewayClass["GatewayClass traefik"]
    CRDs --> Gateway["Gateway traefik-gateway"]
    CRDs --> HTTPRoute["HTTPRoute"]
    CRDs --> ReferenceGrant["ReferenceGrant"]

    HTTPRoute --> ERPC["/rpc local-only"]
    HTTPRoute --> FE["frontend local-only"]
    HTTPRoute --> Catalog["/skill.md and /api/services.json"]
    HTTPRoute --> Offers["/services/* ServiceOffer routes"]
    HTTPRoute --> Identity["/.well-known/agent-registration.json"]
    ReferenceGrant --> X402["x402 verifier cross-namespace auth service"]
Loading

If we ever make gatewayApiVersion authoritative, the next change should either remove the unused value or wire it to a deliberate CRD install/update path using server-side apply and explicit validation. Gateway API v1.5.x upstream also matters because experimental CRDs are large enough to require server-side apply and TLSRoute alpha has deprecation/removal caveats. That is not happening in the current rendered stack.

Notes By Failure

  • Raw Helm lint/template: the direct raw-chart lint failure was a false check against unsubstituted {{PLACEHOLDER}} strings. The existing workflow already mirrors stack init by substituting placeholders in a temp chart copy. This PR keeps that simple path and only updates the workflow Helm version to match obolup.sh.
  • fix(monetize): drop available, use drainEndsAt as sole drain signal #548 available removal: JSON/schema remain additive and use drainEndsAt as the machine signal. /skill.md now shows active rows as available in the Status column instead of -, without reintroducing an available JSON field or - **Available**: detail bullet.
  • Paid inference: no quick-tunnel bypass was added. The smoke still asserts paid inference against the real paid endpoint; transient timeout/524/context-canceled results get one retry by default via PAID_INFERENCE_TRANSIENT_RETRIES and then fail.

Validation

  • bash -n flows/*.sh
  • jq empty renovate.json
  • go test ./internal/embed ./internal/serviceoffercontroller -count=1
  • go test ./internal/model ./internal/defaults -count=1
  • go test ./... -count=1
  • Helm base template/lint with CI-equivalent placeholder substitution
  • Helm cloudflared template/lint
  • Rendered cloudflared image contains cloudflare/cloudflared:2026.5.0@sha256:59bab8d3aceec09bf6bdb07d6beca0225ca5cd7ab79436a87ea97978fe1dc4f9
  • docker buildx imagetools inspect cloudflare/cloudflared:2026.5.0 --format '{{ .Manifest.Digest }}'
  • OpenAI-compatible QA LLM preflight passed and correctly discovered that non-thinking mode is required

Not Re-run Here

The full cluster release-smoke was not re-run after this patch because no k3d cluster was left running after cleanup. The fixes are packaged so the next release-smoke run should fail fast on bad LLM endpoint shape before cluster setup and should exercise the real paid route without the previous bypass proposal.

OBOL Permit2 Follow-Up

The final flow-13 failure after the initial train fixes was not a facilitator image regression. The agent successfully created a PurchaseRequest and the sidecar had 5 auths, but paid inference was rejected by the facilitator with PaymentTooEarly().

Root cause: Permit2 auths were using buyer host wall-clock time for validAfter. On Anvil forks, chain time only advances when blocks are mined, so time.time() - 600 can still be in the future relative to the forked chain timestamp after long LLM/cluster setup.

Fix: Permit2 presigned auths now use validAfter = "0", matching the immediate-valid behavior already used for ERC-3009-style smoke auths. The auth lifetime remains bounded by deadline.

sequenceDiagram
    participant BobHost as Bob host clock
    participant BuyPy as buy.py Permit2 signer
    participant Anvil as Anvil fork chain time
    participant Fac as x402 facilitator

    BobHost->>BuyPy: wall-clock now
    BuyPy--x Fac: old validAfter = now - 600
    Fac->>Anvil: compare against block.timestamp
    Anvil-->>Fac: chain time may lag wall-clock
    Fac--x BuyPy: PaymentTooEarly()

    BuyPy->>Fac: new validAfter = 0, bounded deadline
    Fac->>Anvil: valid immediately on chain
    Fac-->>BuyPy: payment accepted
Loading

Agent Discovery Follow-Up

The broad Hermes “discover Alice” prompt in flows 13/14 was informational, burned a 300s LLM turn, ignored curl failures, and then passed regardless. It was replaced with deterministic catalog validation from Bob’s agent pod against /api/services.json.

The structural proof is still the next step: Hermes must invoke buy.py, create the PurchaseRequest, wait for Ready=True, provision sidecar auths, make paid inference return HTTP 200, settle on-chain, and match exact OBOL balance deltas.

flowchart LR
    BobPod[Bob agent pod] --> Catalog[Alice /api/services.json]
    Catalog --> Assert[Assert service name, endpoint path, OBOL, permit2, Base Sepolia]
    Assert --> Hermes[Hermes agent buy prompt]
    Hermes --> BuyPy[buy.py creates PurchaseRequest]
    BuyPy --> Ready[PurchaseRequest Ready=True]
    Ready --> Paid[paid/qwen36-apex-i-compact HTTP 200]
    Paid --> Chain[OBOL settlement + exact balance deltas]
Loading

Latest Validation

Validated #551 + #552 with SilverMesh and the target x402 facilitator image:

  • Image: ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay:1.4.10@sha256:1fbd9e6b9863a288aba823e3107b1884746d9fb66e3c7989add4ed437c98a7ad
  • Flow: flow-13-dual-stack-obol.sh
  • Artifact: .tmp/flow13-services-json-permit2-retry-20260525-234847
  • Result: METRIC steps_failed=0
  • Key checks:
    • agent pod found alice-obol-inference in /api/services.json with OBOL + permit2
    • Hermes agent created PurchaseRequest on attempt 1
    • PurchaseRequest Ready=True
    • buyer sidecar had 5 auths
    • paid inference succeeded with HTTP 200 and expected content
    • settlement receipt archived
    • Alice balance increased and Bob signer balance decreased by exactly 1000000000000000 wei

github-actions Bot and others added 13 commits May 24, 2026 20:19
…e drain signal (re-amend of #535)

drain becomes purely additive — active offers serialize identically to
pre-drain main. The only new wire field is `drainEndsAt`, set on draining
offers only. Consumers detect drain with `if (entry.drainEndsAt) { /* draining */ }`.
No schema-breaking change for any consumer that was reading the catalog
before drain landed.

This re-ships an amendment that was originally pushed as commit dd89750 on
`feat/drain-replaces-pause` for PR #535. The amendment didn't survive the
bundle PR #536's merge into main, so the controller is shipping the
un-amended `Available bool` shape today.

- ServiceCatalogEntry: remove `Available bool`; keep `DrainEndsAt string omitempty`
- service-catalog.schema.json: drop `available` from `required` + `properties`
- buildServiceCatalogJSON: stop setting Available; only set DrainEndsAt on drain
- buildSkillCatalogMarkdown: rename `Available` table column to `Status` (active
  rows show `—`; draining rows show `draining · ends <RFC3339>`). Drop the
  per-service `- **Available**:` bullet entirely; draining services keep only
  the `- **Drain ends at**:` bullet.
- serviceDefWithDrain: stop setting the (already-additive) `Available *bool`
  on erc8004.ServiceDef during drain; signal via DrainEndsAt only.
- Tests:
  - TestBuildServiceCatalogJSON_ExcludesNonReady: replace
    `services[0].Available == true` with raw-JSON map walk asserting
    `available` and `drainEndsAt` keys are absent on active entries.
  - TestBuildServiceCatalogJSON_DrainLifecycle: rewrite to raw-map walk;
    assert active entries have neither `available` nor `drainEndsAt`, mid-drain
    entries have only `drainEndsAt` (no `available`).
  - TestBuildRegistration{,Identity}Services_IncludesDrainMetadata: replace
    `svc.Available == &false` checks with `svc.Available == nil` (DrainEndsAt
    is now the sole drain marker).
  - Add TestBuildSkillCatalogMarkdown_DrainAdditiveDetail: asserts no
    `- **Available**:` bullet appears for any offer, that draining offers
    keep their `- **Drain ends at**:` bullet, and that the table header
    uses `Status` not `Available`.
…or settlement log

The `obol_x402_verifier_last_payment_success_seconds` gauge already lands
on the verifier success branch (see internal/x402/verifier.go:206 and :261,
right next to chargedRequests.Inc()). Add the matching recording rule so
the frontend My Purchases drawer can render "last settlement: 12s ago"
labels without joining against the buyer sidecar or chasing receipts.
@bussyjd bussyjd force-pushed the fix/pr544-plus-smoke-regressions branch from 49bf2b7 to 69ca96b Compare May 25, 2026 08:43
@bussyjd bussyjd marked this pull request as ready for review May 25, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant