fix: stabilize post-544 smoke train by bussyjd · Pull Request #551 · ObolNetwork/obol-stack

bussyjd · 2026-05-25T08:31:47Z

Summary

This stabilizes the merged post-#544 train (#544, #546, #547, #548, #549, #550) against the smoke failures seen during the ordered test run.

Fixes the chore(deps): update cloudflare/cloudflared docker tag to v2026.5.1 #547 cloudflared tag/digest mismatch and teaches Renovate to keep that digest in lockstep.
Aligns flow-06 with the local stack's one-replica verifier manifest.
Adds a real OpenAI-compatible LLM preflight and carries its enable_thinking=false discovery into agent smoke requests.
Keeps the flow-11 paid inference assertion on the real paid route, but retries bounded transient timeout/524 failures once before failing.
Fixes LiteLLM hot-add by replacing the read-only ConfigMap subPath config file with a writable pod-local runtime copy.
Preserves fix(monetize): drop available, use drainEndsAt as sole drain signal #548's machine-wire removal of available while restoring a useful human-readable available status in /skill.md.
Documents the chore(deps): update dependency kubernetes-sigs/gateway-api to v1.5.1 #544 Gateway API story: current stack usage, CRDs involved, and why the bumped value is currently inert.

LiteLLM Fork Bump

The Obol LiteLLM fork was rebuilt cleanly on top of current upstream instead of merging upstream history into the fork branch. The fork compare now shows only the six Obol-specific commits ahead of upstream and no upstream backlog:

Upstream base: BerriAI/litellm@06f6cfc5ae
Obol fork head: ObolNetwork/litellm@9b3e569670
Clean compare: BerriAI/litellm@main...ObolNetwork:litellm:main
Verified via GitHub compare API: ahead_by=6, behind_by=0
Fork image build: https://github.com/ObolNetwork/litellm/actions/runs/26395690138
Stack image pin: ghcr.io/obolnetwork/litellm:sha-9b3e569@sha256:ac453f9cdfa3752efa38998aa5bbf4f9a67e642a68b27a647aaf667c083ddc51

The Renovate config now lets the existing generic ghcr.io/obolnetwork/* extractor track this image and applies the LiteLLM package rule for labels/grouping/digest pinning, avoiding a duplicate LiteLLM-specific regex manager for the same lines.

Supersedes

This PR is intentionally based on the ordered post-#544 smoke train and should be reviewed as the collapse/superseding PR for:

chore(deps): update dependency kubernetes-sigs/gateway-api to v1.5.1 #544 Gateway API bump
chore(deps): update obolup.sh dependency updates #546 obolup dependency bumps
chore(deps): update cloudflare/cloudflared docker tag to v2026.5.1 #547 cloudflared bump
fix(monetize): drop available, use drainEndsAt as sole drain signal #548 drain catalog available removal
docs: add obol stack release train playbook #549 release-train skill docs
feat(x402): add last-payment-success-seconds gauge + recording rule for settlement log #550 x402 settlement recording rule

It contains those changes plus the follow-up fixes from the smoke review. If this PR lands, the underlying PRs should not be merged separately afterward.

Failure Map

flowchart TD
    Train["Merged PR train > #544"] --> Smoke["release-smoke run"]
    Smoke --> F06["flow-06 expected verifier replicas=2"]
    Smoke --> F04["flow-04 empty agent response"]
    Smoke --> F13["flow-13 empty agent / no PurchaseRequest"]
    Smoke --> F14["flow-14 empty agent / no PurchaseRequest"]
    Smoke --> F11["flow-11 paid inference timeout"]
    Smoke --> CF["#547 cloudflared tag/digest mismatch"]
    Smoke --> HotAdd["LiteLLM /model/new falls back"]

    F06 --> Fix06["Expect 1 local verifier replica"]
    F04 --> LLM["Endpoint preflight + discovered non-thinking request shape"]
    F13 --> LLM
    F14 --> LLM
    F11 --> Retry["Retry transient timeout once, then fail"]
    CF --> Digest["Use 2026.5.0 manifest digest"]
    HotAdd --> RuntimeConfig["Writable emptyDir runtime config"]

LiteLLM Hot-Add Root Cause

/model/new was not unreliable because of a recent #544+ product change. The failure was architectural: LiteLLM was configured to persist to /etc/litellm/config.yaml, but that path was a ConfigMap-backed subPath mount. ConfigMap volumes are read-only by design, and the container also runs with readOnlyRootFilesystem: true.

The correct Kubernetes primitive is to keep the ConfigMap as the source of truth and give the pod a writable runtime copy:

flowchart LR
    CM["litellm-config ConfigMap\nsource of truth"] --> Src["read-only mount\n/config-src/config.yaml"]
    Src --> Init["initContainer\ncopy config.yaml"]
    Init --> Work["emptyDir\n/config/config.yaml"]
    Work --> Main["LiteLLM\n/etc/litellm/config.yaml"]
    Main --> API["/model/new\nupdates router + persists YAML"]
    API --> Work

This keeps the restricted security posture intact, avoids PVC drift, and matches the existing controller/CLI flow: patch the Kubernetes ConfigMap first, then call /model/new for live routing. Restarts still converge back to the ConfigMap.

LiteLLM Model Mutation Safety

The writable runtime copy is safe for Obol-managed model changes because every supported mutation path either patches the ConfigMap and then updates the live router, or patches the ConfigMap and restarts LiteLLM so the initContainer copies the restored source of truth.

Path	Persistent source	Live effect	Runtime-copy implication
`obol model setup --provider ollama`	patches `litellm-config`	`/model/new`; restart fallback	live YAML is writable, so the API can persist
`obol model setup --provider anthropic/openai`	patches `litellm-config` and Secret	restart for envFrom Secret	new pod copies patched ConfigMap
`obol model setup custom`	patches `litellm-config`	`/model/new`; restart fallback	same as Ollama/custom OpenAI-compatible route
`obol model prefer`	reorders `litellm-config`	restart, because LiteLLM has no reorder API	new pod copies reordered ConfigMap
`obol model remove`	removes from `litellm-config`	`/model/delete`; restart fallback is intentionally not required	live route is deleted in place; ConfigMap remains source of truth
`obol model sync`	reads `litellm-config`	agent re-render only	no LiteLLM mutation expected
purchase controller paid routes	patches `litellm-config`	`/model/new` / `/model/delete`	avoids pod restart while still persisting route intent
`obol stack up` Helm restore	preserves and restores `litellm-config` around Helmfile	restart only when restored config semantically changes	closes the upgrade window where a pod could copy chart-default config before restore

The important invariant is: no Obol CLI path relies on Kubernetes live-updating /etc/litellm/config.yaml from a ConfigMap mount. That was already false with the old subPath mount. The live paths rely on /model/new / /model/delete or an explicit rollout, and this PR makes those paths consistent with a writable LiteLLM config file.

Additional guards added after review:

TestLiteLLMConfigSemanticEqualIgnoresFormatting prevents formatting-only restore churn from causing unnecessary restarts.
TestSyncDefaultsRestartsLiteLLMAfterConfigRestore_SourceGuard pins the restore ConfigMap -> restart LiteLLM -> autoConfigureLLM ordering.

Alternatives Considered

Alternative	Result	Why
Keep ConfigMap `subPath` mount	rejected	`subPath` is read-only and does not live-update; `/model/new` cannot persist.
Mount the whole ConfigMap directory at `/etc/litellm`	rejected	removes `subPath`, but ConfigMap volumes are still read-only, so `/model/new` still fails.
Always restart LiteLLM for every model change	rejected	simple, but defeats paid-route hot-add and creates avoidable disruption.
Writable PVC for `/etc/litellm`	rejected	avoids restart drift but creates a second durable source of truth outside Helm/ConfigMap ownership.
Sidecar sync from ConfigMap to writable file	rejected	more moving parts and race-prone against LiteLLM writing the same file.
LiteLLM DB/Postgres persistence	not for this PR	durable, but introduces a database and changes the stack shape substantially.
ConfigMap source + init copy to `emptyDir` + explicit API/restart convergence	chosen	minimal Kubernetes primitive, keeps ConfigMap as source of truth, gives LiteLLM a writable configured path, and preserves zero-restart hot-add where needed.

LLM Smoke Behavior

The smoke runner now validates an endpoint before spending time on clusters:

sequenceDiagram
    participant Smoke as release-smoke / flow-01
    participant LLM as OpenAI-compatible endpoint
    participant Agent as Hermes/OpenClaw agent smoke calls

    Smoke->>LLM: GET /models
    Smoke->>LLM: POST /chat/completions marker prompt
    alt final content returned
        Smoke-->>Agent: run normal request payloads
    else reasoning only / empty final content
        Smoke->>LLM: retry with chat_template_kwargs.enable_thinking=false
        LLM-->>Smoke: final marker content
        Smoke-->>Agent: export OBOL_LLM_DISABLE_THINKING=true
        Agent->>LLM: include chat_template_kwargs.enable_thinking=false
    else still unusable
        Smoke-->>Smoke: fail fast before cluster flows
    end

This keeps smoke usable across OpenAI-compatible endpoints while avoiding the previous mismatch where the endpoint could pass a direct manual smoke only when non-thinking was forced, but the agent flows still sent a different payload.

Gateway API #544 Story

#544 bumped gatewayApiVersion in internal/embed/infrastructure/helmfile.yaml from v1.4.1 to v1.5.1. Today that value is not consumed by any rendered template or CRD installer. The Traefik Helm chart is the component that installs and owns the Gateway API CRDs in the local stack.

CRDs/resources relevant to this stack:

gatewayclasses.gateway.networking.k8s.io: Traefik's controller class.
gateways.gateway.networking.k8s.io: Traefik's traefik-gateway entry point.
httproutes.gateway.networking.k8s.io: eRPC, frontend, skill catalog, services JSON, ServiceOffer routes, and agent identity well-known routes.
referencegrants.gateway.networking.k8s.io: allows cross-namespace references needed by controller-created ServiceOffer routing to the shared x402 verifier service.

flowchart TD
    TraefikChart["traefik/traefik chart"] --> CRDs["Gateway API CRDs"]
    CRDs --> GatewayClass["GatewayClass traefik"]
    CRDs --> Gateway["Gateway traefik-gateway"]
    CRDs --> HTTPRoute["HTTPRoute"]
    CRDs --> ReferenceGrant["ReferenceGrant"]

    HTTPRoute --> ERPC["/rpc local-only"]
    HTTPRoute --> FE["frontend local-only"]
    HTTPRoute --> Catalog["/skill.md and /api/services.json"]
    HTTPRoute --> Offers["/services/* ServiceOffer routes"]
    HTTPRoute --> Identity["/.well-known/agent-registration.json"]
    ReferenceGrant --> X402["x402 verifier cross-namespace auth service"]

If we ever make gatewayApiVersion authoritative, the next change should either remove the unused value or wire it to a deliberate CRD install/update path using server-side apply and explicit validation. Gateway API v1.5.x upstream also matters because experimental CRDs are large enough to require server-side apply and TLSRoute alpha has deprecation/removal caveats. That is not happening in the current rendered stack.

Notes By Failure

Raw Helm lint/template: the direct raw-chart lint failure was a false check against unsubstituted {{PLACEHOLDER}} strings. The existing workflow already mirrors stack init by substituting placeholders in a temp chart copy. This PR keeps that simple path and only updates the workflow Helm version to match obolup.sh.
fix(monetize): drop available, use drainEndsAt as sole drain signal #548 available removal: JSON/schema remain additive and use drainEndsAt as the machine signal. /skill.md now shows active rows as available in the Status column instead of -, without reintroducing an available JSON field or - **Available**: detail bullet.
Paid inference: no quick-tunnel bypass was added. The smoke still asserts paid inference against the real paid endpoint; transient timeout/524/context-canceled results get one retry by default via PAID_INFERENCE_TRANSIENT_RETRIES and then fail.

Validation

Not Re-run Here

The full cluster release-smoke was not re-run after this patch because no k3d cluster was left running after cleanup. The fixes are packaged so the next release-smoke run should fail fast on bad LLM endpoint shape before cluster setup and should exercise the real paid route without the previous bypass proposal.

OBOL Permit2 Follow-Up

The final flow-13 failure after the initial train fixes was not a facilitator image regression. The agent successfully created a PurchaseRequest and the sidecar had 5 auths, but paid inference was rejected by the facilitator with PaymentTooEarly().

Root cause: Permit2 auths were using buyer host wall-clock time for validAfter. On Anvil forks, chain time only advances when blocks are mined, so time.time() - 600 can still be in the future relative to the forked chain timestamp after long LLM/cluster setup.

Fix: Permit2 presigned auths now use validAfter = "0", matching the immediate-valid behavior already used for ERC-3009-style smoke auths. The auth lifetime remains bounded by deadline.

sequenceDiagram
    participant BobHost as Bob host clock
    participant BuyPy as buy.py Permit2 signer
    participant Anvil as Anvil fork chain time
    participant Fac as x402 facilitator

    BobHost->>BuyPy: wall-clock now
    BuyPy--x Fac: old validAfter = now - 600
    Fac->>Anvil: compare against block.timestamp
    Anvil-->>Fac: chain time may lag wall-clock
    Fac--x BuyPy: PaymentTooEarly()

    BuyPy->>Fac: new validAfter = 0, bounded deadline
    Fac->>Anvil: valid immediately on chain
    Fac-->>BuyPy: payment accepted

Agent Discovery Follow-Up

The broad Hermes “discover Alice” prompt in flows 13/14 was informational, burned a 300s LLM turn, ignored curl failures, and then passed regardless. It was replaced with deterministic catalog validation from Bob’s agent pod against /api/services.json.

The structural proof is still the next step: Hermes must invoke buy.py, create the PurchaseRequest, wait for Ready=True, provision sidecar auths, make paid inference return HTTP 200, settle on-chain, and match exact OBOL balance deltas.

flowchart LR
    BobPod[Bob agent pod] --> Catalog[Alice /api/services.json]
    Catalog --> Assert[Assert service name, endpoint path, OBOL, permit2, Base Sepolia]
    Assert --> Hermes[Hermes agent buy prompt]
    Hermes --> BuyPy[buy.py creates PurchaseRequest]
    BuyPy --> Ready[PurchaseRequest Ready=True]
    Ready --> Paid[paid/qwen36-apex-i-compact HTTP 200]
    Paid --> Chain[OBOL settlement + exact balance deltas]

Latest Validation

Validated #551 + #552 with SilverMesh and the target x402 facilitator image:

Image: ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay:1.4.10@sha256:1fbd9e6b9863a288aba823e3107b1884746d9fb66e3c7989add4ed437c98a7ad
Flow: flow-13-dual-stack-obol.sh
Artifact: .tmp/flow13-services-json-permit2-retry-20260525-234847
Result: METRIC steps_failed=0
Key checks:
- agent pod found alice-obol-inference in /api/services.json with OBOL + permit2
- Hermes agent created PurchaseRequest on attempt 1
- PurchaseRequest Ready=True
- buyer sidecar had 5 auths
- paid inference succeeded with HTTP 200 and expected content
- settlement receipt archived
- Alice balance increased and Bob signer balance decreased by exactly 1000000000000000 wei

…e drain signal (re-amend of #535) drain becomes purely additive — active offers serialize identically to pre-drain main. The only new wire field is `drainEndsAt`, set on draining offers only. Consumers detect drain with `if (entry.drainEndsAt) { /* draining */ }`. No schema-breaking change for any consumer that was reading the catalog before drain landed. This re-ships an amendment that was originally pushed as commit dd89750 on `feat/drain-replaces-pause` for PR #535. The amendment didn't survive the bundle PR #536's merge into main, so the controller is shipping the un-amended `Available bool` shape today. - ServiceCatalogEntry: remove `Available bool`; keep `DrainEndsAt string omitempty` - service-catalog.schema.json: drop `available` from `required` + `properties` - buildServiceCatalogJSON: stop setting Available; only set DrainEndsAt on drain - buildSkillCatalogMarkdown: rename `Available` table column to `Status` (active rows show `—`; draining rows show `draining · ends <RFC3339>`). Drop the per-service `- **Available**:` bullet entirely; draining services keep only the `- **Drain ends at**:` bullet. - serviceDefWithDrain: stop setting the (already-additive) `Available *bool` on erc8004.ServiceDef during drain; signal via DrainEndsAt only. - Tests: - TestBuildServiceCatalogJSON_ExcludesNonReady: replace `services[0].Available == true` with raw-JSON map walk asserting `available` and `drainEndsAt` keys are absent on active entries. - TestBuildServiceCatalogJSON_DrainLifecycle: rewrite to raw-map walk; assert active entries have neither `available` nor `drainEndsAt`, mid-drain entries have only `drainEndsAt` (no `available`). - TestBuildRegistration{,Identity}Services_IncludesDrainMetadata: replace `svc.Available == &false` checks with `svc.Available == nil` (DrainEndsAt is now the sole drain marker). - Add TestBuildSkillCatalogMarkdown_DrainAdditiveDetail: asserts no `- **Available**:` bullet appears for any offer, that draining offers keep their `- **Drain ends at**:` bullet, and that the table header uses `Status` not `Available`.

…or settlement log The `obol_x402_verifier_last_payment_success_seconds` gauge already lands on the verifier success branch (see internal/x402/verifier.go:206 and :261, right next to chargedRequests.Inc()). Add the matching recording rule so the frontend My Purchases drawer can render "last settlement: 12s ago" labels without joining against the buyer sidecar or chasing receipts.

…-550-20260525

github-actions Bot and others added 13 commits May 24, 2026 20:19

chore(deps): update dependency kubernetes-sigs/gateway-api to v1.5.1

e81a821

chore(deps): update obolup.sh dependency updates

db4f33a

chore(deps): update cloudflare/cloudflared docker tag to v2026.5.0

9a84f30

docs: add obol stack release train playbook

ae25b41

Merge remote-tracking branch 'refs/remotes/pr/544' into smoke/prs-544…

1ad607d

…-550-20260525

Merge remote-tracking branch 'refs/remotes/pr/546' into smoke/prs-544…

75700da

…-550-20260525

Merge remote-tracking branch 'refs/remotes/pr/547' into smoke/prs-544…

c93e2c9

…-550-20260525

Merge remote-tracking branch 'refs/remotes/pr/548' into smoke/prs-544…

b2dcb27

…-550-20260525

Merge remote-tracking branch 'refs/remotes/pr/549' into smoke/prs-544…

65fd3b2

…-550-20260525

Merge remote-tracking branch 'refs/remotes/pr/550' into smoke/prs-544…

51aa86d

…-550-20260525

fix: stabilize post-544 smoke train

69ca96b

bussyjd force-pushed the fix/pr544-plus-smoke-regressions branch from 49bf2b7 to 69ca96b Compare May 25, 2026 08:43

bussyjd added 5 commits May 25, 2026 14:41

fix: bump LiteLLM fork image

d86a10e

fix: raise LiteLLM memory limit

a3a4073

fix: persist no-thinking option for custom LLM routes

061e356

fix: make tool-call smoke deterministic

28838d0

fix: stabilize OBOL Permit2 smoke

df96d9f

bussyjd marked this pull request as ready for review May 25, 2026 20:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stabilize post-544 smoke train#551

fix: stabilize post-544 smoke train#551
bussyjd wants to merge 18 commits into
mainfrom
fix/pr544-plus-smoke-regressions

bussyjd commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

LiteLLM Fork Bump

Supersedes

Failure Map

LiteLLM Hot-Add Root Cause

LiteLLM Model Mutation Safety

Alternatives Considered

LLM Smoke Behavior

Gateway API #544 Story

Notes By Failure

Validation

Not Re-run Here

OBOL Permit2 Follow-Up

Agent Discovery Follow-Up

Latest Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bussyjd commented May 25, 2026 •

edited

Loading