fix: stabilize post-544 smoke train#551
Open
bussyjd wants to merge 18 commits into
Open
Conversation
…e drain signal (re-amend of #535) drain becomes purely additive — active offers serialize identically to pre-drain main. The only new wire field is `drainEndsAt`, set on draining offers only. Consumers detect drain with `if (entry.drainEndsAt) { /* draining */ }`. No schema-breaking change for any consumer that was reading the catalog before drain landed. This re-ships an amendment that was originally pushed as commit dd89750 on `feat/drain-replaces-pause` for PR #535. The amendment didn't survive the bundle PR #536's merge into main, so the controller is shipping the un-amended `Available bool` shape today. - ServiceCatalogEntry: remove `Available bool`; keep `DrainEndsAt string omitempty` - service-catalog.schema.json: drop `available` from `required` + `properties` - buildServiceCatalogJSON: stop setting Available; only set DrainEndsAt on drain - buildSkillCatalogMarkdown: rename `Available` table column to `Status` (active rows show `—`; draining rows show `draining · ends <RFC3339>`). Drop the per-service `- **Available**:` bullet entirely; draining services keep only the `- **Drain ends at**:` bullet. - serviceDefWithDrain: stop setting the (already-additive) `Available *bool` on erc8004.ServiceDef during drain; signal via DrainEndsAt only. - Tests: - TestBuildServiceCatalogJSON_ExcludesNonReady: replace `services[0].Available == true` with raw-JSON map walk asserting `available` and `drainEndsAt` keys are absent on active entries. - TestBuildServiceCatalogJSON_DrainLifecycle: rewrite to raw-map walk; assert active entries have neither `available` nor `drainEndsAt`, mid-drain entries have only `drainEndsAt` (no `available`). - TestBuildRegistration{,Identity}Services_IncludesDrainMetadata: replace `svc.Available == &false` checks with `svc.Available == nil` (DrainEndsAt is now the sole drain marker). - Add TestBuildSkillCatalogMarkdown_DrainAdditiveDetail: asserts no `- **Available**:` bullet appears for any offer, that draining offers keep their `- **Drain ends at**:` bullet, and that the table header uses `Status` not `Available`.
…or settlement log The `obol_x402_verifier_last_payment_success_seconds` gauge already lands on the verifier success branch (see internal/x402/verifier.go:206 and :261, right next to chargedRequests.Inc()). Add the matching recording rule so the frontend My Purchases drawer can render "last settlement: 12s ago" labels without joining against the buyer sidecar or chasing receipts.
49bf2b7 to
69ca96b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This stabilizes the merged post-#544 train (#544, #546, #547, #548, #549, #550) against the smoke failures seen during the ordered test run.
enable_thinking=falsediscovery into agent smoke requests.subPathconfig file with a writable pod-local runtime copy.availablewhile restoring a useful human-readableavailablestatus in/skill.md.LiteLLM Fork Bump
The Obol LiteLLM fork was rebuilt cleanly on top of current upstream instead of merging upstream history into the fork branch. The fork compare now shows only the six Obol-specific commits ahead of upstream and no upstream backlog:
BerriAI/litellm@06f6cfc5aeObolNetwork/litellm@9b3e569670ahead_by=6,behind_by=0ghcr.io/obolnetwork/litellm:sha-9b3e569@sha256:ac453f9cdfa3752efa38998aa5bbf4f9a67e642a68b27a647aaf667c083ddc51The Renovate config now lets the existing generic
ghcr.io/obolnetwork/*extractor track this image and applies the LiteLLM package rule for labels/grouping/digest pinning, avoiding a duplicate LiteLLM-specific regex manager for the same lines.Supersedes
This PR is intentionally based on the ordered post-#544 smoke train and should be reviewed as the collapse/superseding PR for:
availableremovalIt contains those changes plus the follow-up fixes from the smoke review. If this PR lands, the underlying PRs should not be merged separately afterward.
Failure Map
flowchart TD Train["Merged PR train > #544"] --> Smoke["release-smoke run"] Smoke --> F06["flow-06 expected verifier replicas=2"] Smoke --> F04["flow-04 empty agent response"] Smoke --> F13["flow-13 empty agent / no PurchaseRequest"] Smoke --> F14["flow-14 empty agent / no PurchaseRequest"] Smoke --> F11["flow-11 paid inference timeout"] Smoke --> CF["#547 cloudflared tag/digest mismatch"] Smoke --> HotAdd["LiteLLM /model/new falls back"] F06 --> Fix06["Expect 1 local verifier replica"] F04 --> LLM["Endpoint preflight + discovered non-thinking request shape"] F13 --> LLM F14 --> LLM F11 --> Retry["Retry transient timeout once, then fail"] CF --> Digest["Use 2026.5.0 manifest digest"] HotAdd --> RuntimeConfig["Writable emptyDir runtime config"]LiteLLM Hot-Add Root Cause
/model/newwas not unreliable because of a recent #544+ product change. The failure was architectural: LiteLLM was configured to persist to/etc/litellm/config.yaml, but that path was a ConfigMap-backedsubPathmount. ConfigMap volumes are read-only by design, and the container also runs withreadOnlyRootFilesystem: true.The correct Kubernetes primitive is to keep the ConfigMap as the source of truth and give the pod a writable runtime copy:
flowchart LR CM["litellm-config ConfigMap\nsource of truth"] --> Src["read-only mount\n/config-src/config.yaml"] Src --> Init["initContainer\ncopy config.yaml"] Init --> Work["emptyDir\n/config/config.yaml"] Work --> Main["LiteLLM\n/etc/litellm/config.yaml"] Main --> API["/model/new\nupdates router + persists YAML"] API --> WorkThis keeps the restricted security posture intact, avoids PVC drift, and matches the existing controller/CLI flow: patch the Kubernetes ConfigMap first, then call
/model/newfor live routing. Restarts still converge back to the ConfigMap.LiteLLM Model Mutation Safety
The writable runtime copy is safe for Obol-managed model changes because every supported mutation path either patches the ConfigMap and then updates the live router, or patches the ConfigMap and restarts LiteLLM so the initContainer copies the restored source of truth.
obol model setup --provider ollamalitellm-config/model/new; restart fallbackobol model setup --provider anthropic/openailitellm-configand Secretobol model setup customlitellm-config/model/new; restart fallbackobol model preferlitellm-configobol model removelitellm-config/model/delete; restart fallback is intentionally not requiredobol model synclitellm-configlitellm-config/model/new//model/deleteobol stack upHelm restorelitellm-configaround HelmfileThe important invariant is: no Obol CLI path relies on Kubernetes live-updating
/etc/litellm/config.yamlfrom a ConfigMap mount. That was already false with the oldsubPathmount. The live paths rely on/model/new//model/deleteor an explicit rollout, and this PR makes those paths consistent with a writable LiteLLM config file.Additional guards added after review:
TestLiteLLMConfigSemanticEqualIgnoresFormattingprevents formatting-only restore churn from causing unnecessary restarts.TestSyncDefaultsRestartsLiteLLMAfterConfigRestore_SourceGuardpins therestore ConfigMap -> restart LiteLLM -> autoConfigureLLMordering.Alternatives Considered
subPathmountsubPathis read-only and does not live-update;/model/newcannot persist./etc/litellmsubPath, but ConfigMap volumes are still read-only, so/model/newstill fails./etc/litellmemptyDir+ explicit API/restart convergenceLLM Smoke Behavior
The smoke runner now validates an endpoint before spending time on clusters:
sequenceDiagram participant Smoke as release-smoke / flow-01 participant LLM as OpenAI-compatible endpoint participant Agent as Hermes/OpenClaw agent smoke calls Smoke->>LLM: GET /models Smoke->>LLM: POST /chat/completions marker prompt alt final content returned Smoke-->>Agent: run normal request payloads else reasoning only / empty final content Smoke->>LLM: retry with chat_template_kwargs.enable_thinking=false LLM-->>Smoke: final marker content Smoke-->>Agent: export OBOL_LLM_DISABLE_THINKING=true Agent->>LLM: include chat_template_kwargs.enable_thinking=false else still unusable Smoke-->>Smoke: fail fast before cluster flows endThis keeps smoke usable across OpenAI-compatible endpoints while avoiding the previous mismatch where the endpoint could pass a direct manual smoke only when non-thinking was forced, but the agent flows still sent a different payload.
Gateway API #544 Story
#544 bumped
gatewayApiVersionininternal/embed/infrastructure/helmfile.yamlfrom v1.4.1 to v1.5.1. Today that value is not consumed by any rendered template or CRD installer. The Traefik Helm chart is the component that installs and owns the Gateway API CRDs in the local stack.CRDs/resources relevant to this stack:
gatewayclasses.gateway.networking.k8s.io: Traefik's controller class.gateways.gateway.networking.k8s.io: Traefik'straefik-gatewayentry point.httproutes.gateway.networking.k8s.io: eRPC, frontend, skill catalog, services JSON, ServiceOffer routes, and agent identity well-known routes.referencegrants.gateway.networking.k8s.io: allows cross-namespace references needed by controller-created ServiceOffer routing to the shared x402 verifier service.flowchart TD TraefikChart["traefik/traefik chart"] --> CRDs["Gateway API CRDs"] CRDs --> GatewayClass["GatewayClass traefik"] CRDs --> Gateway["Gateway traefik-gateway"] CRDs --> HTTPRoute["HTTPRoute"] CRDs --> ReferenceGrant["ReferenceGrant"] HTTPRoute --> ERPC["/rpc local-only"] HTTPRoute --> FE["frontend local-only"] HTTPRoute --> Catalog["/skill.md and /api/services.json"] HTTPRoute --> Offers["/services/* ServiceOffer routes"] HTTPRoute --> Identity["/.well-known/agent-registration.json"] ReferenceGrant --> X402["x402 verifier cross-namespace auth service"]If we ever make
gatewayApiVersionauthoritative, the next change should either remove the unused value or wire it to a deliberate CRD install/update path using server-side apply and explicit validation. Gateway API v1.5.x upstream also matters because experimental CRDs are large enough to require server-side apply and TLSRoute alpha has deprecation/removal caveats. That is not happening in the current rendered stack.Notes By Failure
{{PLACEHOLDER}}strings. The existing workflow already mirrors stack init by substituting placeholders in a temp chart copy. This PR keeps that simple path and only updates the workflow Helm version to matchobolup.sh.drainEndsAtas the machine signal./skill.mdnow shows active rows asavailablein theStatuscolumn instead of-, without reintroducing anavailableJSON field or- **Available**:detail bullet.PAID_INFERENCE_TRANSIENT_RETRIESand then fail.Validation
bash -n flows/*.shjq empty renovate.jsongo test ./internal/embed ./internal/serviceoffercontroller -count=1go test ./internal/model ./internal/defaults -count=1go test ./... -count=1cloudflare/cloudflared:2026.5.0@sha256:59bab8d3aceec09bf6bdb07d6beca0225ca5cd7ab79436a87ea97978fe1dc4f9docker buildx imagetools inspect cloudflare/cloudflared:2026.5.0 --format '{{ .Manifest.Digest }}'Not Re-run Here
The full cluster release-smoke was not re-run after this patch because no k3d cluster was left running after cleanup. The fixes are packaged so the next release-smoke run should fail fast on bad LLM endpoint shape before cluster setup and should exercise the real paid route without the previous bypass proposal.
OBOL Permit2 Follow-Up
The final flow-13 failure after the initial train fixes was not a facilitator image regression. The agent successfully created a
PurchaseRequestand the sidecar had 5 auths, but paid inference was rejected by the facilitator withPaymentTooEarly().Root cause: Permit2 auths were using buyer host wall-clock time for
validAfter. On Anvil forks, chain time only advances when blocks are mined, sotime.time() - 600can still be in the future relative to the forked chain timestamp after long LLM/cluster setup.Fix: Permit2 presigned auths now use
validAfter = "0", matching the immediate-valid behavior already used for ERC-3009-style smoke auths. The auth lifetime remains bounded bydeadline.sequenceDiagram participant BobHost as Bob host clock participant BuyPy as buy.py Permit2 signer participant Anvil as Anvil fork chain time participant Fac as x402 facilitator BobHost->>BuyPy: wall-clock now BuyPy--x Fac: old validAfter = now - 600 Fac->>Anvil: compare against block.timestamp Anvil-->>Fac: chain time may lag wall-clock Fac--x BuyPy: PaymentTooEarly() BuyPy->>Fac: new validAfter = 0, bounded deadline Fac->>Anvil: valid immediately on chain Fac-->>BuyPy: payment acceptedAgent Discovery Follow-Up
The broad Hermes “discover Alice” prompt in flows 13/14 was informational, burned a 300s LLM turn, ignored curl failures, and then passed regardless. It was replaced with deterministic catalog validation from Bob’s agent pod against
/api/services.json.The structural proof is still the next step: Hermes must invoke
buy.py, create thePurchaseRequest, wait forReady=True, provision sidecar auths, make paid inference return HTTP 200, settle on-chain, and match exact OBOL balance deltas.flowchart LR BobPod[Bob agent pod] --> Catalog[Alice /api/services.json] Catalog --> Assert[Assert service name, endpoint path, OBOL, permit2, Base Sepolia] Assert --> Hermes[Hermes agent buy prompt] Hermes --> BuyPy[buy.py creates PurchaseRequest] BuyPy --> Ready[PurchaseRequest Ready=True] Ready --> Paid[paid/qwen36-apex-i-compact HTTP 200] Paid --> Chain[OBOL settlement + exact balance deltas]Latest Validation
Validated #551 + #552 with SilverMesh and the target x402 facilitator image:
ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay:1.4.10@sha256:1fbd9e6b9863a288aba823e3107b1884746d9fb66e3c7989add4ed437c98a7adflow-13-dual-stack-obol.sh.tmp/flow13-services-json-permit2-retry-20260525-234847METRIC steps_failed=0alice-obol-inferencein/api/services.jsonwithOBOL+permit2PurchaseRequeston attempt 1PurchaseRequest Ready=True1000000000000000wei