Improve gateway restart reliability, add diagnostic endpoint, clean stale providers#261
Open
dalexeenko wants to merge 1 commit intocloudflare:mainfrom
Open
Conversation
…tale AI Gateway providers The gateway restart handler was unreliable — Process.kill() didn't always terminate the openclaw gateway, leaving stale processes and lock files. This overhauls the restart flow and adds a diagnostic endpoint for debugging API connectivity and AI Gateway configuration issues. Gateway restart improvements: - Force kill via pkill -9 before falling back to Process API - Remove lock files (/tmp/openclaw-gateway.lock, gateway.lock) - Wait for process to fully die before restarting - Clean stale AI Gateway providers (cf-ai-gw-*, cloudflare-ai-gateway) from openclaw config on restart to prevent config validation failures Diagnostic endpoint (GET /api/admin/diagnostic): - Shows Worker env var status (masked) for all AI-related secrets - Constructs and displays AI Gateway URL (mirrors start-openclaw.sh logic) - Reads openclaw config from container showing providers and default model - Checks gateway process status via ps - Tests direct Anthropic API connectivity from inside container - Tests AI Gateway URL connectivity when configured Stale provider cleanup in start-openclaw.sh: - When CF_AI_GATEWAY_MODEL is not set, remove any cf-ai-gw-* providers restored from R2 backup and reset default model if it referenced one - Prevents config validation failures from stale R2 backups Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Process.kill()withpkill -9, lock file cleanup, process death verification, and stale AI Gateway provider removal from openclaw config before restartingGET /api/admin/diagnosticendpoint — comprehensive debugging tool that checks Worker env vars (masked), AI Gateway URL construction, openclaw provider config, gateway process status, and tests both direct Anthropic API and AI Gateway connectivity from inside the containerCF_AI_GATEWAY_MODELis not set,start-openclaw.shnow removes anycf-ai-gw-*providers restored from R2 backup and resets the default model if it referenced one, preventing config validation failuresContext
When debugging a non-responsive moltbot, several issues were discovered:
cf-ai-gw-anthropic,cloudflare-ai-gateway) that fail openclaw's config validation when the corresponding env vars are no longer setThe diagnostic endpoint was instrumental in identifying that the gateway process was running without API keys in its environment (started before secrets were configured).
Changes
src/routes/api.tspkill -9→ Process API kill → lock file removal → 3s wait → verify death → clean stale providers from config → start new gatewaysrc/gateway/index.tsbuildEnvVarsfor use in the diagnostic endpointstart-openclaw.shelsebranch toCF_AI_GATEWAY_MODELcheck that removes stalecf-ai-gw-*providers and resets default modelTest plan
/api/admin/diagnosticand verify all 6 sections return meaningful dataCF_AI_GATEWAY_MODELset: verify AI Gateway provider appears in config and gateway test returns 200CF_AI_GATEWAY_MODEL: verify stale providers are cleaned on container boot🤖 Generated with Claude Code