Skip to content

Improve gateway restart reliability, add diagnostic endpoint, clean stale providers#261

Open
dalexeenko wants to merge 1 commit intocloudflare:mainfrom
dalexeenko:fix/gateway-restart-and-diagnostics
Open

Improve gateway restart reliability, add diagnostic endpoint, clean stale providers#261
dalexeenko wants to merge 1 commit intocloudflare:mainfrom
dalexeenko:fix/gateway-restart-and-diagnostics

Conversation

@dalexeenko
Copy link

Summary

  • Overhaul gateway restart handler — replace unreliable Process.kill() with pkill -9, lock file cleanup, process death verification, and stale AI Gateway provider removal from openclaw config before restarting
  • Add GET /api/admin/diagnostic endpoint — comprehensive debugging tool that checks Worker env vars (masked), AI Gateway URL construction, openclaw provider config, gateway process status, and tests both direct Anthropic API and AI Gateway connectivity from inside the container
  • Clean stale AI Gateway providers on boot — when CF_AI_GATEWAY_MODEL is not set, start-openclaw.sh now removes any cf-ai-gw-* providers restored from R2 backup and resets the default model if it referenced one, preventing config validation failures

Context

When debugging a non-responsive moltbot, several issues were discovered:

  1. Gateway restart didn't reliably kill the process, leaving stale lock files and zombie processes
  2. R2 backups could restore stale AI Gateway provider configs (e.g. cf-ai-gw-anthropic, cloudflare-ai-gateway) that fail openclaw's config validation when the corresponding env vars are no longer set
  3. There was no way to inspect the runtime state (env vars reaching the container, openclaw config, API connectivity) without SSH access to the sandbox

The diagnostic endpoint was instrumental in identifying that the gateway process was running without API keys in its environment (started before secrets were configured).

Changes

src/routes/api.ts

  • Restart handler: force kill via pkill -9 → Process API kill → lock file removal → 3s wait → verify death → clean stale providers from config → start new gateway
  • Diagnostic endpoint: 6-step check covering Worker env, AI Gateway URL, openclaw config, process status, direct API test, AI Gateway test

src/gateway/index.ts

  • Export buildEnvVars for use in the diagnostic endpoint

start-openclaw.sh

  • Added else branch to CF_AI_GATEWAY_MODEL check that removes stale cf-ai-gw-* providers and resets default model

Test plan

  • Deploy and verify gateway restart works reliably (kills old process, starts new one)
  • Hit /api/admin/diagnostic and verify all 6 sections return meaningful data
  • Test with CF_AI_GATEWAY_MODEL set: verify AI Gateway provider appears in config and gateway test returns 200
  • Test without CF_AI_GATEWAY_MODEL: verify stale providers are cleaned on container boot
  • Verify R2 backup restore + stale provider cleanup works end-to-end

🤖 Generated with Claude Code

…tale AI Gateway providers

The gateway restart handler was unreliable — Process.kill() didn't always
terminate the openclaw gateway, leaving stale processes and lock files. This
overhauls the restart flow and adds a diagnostic endpoint for debugging
API connectivity and AI Gateway configuration issues.

Gateway restart improvements:
- Force kill via pkill -9 before falling back to Process API
- Remove lock files (/tmp/openclaw-gateway.lock, gateway.lock)
- Wait for process to fully die before restarting
- Clean stale AI Gateway providers (cf-ai-gw-*, cloudflare-ai-gateway)
  from openclaw config on restart to prevent config validation failures

Diagnostic endpoint (GET /api/admin/diagnostic):
- Shows Worker env var status (masked) for all AI-related secrets
- Constructs and displays AI Gateway URL (mirrors start-openclaw.sh logic)
- Reads openclaw config from container showing providers and default model
- Checks gateway process status via ps
- Tests direct Anthropic API connectivity from inside container
- Tests AI Gateway URL connectivity when configured

Stale provider cleanup in start-openclaw.sh:
- When CF_AI_GATEWAY_MODEL is not set, remove any cf-ai-gw-* providers
  restored from R2 backup and reset default model if it referenced one
- Prevents config validation failures from stale R2 backups

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments