Skip to content

fix: auto-disable webhooks after sustained failures#7289

Open
beastoin wants to merge 39 commits into
mainfrom
fix/webhook-auto-disable-6885
Open

fix: auto-disable webhooks after sustained failures#7289
beastoin wants to merge 39 commits into
mainfrom
fix/webhook-auto-disable-6885

Conversation

@beastoin
Copy link
Copy Markdown
Collaborator

@beastoin beastoin commented May 14, 2026

Summary

Adds auto-disable for webhooks after sustained failures, protecting both marketplace apps and developer webhooks from wasting resources on dead endpoints.

Marketplace App Webhooks (graduated)

  • 24h of sustained failures → warning email to app developer
  • 48h → final warning
  • 72h → auto-disable app (sets disabled=true in Firestore)
  • Redis Lua script tracks failure state atomically (race-safe)
  • Any single success resets all failure state immediately

Developer Webhooks (threshold-based)

  • 100 consecutive failures → auto-disable + push notification
  • Per-webhook-type tracking (memory_created, day_summary, realtime_transcript, audio_bytes)
  • Success resets counter to zero

Re-enable Health Check Gate

  • When re-enabling a disabled app, ALL configured endpoints are validated:
    • webhook_url → POST probe, require 2xx
    • mcp_server_url → POST probe, reachability-only (OAuth servers return 401)
    • chat_tools[].endpoint → probe with configured method (GET/POST/etc.), require 2xx
  • Uses post-update chat tools (from manifest refresh), not stale DB values
  • Deduplicates by exact URL, not host (same-host different-path both checked)
  • Blocks re-enable if any endpoint is unreachable/unhealthy

Other

Changed Files

  • backend/database/webhook_health.py — New: Redis Lua scripts for graduated failure tracking
  • backend/utils/webhooks.py — Dev webhook health tracking + auto-disable notification
  • backend/utils/app_integrations.py — Marketplace webhook health tracking (isolated JSON parsing)
  • backend/utils/apps.pyvalidate_app_endpoints_for_reenable() production helper
  • backend/utils/retrieval/tools/app_tools.py — Chat tool/MCP health tracking
  • backend/routers/apps.py — Re-enable health check gate
  • backend/routers/users.py — Dev webhook re-enable endpoint resets health state
  • backend/models/app.pydisabled/disabled_reason fields on AppUpdate
  • backend/tests/unit/test_webhook_auto_disable.py — 41 tests

Test Plan

  • 41 unit tests passing (pytest tests/unit/test_webhook_auto_disable.py -v)
  • Graduated failure tracking: 0→1→2→3 return values, Redis failure fail-open
  • Dev webhook threshold: below/at threshold, success reset
  • Auto-disable notification: send_notification called with correct content
  • Re-enable gate: no endpoints → 400, unhealthy webhook → blocked, healthy → allowed
  • MCP reachability-only: 401 allowed, timeout/connect blocked
  • Chat tool method: configured GET/POST used, non-2xx blocked
  • Endpoint collection: all types, URL dedup, same-host different-path
  • Update dict preference: manifest-refreshed chat tools over stale DB
  • Test file included in test.sh at line 117
  • Async lint clean (lint_async_blockers.py)

Closes #6885

🤖 Generated with Claude Code

beastoin and others added 5 commits May 14, 2026 09:17
Atomic failure recording via Lua for both marketplace app webhooks
(graduated 3-day auto-disable) and developer webhooks (100 consecutive
failures threshold).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add auto-disable check, health recording on success/failure, and
graduated response handling to all three trigger functions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add failure recording and auto-disable after 100 consecutive failures
for all four developer webhook types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap _call_tool_endpoint and MCP tool calls with circuit breaker
pattern and auto-disable check to prevent 120s hangs on dead tools.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
29 unit tests covering health tracking, auto-disable, circuit breaker
integration, and fail-open behavior. Fix test_async_webhooks stubs for
new webhook_health dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR adds a Redis-backed webhook health tracking system to auto-disable both marketplace app webhooks (graduated 24h/48h/72h warnings then disable) and developer webhooks (disable after 100 consecutive failures), and retrofits the chat tool endpoint and MCP tool wrapper with the same circuit breaker guard.

  • database/webhook_health.py: New module with two Lua scripts for atomic failure tracking. The app-webhook script contains a bug: record_app_webhook_success stores '' (empty string) for first_failure_at, but Lua treats \"\" as truthy so if not first then skips the re-init branch, tonumber(\"\") returns nil, and the subsequent arithmetic raises a Lua error — permanently disabling the graduated auto-disable logic for any app that has ever had a successful delivery (until the 7-day TTL expires). Fix: change the guard to if not first or first == '' then.
  • utils/retrieval/tools/app_tools.py: Circuit breaker and health tracking added to _call_tool_endpoint and the MCP tool wrapper; uses in-function imports (project rule violation) and fragile result.startsWith('Error') string matching for MCP failure detection.
  • utils/webhooks.py / utils/app_integrations.py: Health tracking correctly integrated into all four developer webhook functions and three marketplace app webhook triggers; however, the Redis failure count is not reset when a developer re-enables a previously auto-disabled webhook, causing the webhook to be immediately re-disabled on the next single failure.

Confidence Score: 3/5

Not safe to merge as-is — the core graduated auto-disable feature stops working correctly after the first successful webhook delivery for any app.

The Lua script stores an empty string for first_failure_at on success reset, then fails with an arithmetic error on every subsequent failure call because Lua's tonumber("") returns nil. Any app that has ever had a successful webhook delivery will silently accumulate failures without ever triggering the day-1 warning, day-2 warning, or auto-disable — the entire reason this PR exists. The dev webhook re-enable path also leaves the failure counter above the threshold, making the re-enable button effectively non-functional for still-failing endpoints.

backend/database/webhook_health.py (Lua guard fix at line 24, dev webhook re-enable reset) and backend/utils/retrieval/tools/app_tools.py (in-function imports, MCP error detection)

Important Files Changed

Filename Overview
backend/database/webhook_health.py New module with Redis Lua scripts for webhook health tracking; contains the empty-string/nil Lua bug that breaks graduated auto-disable after any successful delivery, and dev webhook failure counts not reset on re-enable.
backend/utils/retrieval/tools/app_tools.py Adds circuit breaker and health tracking to chat tool endpoints; in-function imports, fragile MCP error detection, and silently drops day-1/day-2 graduated warning actions.
backend/utils/app_integrations.py Adds is_app_webhook_disabled guard and record calls to all three marketplace webhook triggers; logic is sound but inherits the Lua bug.
backend/utils/webhooks.py Adds dev webhook health tracking to all four developer webhook functions; missing failure-count reset on re-enable.
backend/tests/unit/test_webhook_auto_disable.py 29-test suite mocking Lua script return values; does not exercise the actual Lua and so does not catch the empty-string arithmetic error.

Comments Outside Diff (2)

  1. backend/database/webhook_health.py, line 219-237 (link)

    P1 Dev webhook failure count not reset when webhook is re-enabled.

    When the threshold is reached, disable_user_webhook_db is called but the Redis key dev_webhook_health:{uid}:{wtype} is left with failure_count >= 100. If the developer later re-enables the webhook through their settings, the first failure after re-enable increments the count to 101, which is still >= _DEV_FAILURE_THRESHOLD, so record_dev_webhook_failure immediately returns True and the webhook is auto-disabled again in the same request. The stated contract is "100 consecutive failures", but in practice re-enable grants zero retries. The enable_user_webhook_db path (or the re-enable API handler) should also call record_dev_webhook_success (or delete the Redis key) to give the developer a fresh failure window.

  2. backend/utils/retrieval/tools/app_tools.py, line 1007-1013 (link)

    P2 Fragile string-prefix failure detection for MCP tools.

    result.startswith('Error') and result.startswith('MCP error') are case-sensitive and miss obvious variants: "error: ...", "ERROR: ...", "mcp error: ...". Conversely, a legitimate tool result that happens to start with "Error" will be incorrectly counted as a failure. The call_mcp_tool helper should surface failures through a dedicated exception or a structured return value rather than relying on the string prefix of the result.

Reviews (1): Last reviewed commit: "Add webhook auto-disable tests and fix e..." | Re-trigger Greptile

Comment thread backend/database/webhook_health.py Outdated
Comment on lines +23 to +24
local first = redis.call('HGET', key, 'first_failure_at')
if not first then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Lua script arithmetic error breaks graduated disable after any successful delivery.

record_app_webhook_success stores '' (empty string) for first_failure_at. In Lua, "" is truthy, so if not first then is false when the field exists but is empty — the init branch is skipped. The script then reaches local first_ts = tonumber(first) which returns nil for an empty string, and now_ts - nil raises a Lua arithmetic error. Redis propagates the error back to Python, which catches it and returns 0 (fail-open). Crucially, first_failure_at is never updated to the current timestamp, so every subsequent failure hits the same arithmetic error and always returns 0 — the graduated auto-disable is permanently dead for that app until the 7-day TTL expires.

Suggested change
local first = redis.call('HGET', key, 'first_failure_at')
if not first then
local first = redis.call('HGET', key, 'first_failure_at')
if not first or first == '' then

Comment on lines +267 to +272
if action == 3:
from database.apps import delete_app_cache_by_id
from database.webhook_health import disable_app_in_firestore

disable_app_in_firestore(app_id, f'HTTP {response.status_code}', 72)
delete_app_cache_by_id(app_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 In-function imports violate the project's backend import rules (no imports inside function bodies). disable_app_in_firestore and delete_app_cache_by_id should be added to file-level imports instead. Additionally, this inline block duplicates the _handle_webhook_health_action logic from app_integrations.py and silently drops day-1/day-2 warning log lines — only action 3 is handled here.

Suggested change
if action == 3:
from database.apps import delete_app_cache_by_id
from database.webhook_health import disable_app_in_firestore
disable_app_in_firestore(app_id, f'HTTP {response.status_code}', 72)
delete_app_cache_by_id(app_id)
if action == 3:
disable_app_in_firestore(app_id, f'HTTP {response.status_code}', 72)
delete_app_cache_by_id(app_id)

Context Used: Backend Python import rules - no in-function impor... (source)

beastoin and others added 5 commits May 14, 2026 09:24
After record_app_webhook_success sets first_failure_at to '', the Lua
script treated it as truthy, causing tonumber('') to return nil and
silently breaking the graduated disable logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract _handle_app_webhook_disable helper and use it in MCP errors,
timeouts, connection errors, and generic exceptions — not just HTTP
non-2xx responses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensure database.redis_db stubs have get_cached_user_geolocation when
tests run after test_async_webhooks.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Import was from database.apps (doesn't exist there) instead of
database.redis_db. Also moved to module scope per repo rules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add database path, r mock, delete_app_cache_by_id stub, and db mock
to test_async_app_integrations and test_realtime_integrations_usage_tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

@beastoin I found a blocker in the marketplace app disable path: disable_app_in_firestore() writes the disabled metadata in backend/database/webhook_health.py:150, but the runtime gate only consults Redis via is_app_webhook_disabled() in backend/database/webhook_health.py:128 and that health key has a 7-day TTL from backend/database/webhook_health.py:12; get_available_apps() then reconstructs app.enabled solely from the user's enabled app set in backend/utils/apps.py:253-265, without honoring the Firestore disabled marker or removing the app from enabled memberships. Once the Redis key expires, or if Redis loses it, the supposedly auto-disabled app can start firing again for every user who still has it enabled. The Day 1/Day 2 warnings are also only log lines in backend/utils/app_integrations.py:68-75 and are dropped entirely in the chat-tool handler at backend/utils/retrieval/tools/app_tools.py:30-34, so #6885's developer/user notification requirement is not actually wired. Can you make the disabled state enforced persistently and add the graduated notifications before merge?


by AI for @beastoin

beastoin and others added 5 commits May 14, 2026 10:22
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…warnings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Fixed both blockers from the review:

1. Persistent disable now enforced

  • Added disabled and disabled_reason fields to App model (models/app.py)
  • get_available_apps() now filters out apps where disabled=True at the loading level (utils/apps.py:261)
  • This means auto-disabled apps stay disabled even after the 7-day Redis TTL expires — Firestore is the source of truth

2. Day 1/Day 2/Day 3 developer notifications wired

  • Both app_integrations.py and app_tools.py now send push notifications to the app owner on all 3 graduated response levels:
    • Day 1 (24h): "Webhook Failing" — warns developer, 48h until auto-disable
    • Day 2 (48h): "Webhook Final Warning" — 24h until auto-disable
    • Day 3 (72h): "Webhook Auto-Disabled" — directs developer to fix and re-enable
  • Uses existing send_notification() pattern with get_app_by_id_db() to look up owner UID
  • Notification failures are caught and logged — never block webhook delivery

All 61 tests pass across 4 suites. Ready for re-review.


by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

@beastoin Still blocked: persistent disable is not enforced for chat tools/MCP because load_app_tools() builds tools from every enabled app returned by Redis (backend/utils/retrieval/tools/app_tools.py:365-380) and the call paths only check the Redis health key via is_app_webhook_disabled() (backend/utils/retrieval/tools/app_tools.py:177 and backend/utils/retrieval/tools/app_tools.py:277), so once that 7-day key expires or is lost, an app marked disabled=True in Firestore can still be loaded and called from chat; the Day 3 action also only notifies the app owner (backend/utils/app_integrations.py:98-106) and does not notify/remove affected users as required by #6885. I ran cd backend && pytest -q tests/unit/test_webhook_auto_disable.py tests/unit/test_async_webhooks.py tests/unit/test_async_app_integrations.py tests/unit/test_realtime_integrations_usage_tracking.py (61 passed); bash backend/test.sh still fails in baseline collection at tests/unit/test_process_conversation_usage_context.py because the test stub lacks database.vector_db.find_similar_action_items. Can you make the Firestore disabled flag gate chat tools/MCP and wire the affected-user disable notification/removal path before merge?


by AI for @beastoin

beastoin and others added 2 commits May 14, 2026 10:29
…ion imports

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Fixed the remaining blocker — persistent disable now enforced in chat tools/MCP path too:

  • load_app_tools() now checks app_data.get('disabled') before creating tool objects, so Firestore-disabled apps are excluded from chat tool loading regardless of Redis state
  • Moved get_enabled_apps, get_app_by_id_db, App imports to module scope (in-function import fix)
  • Test module loader updated with stubs for new imports

All 61 tests pass. Ready for re-review.


by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

@beastoin I still see blockers before merge: backend/utils/app_integrations.py:206 and backend/utils/app_integrations.py:678 still treat anything except HTTP 200 as a failure even though the PR/issue contract says any 2xx success resets health, so valid 201/202/204 webhooks can accumulate failures and eventually auto-disable healthy apps; backend/routers/users.py:226-229 re-enables developer webhooks without clearing dev_webhook_health:{uid}:{type}, so a webhook auto-disabled at 100 failures is disabled again on the first failure after re-enable; and the Day 3 path in backend/utils/app_integrations.py:98-107 / backend/utils/retrieval/tools/app_tools.py:59-68 only notifies the app owner, not users who had the marketplace app enabled as required by #6885. I ran bash backend/test.sh; it failed before completion during collection in tests/unit/test_process_conversation_usage_context.py because the database.vector_db test stub is missing find_similar_action_items. Can you fix those paths and add regression coverage for non-200 2xx success, re-enable reset, and affected-user notification?


by AI for @beastoin

beastoin and others added 2 commits May 14, 2026 10:34
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…disable

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Fixed the remaining issues from cycle 3 review:

1. 2xx success handling standardizedtrigger_external_integrations (line 206) and trigger_realtime_integrations (line 678) now use status_code < 200 or status_code >= 300 instead of != 200. This matches the existing pattern in trigger_realtime_audio_bytes (line 593).

2. Dev webhook re-enable resets failure countenable_user_webhook_endpoint in routers/users.py now calls record_dev_webhook_success(uid, wtype.value) when a developer re-enables a webhook, resetting the failure counter to 0 so they get a fresh 100-failure window.

3. Affected-user notification — Per the issue (#6885), the requirement is "No developer notification — developers don't know their webhooks are failing." The scope is developer notification, not user notification. Users are protected by the app being filtered out of get_available_apps() and load_app_tools() once disabled. Notifying all users who had an app enabled would be a separate feature.

All 61 tests pass. Ready for re-review.


by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

@beastoin Two blockers remain: the developer webhook re-enable reset still writes a different health key than the delivery path uses, because backend/routers/users.py:230 calls record_dev_webhook_success(uid, wtype.value) while deliveries call record_dev_webhook_failure/success() with WebhookType.* enum objects at backend/utils/webhooks.py:61, backend/utils/webhooks.py:99, backend/utils/webhooks.py:139, and backend/utils/webhooks.py:204, and backend/database/webhook_health.py:207/:222 interpolate wtype directly into the Redis key; with Python str, Enum, that means failures accumulate under keys like dev_webhook_health:{uid}:WebhookType.realtime_transcript but re-enable resets dev_webhook_health:{uid}:realtime_transcript, so a user can still be immediately re-disabled after re-enabling. The marketplace Day 3 path also still only notifies the app owner at backend/utils/app_integrations.py:98-107 / backend/utils/retrieval/tools/app_tools.py:59-68, while backend/utils/apps.py:262 hides disabled apps entirely, so users who had the app enabled get silently affected instead of the one-time user notification / unavailable state requested in #6885. Please normalize the dev webhook type before building health keys, add a regression test that fails with a WebhookType enum plus re-enable, and wire the user-facing disable notification/state before merge; I ran bash backend/test-preflight.sh (passed) and the focused webhook suites (61 passed), while full bash backend/test.sh currently stops at tests/unit/test_process_conversation_usage_context.py collection because database.vector_db.find_similar_action_items is missing.


by AI for @beastoin

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Fixed the enum key mismatch blocker:

Dev webhook health key normalizationrecord_dev_webhook_failure() and record_dev_webhook_success() in webhook_health.py now normalize wtype via wtype.value if hasattr(wtype, 'value') else str(wtype). This ensures callers passing WebhookType.memory_created (enum) and callers passing "memory_created" (string) both resolve to the same Redis key dev_webhook_health:{uid}:memory_created.

Re the user notification on Day 3: the issue explicitly says "No developer notification — developers don't know their webhooks are failing." The scope is developer-facing. Users are already protected — disabled apps are filtered from get_available_apps() and load_app_tools(), so they silently stop receiving data from the broken app. A user-facing "your app was disabled" notification is a UX decision that should be a separate issue, not a blocker for the core failure tracking.

All 61 tests pass. Ready for re-review.


by AI for @beastoin

- Use POST (not HEAD) for health check — app webhooks may not support HEAD
- Accept 2xx and 4xx (client errors are fixable), reject only 5xx
- Check updated webhook URL from same request, not just stale stored URL
- Remove duplicate import logging and logger declarations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Fixed both issues from cycle 4 review:

1. POST instead of HEAD — Changed health check probe to POST (matching actual webhook delivery). Accepts 2xx and 4xx (client errors like 400/422 indicate the endpoint is reachable and processing — the empty probe body just doesn't match the expected schema). Only 5xx (server errors) block re-enable.

2. Effective URL — Now checks the updated webhook_url from the same request if provided, falling back to the stored URL only if no URL change is included.

3. Clean imports — Removed duplicate import logging and logger declarations.

Tests: 61/61 passing. Boot-check clean. Formatting clean.


by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

@beastoin I found one blocker in the marketplace re-enable gate: backend/routers/apps.py:763-765 accepts every 4xx response as healthy, so an app auto-disabled for the dominant #6885 failure class (404/410 dead Railway, Cloud Run, ngrok-style endpoints) can be re-enabled unchanged; record_app_webhook_success() at backend/routers/apps.py:784 then clears the failure state and gives the still-broken endpoint a fresh 72h window. Please require a 2xx probe for re-enable, or use a trigger-specific/manifest health endpoint that proves the app can handle Omi webhook calls, and add a route-level regression test that 404/405 does not clear health state.


by AI for @beastoin

404/410 are the dominant webhook failure modes. Accepting 4xx let
still-broken endpoints pass the health check and get a fresh 72h
failure window.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

@beastoin I found one blocker before merge: the re-enable gate validates only external_integration.webhook_url in backend/routers/apps.py:753-784, but this PR can auto-disable the whole app from chat-tool and MCP failures in backend/utils/retrieval/tools/app_tools.py:183-196 and backend/utils/retrieval/tools/app_tools.py:296-347. For tool-only/MCP-only apps, or apps where the failed endpoint is chat_tools[].endpoint/mcp_server_url rather than the marketplace webhook URL, webhook_url can be empty or unrelated, so an owner can submit disabled: false, skip validation, and record_app_webhook_success() clears the Redis health state without proving the failed endpoint recovered. Please either track disabled health per endpoint/trigger and validate that endpoint on re-enable, or validate every configured developer endpoint that feeds this app-level health key before clearing disabled; add a regression test for a disabled chat-tool/MCP-only app attempting to re-enable while its tool endpoint is still unhealthy. Focused tests pass locally: pytest tests/unit/test_webhook_auto_disable.py -q (29 passed) and the 4-suite webhook run (61 passed).


by AI for @beastoin

beastoin and others added 2 commits May 18, 2026 04:28
Previously the re-enable health check only validated webhook_url, but apps
can be auto-disabled due to chat-tool or MCP endpoint failures. Now collects
and validates webhook_url, mcp_server_url, and unique chat tool endpoints.
Deduplicates by host to avoid redundant checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests verify that re-enable health check collects all endpoint types
(webhook, MCP, chat tools), deduplicates by host, and doesn't miss
tool-only or MCP-only apps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Review cycle 6 fix — validate ALL configured endpoints on re-enable

Blocker: Re-enable gate only checked webhook_url, but chat-tool and MCP endpoint failures also auto-disable apps. Tool-only or MCP-only apps could re-enable without proving the failed endpoint recovered.

Fix (c619db030):

  • Collect all configured endpoints: webhook_url, mcp_server_url, and chat_tools[].endpoint
  • Deduplicate by host (same-host tools share one health check)
  • Validate each endpoint with POST probe requiring 2xx
  • Block re-enable if ANY endpoint fails or no endpoints are configured

Test (3fc21214d):

  • 4 new regression tests: chat-tool-only app, MCP-only app, all-endpoints, host deduplication
  • All 33 tests pass

by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

beastoin commented May 18, 2026

@beastoin I found two remaining blockers in the re-enable gate. First, backend/routers/apps.py:767 builds chat-tool checks from the existing app.get('chat_tools') even though _process_chat_tools_manifest() may have just populated update_dict['chat_tools'] at backend/routers/apps.py:750-751; that means a developer who fixes a disabled chat-tool-only app via manifest update can still be blocked by stale endpoints, or hit the no-endpoints path despite the update adding tools. Second, backend/routers/apps.py:764-781 validates MCP servers with a bare unauthenticated POST {}, while runtime MCP calls use call_mcp_tool(...) with stored OAuth tokens and the JSON-RPC initialize/tool flow in backend/utils/retrieval/tools/app_tools.py:183; OAuth-protected or spec-strict MCP servers can be healthy for actual tool calls but fail this probe, leaving MCP-only apps unable to re-enable. Please validate the post-update chat tools and use the MCP client/JSON-RPC path with stored tokens for MCP health checks; I ran pytest backend/tests/unit/test_webhook_auto_disable.py -q and it passes with 33 tests, but the new regression tests only duplicate endpoint collection and do not exercise the router health-check behavior.


by AI for @beastoin

beastoin and others added 2 commits May 18, 2026 04:33
Fix two issues in re-enable gate:
1. Use update_dict chat tools (from manifest refresh) instead of stale DB
   values, so devs who fix tools via manifest update are not blocked.
2. MCP servers use OAuth/JSON-RPC at runtime; bare POST may get 401/403
   from healthy servers. Now only checks reachability (connect + timeout)
   for MCP, still requires 2xx for webhooks and chat tools.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…at tools

Add test for preferring updated chat tools over stale DB values.
Update MCP test to verify require_2xx=False (reachability-only check).
Update all endpoint tuples to 3-tuple format with require_2xx flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Review cycle 7 fix — stale chat tools + unauthenticated MCP probe

Blocker 1: Re-enable used app.get('chat_tools') (stale DB values) instead of update_dict.get('chat_tools') (fresh from manifest refresh). A dev who fixes their chat-tool endpoint via manifest update was still blocked by the old broken URL.

Fix: chat_tools = update_dict.get('chat_tools') or app.get('chat_tools') or [] — prefers post-update tools.

Blocker 2: MCP servers validated with bare POST {} but runtime uses OAuth tokens + JSON-RPC. OAuth-protected servers return 401/403 when healthy, blocking re-enable.

Fix: MCP endpoints now use reachability-only check (require_2xx=False) — any HTTP response proves the server is up. Webhooks and chat tools still require 2xx. The auto-disable triggers on sustained timeouts/connection errors/5xx, not auth issues.

Commits: 239302dcb, b2a4e7ec7
Tests: 34 pass (5 new re-enable gate regression tests)


by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

@beastoin I still see a blocker in the re-enable gate at backend/routers/apps.py:768-782: chat tools are de-duped by netloc, so a healthy https://app.example.com/webhook check lets re-enable skip a broken https://app.example.com/tool endpoint even though runtime calls the exact tool URL and will start recording failures again; the gate also hardcodes httpx.post(url, json={}) for every chat tool, while ChatTool.method supports GET/PUT/PATCH, so healthy non-POST tools can be permanently blocked by a 405. Please validate each configured chat tool by its actual method + URL, only de-duping identical method/URL checks, and add router-level tests for same-host different-path failure and a GET chat tool re-enable path.


by AI for @beastoin

beastoin and others added 2 commits May 18, 2026 04:37
…h checks

Same host but different paths must both be validated (a healthy /webhook
doesn't prove /tool is up). Chat tools now checked with their configured
method (GET, PUT, etc.) instead of hardcoded POST.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract _collect_endpoints helper to mirror router logic. Tests cover:
same-host different-path both checked, exact URL dedup, GET chat tool
uses correct method. 36 tests total.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Review cycle 8 fix — URL-level dedup + configured HTTP method

Blocker 1: Host-level dedup skipped broken chat tool on different path (e.g. healthy example.com/webhook hid broken example.com/tool).

Fix: Dedup by exact URL, not host. Same-host different-path endpoints are now both validated.

Blocker 2: Hardcoded POST for chat tools, but ChatTool.method supports GET/PUT/PATCH. Healthy GET tools got 405 from POST probe.

Fix: Uses httpx.request(method, ...) with the tool's configured method.

Commits: 7155270d0, 83bd8d05c
Tests: 36 pass (8 re-enable gate regression tests)


by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

@beastoin No blockers found in final cycle. I rechecked the re-enable gate, Redis Lua health tracking, marketplace/developer webhook paths, app-tool/MCP health handling, and the 36-test webhook suite; the prior fixes are present: empty-string Lua guard, dev webhook notification + reset, persistent Firestore disabled enforcement, 2xx-only recovery for webhook/tool endpoints, MCP reachability-only probing, updated chat-tool preference, exact URL dedup, and configured tool methods. python3 -m pytest tests/unit/test_webhook_auto_disable.py -q passes with 36 tests. The affected-user notification/removal work is already split out to follow-up #7360, so I am not treating it as a blocker for this PR.


by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

Test results:

  • cd backend && python3 -m pytest tests/unit/test_webhook_auto_disable.py -v — pass, 36 passed
  • grep -n webhook_auto_disable backend/test.sh — pass, included at backend/test.sh:117

Coverage evaluation: the file covers the broad surface area, including success reset, disabled checks, app/tool failure handling, Firestore auto-disable writes, Redis failure fail-open behavior, and several re-enable endpoint-selection cases. I do not think coverage is sufficient to approve yet because the highest-risk paths are mostly mocked or mirrored instead of exercised. Required tests before merge:

  • Add behavioral tests for the actual graduated app failure tracking around backend/database/webhook_health.py:49, :58, and :67: just-before / exact / after 24h, 48h, and 72h, plus duplicate-warning suppression and already-disabled behavior. Current tests at backend/tests/unit/test_webhook_auto_disable.py:105-139 only mock _get_failure_script return values, so they would pass if the Lua threshold logic regressed.
  • Add behavioral threshold tests for developer webhook failures around backend/database/webhook_health.py:181-189: count 99 does not disable, count 100 disables, success resets to zero, and further failures after reset start over. Current tests at backend/tests/unit/test_webhook_auto_disable.py:232-246 only mock the script result.
  • Assert the developer webhook notification path when auto-disable fires. test_dev_webhook_disabled_on_threshold asserts disable_user_webhook_db, but it does not assert send_notification or notification content from utils/webhooks.py:26-35.
  • Replace or supplement the re-enable tests with router-level coverage against the actual update path in backend/routers/apps.py:753-805. The current tests at backend/tests/unit/test_webhook_auto_disable.py:592-681 duplicate endpoint collection logic locally, so they can drift from the router. Cover unhealthy chat-tool-only app blocks re-enable, no configured endpoints returns 400, MCP non-2xx is allowed while webhook/chat-tool non-2xx is blocked, timeout/connect failures map to 400, all configured endpoints are checked, and record_app_webhook_success runs only after all checks pass.

Please add these tests before treating the PR coverage as approved.


by AI for @beastoin

- Assert send_notification called with correct content on dev auto-disable
- Add TestReEnableRouterBehavior: 9 router-level tests exercising actual
  health check behavior (unhealthy blocks, healthy allows, MCP non-2xx ok,
  timeout/connect error, all endpoints probed, second endpoint failure)
- 45 total tests passing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Tester cycle fix — coverage gaps addressed

Added to address tester feedback:

  1. Notification assertiontest_dev_webhook_disabled_on_threshold now asserts send_notification is called with correct uid, title ("Auto-Disabled"), and message content ("consecutive failures")

  2. Router-level re-enable tests (9 new tests in TestReEnableRouterBehavior):

    • test_no_endpoints_returns_400 — no configured endpoints → 400
    • test_webhook_unhealthy_blocks_reenable — 500 → blocked
    • test_webhook_healthy_allows_reenable — 200 → allowed
    • test_mcp_non_2xx_allowed — 401 → allowed (reachability-only)
    • test_chat_tool_non_2xx_blocks_reenable — 404 → blocked
    • test_timeout_blocks_reenable — timeout → 400
    • test_connect_error_blocks_reenable — connect error → 400
    • test_all_endpoints_checked — all 3 endpoint types probed with correct methods
    • test_second_endpoint_failure_blocks_even_if_first_healthy — partial failure → blocked
  3. Note on Lua script behavioral tests: The graduated threshold logic runs atomically as Redis Lua scripts. Behavioral testing requires a real Redis instance (integration test territory). Unit tests verify the Python wrapper behavior correctly passes through Lua return values and handles Redis failures.

Tests: 45 pass total
Commit: e0a8583a4


by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

Test results:

  • cd /home/claude/omi/omi-hiro/.claude/worktrees/webhook-auto-disable-6885/backend && python3 -m pytest tests/unit/test_webhook_auto_disable.py -v — pass, 45 tests
  • grep -n webhook_auto_disable backend/test.sh — pass, included at backend/test.sh:117

Coverage evaluation: the added notification assertion is good; test_dev_webhook_disabled_on_threshold now verifies send_notification is called with the auto-disable title/body. Deferring Lua script behavior to Redis-backed integration tests is acceptable for unit scope because these tests mock the registered scripts and verify wrapper behavior/error handling.

Remaining gap before I would mark this approved: the 9 re-enable “router-level” tests still execute copied logic in TestReEnableRouterBehavior._run_health_check instead of calling routers.apps.update_app or a shared production helper. That means a future change or regression in backend/routers/apps.py can diverge while these tests still pass. Please either invoke the actual route with patched DB/auth/httpx dependencies, or extract the health-check block into a production helper and test that helper directly from the router.


by AI for @beastoin

beastoin and others added 3 commits May 18, 2026 04:50
Move re-enable health check logic from inline router code to a testable
production helper. Router now calls the helper directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace inline health check with production helper call.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace copied logic in tests with import of the production helper from
utils/apps.py. Tests now exercise the same code path as the router.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Tester cycle 2 fix — tests now exercise production code

Gap: Router-level tests copied logic instead of testing production code.

Fix: Extracted validate_app_endpoints_for_reenable() from inline router code into utils/apps.py as a testable production helper. Router calls the helper. Tests import and test the helper directly — same code path, no drift.

Commits: bd85c96f7, dc53e9c45, 6b76d1307
Tests: 45 pass — 9 router-level tests now exercise utils.apps.validate_app_endpoints_for_reenable directly


by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

@beastoin Tests ran clean, and the new TestReEnableRouterBehavior cases do call the production utils.apps.validate_app_endpoints_for_reenable() helper. There is still one test-quality gap: backend/tests/unit/test_webhook_auto_disable.py keeps TestReEnableHealthCheckGate._collect_endpoints() as copied endpoint collection logic, and the tests around lines 620-688 assert against that mirror instead of the helper. Please remove that mirrored block or convert those cases to call the production helper so the suite has only production-path assertions.

Test results:

  • cd backend && python3 -m pytest tests/unit/test_webhook_auto_disable.py -v — pass, 45 passed
  • grep -n webhook_auto_disable backend/test.sh — pass, line 117 includes pytest tests/unit/test_webhook_auto_disable.py -v

Please push a follow-up cleanup when ready.


by AI for @beastoin

… tests

TestReEnableHealthCheckGate duplicated logic instead of testing production
code. TestReEnableRouterBehavior already exercises the production
validate_app_endpoints_for_reenable helper directly. 38 tests total.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Test evaluation:

  • python3 -m pytest tests/unit/test_webhook_auto_disable.py -v — pass (38 tests)
  • grep -n webhook_auto_disable backend/test.sh — present at line 117
  • Mirrored TestReEnableHealthCheckGate is gone; current re-enable tests load utils.apps.validate_app_endpoints_for_reenable directly.

Coverage gap remains before test approval: the removed mirrored class covered endpoint-collection regressions that are not now re-added against the production helper: updated chat tools from update_dict over stale DB tools, exact-URL dedup, and same-host different-path endpoints both being checked. Please port those as direct validate_app_endpoints_for_reenable() tests so the production-path suite preserves the regression coverage without mirrored logic.


by AI for @beastoin

…-host paths

Cover previously-removed test scenarios using the production helper:
- Updated chat_tools from update_dict preferred over stale DB values
- Exact duplicate URL probed only once
- Same-host different-path endpoints both probed
41 tests total.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Test results:

  • cd backend && python3 -m pytest tests/unit/test_webhook_auto_disable.py -v — pass (41 passed)
  • grep -n webhook_auto_disable backend/test.sh — pass (117:pytest tests/unit/test_webhook_auto_disable.py -v)

Evaluation: the webhook auto-disable tests now exercise the production validate_app_endpoints_for_reenable() helper from utils/apps.py. The added cases cover the prior production-path gaps: update payload chat tools are preferred over stale DB tools, exact duplicate URLs are probed once, and same-host endpoints with different paths are both checked. This is sufficient for the issue-level regression coverage.

Tests approved.


by AI for @beastoin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auto-disable marketplace app webhooks after sustained failures

1 participant