Skip to content

feat add fetch_url_tool so AI chat can read direct URLs#7328

Open
krushnarout wants to merge 5 commits into
mainfrom
feat/fetch-url-tool
Open

feat add fetch_url_tool so AI chat can read direct URLs#7328
krushnarout wants to merge 5 commits into
mainfrom
feat/fetch-url-tool

Conversation

@krushnarout
Copy link
Copy Markdown
Member

Summary

  • AI chat previously said it couldn't browse a URL when given one directly — it only searched the web
  • Adds fetch_url_tool that fetches and strips HTML from a given URL, returning up to 8000 chars of readable text
  • Wires the tool into CORE_TOOLS in agentic.py so Claude uses it when a user shares a link

Test plan

  • Send a message in mobile AI chat with a direct URL (e.g. a news article) and verify the content is read and summarized
  • Verify web_search still works for non-URL queries
  • Verify error path: invalid URL, non-HTML content type, non-200 status

🤖 Generated with Claude Code

krushnarout and others added 3 commits May 16, 2026 12:19
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 16, 2026

Greptile Summary

This PR adds fetch_url_tool, a new LangChain tool wired into CORE_TOOLS that lets Claude fetch and summarise the content of a user-supplied URL using a lightweight custom HTML-to-text parser.

  • web_tools.py implements the new tool: scheme validation, an httpx GET via the shared webhook client, content-type gating, HTML stripping, and 8 000-character truncation.
  • agentic.py adds fetch_url_tool to CORE_TOOLS and the display-name map; tools/__init__.py exports it — both are mechanical, low-risk changes.
  • The new tool has a critical SSRF gap (no private/metadata IP filtering), loads the full response body into memory before truncation, and reuses the webhook connection pool without isolation.

Confidence Score: 2/5

Not safe to merge as-is — any authenticated user can point the tool at cloud metadata services or internal network addresses and read the response directly in their AI chat.

The new tool makes outbound HTTP calls to arbitrary user-controlled URLs without filtering link-local or private-network destinations. On a cloud-hosted backend this directly exposes IAM credential endpoints and internal services to any user who can send a message to the AI chat.

backend/utils/retrieval/tools/web_tools.py needs a hostname/IP block-list and a response-size guard before this ships.

Security Review

  • SSRF (Server-Side Request Forgery)web_tools.py: fetch_url_tool accepts any http:// or https:// URL from the user and makes a server-side network request with no IP/hostname filtering. Cloud metadata endpoints (169.254.169.254, metadata.google.internal), localhost-bound services, and RFC-1918 internal addresses are all reachable. Successful exploitation would expose IAM credentials or internal service data to the calling user via the tool's return value.
  • Sensitive data in logsweb_tools.py lines 74, 89, 104: User-supplied URLs are logged verbatim without passing through sanitize(), potentially leaking API keys or OAuth tokens embedded in query parameters.

Important Files Changed

Filename Overview
backend/utils/retrieval/tools/web_tools.py New tool that fetches arbitrary user-supplied URLs with no SSRF protection, no response-size cap before decode, and shared use of the webhook connection pool.
backend/utils/retrieval/agentic.py Adds fetch_url_tool to CORE_TOOLS and the display-name map; change is mechanical and correct.
backend/utils/retrieval/tools/init.py Exports fetch_url_tool from the new web_tools module; straightforward import/export addition.

Sequence Diagram

sequenceDiagram
    participant User as User (Mobile Chat)
    participant Claude as Claude LLM
    participant Tool as fetch_url_tool
    participant Client as get_webhook_client()
    participant Ext as External URL

    User->>Claude: "Summarize https://example.com/article"
    Claude->>Tool: fetch_url_tool(url)
    Tool->>Tool: Validate scheme (http/https only)
    Tool->>Client: "GET url (timeout=15s, follow_redirects=True)"
    Client->>Ext: HTTP GET request
    Ext-->>Client: HTTP response (status, content-type, body)
    Client-->>Tool: response object
    Tool->>Tool: Check status code
    Tool->>Tool: Check content-type
    Tool->>Tool: _html_to_text(response.text) — full body in memory
    Tool->>Tool: Truncate to 8000 chars
    Tool-->>Claude: Content from url
    Claude-->>User: Summary of page content
Loading

Reviews (1): Last reviewed commit: "feat wire fetch_url_tool into CORE_TOOLS..." | Re-trigger Greptile

Comment thread backend/utils/retrieval/tools/web_tools.py Outdated
Comment thread backend/utils/retrieval/tools/web_tools.py Outdated
Comment thread backend/utils/retrieval/tools/web_tools.py Outdated
Comment thread backend/utils/retrieval/tools/web_tools.py Outdated
Comment thread backend/utils/retrieval/tools/web_tools.py Outdated
krushnarout and others added 2 commits May 16, 2026 20:41
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…x content-type check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant