feat(python): Add CRW web scraper plugin#13709
Conversation
Add WebScraperPlugin that integrates CRW (open-source web scraper for AI agents) with Semantic Kernel. CRW exposes a Firecrawl-compatible REST API for scraping, crawling, and site mapping. Plugin provides kernel functions: - scrape_url: Scrape a single page and return markdown content - crawl_website: Start an async crawl job across multiple pages - check_crawl_status: Poll crawl job progress and results - map_site: Discover all URLs on a website via links and sitemaps
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 88%
✗ Correctness
New WebScraperPlugin wrapping a CRW (Firecrawl-compatible) server. The code is internally consistent and tests pass for the mocked scenarios. However, there are two correctness concerns: (1) In both
_postand_get,response.json()is called before checkingresponse.status, so a non-JSON error response (e.g., 502 with HTML body) will raise an unhandledaiohttp.ContentTypeErrorinstead of the intendedFunctionExecutionException—the existingHttpPluginavoids this by usingraise_for_status=True. (2) Themap_sitemethod parses links fromresult['data']['links'], but the Firecrawl v1/mapendpoint returns links at the top level (result['links']), so if CRW truly follows the Firecrawl spec this will always return an empty list.
✗ Security Reliability
New WebScraperPlugin wraps a CRW (Firecrawl-compatible) server with kernel functions for scraping, crawling, and site-mapping. The main security/reliability concerns are: (1) no HTTP request timeout on aiohttp sessions, allowing indefinite hangs if the CRW server is unresponsive — a denial-of-service risk; (2) the
crawl_idparameter incheck_crawl_statusis interpolated directly into the URL path without sanitization, enabling path traversal or query-injection when called by an AI agent with untrusted input; (3)response.json()is called before checking the HTTP status code, so non-JSON error responses (e.g., 502 HTML pages) will raise an unhandledContentTypeErrorinstead of the intendedFunctionExecutionException. These patterns partially mirror existing plugins (HttpPlugin, CrewAI client), but the new plugin is a good opportunity to raise the bar.
✗ Test Coverage
The test file covers the happy path for each kernel function, empty-input validation, error responses for POST, and basic configuration. However, there are significant gaps: the scrape_url method has 5 distinct return-value branches (markdown, html, plainText, links, fallback json.dumps) but only the markdown path is tested; the crawl_website method has an error branch when the server returns no crawl ID that is untested; the _get error path (reachable via check_crawl_status) has no test; and optional parameters like css_selector are never exercised. These gaps leave meaningful branching logic and error handling uncovered.
✗ Design Approach
The plugin introduces a web scraping abstraction over the CRW/Firecrawl-compatible API. The most significant design issue is the two-step crawl pattern:
crawl_websiteis described as 'Crawl a website … following links' but silently returns a raw job ID instead of content, leaving an LLM caller with no signal that a second polling call tocheck_crawl_statusis required. This misalignment between the ``@kernel_functiondescription and the actual return value is not a style issue—it will reliably break agent usage because the LLM has no way to infer the polling contract from the description alone. There are also fragile assumptions in the HTTP error-handling path (JSON parsing before status check) and a silent hardcoded truncation of page content.
Flagged Issues
- In
_postand_get,await response.json()is called before checkingresponse.status >= 400. If the CRW server returns a non-JSON error response (e.g., 502 Bad Gateway with an HTML body),response.json()raises an unhandledaiohttp.ContentTypeError/JSONDecodeErrorinstead of the intendedFunctionExecutionException, masking the real HTTP failure. The status code should be checked first, then JSON parsing attempted. - No request timeout configured on
aiohttp.ClientSessionin_postand_get— a slow or unresponsive CRW server will cause the calling coroutine to hang indefinitely, which is a denial-of-service risk when the plugin is invoked by an AI agent. -
crawl_website's ``@kernel_functiondescription says 'Crawl a website … following links up to a specified depth', strongly implying it returns crawled content, but it returns a bare job ID string. Any LLM agent using this function will treat the job ID as the result and never call `check_crawl_status`. Either have `crawl_website` poll internally until completion, or rename/redescribe it to make the two-step contract explicit. -
scrape_urlhas 5 return-path branches (markdown → html → plainText → links → json.dumps fallback) but only the markdown branch is tested. The other four paths have zero coverage. -
crawl_websiteraisesFunctionExecutionExceptionwhen the server response has no 'id' field (line 165-166), but no test covers this error branch.
Suggestions
- The
map_sitemethod parsesresult.get('data', {}).get('links', []), but the Firecrawl v1/mapendpoint returns{"success": true, "links": [...]}at the top level. If CRW follows the Firecrawl spec, this will always return an empty list. Verify against the actual CRW API response shape. - Sanitize or validate
crawl_idincheck_crawl_statusbefore interpolating it into the URL path. Useurllib.parse.quote(crawl_id, safe=')to prevent path traversal (e.g.,../../admin) or query injection (e.g.,id?admin=true). - Consider adding URL scheme validation (require http/https) on user-supplied URLs in
scrape_url,crawl_website, andmap_site, similar to howHttpPluginoffersallowed_domainsfiltering. - Add a test for
_geterror handling (e.g.,check_crawl_statuswith a 404/500 mock response) to mirrortest_scrape_url_error_responsewhich only covers_posterrors. - Add a test exercising the
css_selectorparameter inscrape_urlto verify it is included in the request body. - Consider verifying the request body passed to the mock in happy-path tests (e.g., assert
mock_session.postwas called with the expected URL and JSON payload) rather than only asserting the return value, to catch silently dropped parameters. - The hardcoded
[:500]truncation of markdown incheck_crawl_status(line 196) silently drops content with no signal to the caller. Make this limit a configurable plugin attribute (e.g.,max_markdown_preview: int = 500) and include atruncated: Trueflag when content is cut.
Automated review by us's agents
| ): | ||
| result = await response.json() | ||
| if response.status >= 400: | ||
| error_msg = result.get("error", f"HTTP {response.status}") | ||
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | ||
| return result | ||
|
|
||
| async def _get(self, path: str) -> dict[str, Any]: |
There was a problem hiding this comment.
Bug: response.json() is called before checking response.status. If the server returns a non-JSON error (e.g., 502 with HTML), this raises ContentTypeError/JSONDecodeError instead of FunctionExecutionException. Check status first, then attempt JSON parsing.
| ): | |
| result = await response.json() | |
| if response.status >= 400: | |
| error_msg = result.get("error", f"HTTP {response.status}") | |
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | |
| return result | |
| async def _get(self, path: str) -> dict[str, Any]: | |
| async with ( | |
| aiohttp.ClientSession() as session, | |
| session.post(url, headers=self._headers(), data=json.dumps(body)) as response, | |
| ): | |
| if response.status >= 400: | |
| try: | |
| result = await response.json() | |
| error_msg = result.get("error", f"HTTP {response.status}") | |
| except Exception: | |
| error_msg = f"HTTP {response.status}" | |
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | |
| return await response.json() |
| aiohttp.ClientSession() as session, | ||
| session.get(url, headers=self._headers()) as response, | ||
| ): | ||
| result = await response.json() | ||
| if response.status >= 400: | ||
| error_msg = result.get("error", f"HTTP {response.status}") | ||
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | ||
| return result |
There was a problem hiding this comment.
Same bug in _get: response.json() before status check. Non-JSON error responses will raise an unhandled exception instead of FunctionExecutionException.
| aiohttp.ClientSession() as session, | |
| session.get(url, headers=self._headers()) as response, | |
| ): | |
| result = await response.json() | |
| if response.status >= 400: | |
| error_msg = result.get("error", f"HTTP {response.status}") | |
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | |
| return result | |
| async with ( | |
| aiohttp.ClientSession() as session, | |
| session.get(url, headers=self._headers()) as response, | |
| ): | |
| if response.status >= 400: | |
| try: | |
| result = await response.json() | |
| error_msg = result.get("error", f"HTTP {response.status}") | |
| except Exception: | |
| error_msg = f"HTTP {response.status}" | |
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | |
| return await response.json() |
| ): | ||
| result = await response.json() | ||
| if response.status >= 400: | ||
| error_msg = result.get("error", f"HTTP {response.status}") |
There was a problem hiding this comment.
No request timeout is set on the ClientSession. If the CRW server is unreachable or slow, this coroutine hangs indefinitely. Add an aiohttp.ClientTimeout to bound request duration. The same applies to _get.
| error_msg = result.get("error", f"HTTP {response.status}") | |
| async def _post(self, path: str, body: dict[str, Any]) -> dict[str, Any]: | |
| """Send a POST request to the CRW server and return the JSON response."" | |
| url = f"{self.base_url.rstrip('/')}{path}" | |
| timeout = aiohttp.ClientTimeout(total=30) | |
| async with ( | |
| aiohttp.ClientSession(timeout=timeout) as session, |
|
|
||
| if css_selector: | ||
| body["cssSelector"] = css_selector | ||
|
|
There was a problem hiding this comment.
crawl_website's ``@kernel_function description says 'Crawl a website starting from a URL, following links up to a specified depth' but the function returns a job ID string, not crawled content. An LLM invoking this function has no way to know a second polling call is needed. Update the description to reflect the actual return value, or have the function poll until completion.
| description="Start an async crawl job and return a job ID. Call check_crawl_status with the returned ID to retrieve results.", |
|
|
||
| return crawl_id | ||
|
|
||
| @kernel_function( |
There was a problem hiding this comment.
crawl_id is user-supplied and interpolated directly into the URL path without sanitization. A malicious value like ../../admin or id?x=y could alter the request target. Use urllib.parse.quote(crawl_id, safe='') to encode path-unsafe characters.
| } | ||
|
|
||
| result = await self._post("/v1/map", body) | ||
| links = result.get("data", {}).get("links", []) |
There was a problem hiding this comment.
Likely bug: Firecrawl v1 /map returns {"success": true, "links": [...]} — links are at the top level, not nested under data. If CRW follows the Firecrawl spec, this will always return []. Should this be result.get("links", [])?
| links = result.get("data", {}).get("links", []) | |
| links = result.get("links", []) |
| assert result == "# Example\n\nHello world" | ||
|
|
||
|
|
||
| async def test_scrape_url_empty_url(): |
There was a problem hiding this comment.
This test only covers the markdown return path of scrape_url. The method has 4 other branches (html, plainText, links, json.dumps fallback) that are all untested. Add tests where the mock response data omits 'markdown' but includes 'html', 'plainText', or 'links' to cover each branch.
| mock_session = MagicMock() | ||
| mock_session.post = MagicMock(return_value=AsyncMock(__aenter__=AsyncMock(return_value=mock_resp))) | ||
| mock_session.close = AsyncMock() | ||
| mock_session_cls.return_value.__aenter__ = AsyncMock(return_value=mock_session) |
There was a problem hiding this comment.
No test for the case where crawl_website succeeds (HTTP 200) but the response JSON has no 'id' field. The plugin raises FunctionExecutionException at line 165-166 for this case. Add a test with a mock response like {"success": True} (no 'id' key) and assert the exception is raised.
| mock_session_cls.return_value.__aenter__ = AsyncMock(return_value=mock_session) | |
| async def test_crawl_website_missing_id(): | |
| response_data = {"success": True} | |
| mock_resp = _mock_response(response_data) | |
| mock_session = MagicMock() | |
| mock_session.post = MagicMock(return_value=AsyncMock(__aenter__=AsyncMock(return_value=mock_resp)) | |
| mock_session.close = AsyncMock() | |
| with patch("aiohttp.ClientSession") as mock_session_cls: | |
| mock_session_cls.return_value.__aenter__ = AsyncMock(return_value=mock_session) | |
| mock_session_cls.return_value.__aexit__ = AsyncMock(return_value=False) | |
| plugin = WebScraperPlugin() | |
| with pytest.raises(FunctionExecutionException, match="did not return a crawl job ID"): | |
| await plugin.crawl_website("https://example.com") |
| "success": True, | ||
| "data": {"links": ["https://example.com", "https://example.com/about"]}, | ||
| } | ||
| mock_resp = _mock_response(response_data) |
There was a problem hiding this comment.
This tests the _get happy path but there's no corresponding error test for _get (HTTP >= 400). Add a test similar to test_scrape_url_error_response but using check_crawl_status with a mock GET returning status 404 or 500.
|
|
||
| plugin = WebScraperPlugin() | ||
| result = await plugin.scrape_url("https://example.com", formats="markdown, links") | ||
| assert result == "# Example" |
There was a problem hiding this comment.
Good test for the formats parameter, but the css_selector parameter (plugin line 113) is never tested. Add a test that passes css_selector and verifies it's included in the request body.
There was a problem hiding this comment.
Pull request overview
Adds a new Python core plugin that wraps the CRW (Firecrawl compatible) REST API so Semantic Kernel agents can scrape pages, crawl sites, check crawl status, and map links, with unit tests and public export via core_plugins.__init__.
Changes:
- Introduces
WebScraperPluginwith four@kernel_functionendpoints backed byaiohttp. - Adds unit tests for instantiation, headers, plugin registration, and mocked HTTP flows.
- Exports
WebScraperPluginfromsemantic_kernel.core_plugins.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| python/semantic_kernel/core_plugins/web_scraper_plugin.py | Implements CRW client helpers plus scrape, crawl, status, and map kernel functions. |
| python/tests/unit/core_plugins/test_web_scraper_plugin.py | Adds unit tests with mocked aiohttp responses and Kernel plugin registration checks. |
| python/semantic_kernel/core_plugins/init.py | Re exports the new plugin as part of the core plugins package. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if pages: | ||
| summary["pages"] = [ | ||
| { | ||
| "url": page.get("metadata", {}).get("sourceURL", ""), | ||
| "title": page.get("metadata", {}).get("title", ""), | ||
| "markdown": page.get("markdown", "")[:500], | ||
| } |
There was a problem hiding this comment.
check_crawl_status() truncates each page's markdown to 500 characters, but the docstring says it returns crawl results. This truncation is a behavior change that API consumers may not expect. Consider documenting the truncation explicitly, returning full content, or making the truncation length configurable.
| } | ||
| mock_resp = _mock_response(response_data) | ||
| mock_session = MagicMock() | ||
| mock_session.post = MagicMock(return_value=AsyncMock(__aenter__=AsyncMock(return_value=mock_resp))) |
There was a problem hiding this comment.
The mocked object returned from mock_session.post (and get) is used as an async context manager in the plugin, but the test setup only defines __aenter__ and not __aexit__. This can break async with teardown depending on mock behavior. Consider following the existing test_http_plugin.py pattern by patching aiohttp.ClientSession.post/get and setting mock_post.return_value.__aenter__.return_value..., or explicitly defining both __aenter__ and __aexit__ on the mocked request context manager.
| mock_session.post = MagicMock(return_value=AsyncMock(__aenter__=AsyncMock(return_value=mock_resp))) | |
| mock_session.post = MagicMock( | |
| return_value=AsyncMock( | |
| __aenter__=AsyncMock(return_value=mock_resp), | |
| __aexit__=AsyncMock(return_value=False), | |
| ) | |
| ) |
| import logging | ||
| from typing import Annotated, Any | ||
|
|
||
| import aiohttp | ||
|
|
||
| from semantic_kernel.exceptions import FunctionExecutionException | ||
| from semantic_kernel.functions.kernel_function_decorator import kernel_function | ||
| from semantic_kernel.kernel_pydantic import KernelBaseModel | ||
|
|
||
| logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
logger is defined but never used in this module, which will fail Ruff (unused variable). Remove the logging import and logger = logging.getLogger(__name__), or use logger for actual logging in this plugin.
| async with ( | ||
| aiohttp.ClientSession() as session, | ||
| session.post(url, headers=self._headers(), data=json.dumps(body)) as response, | ||
| ): | ||
| result = await response.json() | ||
| if response.status >= 400: | ||
| error_msg = result.get("error", f"HTTP {response.status}") | ||
| raise FunctionExecutionException(f"CRW request failed: {error_msg}") | ||
| return result |
There was a problem hiding this comment.
_post() always calls response.json() before checking response.status. If CRW returns a non JSON error body (or invalid JSON), this will raise an aiohttp parsing exception instead of a FunctionExecutionException. Consider checking the status first and reading response.text() on errors, or wrapping JSON parsing in a try except with a safe fallback error message. Same concern applies to _get().
| """Scrape a single web page and return its content. | ||
|
|
||
| Args: | ||
| url: The URL to scrape (must be http or https). | ||
| formats: Comma-separated output formats. Defaults to "markdown". | ||
| only_main_content: If True, strips navigation, footer, sidebar. | ||
| css_selector: Optional CSS selector to extract specific elements. | ||
|
|
||
| Returns: | ||
| The scraped content as a string. | ||
| """ | ||
| if not url: | ||
| raise FunctionExecutionException("url cannot be `None` or empty") | ||
|
|
There was a problem hiding this comment.
The docstring says url "must be http or https" but the implementation only checks for empty string. Either validate the scheme (and potentially reject unsupported URLs early) or update the docstring to avoid stating requirements that are not enforced.
| } | ||
|
|
||
| if formats: | ||
| body["formats"] = [f.strip() for f in formats.split(",")] |
There was a problem hiding this comment.
When parsing formats, the current split and strip logic can still include empty entries (for example trailing commas). Consider filtering out empty strings and optionally validating against the supported set so the CRW API does not receive invalid formats.
| body["formats"] = [f.strip() for f in formats.split(",")] | |
| # Normalize, filter out empty entries, and validate against supported formats | |
| supported_formats = {"markdown", "html", "plainText", "links"} | |
| requested_formats = [ | |
| f.strip() for f in formats.split(",") if f.strip() | |
| ] | |
| valid_formats = [f for f in requested_formats if f in supported_formats] | |
| body["formats"] = valid_formats or ["markdown"] |
|
@microsoft-github-policy-service agree |
- Check HTTP status before parsing JSON in _post/_get to handle non-JSON error responses (e.g. 502 HTML pages) - Add aiohttp.ClientTimeout(total=30) to prevent hanging requests - Update crawl_website description to clarify async behavior and job ID - Sanitize crawl_id with urllib.parse.quote before URL interpolation - Fix map_site to read links from top-level (Firecrawl v1 /map format) - Remove unused logger import - Make markdown preview truncation configurable via max_markdown_preview - Add format validation against supported formats set - Add URL scheme validation (http/https only) - Add __aexit__ to all mock context managers in tests - Add tests for html, plainText, links scrape return paths - Add test for crawl response missing id field - Add tests for _get error responses (404, 500, non-JSON 502) - Add test for css_selector inclusion in request body - Add test for crawl_id path traversal sanitization
Summary
WebScraperPluginthat integrates CRW, an open-source web scraper for AI agents, with Semantic Kernelscrape_url,crawl_website,check_crawl_status, andmap_sitefunctions with full type annotations and@kernel_functiondecoratorsHttpPlugin,WebSearchEnginePlugin)Motivation
AI agents frequently need to extract content from web pages for RAG pipelines, research tasks, and data gathering. CRW provides a lightweight, self-hostable web scraping backend that pairs naturally with Semantic Kernel's plugin architecture.
Test plan