feat(python): Add CRW web scraper plugin by us · Pull Request #13709 · microsoft/semantic-kernel

us · 2026-03-26T14:09:50Z

Summary

Add WebScraperPlugin that integrates CRW, an open-source web scraper for AI agents, with Semantic Kernel
CRW exposes a Firecrawl-compatible REST API — the plugin wraps its scrape, crawl, and map endpoints as kernel functions
Includes scrape_url, crawl_website, check_crawl_status, and map_site functions with full type annotations and @kernel_function decorators
Follows the same patterns as existing core plugins (HttpPlugin, WebSearchEnginePlugin)
Adds unit tests with mocked HTTP responses

Motivation

AI agents frequently need to extract content from web pages for RAG pipelines, research tasks, and data gathering. CRW provides a lightweight, self-hostable web scraping backend that pairs naturally with Semantic Kernel's plugin architecture.

Test plan

Unit tests for all four kernel functions
Tests for empty/invalid input validation
Tests for error response handling
Tests for plugin registration with Kernel
Tests for auth header construction

Add WebScraperPlugin that integrates CRW (open-source web scraper for AI agents) with Semantic Kernel. CRW exposes a Firecrawl-compatible REST API for scraping, crawling, and site mapping. Plugin provides kernel functions: - scrape_url: Scrape a single page and return markdown content - crawl_website: Start an async crawl job across multiple pages - check_crawl_status: Poll crawl job progress and results - map_site: Discover all URLs on a website via links and sitemaps

github-actions

Automated Code Review

Reviewers: 4 | Confidence: 88%

✗ Correctness

New WebScraperPlugin wrapping a CRW (Firecrawl-compatible) server. The code is internally consistent and tests pass for the mocked scenarios. However, there are two correctness concerns: (1) In both _post and _get, response.json() is called before checking response.status, so a non-JSON error response (e.g., 502 with HTML body) will raise an unhandled aiohttp.ContentTypeError instead of the intended FunctionExecutionException—the existing HttpPlugin avoids this by using raise_for_status=True. (2) The map_site method parses links from result['data']['links'], but the Firecrawl v1 /map endpoint returns links at the top level (result['links']), so if CRW truly follows the Firecrawl spec this will always return an empty list.

✗ Security Reliability

New WebScraperPlugin wraps a CRW (Firecrawl-compatible) server with kernel functions for scraping, crawling, and site-mapping. The main security/reliability concerns are: (1) no HTTP request timeout on aiohttp sessions, allowing indefinite hangs if the CRW server is unresponsive — a denial-of-service risk; (2) the crawl_id parameter in check_crawl_status is interpolated directly into the URL path without sanitization, enabling path traversal or query-injection when called by an AI agent with untrusted input; (3) response.json() is called before checking the HTTP status code, so non-JSON error responses (e.g., 502 HTML pages) will raise an unhandled ContentTypeError instead of the intended FunctionExecutionException. These patterns partially mirror existing plugins (HttpPlugin, CrewAI client), but the new plugin is a good opportunity to raise the bar.

✗ Test Coverage

The test file covers the happy path for each kernel function, empty-input validation, error responses for POST, and basic configuration. However, there are significant gaps: the scrape_url method has 5 distinct return-value branches (markdown, html, plainText, links, fallback json.dumps) but only the markdown path is tested; the crawl_website method has an error branch when the server returns no crawl ID that is untested; the _get error path (reachable via check_crawl_status) has no test; and optional parameters like css_selector are never exercised. These gaps leave meaningful branching logic and error handling uncovered.

✗ Design Approach

The plugin introduces a web scraping abstraction over the CRW/Firecrawl-compatible API. The most significant design issue is the two-step crawl pattern: crawl_website is described as 'Crawl a website … following links' but silently returns a raw job ID instead of content, leaving an LLM caller with no signal that a second polling call to check_crawl_status is required. This misalignment between the ``@kernel_function description and the actual return value is not a style issue—it will reliably break agent usage because the LLM has no way to infer the polling contract from the description alone. There are also fragile assumptions in the HTTP error-handling path (JSON parsing before status check) and a silent hardcoded truncation of page content.

Flagged Issues

In _post and _get, await response.json() is called before checking response.status >= 400. If the CRW server returns a non-JSON error response (e.g., 502 Bad Gateway with an HTML body), response.json() raises an unhandled aiohttp.ContentTypeError/JSONDecodeError instead of the intended FunctionExecutionException, masking the real HTTP failure. The status code should be checked first, then JSON parsing attempted.
No request timeout configured on aiohttp.ClientSession in _post and _get — a slow or unresponsive CRW server will cause the calling coroutine to hang indefinitely, which is a denial-of-service risk when the plugin is invoked by an AI agent.
crawl_website's ``@kernel_function description says 'Crawl a website … following links up to a specified depth', strongly implying it returns crawled content, but it returns a bare job ID string. Any LLM agent using this function will treat the job ID as the result and never call `check_crawl_status`. Either have `crawl_website` poll internally until completion, or rename/redescribe it to make the two-step contract explicit.
scrape_url has 5 return-path branches (markdown → html → plainText → links → json.dumps fallback) but only the markdown branch is tested. The other four paths have zero coverage.
crawl_website raises FunctionExecutionException when the server response has no 'id' field (line 165-166), but no test covers this error branch.

Suggestions

The map_site method parses result.get('data', {}).get('links', []), but the Firecrawl v1 /map endpoint returns {"success": true, "links": [...]} at the top level. If CRW follows the Firecrawl spec, this will always return an empty list. Verify against the actual CRW API response shape.
Sanitize or validate crawl_id in check_crawl_status before interpolating it into the URL path. Use urllib.parse.quote(crawl_id, safe=') to prevent path traversal (e.g., ../../admin) or query injection (e.g., id?admin=true).
Consider adding URL scheme validation (require http/https) on user-supplied URLs in scrape_url, crawl_website, and map_site, similar to how HttpPlugin offers allowed_domains filtering.
Add a test for _get error handling (e.g., check_crawl_status with a 404/500 mock response) to mirror test_scrape_url_error_response which only covers _post errors.
Add a test exercising the css_selector parameter in scrape_url to verify it is included in the request body.
Consider verifying the request body passed to the mock in happy-path tests (e.g., assert mock_session.post was called with the expected URL and JSON payload) rather than only asserting the return value, to catch silently dropped parameters.
The hardcoded [:500] truncation of markdown in check_crawl_status (line 196) silently drops content with no signal to the caller. Make this limit a configurable plugin attribute (e.g., max_markdown_preview: int = 500) and include a truncated: True flag when content is cut.

Automated review by us's agents

github-actions · 2026-03-26T14:14:51Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+        ):
+            result = await response.json()
+            if response.status >= 400:
+                error_msg = result.get("error", f"HTTP {response.status}")
+                raise FunctionExecutionException(f"CRW request failed: {error_msg}")
+            return result
+
+    async def _get(self, path: str) -> dict[str, Any]:


Bug: response.json() is called before checking response.status. If the server returns a non-JSON error (e.g., 502 with HTML), this raises ContentTypeError/JSONDecodeError instead of FunctionExecutionException. Check status first, then attempt JSON parsing.

Suggested change

):

result = await response.json()

if response.status >= 400:

error_msg = result.get("error", f"HTTP {response.status}")

raise FunctionExecutionException(f"CRW request failed: {error_msg}")

return result

async def _get(self, path: str) -> dict[str, Any]:

async with (

aiohttp.ClientSession() as session,

session.post(url, headers=self._headers(), data=json.dumps(body)) as response,

):

if response.status >= 400:

try:

result = await response.json()

error_msg = result.get("error", f"HTTP {response.status}")

except Exception:

error_msg = f"HTTP {response.status}"

raise FunctionExecutionException(f"CRW request failed: {error_msg}")

return await response.json()

github-actions · 2026-03-26T14:14:51Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+            aiohttp.ClientSession() as session,
+            session.get(url, headers=self._headers()) as response,
+        ):
+            result = await response.json()
+            if response.status >= 400:
+                error_msg = result.get("error", f"HTTP {response.status}")
+                raise FunctionExecutionException(f"CRW request failed: {error_msg}")
+            return result


Same bug in _get: response.json() before status check. Non-JSON error responses will raise an unhandled exception instead of FunctionExecutionException.

Suggested change

aiohttp.ClientSession() as session,

session.get(url, headers=self._headers()) as response,

):

result = await response.json()

if response.status >= 400:

error_msg = result.get("error", f"HTTP {response.status}")

raise FunctionExecutionException(f"CRW request failed: {error_msg}")

return result

async with (

aiohttp.ClientSession() as session,

session.get(url, headers=self._headers()) as response,

):

if response.status >= 400:

try:

result = await response.json()

error_msg = result.get("error", f"HTTP {response.status}")

except Exception:

error_msg = f"HTTP {response.status}"

raise FunctionExecutionException(f"CRW request failed: {error_msg}")

return await response.json()

github-actions · 2026-03-26T14:14:51Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+        ):
+            result = await response.json()
+            if response.status >= 400:
+                error_msg = result.get("error", f"HTTP {response.status}")


No request timeout is set on the ClientSession. If the CRW server is unreachable or slow, this coroutine hangs indefinitely. Add an aiohttp.ClientTimeout to bound request duration. The same applies to _get.

Suggested change

error_msg = result.get("error", f"HTTP {response.status}")

async def _post(self, path: str, body: dict[str, Any]) -> dict[str, Any]:

"""Send a POST request to the CRW server and return the JSON response.""

url = f"{self.base_url.rstrip('/')}{path}"

timeout = aiohttp.ClientTimeout(total=30)

async with (

aiohttp.ClientSession(timeout=timeout) as session,

github-actions · 2026-03-26T14:14:51Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+
+        if css_selector:
+            body["cssSelector"] = css_selector
+


crawl_website's ``@kernel_function description says 'Crawl a website starting from a URL, following links up to a specified depth' but the function returns a job ID string, not crawled content. An LLM invoking this function has no way to know a second polling call is needed. Update the description to reflect the actual return value, or have the function poll until completion.

Suggested change

description="Start an async crawl job and return a job ID. Call check_crawl_status with the returned ID to retrieve results.",

github-actions · 2026-03-26T14:14:51Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+
+        return crawl_id
+
+    @kernel_function(


crawl_id is user-supplied and interpolated directly into the URL path without sanitization. A malicious value like ../../admin or id?x=y could alter the request target. Use urllib.parse.quote(crawl_id, safe='') to encode path-unsafe characters.

github-actions · 2026-03-26T14:14:52Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+        }
+
+        result = await self._post("/v1/map", body)
+        links = result.get("data", {}).get("links", [])


Likely bug: Firecrawl v1 /map returns {"success": true, "links": [...]} — links are at the top level, not nested under data. If CRW follows the Firecrawl spec, this will always return []. Should this be result.get("links", [])?

Suggested change

links = result.get("data", {}).get("links", [])

links = result.get("links", [])

github-actions · 2026-03-26T14:14:52Z

python/tests/unit/core_plugins/test_web_scraper_plugin.py

+    assert result == "# Example\n\nHello world"
+
+
+async def test_scrape_url_empty_url():


This test only covers the markdown return path of scrape_url. The method has 4 other branches (html, plainText, links, json.dumps fallback) that are all untested. Add tests where the mock response data omits 'markdown' but includes 'html', 'plainText', or 'links' to cover each branch.

github-actions · 2026-03-26T14:14:52Z

python/tests/unit/core_plugins/test_web_scraper_plugin.py

+    mock_session = MagicMock()
+    mock_session.post = MagicMock(return_value=AsyncMock(__aenter__=AsyncMock(return_value=mock_resp)))
+    mock_session.close = AsyncMock()
+    mock_session_cls.return_value.__aenter__ = AsyncMock(return_value=mock_session)


No test for the case where crawl_website succeeds (HTTP 200) but the response JSON has no 'id' field. The plugin raises FunctionExecutionException at line 165-166 for this case. Add a test with a mock response like {"success": True} (no 'id' key) and assert the exception is raised.

Suggested change

mock_session_cls.return_value.__aenter__ = AsyncMock(return_value=mock_session)

async def test_crawl_website_missing_id():

response_data = {"success": True}

mock_resp = _mock_response(response_data)

mock_session = MagicMock()

mock_session.post = MagicMock(return_value=AsyncMock(__aenter__=AsyncMock(return_value=mock_resp))

mock_session.close = AsyncMock()

with patch("aiohttp.ClientSession") as mock_session_cls:

mock_session_cls.return_value.__aenter__ = AsyncMock(return_value=mock_session)

mock_session_cls.return_value.__aexit__ = AsyncMock(return_value=False)

plugin = WebScraperPlugin()

with pytest.raises(FunctionExecutionException, match="did not return a crawl job ID"):

await plugin.crawl_website("https://example.com")

github-actions · 2026-03-26T14:14:52Z

python/tests/unit/core_plugins/test_web_scraper_plugin.py

+        "success": True,
+        "data": {"links": ["https://example.com", "https://example.com/about"]},
+    }
+    mock_resp = _mock_response(response_data)


This tests the _get happy path but there's no corresponding error test for _get (HTTP >= 400). Add a test similar to test_scrape_url_error_response but using check_crawl_status with a mock GET returning status 404 or 500.

github-actions · 2026-03-26T14:14:52Z

python/tests/unit/core_plugins/test_web_scraper_plugin.py

+
+    plugin = WebScraperPlugin()
+    result = await plugin.scrape_url("https://example.com", formats="markdown, links")
+    assert result == "# Example"


Good test for the formats parameter, but the css_selector parameter (plugin line 113) is never tested. Add a test that passes css_selector and verifies it's included in the request body.

Copilot

Pull request overview

Adds a new Python core plugin that wraps the CRW (Firecrawl compatible) REST API so Semantic Kernel agents can scrape pages, crawl sites, check crawl status, and map links, with unit tests and public export via core_plugins.__init__.

Changes:

Introduces WebScraperPlugin with four @kernel_function endpoints backed by aiohttp.
Adds unit tests for instantiation, headers, plugin registration, and mocked HTTP flows.
Exports WebScraperPlugin from semantic_kernel.core_plugins.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
python/semantic_kernel/core_plugins/web_scraper_plugin.py	Implements CRW client helpers plus scrape, crawl, status, and map kernel functions.
python/tests/unit/core_plugins/test_web_scraper_plugin.py	Adds unit tests with mocked `aiohttp` responses and Kernel plugin registration checks.
python/semantic_kernel/core_plugins/init.py	Re exports the new plugin as part of the core plugins package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T14:17:00Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+        if pages:
+            summary["pages"] = [
+                {
+                    "url": page.get("metadata", {}).get("sourceURL", ""),
+                    "title": page.get("metadata", {}).get("title", ""),
+                    "markdown": page.get("markdown", "")[:500],
+                }


check_crawl_status() truncates each page's markdown to 500 characters, but the docstring says it returns crawl results. This truncation is a behavior change that API consumers may not expect. Consider documenting the truncation explicitly, returning full content, or making the truncation length configurable.

Copilot · 2026-03-26T14:17:00Z

python/tests/unit/core_plugins/test_web_scraper_plugin.py

+    }
+    mock_resp = _mock_response(response_data)
+    mock_session = MagicMock()
+    mock_session.post = MagicMock(return_value=AsyncMock(__aenter__=AsyncMock(return_value=mock_resp)))


The mocked object returned from mock_session.post (and get) is used as an async context manager in the plugin, but the test setup only defines __aenter__ and not __aexit__. This can break async with teardown depending on mock behavior. Consider following the existing test_http_plugin.py pattern by patching aiohttp.ClientSession.post/get and setting mock_post.return_value.__aenter__.return_value..., or explicitly defining both __aenter__ and __aexit__ on the mocked request context manager.

Suggested change

mock_session.post = MagicMock(return_value=AsyncMock(__aenter__=AsyncMock(return_value=mock_resp)))

mock_session.post = MagicMock(

return_value=AsyncMock(

__aenter__=AsyncMock(return_value=mock_resp),

__aexit__=AsyncMock(return_value=False),

)

)

Copilot · 2026-03-26T14:17:01Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+import logging
+from typing import Annotated, Any
+
+import aiohttp
+
+from semantic_kernel.exceptions import FunctionExecutionException
+from semantic_kernel.functions.kernel_function_decorator import kernel_function
+from semantic_kernel.kernel_pydantic import KernelBaseModel
+
+logger = logging.getLogger(__name__)


logger is defined but never used in this module, which will fail Ruff (unused variable). Remove the logging import and logger = logging.getLogger(__name__), or use logger for actual logging in this plugin.

Copilot · 2026-03-26T14:17:01Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+        async with (
+            aiohttp.ClientSession() as session,
+            session.post(url, headers=self._headers(), data=json.dumps(body)) as response,
+        ):
+            result = await response.json()
+            if response.status >= 400:
+                error_msg = result.get("error", f"HTTP {response.status}")
+                raise FunctionExecutionException(f"CRW request failed: {error_msg}")
+            return result


_post() always calls response.json() before checking response.status. If CRW returns a non JSON error body (or invalid JSON), this will raise an aiohttp parsing exception instead of a FunctionExecutionException. Consider checking the status first and reading response.text() on errors, or wrapping JSON parsing in a try except with a safe fallback error message. Same concern applies to _get().

Copilot · 2026-03-26T14:17:01Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+        """Scrape a single web page and return its content.
+
+        Args:
+            url: The URL to scrape (must be http or https).
+            formats: Comma-separated output formats. Defaults to "markdown".
+            only_main_content: If True, strips navigation, footer, sidebar.
+            css_selector: Optional CSS selector to extract specific elements.
+
+        Returns:
+            The scraped content as a string.
+        """
+        if not url:
+            raise FunctionExecutionException("url cannot be `None` or empty")
+


The docstring says url "must be http or https" but the implementation only checks for empty string. Either validate the scheme (and potentially reject unsupported URLs early) or update the docstring to avoid stating requirements that are not enforced.

Copilot · 2026-03-26T14:17:01Z

python/semantic_kernel/core_plugins/web_scraper_plugin.py

+        }
+
+        if formats:
+            body["formats"] = [f.strip() for f in formats.split(",")]


When parsing formats, the current split and strip logic can still include empty entries (for example trailing commas). Consider filtering out empty strings and optionally validating against the supported set so the CRW API does not receive invalid formats.

Suggested change

body["formats"] = [f.strip() for f in formats.split(",")]

# Normalize, filter out empty entries, and validate against supported formats

supported_formats = {"markdown", "html", "plainText", "links"}

requested_formats = [

f.strip() for f in formats.split(",") if f.strip()

]

valid_formats = [f for f in requested_formats if f in supported_formats]

body["formats"] = valid_formats or ["markdown"]

us · 2026-03-26T14:44:54Z

@microsoft-github-policy-service agree

- Check HTTP status before parsing JSON in _post/_get to handle non-JSON error responses (e.g. 502 HTML pages) - Add aiohttp.ClientTimeout(total=30) to prevent hanging requests - Update crawl_website description to clarify async behavior and job ID - Sanitize crawl_id with urllib.parse.quote before URL interpolation - Fix map_site to read links from top-level (Firecrawl v1 /map format) - Remove unused logger import - Make markdown preview truncation configurable via max_markdown_preview - Add format validation against supported formats set - Add URL scheme validation (http/https only) - Add __aexit__ to all mock context managers in tests - Add tests for html, plainText, links scrape return paths - Add test for crawl response missing id field - Add tests for _get error responses (404, 500, non-JSON 502) - Add test for css_selector inclusion in request body - Add test for crawl_id path traversal sanitization

us requested a review from a team as a code owner March 26, 2026 14:09

Copilot AI review requested due to automatic review settings March 26, 2026 14:09

Copilot started reviewing on behalf of us March 26, 2026 14:10 View session

github-actions bot reviewed Mar 26, 2026

View reviewed changes

Copilot AI reviewed Mar 26, 2026

View reviewed changes

-                error_msg = result.get("error", f"HTTP {response.status}")
+    async def _post(self, path: str, body: dict[str, Any]) -> dict[str, Any]:
+        """Send a POST request to the CRW server and return the JSON response.""
+        url = f"{self.base_url.rstrip('/')}{path}"
+        timeout = aiohttp.ClientTimeout(total=30)
+        async with (
+            aiohttp.ClientSession(timeout=timeout) as session,


	description="Start an async crawl job and return a job ID. Call check_crawl_status with the returned ID to retrieve results.",

	links = result.get("data", {}).get("links", [])
	links = result.get("links", [])

		assert result == "# Example\n\nHello world"


		async def test_scrape_url_empty_url():

-            body["formats"] = [f.strip() for f in formats.split(",")]
+            # Normalize, filter out empty entries, and validate against supported formats
+            supported_formats = {"markdown", "html", "plainText", "links"}
+            requested_formats = [
+                f.strip() for f in formats.split(",") if f.strip()
+            ]
+            valid_formats = [f for f in requested_formats if f in supported_formats]
+            body["formats"] = valid_formats or ["markdown"]

Conversation

us commented Mar 26, 2026

Summary

Motivation

Test plan

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Automated Code Review

✗ Correctness

✗ Security Reliability

✗ Test Coverage

✗ Design Approach

Flagged Issues

Suggestions

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

us commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants