The God-Tier Intelligence Engine for AI Agents
The Sovereign, Self-Hosted Alternative to Firecrawl, Jina, and Tavily.
ShadowCrawl is not just a scraper or a search wrapper β it is a complete intelligence layer purpose-built for AI Agents. ShadowCrawl ships a native Rust meta-search engine running inside the same binary. Zero extra containers. Parallel engines. LLM-grade clean output.
When every other tool gets blocked, ShadowCrawl doesn't retreat β it escalates: native engines β native Chromium CDP headless β Human-In-The-Loop (HITL) nuclear option. You always get results.
ShadowCrawl v3.0.0 ships a 100% Rust-native metasearch engine that queries 4 engines in parallel and fuses results intelligently:
| Engine | Coverage | Notes |
|---|---|---|
| π΅ DuckDuckGo | General Web | HTML scrape, no API key needed |
| π’ Bing | General + News | Best for current events |
| π΄ Google | Authoritative Results | High-relevance, deduped |
| π Brave Search | Privacy-Focused | Independent index, low overlap |
Parallel Concurrency β All 4 engines fire simultaneously. Total latency = slowest engine, not sum of all.
Smart Deduplication + Scoring β Cross-engine results are merged by URL fingerprint. Pages confirmed by 2+ engines receive a corroboration score boost. Domain authority weighting (docs, .gov, .edu, major outlets) pushes high-trust sources to the top.
Ultra-Clean Output for LLMs β Clean fields and predictable structure:
published_atis parsed and stored as a clean ISO-8601 field (2025-07-23T00:00:00)content/snippetis clean β zero date-prefix garbagebreadcrumbsextracted from URL path for navigation contextdomainandsource_typeauto-classified (blog,docs,reddit,news, etc.)
Result: LLMs receive dense, token-efficient, structured data β not a wall of noisy text.
Unstoppable Fallback β If an engine returns a bot-challenge page (anomaly.js, Cloudflare, PerimeterX), it is automatically retried via the native Chromium CDP instance (headless Chrome, bundled in-binary). No manual intervention. No 0-result failures.
Quality > Quantity β ~20 deduplicated, scored results rather than 50 raw duplicates. For an AI agent with a limited context window, 20 high-quality results outperform 50 noisy ones every time.
| Feature | Details |
|---|---|
| π God-Tier Meta-Search | Parallel Google / Bing / DDG / Brave Β· dedup Β· scoring Β· breadcrumbs Β· published_at |
| π· Universal Scraper | Rust-native + native Chromium CDP for JS-heavy and anti-bot sites |
| π Human Auth (HITL) | human_auth_session: Real browser + persistent cookies + instruction overlay + Automatic Re-injection. Fetch any protected URL. |
| π§ Semantic Memory | Embedded LanceDB + Model2Vec for long-term research recall (no DB container) |
| π€ HITL Non-Robot Search | Visible Brave Browser + keyboard hooks for human CAPTCHA / login-wall bypass |
| π Deep Crawler | Recursive, bounded crawl to map entire subdomains |
| π Proxy Master | Native HTTP/SOCKS5 pool rotation with health checks |
| π§½ Universal Janitor | Strips cookie banners, popups, skeleton screens β delivers clean Markdown |
| π₯ Hydration Extractor | Resolves React/Next.js hydration JSON (__NEXT_DATA__, embedded state) |
| π‘ Anti-Bot Arsenal | Stealth UA rotation, fingerprint spoofing, CDP automation, mobile profile emulation |
| π Structured Extract | CSS-selector + prompt-driven field extraction from any page |
| π Batch Scrape | Parallel scrape of N URLs with configurable concurrency |
ShadowCrawl is pure binary: a single Rust executable exposes MCP tools (stdio) and an optional HTTP server β no Docker, no sidecars.
When standard automation fails (Cloudflare, CAPTCHA, complex logins), ShadowCrawl activates the human element.
This is our signature tool that surpasses all competitors. While most scrapers fail on login-walled content, human_auth_session opens a real, visible browser window for you to solve the challenge.
Once you click FINISH & RETURN, all authentication cookies are transparently captured and persisted in ~/.shadowcrawl/sessions/. Subsequent requests to the same domain automatically inject these cookies β making future fetches fully automated and effortless.
- π’ Instruction Overlay β A native green banner guides the user on what to solve.
- πͺ Persistent Sessions β Solve once, scrape forever. No need to log in manually again for weeks.
- π‘ Security first β Cookies are stored locally and encrypted (optional/upcoming).
- π Auto-injection β Next
web_fetchorweb_crawlcalls automatically load found sessions.
We don't claim β we show receipts. All captured with human_auth_session and our advanced CDP engines (2026-02-20):
| Target | Protection | Evidence | Extracted |
|---|---|---|---|
| Cloudflare + Auth | JSON Β· Snippet | 60+ job listings β | |
| Ticketmaster | Cloudflare Turnstile | JSON Β· Snippet | Tour dates & venues β |
| Airbnb | DataDome | JSON Β· Snippet | 1,000+ Tokyo listings β |
| Upwork | reCAPTCHA | JSON Β· Snippet | 160K+ job postings β |
| Amazon | AWS Shield | JSON Β· Snippet | RTX 5070 Ti search results β |
| nowsecure.nl | Cloudflare | JSON | Manual button verified β |
π Full analysis: proof/README.md
Download the latest release assets from GitHub Releases and run one of:
Prebuilt assets are published for: windows-x64, windows-arm64, linux-x64, linux-arm64.
shadowcrawl-mcpβ MCP stdio server (recommended for VS Code / Cursor / Claude Desktop)shadowcrawlβ HTTP server (default port5000; override via--port,PORT, orSHADOWCRAWL_PORT)
Confirm the HTTP server is alive:
./shadowcrawl --port 5000
curl http://localhost:5000/healthBuild all binaries with all optional features enabled:
cd mcp-server
cargo build --release --all-featuresgit clone https://github.com/DevsHero/shadowcrawl.git
cd shadowcrawlBuild:
cd mcp-server
cargo build --release --features non_robot_search --bin shadowcrawl --bin shadowcrawl-mcpOr install (puts binaries into your Cargo bin directory):
cargo install --path mcp-server --lockedBinaries land at:
target/release/shadowcrawlβ HTTP server (default port5000; override via--port,PORT, orSHADOWCRAWL_PORT)target/release/shadowcrawl-mcpβ MCP stdio server
Prerequisites for HITL:
- Brave Browser (brave.com/download)
- Accessibility permission (macOS: System Preferences β Privacy & Security β Accessibility)
- A desktop session (not SSH-only)
Platform guides: docs/window_setup.md Β· docs/ubuntu_setup.md
After any binary rebuild/update, restart your MCP client session to pick up new tool definitions.
Use this exact decision flow to get the highest-quality results with minimal tokens:
memory_searchfirst (avoid re-fetching)web_search_jsonfor initial research (search + content summaries in one call)web_fetchfor specific URLs (docs/articles) -output_format="clean_json"for token-efficient output - setquery+strict_relevance=truewhen you want only query-relevant paragraphs- If
web_fetchreturns 403/429/rate-limit βproxy_controlgrabthen retry withuse_proxy=true - If
web_fetchreturnsauth_risk_score >= 0.4βvisual_scout(confirm login wall) βhuman_auth_session(The God-Tier Nuclear Option)
Structured extraction (schema-first):
- Prefer
fetch_then_extractfor one-shot fetch + extract. strict=true(default) enforces schema shape: missing arrays become[], missing scalars becomenull(no schema drift).- Treat
confidence=0.0as βplaceholder / unrendered pageβ (often JS-only like crates.io). Escalate to browser rendering (CDP/HITL) instead of trusting the fields. - π‘ New in v3.0.0: Placeholder detection is now scalar-only. Pure-array schemas (only lists/structs) never trigger confidence=0.0, fixing prior regressions.
clean_json notes:
- Large pages are truncated to respect
max_chars(look forclean_json_truncatedwarning). Increasemax_charsto see more. key_code_blocksis extracted from fenced blocks and signature-like inline code; short docs pages are supported.- π· v3.0.0 fix: Module extraction on
docs.rsworks recursively for all relative and absolute sub-paths.
ShadowCrawl exposes all tools via the Model Context Protocol (stdio transport).
Add to your MCP config (~/.config/Code/User/mcp.json):
Use the same stdio setup as VS Code (run shadowcrawl-mcp locally and pass env vars via env or your clientβs env field).
π Full multi-IDE guide: docs/IDE_SETUP.md
| Variable | Default | Description |
|---|---|---|
CHROME_EXECUTABLE |
auto-detected | Override path to Chromium/Chrome/Brave binary |
SEARCH_ENGINES |
google,bing,duckduckgo,brave |
Active search engines (comma-separated) |
SEARCH_MAX_RESULTS_PER_ENGINE |
10 |
Results per engine before merge |
SEARCH_CDP_FALLBACK |
true if browser found |
Auto-retry blocked engines via native Chromium CDP (alias: SEARCH_BROWSERLESS_FALLBACK) |
SEARCH_SIMULATE_BLOCK |
β | Force blocked path for testing: duckduckgo,bing or all |
LANCEDB_URI |
β | Path for semantic research memory (optional) |
SHADOWCRAWL_NEUROSIPHON |
1 (enabled) |
Set to 0 / false / off to disable all NeuroSiphon techniques (import nuking, SPA extraction, semantic shaving, search reranking) |
HTTP_TIMEOUT_SECS |
30 |
Per-request timeout |
OUTBOUND_LIMIT |
32 |
Max concurrent outbound connections |
MAX_CONTENT_CHARS |
10000 |
Max chars per scraped document |
IP_LIST_PATH |
β | Path to proxy IP list |
SCRAPE_DELAY_PRESET |
polite |
fast / polite / cautious |
| Feature | Firecrawl / Jina / Tavily | ShadowCrawl v2.4.0 |
|---|---|---|
| Cost | $49β$499/mo | $0 β self-hosted |
| Privacy | They see your queries | 100% private, local-only |
| Search Engine | Proprietary / 3rd-party API | Native Rust (4 engines, parallel) |
| Result Quality | Mixed, noisy snippets | Deduped, scored, LLM-clean |
| Cloudflare Bypass | Rarely | Native Chromium CDP + HITL fallback |
| LinkedIn / Airbnb | Blocked | 99.99% success (HITL) |
| JS Rendering | Cloud API | Native Brave + bundled Chromium CDP |
| Semantic Memory | None | Embedded LanceDB + Model2Vec |
| Proxy Support | Paid add-on | Native SOCKS5/HTTP rotation |
| MCP Native | Partial | Full MCP stdio + HTTP |
ShadowCrawl works best when your AI agent knows the operational rules before it starts β which tool to call first, when to rotate proxies, and when not to use extract_structured. Without these rules, agents waste tokens re-fetching cached data and can misuse tools on incompatible sources.
The complete rules file lives at .github/copilot-instructions.md (VS Code / GitHub Copilot) and is also available as .clinerules for Cline. Copy the block below into the IDE-specific file for your editor.
Create (or append to) .github/copilot-instructions.md in your workspace root:
## MCP Usage Guidelines β ShadowCrawl
### Shadowcrawl Priority Rules
1. **Memory first (NEVER skip):** ALWAYS call `research_history` BEFORE calling `search_web`,
`search_structured`, or `scrape_url`.
**Cache-quality guard:** only skip a live fetch when ALL of the following are true:
- similarity score β₯ 0.60
- entry_type is NOT "search" (search entries have no word_count β always follow up with scrape_url)
- word_count β₯ 50 (cached crates.io pages are JS-placeholders with ~11 words)
- no placeholder/sparse warnings (placeholder_page, short_content, content_restricted)
2. **Initial research:** use `search_structured` (search + content summaries in one call).
For private/internal tools not indexed publicly, skip search and go directly to
`scrape_url` on the known repo/docs URL.
3. **Doc/article pages:** `scrape_url` with `output_format: clean_json`,
`strict_relevance: true`, `query: "<your question>"`.
Raw `.md`/`.txt` URLs are auto-detected β HTML pipeline is skipped, raw content returned.
4. **Proxy rotation (mandatory on first block):** if `scrape_url` or `search_web` returns
403/429/rate-limit, immediately call `proxy_manager` with `action: "grab"` then retry
with `use_proxy: true`. Do NOT wait for a second failure.
4a. **Auto-escalation on low confidence:** if `scrape_url` returns confidence < 0.3 or
extraction_score < 0.4 β retry with `quality_mode: "aggressive"` β `visual_scout`
β `human_auth_session`. Never stay stuck on a low-confidence result.
5. **Schema extraction:** use `fetch_then_extract` (one-shot) or `extract_structured`.
Both auto-inject `raw_markdown_url` warning when called on raw file URLs.
Do NOT point at raw `.md`/`.json`/`.txt` unless intentional.
6. **Sub-page discovery:** use `crawl_website` before `scrape_url` when you only know
an index URL and need to find the right sub-page.
7. **Last resort:** `non_robot_search` only after direct fetch + proxy rotation have both
failed (Cloudflare / CAPTCHA / login walls). Session cookies are persisted after login.Create or append to .cursorrules in your project root with the same block above.
Already included in this repository as .clinerules. Cline loads it automatically β no action needed.
Paste the rules block into the Custom Instructions or System Prompt field in Claude Desktop settings (Settings β Advanced β System Prompt).
Any agent that accepts a system prompt or workspace instruction file: paste the same block. The rules are plain markdown and tool-agnostic.
Question / research task
β
βΌ
research_history βββΊ hit (β₯ 0.60)? βββΊ cache-quality guard:
β miss β entry_type=="search"? βββΊ don't skip; do scrape_url
β β word_count < 50 or placeholder warnings? βββΊ don't skip
β ββββΊ quality OK? βββΊ use cached result, STOP
β
βΌ
search_structured βββΊ enough content? βββΊ use it, STOP
β need deeper page
βΌ
scrape_url (clean_json + strict_relevance + query)
β confidence < 0.3 or extraction_score < 0.4?
ββββΊ retry quality_mode: aggressive βββΊ visual_scout βββΊ human_auth_session
β 403/429/blocked? βββΊ proxy_manager grab βββΊ retry use_proxy: true
β still blocked? βββΊ non_robot_search (LAST RESORT)
β
βββ need schema JSON? βββΊ fetch_then_extract (schema + strict=true)
π Full rules + per-tool quick-reference table:
.github/copilot-instructions.md
human_auth_session(The Nuclear Option): Launches a visible browser for human login/CAPTCHA solving. Captures and persists full authentication cookies to~/.shadowcrawl/sessions/{domain}.json. Enables full automation for protected URLs after a single manual session.- Instruction Overlay:
human_auth_sessionnow displays a custom green "ShadowCrawl" instruction banner on top of the browser window to guide users through complex auth walls. - Persistent Session Auto-Injection:
web_fetch,web_crawl, andvisual_scoutnow automatically check for and inject matching cookies from the local session store. extract_structured/fetch_then_extract: new optional paramsplaceholder_word_threshold(int, default 10) andplaceholder_empty_ratio(float 0β1, default 0.9) allow agents to tune placeholder detection sensitivity per-call.web_crawl: new optionalmax_charsparam (default 10 000) caps total JSON output size to prevent workspace storage spill.- Rustdoc module extraction:
extract_structured/fetch_then_extractcorrectly populatemodules: [...]on docs.rs pages using theNAME/index.htmlsub-directory convention. - GitHub Discussions & Issues hydration:
fetch_via_cdpdetectsgithub.com/*/discussions/*and/issues/*URLs; extends network-idle window to 2.5 s / 12 s max and polls for.timeline-comment,.js-discussion,.comment-bodyDOM nodes. - Contextual code blocks (
clean_jsonmode):SniperCodeBlockgains acontext: Option<String>field. Performs two-pass extraction for prose preceding fenced blocks and Markdown sentences containing inline snippets. - IDE copilot-instructions guide (README): new
π€ Agent Optimal Setupsection. .clinerulesworkspace file: all 7 priority rules + decision-flow diagram + per-tool quick-reference table.- Agent priority rules in tool schemas: every MCP tool description now carries machine-readable
β οΈ AGENT RULE/β BEST PRACTICE.
- Placeholder detection (Scalar-Only Logic): Confidence override to 0.0 now only considers scalar (non-array) fields. Pure-array schemas (headers, modules, structs) never trigger fake placeholder warnings, fixing false-positives on rich but list-heavy documentation pages.
web_fetch(output_format="clean_json"): applies amax_chars-based paragraph budget and emitsclean_json_truncatedwhen output is clipped.extract_fields/fetch_then_extract: placeholder/unrendered pages (very low content + mostly empty schema fields) forceconfidence=0.0.- Short-content bypass (
strict_relevance/extract_relevant_sections): early exit with a descriptive warning whenword_count < 200. Short pages (GitHub Discussions, Q&A threads) are returned whole.
- BUG-6:
modules: []always empty on rustdoc pages β refactored regex to support both absolute and simple relative module links (init/index.html,optim/index.html). - BUG-7: false-positive
confidence=0.0on real docs.rs pages; replaced whole-schema empty ratio with scalar-only ratio + raised threshold. - BUG-9:
web_crawlcould spill 16 KB+ of JSON into VS Code workspace storage; handler now truncates response tomax_chars(default 10 000). web_fetch(output_format="clean_json"): paragraph filter now adapts forword_count < 200.fetch_then_extract: prevents false-high confidence on JS-only placeholder pages (e.g. crates.io) by overriding confidence to 0.0.cdp_fallback_failedon GitHub Discussions: extended CDP hydration window and selector polling ensures full thread capture.
ShadowCrawl is built with β€οΈ by a solo developer for the open-source AI community. If this tool saved you from a $500/mo scraping API bill:
- β Star the repo β it helps others discover this
- π Found a bug? Open an issue
- π‘ Feature request? Start a discussion
- β Fuel more updates:
License: MIT β free for personal and commercial use.
{ "servers": { "shadowcrawl": { "type": "stdio", "command": "env", "args": [ "RUST_LOG=info", "SEARCH_ENGINES=google,bing,duckduckgo,brave", "LANCEDB_URI=/YOUR_PATH/shadowcrawl/lancedb", "HTTP_TIMEOUT_SECS=30", "MAX_CONTENT_CHARS=10000", "IP_LIST_PATH=/YOUR_PATH/shadowcrawl/ip.txt", "PROXY_SOURCE_PATH=/YOUR_PATH/shadowcrawl/proxy_source.json", "/YOUR_PATH/shadowcrawl/mcp-server/target/release/shadowcrawl-mcp" ] } } }