Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
edd0b57
Fix: Use correct URL variable for raw HTML extraction (#1116)
rbushri Aug 28, 2025
c2c4d42
Fix #1181: Preserve whitespace in code blocks during HTML scraping
ntohidi Nov 17, 2025
eca04b0
Refactor Pydantic model configuration to use ConfigDict for arbitrary…
Ahmed-Tawfik94 Nov 18, 2025
7771ed3
Merge branch 'develop' into fix/wrong_url_raw
rbushri Nov 24, 2025
84bfea8
Fix EmbeddingStrategy: Uncomment response handling for the variations…
ntohidi Nov 25, 2025
94c8a83
Merge pull request #1447 from rbushri/fix/wrong_url_raw
ntohidi Nov 25, 2025
b36c6da
Fix: permission issues with .cache/url_seeder and other runtime cache…
ntohidi Nov 25, 2025
a0c5f0f
fix: ensure BrowserConfig.to_dict serializes proxy_config
SohamKukreti Nov 26, 2025
dcb77c9
Merge pull request #1623 from unclecode/fix/deprecated_pydantic
ntohidi Nov 27, 2025
7a133e2
feat: make LLM backoff configurable end-to-end
SohamKukreti Nov 28, 2025
33a3cc3
reproduced AttributeError from #1642
murphycw Dec 1, 2025
6ec6bc4
pass timeout parameter to docker client request
murphycw Dec 1, 2025
eb76df2
added missing deep crawling objects to init
murphycw Dec 1, 2025
e95e8e1
generalized query in ContentRelevanceFilter to be a str or list
murphycw Dec 1, 2025
3a8f829
import modules from enhanceable deserialization
murphycw Dec 1, 2025
6893094
parameterized tests
murphycw Dec 1, 2025
07ccf13
Fix: capture current page URL to reflect JavaScript navigation and ad…
ntohidi Dec 2, 2025
afc31e1
Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …
ntohidi Dec 2, 2025
d06c39e
Merge pull request #1641 from unclecode/fix/serialize-proxy-config
ntohidi Dec 2, 2025
f32cfc6
Merge pull request #1645 from unclecode/fix/configurable-backoff
ntohidi Dec 2, 2025
df4d87e
refactor: replace PyPDF2 with pypdf across the codebase. ref #1412
ntohidi Dec 3, 2025
5a8fb57
Merge pull request #1648 from christopher-w-murphy/fix/content-releva…
ntohidi Dec 3, 2025
306ddcb
Merge branch 'main' into develop
ntohidi Dec 11, 2025
8ae908b
Add browser_context_id and target_id parameters to BrowserConfig
unclecode Dec 13, 2025
66941a5
Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server…
unclecode Dec 13, 2025
d22825e
Fix: add cdp_cleanup_on_close to from_kwargs
unclecode Dec 13, 2025
b2e4a1f
Fix: find context by target_id for concurrent CDP connections
unclecode Dec 13, 2025
c1e485e
Fix: use target_id to find correct page in get_page
unclecode Dec 13, 2025
8014805
Fix: use CDP to find context by browserContextId for concurrent sessions
unclecode Dec 13, 2025
6185d3c
Revert context matching attempts - Playwright cannot see CDP-created …
unclecode Dec 13, 2025
55eb968
Add create_isolated_context flag for concurrent CDP crawls
unclecode Dec 13, 2025
ecedb61
Add context caching to create_isolated_context branch
unclecode Dec 13, 2025
d10ca38
Add init_scripts support to BrowserConfig for pre-page-load JS injection
unclecode Dec 14, 2025
02acad1
Fix CDP connection handling: support WS URLs and proper cleanup
unclecode Dec 18, 2025
f6b29a8
Update gitignore
unclecode Dec 21, 2025
48426f7
Some debugging for caching
unclecode Dec 21, 2025
444cb14
Add _generate_screenshot_from_html for raw: and file:// URLs
unclecode Dec 22, 2025
67e03d6
Add PDF and MHTML support for raw: and file:// URLs
unclecode Dec 22, 2025
31ebf37
Add crash recovery for deep crawl strategies
unclecode Dec 22, 2025
624e341
Fix: HTTP strategy raw: URL parsing truncates at # character
unclecode Dec 24, 2025
3937efc
Add base_url parameter to CrawlerRunConfig for raw HTML processing
unclecode Dec 24, 2025
fde4e9f
Add prefetch mode for two-phase deep crawling
unclecode Dec 25, 2025
9e7f5aa
Updates on proxy rotation and proxy configuration
unclecode Dec 26, 2025
a43256b
Add proxy support to HTTP crawler strategy
unclecode Dec 26, 2025
2550f3d
Add browser pipeline support for raw:/file:// URLs
unclecode Dec 27, 2025
3d78001
Add smart TTL cache for sitemap URL seeder
unclecode Dec 30, 2025
db61ab8
Update URL seeder docs with smart TTL cache parameters
unclecode Dec 30, 2025
0d3f9e6
Add MEMORY.md to gitignore
unclecode Dec 30, 2025
6b2dca7
Docs: Add multi-sample schema generation section
unclecode Jan 4, 2026
f24396c
Fix critical RCE and LFI vulnerabilities in Docker API deployment
unclecode Jan 12, 2026
acfab80
Enhance authentication flow by implementing JWT token retrieval and a…
ntohidi Jan 12, 2026
122b4fe
Add release notes for v0.7.9, detailing breaking changes, security fi…
ntohidi Jan 12, 2026
530cde3
Add release notes for v0.8.0, detailing breaking changes, security fi…
unclecode Jan 12, 2026
315eae9
Add examples for deep crawl crash recovery and prefetch mode in docum…
ntohidi Jan 14, 2026
f09146c
Release v0.8.0: The v0.8.0 Update
ntohidi Jan 14, 2026
177e298
Update security researcher acknowledgment with a hyperlink for Neo by…
ntohidi Jan 14, 2026
a00da65
Add async agenerate_schema method for schema generation
unclecode Jan 16, 2026
6090629
Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility
unclecode Jan 16, 2026
a5354f2
Merge branch 'develop' into release/v0.8.0
ntohidi Jan 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,7 @@ continue_config.json
.private/

.claude/
.context/

CLAUDE_MONITOR.md
CLAUDE.md
Expand Down Expand Up @@ -295,3 +296,4 @@ scripts/
*.db
*.rdb
*.ldb
MEMORY.md
40 changes: 40 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,46 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.8.0] - 2026-01-12

### Security
- **🔒 CRITICAL: Remote Code Execution Fix**: Removed `__import__` from hook allowed builtins
- Prevents arbitrary module imports in user-provided hook code
- Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` environment variable
- Credit: Neo by ProjectDiscovery
- **🔒 HIGH: Local File Inclusion Fix**: Added URL scheme validation to Docker API endpoints
- Blocks `file://`, `javascript:`, `data:` URLs on `/execute_js`, `/screenshot`, `/pdf`, `/html`
- Only allows `http://`, `https://`, and `raw:` URLs
- Credit: Neo by ProjectDiscovery

### Breaking Changes
- **Docker API: Hooks disabled by default**: Set `CRAWL4AI_HOOKS_ENABLED=true` to enable
- **Docker API: file:// URLs blocked**: Use Python library directly for local file processing

### Added
- **🚀 init_scripts for BrowserConfig**: Pre-page-load JavaScript injection for stealth evasions
- **🔄 CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse
- **💾 Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies
- **📄 PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content
- **📸 Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots
- **🔗 base_url Parameter**: Proper URL resolution for raw: HTML processing
- **⚡ Prefetch Mode**: Two-phase deep crawling with fast link extraction
- **🔀 Enhanced Proxy Support**: Improved proxy rotation and sticky sessions
- **🌐 HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies
- **🖥️ Browser Pipeline for raw:/file://**: New `process_in_browser` parameter
- **📋 Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters
- **📚 Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines

### Fixed
- **raw: URL Parsing**: Fixed truncation at `#` character (CSS color codes like `#eee`)
- **Caching System**: Various improvements to cache validation and persistence

### Documentation
- Multi-sample schema generation section
- URL seeder smart TTL cache parameters
- v0.8.0 migration guide
- Security policy and disclosure process

## [Unreleased]

### Added
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
FROM python:3.12-slim-bookworm AS build

# C4ai version
ARG C4AI_VER=0.7.8
ARG C4AI_VER=0.8.0
ENV C4AI_VERSION=$C4AI_VER
LABEL c4ai.version=$C4AI_VER

Expand Down
47 changes: 43 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,13 @@ Limited slots._

Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.

[✨ Check out latest update v0.7.8](#-recent-updates)
[✨ Check out latest update v0.8.0](#-recent-updates)

✨ **New in v0.7.8**: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues (ContentRelevanceFilter, ProxyConfig, cache permissions), LLM extraction improvements (configurable backoff, HTML input format), URL handling fixes, and dependency updates (pypdf, Pydantic v2). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
✨ **New in v0.8.0**: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. Critical security fixes for Docker API (hooks disabled by default, file:// URLs blocked). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)

✨ Recent v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, smart browser pool management, and production-ready observability. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
✨ Recent v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)

✨ Previous v0.7.6: Complete Webhook Infrastructure for Docker Job Queue API! Real-time notifications for both `/crawl/job` and `/llm/job` endpoints with exponential backoff retry, custom headers, and flexible delivery modes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.6.md)
✨ Previous v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, and smart browser pool management. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)

<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
Expand Down Expand Up @@ -562,6 +562,45 @@ async def test_news_crawl():

## ✨ Recent Updates

<details open>
<summary><strong>Version 0.8.0 Release Highlights - Crash Recovery & Prefetch Mode</strong></summary>

This release introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.

- **🔄 Deep Crawl Crash Recovery**:
- `on_state_change` callback fires after each URL for real-time state persistence
- `resume_state` parameter to continue from a saved checkpoint
- JSON-serializable state for Redis/database storage
- Works with BFS, DFS, and Best-First strategies
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

strategy = BFSDeepCrawlStrategy(
max_depth=3,
resume_state=saved_state, # Continue from checkpoint
on_state_change=save_to_redis, # Called after each URL
)
```

- **⚡ Prefetch Mode for Fast URL Discovery**:
- `prefetch=True` skips markdown, extraction, and media processing
- 5-10x faster than full processing
- Perfect for two-phase crawling: discover first, process selectively
```python
config = CrawlerRunConfig(prefetch=True)
result = await crawler.arun("https://example.com", config=config)
# Returns HTML and links only - no markdown generation
```

- **🔒 Security Fixes (Docker API)**:
- Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
- `file://` URLs blocked on API endpoints to prevent LFI
- `__import__` removed from hook execution sandbox

[Full v0.8.0 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)

</details>

<details>
<summary><strong>Version 0.7.8 Release Highlights - Stability & Bug Fix Release</strong></summary>

Expand Down
122 changes: 122 additions & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Security Policy

## Supported Versions

| Version | Supported |
| ------- | ------------------ |
| 0.8.x | :white_check_mark: |
| 0.7.x | :x: (upgrade recommended) |
| < 0.7 | :x: |

## Reporting a Vulnerability

We take security vulnerabilities seriously. If you discover a security issue, please report it responsibly.

### How to Report

**DO NOT** open a public GitHub issue for security vulnerabilities.

Instead, please report via one of these methods:

1. **GitHub Security Advisories (Preferred)**
- Go to [Security Advisories](https://github.com/unclecode/crawl4ai/security/advisories)
- Click "New draft security advisory"
- Fill in the details

2. **Email**
- Send details to: security@crawl4ai.com
- Use subject: `[SECURITY] Brief description`
- Include:
- Description of the vulnerability
- Steps to reproduce
- Potential impact
- Any suggested fixes

### What to Expect

- **Acknowledgment**: Within 48 hours
- **Initial Assessment**: Within 7 days
- **Resolution Timeline**: Depends on severity
- Critical: 24-72 hours
- High: 7 days
- Medium: 30 days
- Low: 90 days

### Disclosure Policy

- We follow responsible disclosure practices
- We will coordinate with you on disclosure timing
- Credit will be given to reporters (unless anonymity is requested)
- We may request CVE assignment for significant vulnerabilities

## Security Best Practices for Users

### Docker API Deployment

If you're running the Crawl4AI Docker API in production:

1. **Enable Authentication**
```yaml
# config.yml
security:
enabled: true
jwt_enabled: true
```
```bash
# Set a strong secret key
export SECRET_KEY="your-secure-random-key-here"
```

2. **Hooks are Disabled by Default** (v0.8.0+)
- Only enable if you trust all API users
- Set `CRAWL4AI_HOOKS_ENABLED=true` only when necessary

3. **Network Security**
- Run behind a reverse proxy (nginx, traefik)
- Use HTTPS in production
- Restrict access to trusted IPs if possible

4. **Container Security**
- Run as non-root user (default in our container)
- Use read-only filesystem where possible
- Limit container resources

### Library Usage

When using Crawl4AI as a Python library:

1. **Validate URLs** before crawling untrusted input
2. **Sanitize extracted content** before using in other systems
3. **Be cautious with hooks** - they execute arbitrary code

## Known Security Issues

### Fixed in v0.8.0

| ID | Severity | Description | Fix |
|----|----------|-------------|-----|
| CVE-pending-1 | CRITICAL | RCE via hooks `__import__` | Removed from allowed builtins |
| CVE-pending-2 | HIGH | LFI via `file://` URLs | URL scheme validation added |

See [Security Advisory](https://github.com/unclecode/crawl4ai/security/advisories) for details.

## Security Features

### v0.8.0+

- **URL Scheme Validation**: Blocks `file://`, `javascript:`, `data:` URLs on API
- **Hooks Disabled by Default**: Opt-in via `CRAWL4AI_HOOKS_ENABLED=true`
- **Restricted Hook Builtins**: No `__import__`, `eval`, `exec`, `open`
- **JWT Authentication**: Optional but recommended for production
- **Rate Limiting**: Configurable request limits
- **Security Headers**: X-Frame-Options, CSP, HSTS when enabled

## Acknowledgments

We thank the following security researchers for responsibly disclosing vulnerabilities:

- **[Neo by ProjectDiscovery](https://projectdiscovery.io/blog/introducing-neo)** - RCE and LFI vulnerabilities (December 2025)

---

*Last updated: January 2026*
2 changes: 1 addition & 1 deletion crawl4ai/__version__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# crawl4ai/__version__.py

# This is the version that will be used for stable releases
__version__ = "0.7.8"
__version__ = "0.8.0"

# For nightly builds, this gets set during build process
__nightly_version__ = None
Expand Down
Loading