A comprehensive asynchronous web scraping tool built with Crawl4AI that crawls documentation sites and extracts clean content in markdown format.
- Asynchronous crawling for high performance
- Smart content filtering with domain and content type filters
- Markdown generation with content pruning
- Configurable crawling strategies with depth and page limits
- Multiple site configurations for popular documentation sites
- Content pattern filtering to skip unwanted pages
- JSON output for easy data processing (e.g. RAG pipelines)
- Python 3.8+
-
Clone the repository
git clone <repository-url> cd webscraper
-
Install Crawl4AI
# Install the package pip install -U crawl4ai # For pre release versions pip install crawl4ai --pre # Run post-installation setup crawl4ai-setup # Verify your installation crawl4ai-doctor
If you're getting import errors, make sure the package is installed in the correct Python environment.
If you encounter any browser-related issues, you can install them manually:
python -m playwright install --with-deps chromium
Each configuration supports the following parameters:
start_url(str): Starting URL for the crawlallowed_domains(List[str]): Domains that are allowed to be crawledblocked_domains(List[str], optional): Domains to exclude from crawlingmax_depth(int): Maximum depth to crawl from the starting URLmax_pages(int): Maximum number of pages to crawloutput_file(str): Output JSON file name
css_selector(str, optional): CSS selector to target specific content areasexcluded_tags(List[str], optional): HTML tags to exclude from contentskip_patterns(List[str], optional): Text patterns that trigger page skippingprune_threshold(float, optional): Content pruning threshold (0.0-1.0)seo_keywords(List[str], optional): Specific keywords in page metadata
To run the script from terminal:
python scraper.pyThe script will default to crawling the selected config at the bottom (e.g.
cuda_config). You can modifyselected_configinmain()to crawl a different site.
Example:
selected_config = tensorflow_config- If you see a lot of skipped pages or empty outputs, adjust the
css_selector. - Increase
max_pagesormax_depthif you're not getting enough content. - Use
prune_threshold=0.0to disable content pruning temporarily.