Skip to content

anandyadav3559/dynamicwebcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper Platform

Overview

The Web Scraper Platform is a modular and scalable system designed to extract structured and unstructured data from diverse online sources. It utilizes asynchronous, multithreaded crawling pipelines to efficiently gather data, leveraging various technologies and libraries.

Features

  • Modular Architecture: The platform is designed with a plugin architecture, allowing for extensible scrapers and easy integration of new functionalities.
  • Asynchronous Crawling: Built using asyncio and concurrent.futures, the system can handle multiple crawling tasks simultaneously, improving efficiency.
  • Dynamic Content Rendering: Utilizes Selenium for rendering dynamic web content, ensuring that all relevant data is captured.
  • HTML Parsing: Employs BeautifulSoup for parsing HTML and extracting structured data from web pages.
  • Data Storage: Data is stored in a normalized PostgreSQL schema, ensuring efficient data management and retrieval.
  • CrewAI Integration: Integrates CrewAI agents for intelligent task planning and adaptive crawling workflows.
  • LLM-Ready: The platform is engineered to be ready for future NLP pipelines, enabling integration with large language models.

Project Structure

web-scraper-platform
├── src
│   ├── main.py                # Entry point of the application
│   ├── config
│   │   └── settings.py        # Configuration settings
│   ├── crawlers
│   │   ├── __init__.py        # Crawler module initialization
│   │   └── base_crawler.py     # Base class for all crawlers
│   ├── plugins
│   │   └── __init__.py        # Plugin module initialization
│   ├── agents
│   │   └── crewai_agent.py    # CrewAI agent integration
│   ├── parsers
│   │   ├── __init__.py        # Parser module initialization
│   │   └── html_parser.py      # HTML parsing functionality
│   ├── storage
│   │   ├── __init__.py        # Storage module initialization
│   │   └── postgres.py         # PostgreSQL storage handling
│   ├── pipelines
│   │   ├── __init__.py        # Pipeline module initialization
│   │   └── async_pipeline.py    # Asynchronous crawling pipeline
│   ├── utils
│   │   └── helpers.py         # Utility functions
│   └── llm
│       └── nlp_pipeline.py     # LLM integration for NLP tasks
├── requirements.txt            # Project dependencies
├── README.md                   # Project documentation

Setup Instructions

  1. Clone the repository

    git clone <repository-url>
    cd web-scraper-platform
  2. Install dependencies

    pip install -r requirements.txt
  3. Configure environment variables Create a .env file in the project root with the following content:

    DATABASE_URL=postgresql+psycopg2://<user>:<password>@<host>:<port>/<database>
    GEMINI_API_KEY=your_gemini_api_key
    CREWAI_API_KEY=your_crewai_api_key
    LLM_API_KEY=your_llm_api_key

    Replace the placeholders with your actual credentials and API keys.

  4. Configure the application

    • Edit src/config/settings.py and/or src/config/test_config.ini to set other options if needed.
  5. Run the project

    python src/main.py

Changing Base URLs

To change the starting URLs for crawling:

  1. Open src/main.py.
  2. Locate the urls list near the top of the file:
    urls = [
       "https://www.nytimes.com",
       "https://www.amazon.com",
       "https://www.bbc.com/news",
    ]
  3. Edit this list to include your desired base URLs.
  4. Save the file and re-run the project.

Alternatively, you can modify your pipeline to read URLs from a config file or user input for more flexibility.

Future Enhancements

  • Integration with additional data sources and formats.
  • Enhanced error handling and logging mechanisms.
  • Development of more sophisticated NLP pipelines using LLMs.

License

This project is licensed under the MIT License. See the LICENSE file for details.# dynamicwebcrawler

dynamicwebcrawler

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages