The Web Scraper Platform is a modular and scalable system designed to extract structured and unstructured data from diverse online sources. It utilizes asynchronous, multithreaded crawling pipelines to efficiently gather data, leveraging various technologies and libraries.
- Modular Architecture: The platform is designed with a plugin architecture, allowing for extensible scrapers and easy integration of new functionalities.
- Asynchronous Crawling: Built using
asyncioandconcurrent.futures, the system can handle multiple crawling tasks simultaneously, improving efficiency. - Dynamic Content Rendering: Utilizes Selenium for rendering dynamic web content, ensuring that all relevant data is captured.
- HTML Parsing: Employs BeautifulSoup for parsing HTML and extracting structured data from web pages.
- Data Storage: Data is stored in a normalized PostgreSQL schema, ensuring efficient data management and retrieval.
- CrewAI Integration: Integrates CrewAI agents for intelligent task planning and adaptive crawling workflows.
- LLM-Ready: The platform is engineered to be ready for future NLP pipelines, enabling integration with large language models.
web-scraper-platform
├── src
│ ├── main.py # Entry point of the application
│ ├── config
│ │ └── settings.py # Configuration settings
│ ├── crawlers
│ │ ├── __init__.py # Crawler module initialization
│ │ └── base_crawler.py # Base class for all crawlers
│ ├── plugins
│ │ └── __init__.py # Plugin module initialization
│ ├── agents
│ │ └── crewai_agent.py # CrewAI agent integration
│ ├── parsers
│ │ ├── __init__.py # Parser module initialization
│ │ └── html_parser.py # HTML parsing functionality
│ ├── storage
│ │ ├── __init__.py # Storage module initialization
│ │ └── postgres.py # PostgreSQL storage handling
│ ├── pipelines
│ │ ├── __init__.py # Pipeline module initialization
│ │ └── async_pipeline.py # Asynchronous crawling pipeline
│ ├── utils
│ │ └── helpers.py # Utility functions
│ └── llm
│ └── nlp_pipeline.py # LLM integration for NLP tasks
├── requirements.txt # Project dependencies
├── README.md # Project documentation
-
Clone the repository
git clone <repository-url> cd web-scraper-platform
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables Create a
.envfile in the project root with the following content:DATABASE_URL=postgresql+psycopg2://<user>:<password>@<host>:<port>/<database> GEMINI_API_KEY=your_gemini_api_key CREWAI_API_KEY=your_crewai_api_key LLM_API_KEY=your_llm_api_key
Replace the placeholders with your actual credentials and API keys.
-
Configure the application
- Edit
src/config/settings.pyand/orsrc/config/test_config.inito set other options if needed.
- Edit
-
Run the project
python src/main.py
To change the starting URLs for crawling:
- Open
src/main.py. - Locate the
urlslist near the top of the file:urls = [ "https://www.nytimes.com", "https://www.amazon.com", "https://www.bbc.com/news", ]
- Edit this list to include your desired base URLs.
- Save the file and re-run the project.
Alternatively, you can modify your pipeline to read URLs from a config file or user input for more flexibility.
- Integration with additional data sources and formats.
- Enhanced error handling and logging mechanisms.
- Development of more sophisticated NLP pipelines using LLMs.
This project is licensed under the MIT License. See the LICENSE file for details.# dynamicwebcrawler