Web Scraper Platform

Overview

The Web Scraper Platform is a modular and scalable system designed to extract structured and unstructured data from diverse online sources. It utilizes asynchronous, multithreaded crawling pipelines to efficiently gather data, leveraging various technologies and libraries.

Features

Modular Architecture: The platform is designed with a plugin architecture, allowing for extensible scrapers and easy integration of new functionalities.
Asynchronous Crawling: Built using asyncio and concurrent.futures, the system can handle multiple crawling tasks simultaneously, improving efficiency.
Dynamic Content Rendering: Utilizes Selenium for rendering dynamic web content, ensuring that all relevant data is captured.
HTML Parsing: Employs BeautifulSoup for parsing HTML and extracting structured data from web pages.
Data Storage: Data is stored in a normalized PostgreSQL schema, ensuring efficient data management and retrieval.
CrewAI Integration: Integrates CrewAI agents for intelligent task planning and adaptive crawling workflows.
LLM-Ready: The platform is engineered to be ready for future NLP pipelines, enabling integration with large language models.

Project Structure

web-scraper-platform
├── src
│   ├── main.py                # Entry point of the application
│   ├── config
│   │   └── settings.py        # Configuration settings
│   ├── crawlers
│   │   ├── __init__.py        # Crawler module initialization
│   │   └── base_crawler.py     # Base class for all crawlers
│   ├── plugins
│   │   └── __init__.py        # Plugin module initialization
│   ├── agents
│   │   └── crewai_agent.py    # CrewAI agent integration
│   ├── parsers
│   │   ├── __init__.py        # Parser module initialization
│   │   └── html_parser.py      # HTML parsing functionality
│   ├── storage
│   │   ├── __init__.py        # Storage module initialization
│   │   └── postgres.py         # PostgreSQL storage handling
│   ├── pipelines
│   │   ├── __init__.py        # Pipeline module initialization
│   │   └── async_pipeline.py    # Asynchronous crawling pipeline
│   ├── utils
│   │   └── helpers.py         # Utility functions
│   └── llm
│       └── nlp_pipeline.py     # LLM integration for NLP tasks
├── requirements.txt            # Project dependencies
├── README.md                   # Project documentation

Setup Instructions

Clone the repository

git clone <repository-url>
cd web-scraper-platform

Install dependencies
```
pip install -r requirements.txt
```

Configure environment variables Create a .env file in the project root with the following content:

DATABASE_URL=postgresql+psycopg2://<user>:<password>@<host>:<port>/<database>
GEMINI_API_KEY=your_gemini_api_key
CREWAI_API_KEY=your_crewai_api_key
LLM_API_KEY=your_llm_api_key

Replace the placeholders with your actual credentials and API keys.

Configure the application
- Edit src/config/settings.py and/or src/config/test_config.ini to set other options if needed.
Run the project
```
python src/main.py
```

Changing Base URLs

To change the starting URLs for crawling:

Open src/main.py.

Locate the urls list near the top of the file:

urls = [
   "https://www.nytimes.com",
   "https://www.amazon.com",
   "https://www.bbc.com/news",
]

Edit this list to include your desired base URLs.
Save the file and re-run the project.

Alternatively, you can modify your pipeline to read URLs from a config file or user input for more flexibility.

Future Enhancements

Integration with additional data sources and formats.
Enhanced error handling and logging mechanisms.
Development of more sophisticated NLP pipelines using LLMs.

License

This project is licensed under the MIT License. See the LICENSE file for details.# dynamicwebcrawler

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper Platform

Overview

Features

Project Structure

Setup Instructions

Changing Base URLs

Future Enhancements

License

dynamicwebcrawler

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Scraper Platform

Overview

Features

Project Structure

Setup Instructions

Changing Base URLs

Future Enhancements

License

dynamicwebcrawler

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages