Skip to content

Configurable Python web scraping tool using Requests, BeautifulSoup, and lxml. Extracts structured data from multiple pages and exports clean CSV files. Ideal for automation and data collection.

Notifications You must be signed in to change notification settings

nbilabsystems/webscraper_pro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

WebScraper Pro πŸ•ΈοΈ

Python Scraping BeautifulSoup Requests License

A configurable Python web scraping tool that extracts structured data from multiple webpages and exports the results to CSV.
Built for automation, data collection, and Upwork-style client projects.


✨ Features

  • Scrapes multiple pages using a URL pattern with {page}
  • Fully configurable via JSON (no code changes needed)
  • Extracts data using CSS selectors (quotes, authors, tags, or any other fields)
  • Saves clean structured data to CSV
  • Logs scraping progress to logs/scraper.log
  • Easy CLI interface for clients and non-technical users

🧱 Project Structure

webscraper_pro/
β”œβ”€ README.md
β”œβ”€ LICENSE
β”œβ”€ requirements.txt
β”œβ”€ .gitignore
β”œβ”€ data/
β”‚  β”œβ”€ sample_urls.txt
β”‚  └─ output/
β”œβ”€ logs/
β”œβ”€ webscraper/
β”‚  β”œβ”€ __init__.py
β”‚  β”œβ”€ config_example.json
β”‚  β”œβ”€ cli.py
β”‚  β”œβ”€ scraper.py
β”‚  β”œβ”€ parser.py
β”‚  └─ storage.py

βš™οΈ Configuration

Example config file: webscraper/config_example.json

{
    "base_url": "https://quotes.toscrape.com/page/{page}/",
    "start_page": 1,
    "end_page": 3,
    "selectors": {
        "quote": ".quote .text",
        "author": ".quote .author",
        "tags": ".quote .tags .tag"
    }
}

Fields explained:

  • base_url β€” must contain {page} so scraper can iterate
  • start_page / end_page β€” scraping range
  • selectors β€” CSS selectors for each extracted field

You can modify this JSON to scrape any website, not just quotes.


▢️ How to Run

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Run the scraper:

python -m webscraper.cli --config webscraper/config_example.json --output data/output/quotes.csv

Result:

  • Fetches pages 1–3
  • Extracts quotes, authors, and tags
  • Saves them to data/output/quotes.csv

πŸ“œ License

This project is licensed under the MIT License.
You are free to use, modify, distribute, and incorporate the code into your own projects.

See the full license in the included LICENSE file.


πŸ“ Notes

  • This project is for demonstration and educational purposes.
  • Always respect website terms of service and robots.txt when scraping real websites.
  • The scraper is modular and easy to extend for more complex automation.

About

Configurable Python web scraping tool using Requests, BeautifulSoup, and lxml. Extracts structured data from multiple pages and exports clean CSV files. Ideal for automation and data collection.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages