patfer3dcoding · Copilot · Sep 23, 2025 · Sep 23, 2025
diff --git a/.gitignore b/.gitignore
@@ -159,3 +159,6 @@ cython_debug/
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
 .DS_Store
+
+# Scraper output files
+shops_*.csv
diff --git a/SCRAPER_README.md b/SCRAPER_README.md
@@ -0,0 +1,97 @@
+# Trusted Shops Web Scraper
+
+A comprehensive web scraping tool that extracts company information from the Trusted Shops website (https://www.trustedshops.de).
+
+## Features
+
+- **Pagination Handling**: Automatically processes multiple pages by incrementing the page parameter
+- **Comprehensive Data Extraction**: Collects the following information for each company:
+  - Company Name
+  - Logo URL
+  - Profile URL
+  - Company Website URL
+  - Phone Number
+  - Physical Address
+  - Business Categories/Tags
+  - Email Address
+  - Company Description
+
+- **CSV Output**: Saves data to a timestamped CSV file (e.g., `shops_2025-09-23_23-42-14.csv`)
+- **Incremental Saving**: Data is saved after each profile is processed to prevent data loss
+- **Error Handling**: Includes retry logic and graceful error handling
+- **Rate Limiting**: Built-in delays between requests to respect server resources
+
+## Files
+
+- `scraper.py` - Main scraping script for production use
+- `scraper_demo.py` - Demo version with mock data for testing
+- `requirements.txt` - Updated with web scraping dependencies
+
+## Installation
+
+1. Install the required dependencies:
+```bash
+pip install -r requirements.txt
+```
+
+## Usage
+
+### Production Scraper
+
+Run the main scraper (requires internet access):
+```bash
+python scraper.py
+```
+
+### Demo Version
+
+Test the functionality with mock data:
+```bash
+python scraper_demo.py
+```
+
+## Output Format
+
+The scraper creates a CSV file with the following columns:
+
+| Column | Description |
+|--------|-------------|
+| Company Name | Name of the business |
+| Logo | URL to company logo image |
+| Profile URL | Link to the Trusted Shops profile page |
+| Company URL | Company's official website |
+| Phone | Contact phone number |
+| Address | Physical business address |
+| Tags | Business categories/tags |
+| Email | Contact email address |
+| Description | Company description/overview |
+
+## Configuration
+
+The scraper can be configured by modifying the `TrustedShopsScraper` class:
+
+- `base_url`: Target URL for scraping (default: computer/electronics category)
+- Request delays: Modify `time.sleep()` values to adjust scraping speed
+- Retry logic: Adjust `max_retries` parameter in `get_page()` method
+
+## Technical Details
+
+- **Framework**: Python 3.x
+- **Libraries**: BeautifulSoup4, requests, pandas, re
+- **Approach**: Sequential page processing with profile detail extraction
+- **Error Recovery**: Retry mechanism for failed requests
+- **Data Persistence**: Incremental CSV writing
+
+## Notes
+
+- The scraper includes proper delays between requests to be respectful to the target server
+- All extracted data is cleaned and formatted for consistency
+- The script handles various HTML structures and missing data gracefully
+- BeautifulSoup warnings have been addressed using current best practices
+
+## Example Output
+
+```csv
+Company Name,Logo,Profile URL,Company URL,Phone,Address,Tags,Email,Description
+EnjoyYourCamera.com,https://channel-settings.etrusted.com/logo-932f448d...,https://www.trustedshops.de/bewertung/info_X233BF...,https://www.enjoyyourcamera.com,+49 511 20029090,"ENJOYYOURBRANDS GmbH, Eleonorenstr. 20, Deutschland","Bücher, Computer, Unterhaltungselektronik & Zubehör",shop@enjoyyourcamera.com,"Enjoyyourcamera.com ist Ihr Versandhaus für Spezial-Fotozubehör..."
+```
diff --git a/requirements.txt b/requirements.txt
@@ -1,6 +1,7 @@
 aiofiles==23.1.0
 annotated-types==0.5.0
 anyio==3.7.1
+beautifulsoup4>=4.12.0
 Brotli==1.0.9
 certifi==2023.7.22
 click==8.1.6
@@ -14,8 +15,10 @@ httpx==0.24.1
 hyperframe==6.0.1
 idna==3.4
 lxml==4.9.3
+pandas>=2.0.0
 pydantic==2.1.1
 pydantic_core==2.4.0
+requests>=2.28.0
 sniffio==1.3.0
 socksio==1.0.0
 starlette==0.27.0