Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,3 +159,6 @@ cython_debug/
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.DS_Store

# Scraper output files
shops_*.csv
97 changes: 97 additions & 0 deletions SCRAPER_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Trusted Shops Web Scraper

A comprehensive web scraping tool that extracts company information from the Trusted Shops website (https://www.trustedshops.de).

## Features

- **Pagination Handling**: Automatically processes multiple pages by incrementing the page parameter
- **Comprehensive Data Extraction**: Collects the following information for each company:
- Company Name
- Logo URL
- Profile URL
- Company Website URL
- Phone Number
- Physical Address
- Business Categories/Tags
- Email Address
- Company Description

- **CSV Output**: Saves data to a timestamped CSV file (e.g., `shops_2025-09-23_23-42-14.csv`)
- **Incremental Saving**: Data is saved after each profile is processed to prevent data loss
- **Error Handling**: Includes retry logic and graceful error handling
- **Rate Limiting**: Built-in delays between requests to respect server resources

## Files

- `scraper.py` - Main scraping script for production use
- `scraper_demo.py` - Demo version with mock data for testing
- `requirements.txt` - Updated with web scraping dependencies

## Installation

1. Install the required dependencies:
```bash
pip install -r requirements.txt
```

## Usage

### Production Scraper

Run the main scraper (requires internet access):
```bash
python scraper.py
```

### Demo Version

Test the functionality with mock data:
```bash
python scraper_demo.py
```

## Output Format

The scraper creates a CSV file with the following columns:

| Column | Description |
|--------|-------------|
| Company Name | Name of the business |
| Logo | URL to company logo image |
| Profile URL | Link to the Trusted Shops profile page |
| Company URL | Company's official website |
| Phone | Contact phone number |
| Address | Physical business address |
| Tags | Business categories/tags |
| Email | Contact email address |
| Description | Company description/overview |

## Configuration

The scraper can be configured by modifying the `TrustedShopsScraper` class:

- `base_url`: Target URL for scraping (default: computer/electronics category)
- Request delays: Modify `time.sleep()` values to adjust scraping speed
- Retry logic: Adjust `max_retries` parameter in `get_page()` method

## Technical Details

- **Framework**: Python 3.x
- **Libraries**: BeautifulSoup4, requests, pandas, re
- **Approach**: Sequential page processing with profile detail extraction
- **Error Recovery**: Retry mechanism for failed requests
- **Data Persistence**: Incremental CSV writing

## Notes

- The scraper includes proper delays between requests to be respectful to the target server
- All extracted data is cleaned and formatted for consistency
- The script handles various HTML structures and missing data gracefully
- BeautifulSoup warnings have been addressed using current best practices

## Example Output

```csv
Company Name,Logo,Profile URL,Company URL,Phone,Address,Tags,Email,Description
EnjoyYourCamera.com,https://channel-settings.etrusted.com/logo-932f448d...,https://www.trustedshops.de/bewertung/info_X233BF...,https://www.enjoyyourcamera.com,+49 511 20029090,"ENJOYYOURBRANDS GmbH, Eleonorenstr. 20, Deutschland","Bücher, Computer, Unterhaltungselektronik & Zubehör",shop@enjoyyourcamera.com,"Enjoyyourcamera.com ist Ihr Versandhaus für Spezial-Fotozubehör..."
```
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
aiofiles==23.1.0
annotated-types==0.5.0
anyio==3.7.1
beautifulsoup4>=4.12.0
Brotli==1.0.9
certifi==2023.7.22
click==8.1.6
Expand All @@ -14,8 +15,10 @@ httpx==0.24.1
hyperframe==6.0.1
idna==3.4
lxml==4.9.3
pandas>=2.0.0
pydantic==2.1.1
pydantic_core==2.4.0
requests>=2.28.0
sniffio==1.3.0
socksio==1.0.0
starlette==0.27.0
Expand Down
Loading