This project can handle authenticated sessions and paid source text. Treat those materials as private.
- Playwright storage-state files.
- Cookies, HAR files, exported request headers, or bearer tokens.
- Generated
raw/output or scrape logs from paid publications. - Private target configs containing local paths or paid-post URLs.
- Raw API payload caches when
debug_cache_raw_payloadsis enabled.
The login command opens a normal browser window and stores Playwright storage
state at the configured auth.cookie_file path:
uv run substack-archive-scraper login --config config/my-paid-publication.ymlThe storage-state file should live outside the repository, preferably in a directory readable only by the operator.
The config loader rejects storage-state paths that resolve inside the detected Git repository. The login flow also writes session files with owner-only file permissions.
The crawler is intentionally polite and identifiable:
User-Agentincludes the package version and operator contact.- Requests are rate-limited by
operator.max_requests_per_second. robots.txtis loaded and respected.- Retries use bounded backoff.
- There is no stealth, CAPTCHA bypass, paywall bypass, or bot-evasion logic.
HTTP response caching is enabled by default under:
~/.cache/substack-archive-scraper/
Authenticated cache directories are scoped by target plus session fingerprint, created with private directory permissions, and expire after seven days. Public cache entries expire after thirty days. Error responses are not permanently cached by default.
The cache can include authenticated article/comment responses. Purge it before handing a machine to another operator or publishing a reproducible artifact:
uv run substack-archive-scraper cache-purgeManual removal is also safe:
rm -rf ~/.cache/substack-archive-scraperBefore publishing this repository:
- Run the release checklist.
- Verify
git grepfinds no local paths, cookies, paid text, or target-private output. - Rewrite or recreate history if private paths or target configs were ever committed.
- Select and commit an explicit license.