Skip to content

Security: xuio/substack-archive-scraper

Security

docs/security.md

Security And Compliance Notes

This project can handle authenticated sessions and paid source text. Treat those materials as private.

Never Commit

  • Playwright storage-state files.
  • Cookies, HAR files, exported request headers, or bearer tokens.
  • Generated raw/ output or scrape logs from paid publications.
  • Private target configs containing local paths or paid-post URLs.
  • Raw API payload caches when debug_cache_raw_payloads is enabled.

Authenticated Access

The login command opens a normal browser window and stores Playwright storage state at the configured auth.cookie_file path:

uv run substack-archive-scraper login --config config/my-paid-publication.yml

The storage-state file should live outside the repository, preferably in a directory readable only by the operator.

The config loader rejects storage-state paths that resolve inside the detected Git repository. The login flow also writes session files with owner-only file permissions.

Crawler Behavior

The crawler is intentionally polite and identifiable:

  • User-Agent includes the package version and operator contact.
  • Requests are rate-limited by operator.max_requests_per_second.
  • robots.txt is loaded and respected.
  • Retries use bounded backoff.
  • There is no stealth, CAPTCHA bypass, paywall bypass, or bot-evasion logic.

Cache

HTTP response caching is enabled by default under:

~/.cache/substack-archive-scraper/

Authenticated cache directories are scoped by target plus session fingerprint, created with private directory permissions, and expire after seven days. Public cache entries expire after thirty days. Error responses are not permanently cached by default.

The cache can include authenticated article/comment responses. Purge it before handing a machine to another operator or publishing a reproducible artifact:

uv run substack-archive-scraper cache-purge

Manual removal is also safe:

rm -rf ~/.cache/substack-archive-scraper

Publication Checklist

Before publishing this repository:

  1. Run the release checklist.
  2. Verify git grep finds no local paths, cookies, paid text, or target-private output.
  3. Rewrite or recreate history if private paths or target configs were ever committed.
  4. Select and commit an explicit license.

There aren't any published security advisories