Skip to content

xuio/substack-archive-scraper

Repository files navigation

Substack Archive Scraper

A Substack scraper and archive exporter that turns single-author Substack publications into Markdown source files for wiki ingestion.

It is intended as a local Substack to Markdown downloader for researchers, operators, and publication owners who need reproducible source archives.

The tool is built around a strict source contract: every article, author reply, accepted PDF, and accepted transcript gets a manifest row, a deterministic source file, provenance metadata, and validation logs. It preserves source text; it does not summarize, paraphrase, atomize, or build the wiki itself.

Safety Model

  • Use this only for publications and paid content you are allowed to access.
  • Paid Substacks are handled through a local browser login that exports a Playwright storage-state file outside the repo.
  • The scraper does not bypass paywalls, evade bot detection, or hide what it is.
  • The HTTP client uses a configured contact in its User-Agent, respects robots.txt, rate-limits requests, and logs access caveats.
  • Generated output can contain paid/private text. Keep output roots and session files outside the repository.

Install

uv sync --dev
uv run playwright install chromium

Without uv:

python3 -m venv .venv
.venv/bin/python -m pip install -e ".[dev]"
.venv/bin/python -m playwright install chromium

Quickstart

Create a config from the generic template:

cp config/public.example.yml config/my-publication.yml

Edit config/my-publication.yml:

  • target.base_url
  • target.publication_name
  • target.author.canonical_name
  • target.author.stable_id
  • output.root
  • operator.user_agent_contact

The author stable ID is required for comment disambiguation. Display-name matching is not safe enough for author replies.

Run preflight and discovery:

uv run substack-archive-scraper preflight --config config/my-publication.yml
uv run substack-archive-scraper discover --config config/my-publication.yml --limit 10

Scrape and validate:

uv run substack-archive-scraper scrape --config config/my-publication.yml
uv run substack-archive-scraper validate \
  --config config/my-publication.yml \
  --output-root /absolute/path/to/output

Paid Substacks

Start from the authenticated template:

cp config/authenticated.example.yml config/my-paid-publication.yml

Set auth.cookie_file to a path outside the repo, then capture a session:

uv run substack-archive-scraper login --config config/my-paid-publication.yml

A headed Chromium window opens. Log in normally, return to the terminal, and press Enter. The scraper stores Playwright storage state at auth.cookie_file.

Credentialed scrapes require auth.known_paid_post_url so the scraper can prove that authenticated article hydration is working before it captures paid content.

Output Contract

<output-root>/
  raw/
    articles/<YYYY>/YYYY-MM-DD-<slug>.md
    pdfs/<descriptive-slug>.md
    comments/YYYY-MM-DD-<article-slug>-<reply-seq>.md
    transcripts/YYYY-MM-DD-<episode-slug>.md
    _manifests/
      source_manifest.jsonl
      candidate_manifest.jsonl
      source_relationships.jsonl
      content_duplicates.jsonl
      scrape_report.md
      voice_candidates.md
      scrape_manifest.yml
  scrape_logs/<run-id>/
    *.log

The manifest is canonical. Source-file frontmatter is a recovery mirror only.

Completeness Policy

By default the scraper is completeness-first:

  • Keep every discovered single-author article.
  • Keep every confirmed author reply the authenticated session can see.
  • Include partially paywalled articles with paywall_truncation warnings.
  • Include confirmed author replies whose parent comment is hidden by the API with comment_parent_context_unavailable.
  • Log comment-access gaps instead of silently omitting them.

Use --exclude-partial-paywalled only when you intentionally want a stricter complete-body corpus.

Cache And Progress

Scrape runs use a persistent HTTP cache by default at:

~/.cache/substack-archive-scraper/

Useful flags:

uv run substack-archive-scraper scrape --config config/my-publication.yml --progress-every 300
uv run substack-archive-scraper scrape --config config/my-publication.yml --refresh-cache
uv run substack-archive-scraper scrape --config config/my-publication.yml --no-cache

Progress output includes post counts, elapsed time, ETA, source counts, and cache hit/miss/write counts.

Authenticated cache entries are scoped by target and session fingerprint, expire after seven days, and are stored in private cache directories. Public cache entries expire after thirty days. Error responses are not permanently cached by default.

Purge one target cache:

uv run substack-archive-scraper cache-purge --config config/my-publication.yml

Purge all scraper caches:

uv run substack-archive-scraper cache-purge

Developer Commands

uv sync --dev
uv run ruff check src tests schemas
uv run pytest
uv run python -m json.tool schemas/source_manifest.schema.json >/dev/null

Equivalent shortcuts are available through make:

make install
make check
make test

The shorter substack-ingest command is kept as a compatibility alias.

Documentation

Publishing Status

This repository is prepared for later publication under The Unlicense.

About

Substack archive scraper for exporting articles, comments, PDFs, and transcripts to Markdown for wiki ingestion.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors