Substack Archive Scraper

A Substack scraper and archive exporter that turns single-author Substack publications into Markdown source files for wiki ingestion.

It is intended as a local Substack to Markdown downloader for researchers, operators, and publication owners who need reproducible source archives.

The tool is built around a strict source contract: every article, author reply, accepted PDF, and accepted transcript gets a manifest row, a deterministic source file, provenance metadata, and validation logs. It preserves source text; it does not summarize, paraphrase, atomize, or build the wiki itself.

Safety Model

Use this only for publications and paid content you are allowed to access.
Paid Substacks are handled through a local browser login that exports a Playwright storage-state file outside the repo.
The scraper does not bypass paywalls, evade bot detection, or hide what it is.
The HTTP client uses a configured contact in its User-Agent, respects robots.txt, rate-limits requests, and logs access caveats.
Generated output can contain paid/private text. Keep output roots and session files outside the repository.

Install

uv sync --dev
uv run playwright install chromium

Without uv:

python3 -m venv .venv
.venv/bin/python -m pip install -e ".[dev]"
.venv/bin/python -m playwright install chromium

Quickstart

Create a config from the generic template:

cp config/public.example.yml config/my-publication.yml

Edit config/my-publication.yml:

target.base_url
target.publication_name
target.author.canonical_name
target.author.stable_id
output.root
operator.user_agent_contact

The author stable ID is required for comment disambiguation. Display-name matching is not safe enough for author replies.

Run preflight and discovery:

uv run substack-archive-scraper preflight --config config/my-publication.yml
uv run substack-archive-scraper discover --config config/my-publication.yml --limit 10

Scrape and validate:

uv run substack-archive-scraper scrape --config config/my-publication.yml
uv run substack-archive-scraper validate \
  --config config/my-publication.yml \
  --output-root /absolute/path/to/output

Paid Substacks

Start from the authenticated template:

cp config/authenticated.example.yml config/my-paid-publication.yml

Set auth.cookie_file to a path outside the repo, then capture a session:

uv run substack-archive-scraper login --config config/my-paid-publication.yml

A headed Chromium window opens. Log in normally, return to the terminal, and press Enter. The scraper stores Playwright storage state at auth.cookie_file.

Credentialed scrapes require auth.known_paid_post_url so the scraper can prove that authenticated article hydration is working before it captures paid content.

Output Contract

<output-root>/
  raw/
    articles/<YYYY>/YYYY-MM-DD-<slug>.md
    pdfs/<descriptive-slug>.md
    comments/YYYY-MM-DD-<article-slug>-<reply-seq>.md
    transcripts/YYYY-MM-DD-<episode-slug>.md
    _manifests/
      source_manifest.jsonl
      candidate_manifest.jsonl
      source_relationships.jsonl
      content_duplicates.jsonl
      scrape_report.md
      voice_candidates.md
      scrape_manifest.yml
  scrape_logs/<run-id>/
    *.log

The manifest is canonical. Source-file frontmatter is a recovery mirror only.

Completeness Policy

By default the scraper is completeness-first:

Keep every discovered single-author article.
Keep every confirmed author reply the authenticated session can see.
Include partially paywalled articles with paywall_truncation warnings.
Include confirmed author replies whose parent comment is hidden by the API with comment_parent_context_unavailable.
Log comment-access gaps instead of silently omitting them.

Use --exclude-partial-paywalled only when you intentionally want a stricter complete-body corpus.

Cache And Progress

Scrape runs use a persistent HTTP cache by default at:

~/.cache/substack-archive-scraper/

Useful flags:

uv run substack-archive-scraper scrape --config config/my-publication.yml --progress-every 300
uv run substack-archive-scraper scrape --config config/my-publication.yml --refresh-cache
uv run substack-archive-scraper scrape --config config/my-publication.yml --no-cache

Progress output includes post counts, elapsed time, ETA, source counts, and cache hit/miss/write counts.

Authenticated cache entries are scoped by target and session fingerprint, expire after seven days, and are stored in private cache directories. Public cache entries expire after thirty days. Error responses are not permanently cached by default.

Purge one target cache:

uv run substack-archive-scraper cache-purge --config config/my-publication.yml

Purge all scraper caches:

uv run substack-archive-scraper cache-purge

Developer Commands

uv sync --dev
uv run ruff check src tests schemas
uv run pytest
uv run python -m json.tool schemas/source_manifest.schema.json >/dev/null

Equivalent shortcuts are available through make:

make install
make check
make test

The shorter substack-ingest command is kept as a compatibility alias.

Documentation

Publishing Status

This repository is prepared for later publication under The Unlicense.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
config		config
docs		docs
schemas		schemas
src/substack_ingest		src/substack_ingest
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Substack Archive Scraper

Safety Model

Install

Quickstart

Paid Substacks

Output Contract

Completeness Policy

Cache And Progress

Developer Commands

Documentation

Publishing Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Substack Archive Scraper

Safety Model

Install

Quickstart

Paid Substacks

Output Contract

Completeness Policy

Cache And Progress

Developer Commands

Documentation

Publishing Status

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages