A Substack scraper and archive exporter that turns single-author Substack publications into Markdown source files for wiki ingestion.
It is intended as a local Substack to Markdown downloader for researchers, operators, and publication owners who need reproducible source archives.
The tool is built around a strict source contract: every article, author reply, accepted PDF, and accepted transcript gets a manifest row, a deterministic source file, provenance metadata, and validation logs. It preserves source text; it does not summarize, paraphrase, atomize, or build the wiki itself.
- Use this only for publications and paid content you are allowed to access.
- Paid Substacks are handled through a local browser login that exports a Playwright storage-state file outside the repo.
- The scraper does not bypass paywalls, evade bot detection, or hide what it is.
- The HTTP client uses a configured contact in its
User-Agent, respectsrobots.txt, rate-limits requests, and logs access caveats. - Generated output can contain paid/private text. Keep output roots and session files outside the repository.
uv sync --dev
uv run playwright install chromiumWithout uv:
python3 -m venv .venv
.venv/bin/python -m pip install -e ".[dev]"
.venv/bin/python -m playwright install chromiumCreate a config from the generic template:
cp config/public.example.yml config/my-publication.ymlEdit config/my-publication.yml:
target.base_urltarget.publication_nametarget.author.canonical_nametarget.author.stable_idoutput.rootoperator.user_agent_contact
The author stable ID is required for comment disambiguation. Display-name matching is not safe enough for author replies.
Run preflight and discovery:
uv run substack-archive-scraper preflight --config config/my-publication.yml
uv run substack-archive-scraper discover --config config/my-publication.yml --limit 10Scrape and validate:
uv run substack-archive-scraper scrape --config config/my-publication.yml
uv run substack-archive-scraper validate \
--config config/my-publication.yml \
--output-root /absolute/path/to/outputStart from the authenticated template:
cp config/authenticated.example.yml config/my-paid-publication.ymlSet auth.cookie_file to a path outside the repo, then capture a session:
uv run substack-archive-scraper login --config config/my-paid-publication.ymlA headed Chromium window opens. Log in normally, return to the terminal, and
press Enter. The scraper stores Playwright storage state at auth.cookie_file.
Credentialed scrapes require auth.known_paid_post_url so the scraper can prove
that authenticated article hydration is working before it captures paid content.
<output-root>/
raw/
articles/<YYYY>/YYYY-MM-DD-<slug>.md
pdfs/<descriptive-slug>.md
comments/YYYY-MM-DD-<article-slug>-<reply-seq>.md
transcripts/YYYY-MM-DD-<episode-slug>.md
_manifests/
source_manifest.jsonl
candidate_manifest.jsonl
source_relationships.jsonl
content_duplicates.jsonl
scrape_report.md
voice_candidates.md
scrape_manifest.yml
scrape_logs/<run-id>/
*.log
The manifest is canonical. Source-file frontmatter is a recovery mirror only.
By default the scraper is completeness-first:
- Keep every discovered single-author article.
- Keep every confirmed author reply the authenticated session can see.
- Include partially paywalled articles with
paywall_truncationwarnings. - Include confirmed author replies whose parent comment is hidden by the API
with
comment_parent_context_unavailable. - Log comment-access gaps instead of silently omitting them.
Use --exclude-partial-paywalled only when you intentionally want a stricter
complete-body corpus.
Scrape runs use a persistent HTTP cache by default at:
~/.cache/substack-archive-scraper/
Useful flags:
uv run substack-archive-scraper scrape --config config/my-publication.yml --progress-every 300
uv run substack-archive-scraper scrape --config config/my-publication.yml --refresh-cache
uv run substack-archive-scraper scrape --config config/my-publication.yml --no-cacheProgress output includes post counts, elapsed time, ETA, source counts, and cache hit/miss/write counts.
Authenticated cache entries are scoped by target and session fingerprint, expire after seven days, and are stored in private cache directories. Public cache entries expire after thirty days. Error responses are not permanently cached by default.
Purge one target cache:
uv run substack-archive-scraper cache-purge --config config/my-publication.ymlPurge all scraper caches:
uv run substack-archive-scraper cache-purgeuv sync --dev
uv run ruff check src tests schemas
uv run pytest
uv run python -m json.tool schemas/source_manifest.schema.json >/dev/nullEquivalent shortcuts are available through make:
make install
make check
make testThe shorter substack-ingest command is kept as a compatibility alias.
- Pipeline design
- Implementation plan
- Development guide
- Security and publishing notes
- Release checklist
This repository is prepared for later publication under The Unlicense.