Generate an LLM-optimized static mirror ("cache") of a website: clean HTML +
markdown twins of every page, plus the root discovery files (llms.txt,
llms-full.txt, corpus.jsonl, sitemap.xml, robots.txt) that LLM crawlers
and RAG pipelines look for. The mirror is SEO-neutral: every page carries
noindex,nofollow and a <link rel="canonical"> back to the original, so it
lives happily on a subdomain (e.g. llm.yourdomain.com) without competing with
the real site for rankings.
Modern websites are built for browsers and search engines, not for language models. A typical page wraps navigation, headers, footers, cookie banners, ad slots, and JavaScript around a small core of actual content. When an LLM crawler or a RAG pipeline fetches that page, it has to download the whole wrapper and then guess which part is the article. The guess is often wrong, the markup is noisy, and JS-rendered content may not be there at all on a plain fetch. The result is worse retrieval, wasted tokens, and content that models either misread or skip.
The obvious fix, publishing a clean text-first copy of your site, runs straight into an SEO problem. A second public copy of your pages looks like duplicate content to search engines and can siphon rankings away from the real site. So most teams never make one.
facsim resolves that tension. It crawls your live site once and writes a separate, read-only mirror designed for machine consumption:
- A clean reading view (
index.html) for every page: just the real content, boilerplate stripped, with structured data baked in. - A markdown twin (
content.md) of every page, with YAML frontmatter, the format RAG pipelines and LLMs ingest most cheaply. - Root discovery files at the mirror's root:
llms.txt(a hierarchical index),llms-full.txt(the whole corpus in one file),corpus.jsonl(one JSON record per page for retrieval), plussitemap.xmlandrobots.txt.
The objective is a single artifact you can host on a subdomain and point any LLM tool at: "here is our content, already clean, already indexed, in the formats you prefer."
Three design decisions do the heavy lifting:
-
SEO-neutral by construction. Every page in the mirror carries
noindex,nofollowand a<link rel="canonical">pointing back to the original URL. Search engines are explicitly told not to index the mirror and that the source page is the real one, so a public mirror onllm.yourdomain.comnever competes with your main site for rankings. This is what makes it safe to publish the clean copy at all. -
It speaks the conventions crawlers already look for.
llms.txtandllms-full.txtare the emerging root-level conventions for "here's the site's content for LLMs";corpus.jsonlis a drop-in for RAG ingestion; the per-page markdown twin is the cheapest possible thing to embed. Nothing has to be reverse-engineered at read time because the work is already done. -
The content is extracted once, well. Boilerplate removal, link absolutization, and markdown conversion happen at build time using purpose-built extraction, so every consumer gets the same clean text instead of each one re-deriving it (badly) from raw HTML.
A content-hash manifest.json makes rebuilds cheap: only pages whose content
actually changed are re-fingerprinted (and, with enrichment on, only those pages
are re-billed to the LLM). --diff-against reports exactly what moved between
builds.
crawl ──▶ extract ──▶ render ──▶ index
(BFS) (clean MD) (HTML+MD) (root files)
-
Crawl (
crawl.py): a same-host breadth-first crawl, seeded from your site'ssitemap.xml(it follows one level of sitemap-index nesting and reads the child-sitemap names to label sections, e.g.post-sitemap.xmlbecomes "Insights & Articles").exclude_patternsskip what you don't want; transient failures (429/5xx, dropped connections) are retried with backoff. -
Extract (
extract.py): primary extraction uses trafilatura for boilerplate removal, with a configurable CSS-selector +markdownifyfallback. Relative links and images are rewritten to absolute, navigation/headers/footers/scripts are stripped, and pages thinner thanmin_content_wordsare dropped to keep the corpus clean. -
Render (
render.py): each page becomes a clean HTML reading view (noindex,nofollow, canonical link, Schema.orgWebPage+BreadcrumbListJSON-LD, a per-page meta box, and, with enrichment on, a TL;DR and FAQ withFAQPageJSON-LD) plus itscontent.mdtwin. -
Index (
indexes.py): the rendered pages are assembled into the root discovery files (llms.txt,llms-full.txt,corpus.jsonl,sitemap.xml,robots.txt), a styled HTML table of contents, an.htaccess, and the content-hashmanifest.json.
Optional enrichment (enrich.py) adds a per-page TL;DR and FAQ via the
Anthropic API; it's off by default and cached per content hash so it only bills
pages that changed.
For every crawled page:
path/index.html: a clean reading view (real HTML body, brand-labelled), withnoindex,nofollow, canonical link, Schema.orgWebPage+BreadcrumbListJSON-LD, a per-page meta box (word count · build date · links to the markdown and source), and, when enrichment is on, a TL;DR and FAQ (withFAQPageJSON-LD).path/content.md: YAML frontmatter (title, url, word_count, optional summary/faqs) followed by clean markdown.
Plus root files:
llms.txt: hierarchical index with per-link descriptions and trailing Access / Recommended-use / Notes guidance.llms-full.txt: the whole corpus concatenated into one file.corpus.jsonl: one JSON record per page for RAG ingestion.sitemap.xml,robots.txt, a styled HTML table of contents (index.html, orcache-index.htmlif the homepage occupies/)..htaccess: correct MIME types + caching (see Deploy).manifest.json: a content-hash fingerprint of every output file, used by--diff-againstto report what changed between rebuilds.
pipx install facsim # isolated, gives you the `facsim` command
# or
pip install facsimFrom a checkout:
pip install -e .Optional LLM enrichment (per-page TL;DR + FAQ) needs one extra:
pip install "facsim[enrich]"Run the setup wizard. It asks for the site, its sitemap, and what to call the
cache, then writes config.yaml:
facsim initExample answers produce a cache labelled "Acme LLM Cache":
Site to mirror (base URL): https://www.acme.com
Sitemap path: /sitemap.xml
Cache will be served at: https://llm.acme.com
Brand name (the "X" in "X LLM Cache"): Acme
Or copy config.yaml and edit by hand. The fields you must set for a real run:
source:
base_url: "https://www.yourdomain.com" # the live site to mirror
sitemap_path: "/sitemap.xml" # path to the site's sitemap
cache:
base_url: "https://llm.yourdomain.com" # where this cache will be served
brand: "COMPANYNAME" # wordmark, renders "COMPANYNAME LLM Cache"
logo: "" # optional SVG logo (see below)
site_name: "yourdomain.com" # human-readable source labelTune extract.content_selectors to your site's markup, and
crawl.exclude_patterns to skip what you don't want mirrored.
By default the wordmark is the brand name set in the display face. To use your
own mark instead, point cache.logo at an SVG file (absolute, or relative to
where you run the build):
cache:
logo: "assets/logo.svg"The SVG is inlined into every page, sized to the wordmark height with its own colours preserved. If the file is missing or isn't a valid SVG, the build prints a notice and falls back to the text wordmark, and never fails.
The repo ships its own mark in assets/ (facsim-logo.svg for light
backgrounds, facsim-logo-light.svg for dark) if you want a reference for
sizing and palette.
facsim --config config.yaml
# or, from a checkout, equivalently:
python -m cache_gen --config config.yamlUseful flags:
| Flag | Effect |
|---|---|
--limit N |
Cap the crawl at N pages (handy for a test run) |
--clean |
Wipe the output dir before building |
--output DIR |
Write to DIR instead of output.dir from the config |
--diff-against manifest.json |
Print a new/changed/removed summary vs a prior build |
--enrich / --no-enrich |
Force the LLM layer on/off (overrides config) |
Off by default. The generator runs fully without an API key or the anthropic
package. To enable: set enrich.enabled: true (or pass --enrich), export ANTHROPIC_API_KEY=..., and install the enrich extra. If the key or package
is missing the build prints a notice and continues without enrichment, and
never fails. Results are cached per content hash in .enrich-cache/, so a
rebuild only bills pages whose content actually changed.
The deploy/ directory has everything for a self-updating subdomain. Suggested
server layout:
/opt/facsim/
src/ this repo (your checkout, with config.yaml)
releases/ timestamped build outputs
current -> releases/<latest> # symlink Apache serves
-
Checkout + install on the server:
sudo mkdir -p /opt/facsim && cd /opt/facsim git clone <your-repo> src && cd src pip install . # add [enrich] if you want the LLM layer
-
Apache vhost: copy
deploy/apache-vhost.conf.template, replaceYOURDOMAIN, enable it, then add HTTPS with certbot:sudo cp deploy/apache-vhost.conf.template /etc/apache2/sites-available/facsim.conf sudo a2enmod expires && sudo a2ensite facsim sudo systemctl reload apache2 sudo certbot --apache -d llm.yourdomain.comThe vhost's
DocumentRootis thecurrentsymlink, and it duplicates the MIME/caching directives so the cache serves correctly even whereAllowOverride Noneignores.htaccess. -
First publish:
sudo ROOT=/opt/facsim deploy/rebuild.sh
rebuild.shbuilds intoreleases/<timestamp>, prints a diff against the live release, then atomically repointscurrent(so visitors never see a half-written tree) and prunes to the most recent 5 releases.
Pick one scheduler.
systemd timer (recommended):
sudo cp deploy/facsim.{service,timer} /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now facsim.timer
systemctl list-timers facsim.timer # confirm next runEdit the .service to set ROOT and (optionally) ANTHROPIC_API_KEY; prefer
an EnvironmentFile chmod 600 for the key.
cron: see deploy/crontab.example (crontab -u www-data deploy/crontab.example).
Either way the schedule just runs rebuild.sh, so each update is an atomic
symlink swap with a logged diff.
Built by Max Avery.
- X / Twitter: @realmaxavery
- LinkedIn: maxavery