breed2vec is a lightweight, reproducible pipeline for collecting Fédération Cynologique Internationale (FCI) dog breed standards, extracting text from PDFs, and analyzing semantic similarity between breeds using modern text representations. The goal of this project is not to infer ancestry or biological truth, but to evaluate how well document embeddings recover known historical, geographic, and morphological relationships from breed-standard text alone. The workflow is designed to be transparent and inspectable.
scrape → store → ingest PDFs → analyze
This project is a small, controlled testbed for evaluating how well text embeddings recover meaningful structure from domain‑specific documents. That style of construct‑validity check is directly relevant to safety‑adjacent evaluation: if embeddings fail on known structure in a narrow domain, they are less trustworthy for high‑stakes interpretability or retrieval settings.
Using a small, interpretable set of breeds, the demo illustrates that:
- Breeds with shared geographic and functional history (e.g., Labrador Retriever and Newfoundland) exhibit high semantic similarity.
- Related breeds developed in different contexts (e.g., Golden Retriever) appear nearby but offset.
- Morphologically and historically distinct breeds (e.g., Xoloitzcuintle) separate cleanly.
- These relationships emerge without any biological labels—purely from text embeddings of breed standards.
This serves as a construct-validity sanity check for document-level embeddings in a fine-grained biological domain.
- Read the Retriever Sandbox writeup:
breed2vec/docs/retriever_sandbox.md. - Optionally run analysis on the provided breed list:
python -m breed2vec analyze --breeds breed2vec/breeds.txt
- To avoid scraping, use a cached DB (see below).
Create and activate the conda environment:
conda env create -f environment.yml
conda activate breed2vec(Alternatively, dependencies can be installed via pip; see environment.yml for details.)
This is the shortest path from scratch to analysis output.
- Populate FCI group and breed metadata:
python -m breed2vec groups
python -m breed2vec breeds-
Specify a small list of breeds: Edit
breed2vec/breeds.txt, adding one breed per line using official FCI names. -
Ingest breed standards (PDF download + text extraction):
python -m breed2vec ingest --breeds breed2vec/breeds.txtNotes:
- Internet access is required to download new PDFs.
- If offline, use a cached DB (see below).
- Run analysis:
python -m breed2vec analyze --breeds breed2vec/breeds.txtThis produces cosine similarity matrices and low-dimensional visualizations (e.g., PCA) over document embeddings.
breed2vec/scrape/: scrape FCI group + breed metadata.breed2vec/db/: sqlite schema + CRUD.breed2vec/ingest/: PDF download and text extraction.breed2vec/analyze/: analysis helpers (TF‑IDF / embeddings).breed2vec/pipeline/: orchestration entrypoints.
For full detail, see breed2vec/MANIFEST.md.
If you have a cached fci_cache.db, you can skip scraping and run ingest/analyze directly.
Option A: environment variables
export BREED2VEC_DB_PATH="/path/to/fci_cache.db"
python -m breed2vec ingest --breeds breed2vec/breeds.txt
python -m breed2vec analyze --breeds breed2vec/breeds.txtOption B: CLI flags
python -m breed2vec ingest --db-path /path/to/fci_cache.db --breeds breed2vec/breeds.txt
python -m breed2vec analyze --db-path /path/to/fci_cache.db --breeds breed2vec/breeds.txtIf you have a full cached data directory (db + pdfs + layout), use:
python -m breed2vec analyze --data-dir /path/to/databreed2vec/data/fci_cache.db: sqlite database for breed metadata and documents.breed2vec/data/pdfs/: downloaded PDF standards.breed2vec/data/layout/: optional layout traces for section extraction.breed2vec/data/plots/<run_id>/: analysis outputs (cosine matrix, TF‑IDF tables, plots).
- The FCI site is the source of truth for breed metadata and PDF standards.
- The pipeline is designed to be incremental; re‑ingest re‑downloads PDFs and updates only when hashes change.
- PCA visualizations are used for interpretability, not statistical inference.
This demo uses a small number of breeds for clarity and interpretability. Semantic similarity reflects shared descriptive language, not proof of genetic ancestry. The project is intended as an exploratory and diagnostic tool rather than a benchmark.
- Section extraction from layout traces (history vs morphology vs temperament).
- Structured comparisons across section types.
If you are interested in this work, please feel free to reach out for a walkthrough or a curated subset of results.
Development was assisted by OpenAI tools (Codex / ChatGPT) for scaffolding, refactoring, and debugging support.