End-to-end system for detecting, reading, and identifying Magic: The Gathering cards. Trains a custom object detection model, then chains detection → OCR → card lookup → art matching to identify cards down to the exact printing.
Key result: 96.7% mAP50 detection accuracy. A photo of Krenko, Mob Boss correctly identifies the Ravnica Remastered #335 printing (0.9515 cosine similarity).
Tech stack: YOLOv11 (Ultralytics) · DINOv2 (Meta) · RapidOCR · Scryfall API · Roboflow · FastAPI · vanilla JS
- System Architecture
- Dataset
- Model Training
- Card Identification Pipeline
- Web Application
- Platforms and Services
- Results and Limitations
- References
The system has two tracks: a training track (offline, one-time) and an inference track (online, per-request).
flowchart LR
subgraph Training["Training Track (offline)"]
RF[Roboflow Dataset<br/>4,065 images] --> YOLO[YOLOv11n Training<br/>100 epochs]
YOLO --> WEIGHTS[best.pt<br/>2.6M params]
WEIGHTS --> DEPLOY[Deploy to<br/>Roboflow API]
end
subgraph Inference["Inference Track (per-request)"]
IMG[Input Image] --> DET[Stage 1: Detection<br/>Roboflow API]
DET --> OCR_S[Stage 2: OCR<br/>RapidOCR]
OCR_S --> SF[Stage 3: Card Lookup<br/>Scryfall API]
SF --> DINO[Stage 4: Art Match<br/>DINOv2 ViT-S/14]
DINO --> RESULT[Card Name +<br/>Exact Printing]
end
DEPLOY -.-> DET
| Component | File(s) | Responsibility | Dependencies |
|---|---|---|---|
| Dataset download | scripts/01_setup_dataset.py |
Fetch annotated images from Roboflow | roboflow |
| Data exploration | scripts/02_explore_dataset.py |
Visualize class distribution, quality checks | opencv, matplotlib |
| Local training | scripts/03_train.py |
Train YOLOv11n on Apple Silicon CPU | ultralytics, torch |
| Cloud training | scripts/train_cloud*.py |
Train on RunPod GPU (v1/v2/v3 experiments) | ultralytics, roboflow |
| Validation | scripts/04_validate.py |
Compute per-class mAP, PR curves | ultralytics |
| Batch predict | scripts/05_predict.py |
Run inference on image files | ultralytics |
| Live detection | scripts/06_live_detect.py |
Webcam detection with bounding boxes | ultralytics, opencv |
| Test image download | scripts/07_download_test_images.py |
Multilingual card images from Scryfall | stdlib only |
| Label export | scripts/08_export_for_correction.py |
Export predictions for Label Studio | ultralytics |
| Card identification | scripts/09_identify_card.py |
CLI: detect → OCR → Scryfall | ultralytics, rapidocr |
| Live identification | scripts/10_live_identify.py |
Webcam: detect → OCR → Scryfall + info panel | ultralytics, rapidocr, opencv |
| Cloud pipeline | scripts/run_cloud.py |
All steps combined for RunPod | all above |
| Web server | web/app.py |
FastAPI: orchestrates 4-stage pipeline | fastapi, httpx |
| Detection service | web/services/detection.py |
Roboflow hosted inference API client | httpx |
| OCR service | web/services/ocr.py |
RapidOCR wrapper for title extraction | rapidocr-onnxruntime |
| Scryfall service | web/services/scryfall.py |
Card lookup + printings pagination | httpx |
| Art matching | web/services/image_match.py |
DINOv2 embedding comparison | torch, torchvision, PIL |
| Frontend | web/static/ |
Upload mode, live camera, card panel | vanilla JS |
Roboflow Universe project mtg-detection-cixf6 version 8 — 4,065 annotated MTG card images, licensed CC BY 4.0.
| Split | Images | Purpose |
|---|---|---|
| Train | 3,761 | Model learns from these |
| Valid | 223 | Monitors overfitting during training |
| Test | 81 | Final unbiased evaluation |
Seven detection classes (indexed 0–6):
| ID | Class | Region | Typical Size |
|---|---|---|---|
| 0 | art |
Card illustration | Large |
| 1 | card |
Full card boundary | Very large |
| 2 | description |
Rules text box | Large |
| 3 | mana-cost |
Mana symbols (top right) | Small |
| 4 | power |
Power/toughness (bottom right) | Small |
| 5 | tags |
Type line (e.g., "Creature — Dragon") | Medium |
| 6 | title |
Card name (top center) | Medium |
The 19-point gap between mAP50 (96.7%) and mAP50-95 (77.7%) is characteristic of annotation noise — bounding boxes in the training data aren't pixel-perfect. Small classes (mana-cost, power) suffer most because even a few pixels of imprecision causes proportionally large IoU drops. Further mAP50-95 gains require tighter annotations, not bigger models.
Details: metrics-guide.md
YOLOv11n (nano) — 2.6M parameters, ~5.4 MB. Pretrained on COCO (80 everyday object classes), then fine-tuned on the MTG dataset via transfer learning.
Architecture: CSP-Net backbone → FPN+PAN neck → 3-scale detection head (80×80 / 40×40 / 20×20 grid).
Architecture details: architecture.md#yolo-model-architecture
Training parameters for scripts/03_train.py. Explicitly-passed parameters are noted; others are YOLO defaults applied automatically (visible in runs/mtg-detect/args.yaml after training):
| Category | Parameter | Value | Justification |
|---|---|---|---|
| Core | epochs |
100 | Sufficient for nano model convergence |
batch |
16 | Fits in 16 GB unified memory | |
imgsz |
640 | Standard YOLO resolution, good speed/accuracy balance | |
patience |
20 | Stop early if no improvement for 20 epochs | |
device |
cpu |
MPS has training bugs in PyTorch 2.10 on macOS 26 | |
workers |
8 | Parallel data loading | |
save_period |
25 | Checkpoint every 25 epochs | |
| Optimizer | optimizer |
AdamW | Better generalization than SGD for small datasets |
lr0 |
0.001 | Standard for AdamW with YOLO | |
lrf |
0.01 | Final LR = 0.001 × 0.01 = 0.00001 | |
cos_lr |
true | Cosine annealing for smooth LR decay | |
| Geometric Aug | degrees |
15.0 | Cards held at angles up to ~15° |
perspective |
0.001 | Simulates keystoning from angled holding | |
shear |
2.0 | Mild perspective variety | |
| Scale Aug | multi_scale |
0.5 | Train at 320–960px for resolution invariance |
| Color Aug | hsv_h |
0.015 | Hue shift (YOLO default) |
hsv_s |
0.7 | Saturation shift (YOLO default) | |
hsv_v |
0.4 | Brightness shift (YOLO default) | |
| Composition | mosaic |
1.0 | 100% mosaic — 4 images combined per training sample |
mixup |
0.05 | 5% image blending — reduces background false positives | |
fliplr |
0.5 | 50% horizontal flip (YOLO default) | |
erasing |
0.4 | Random erasing — occlusion robustness | |
| Loss Weights | box |
7.5 | Bounding box regression (YOLO default) |
cls |
0.5 | Classification (YOLO default) | |
dfl |
1.5 | Distribution focal loss (YOLO default) |
Full parameter reference: parameters.md · Augmentation theory: training-strategies.md
Three iterations on RunPod (RTX 4090):
| Experiment | Script | Model | Resolution | Epochs | Key Changes | mAP50 | mAP50-95 | Cost |
|---|---|---|---|---|---|---|---|---|
| v1 | train_cloud.py |
yolo11n | 640 | 100 | Baseline | 96.7% | 77.7% | $0.73 |
| v2-quick | train_cloud_v2.py |
yolo11n | 1280 | 150 | Resolution ↑ | 96.2% | 74.8% | ~$0.90 |
| v2-balanced | train_cloud_v2.py |
yolo11s | 1280 | 200 | Larger model | — | — | ~$1.80 |
| v3-final | train_cloud_v3.py |
yolo11m | 1280 | 250 | copy_paste=0.3, close_mosaic=30 | 96.1% | 77.2% | ~$5.60 |
Takeaway: The nano model at 640px (v1) remains the best balanced result. Higher resolution helped small objects but destabilized the card class. Larger models didn't reliably beat nano — annotation quality is the ceiling.
Experiment log: training-v2-status.md · Cost analysis: training-cost-analysis.md
Training on Apple Silicon uses CPU (device="cpu") because PyTorch 2.10's MPS backend has tensor corruption bugs on macOS 26 — specifically clamp_() operations and TAL assigner indexing during training loss computation. MPS works correctly for inference (validation, prediction).
CPU training on Apple Silicon is still fast (~15–30 min on M3/M4 Pro) thanks to high-bandwidth unified memory (~200–400 GB/s on Pro/Max chips) and NEON SIMD instructions.
v1 — yolo11n @ 640px (96.7% mAP50 / 77.7% mAP50-95):
| Class | Precision | Recall | mAP50 | mAP50-95 |
|---|---|---|---|---|
| art | 0.963 | 0.972 | 0.982 | 0.918 |
| card | 0.967 | 0.969 | 0.959 | 0.880 |
| description | 0.955 | 0.961 | 0.974 | 0.851 |
| mana-cost | 0.954 | 0.847 | 0.959 | 0.710 |
| power | 0.824 | 0.919 | 0.937 | 0.703 |
| tags | 0.984 | 0.964 | 0.983 | 0.651 |
| title | 0.986 | 0.949 | 0.974 | 0.727 |
| ALL | 0.948 | 0.940 | 0.967 | 0.777 |
Large regions (art, card, description) score highest because small positioning errors barely affect IoU. Small regions (mana-cost, power) score lower on strict metrics.
The centerpiece of the system — a 4-stage pipeline that takes a photo and returns the card's name, metadata, and exact printing.
sequenceDiagram
participant Browser
participant FastAPI as FastAPI Server
participant Roboflow as Roboflow API
participant OCR as RapidOCR
participant Scryfall as Scryfall API
participant DINOv2 as DINOv2 ViT-S/14
Browser->>FastAPI: POST /api/detect (image)
FastAPI->>Roboflow: detect_cards(image, confidence=0.25, overlap=0.45)
Roboflow-->>FastAPI: predictions [{class, bbox, confidence}]
FastAPI->>FastAPI: find_best_title_box() → crop title region
FastAPI->>OCR: ocr_image(title_crop)
OCR-->>FastAPI: "Krenko, Mob Boss"
FastAPI->>Scryfall: GET /cards/named?fuzzy=Krenko,+Mob+Boss
Scryfall-->>FastAPI: card JSON (name, type, oracle, prices)
FastAPI->>Scryfall: GET prints_search_uri (paginated)
Scryfall-->>FastAPI: 50 printings [{set, number, art_crop_uri}]
FastAPI->>FastAPI: find_best_art_box() → crop art region
FastAPI->>DINOv2: embed(art_crop) → 384-dim vector
loop Each printing art_crop
FastAPI->>Scryfall: Download art_crop image (batched 10)
FastAPI->>DINOv2: embed(printing_art) → 384-dim vector
FastAPI->>FastAPI: cosine_similarity(input, printing)
end
DINOv2-->>FastAPI: best match: RVR #335 (score=0.9515)
FastAPI-->>Browser: {card, printings, matched_printing_index, annotated_image}
Service: web/services/detection.py
Sends the image to the Roboflow hosted inference API, which runs the deployed YOLOv11n model.
| Parameter | Value | Purpose |
|---|---|---|
| API endpoint | https://detect.roboflow.com/mtg-detection-cixf6/8 |
Roboflow model v8 |
confidence |
0.25 | Minimum detection confidence |
overlap |
0.45 | NMS overlap threshold |
| Timeout | 30s | HTTP request timeout |
Returns predictions in center-format: {class, confidence, x, y, width, height} where x,y are center coordinates in pixels.
The service also provides helper functions:
find_best_title_box()— highest-confidence title prediction (≥0.3)find_best_art_box()— highest-confidence art prediction (≥0.3)crop_box_from_image()— extract a bounding box region as JPEG bytesdraw_detections()— render colored bounding boxes on the image
Service: web/services/ocr.py
Runs RapidOCR (ONNX Runtime backend) on the cropped title region.
| Parameter | Value | Purpose |
|---|---|---|
| Engine | RapidOCR | ONNX-based, no GPU required |
| Input | Cropped title region (JPEG bytes) | From Stage 1 detection |
| Min text length | 2 characters | Filter noise/artifacts |
| Warm-up | On server startup (lifespan) |
Avoid cold-start latency |
The engine is lazily initialized and reused across requests. OCR result is a single string (all detected text lines joined with spaces).
Service: web/services/scryfall.py
Two API calls:
Fuzzy name search — GET https://api.scryfall.com/cards/named?fuzzy={ocr_text}
- Scryfall's fuzzy matching handles OCR typos (e.g., "Krenk0" → "Krenko, Mob Boss")
- Returns full card JSON with name, mana cost, type line, oracle text, prices, image URIs
- User-Agent:
MTGDetectionProject/1.0 - Timeout: 10s
Printings pagination — follows the card's prints_search_uri to fetch all printings
- Respects Scryfall rate limits (100ms sleep between pages)
- Caps at
MAX_PRINTINGS = 50to bound response time - Extracts per-printing: name, set, collector number, release date, image URIs (small, normal, art_crop), Scryfall URI
- Handles double-faced cards (falls back to
card_faces[0]image URIs)
extract_card_info() normalizes the Scryfall response for the frontend: name, mana cost, type line, oracle text, P/T, loyalty, rarity, set, prices (USD/foil), image URI, Scryfall link, and prints_search_uri for lazy-loading printings.
Service: web/services/image_match.py
Compares the detected art region against printing art crops using DINOv2 embeddings to find the exact printing.
| Property | Value |
|---|---|
| Architecture | DINOv2 ViT-S/14 (Vision Transformer, Small, patch size 14) |
| Parameters | 21M |
| Size | ~86 MB |
| Training | Self-supervised (no labels needed) |
| Source | torch.hub.load("facebookresearch/dinov2", "dinov2_vits14") |
| Embedding dim | 384 (CLS token) |
Standard ImageNet preprocessing, applied to both the input art crop and each printing art image:
Resize(256, BICUBIC) → CenterCrop(224) → ToTensor() → Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
- Forward pass through DINOv2 with
@torch.inference_mode() - Extract
x_norm_clstokenfromforward_features()output - L2-normalize the 384-dim vector
- Cosine similarity = dot product of two normalized vectors
- Best match = highest similarity above
MIN_SIMILARITY = 0.5
Printing art images are downloaded from Scryfall in batches:
| Parameter | Value | Purpose |
|---|---|---|
DOWNLOAD_BATCH_SIZE |
10 | Concurrent downloads per batch |
| Inter-batch pause | 50ms (asyncio.sleep(0.05)) |
Respect Scryfall rate limits |
DOWNLOAD_TIMEOUT |
5.0s | Per-image timeout |
| Fallback | art_crop_uri → image_uri |
Some printings lack art crops |
DINOv2's self-supervised vision features are robust to:
- Lighting variation — photos taken under different light conditions
- Perspective shift — cards photographed at angles
- Domain gap — comparing a real photo against a digital render
This makes it superior to pixel-level methods (histogram comparison, template matching) which fail when the photo isn't perfectly aligned or lit.
flowchart TB
subgraph Frontend["Frontend (vanilla JS)"]
HTML[index.html<br/>Tab navigation]
UP[upload.js<br/>File/drag/paste input]
LIVE[live.js<br/>Camera + frame loop]
CP[card-panel.js<br/>Card info + printings gallery]
end
subgraph Backend["Backend (FastAPI)"]
APP[app.py<br/>Route handlers + lifespan]
DET[detection.py<br/>Roboflow API client]
OCR_B[ocr.py<br/>RapidOCR wrapper]
SCR[scryfall.py<br/>Card lookup + printings]
IMG[image_match.py<br/>DINOv2 art matcher]
end
UP --> APP
LIVE --> APP
APP --> DET
APP --> OCR_B
APP --> SCR
APP --> IMG
CP -.-> UP
CP -.-> LIVE
The server is started with:
ROBOFLOW_API_KEY=your_key uv run uvicorn web.app:app --reload --host 0.0.0.0 --port 8000On startup (lifespan), the server creates a shared httpx.AsyncClient, reads the Roboflow API key from the environment, and warms up the OCR engine.
| Method | Path | Purpose | Input | Output |
|---|---|---|---|---|
GET |
/ |
Serve index.html |
— | HTML |
GET |
/api/health |
Health check | — | {"status": "ok"} |
GET |
/api/config |
Roboflow config for frontend | — | {roboflow_publishable_key, model_id, model_version} |
POST |
/api/detect |
Full 4-stage pipeline (upload mode) | image (multipart file) |
{detections, card, printings, ocr_text, matched_printing_index, annotated_image} |
POST |
/api/identify |
OCR + Scryfall only (live mode) | crop (multipart file) |
{card, ocr_text} |
GET |
/api/printings |
Lazy-load printings | prints_search_uri (query param) |
{printings} |
POST /api/detect is the main endpoint. It runs all 4 stages sequentially:
- Sends the image to Roboflow for detection
- Crops the best title box and OCRs it
- Looks up the card on Scryfall and fetches printings
- Crops the best art box and runs DINOv2 art matching against all printings
- Returns the annotated image (base64), detections, card info, printings, and matched printing index
POST /api/identify is a lightweight endpoint for live camera mode — accepts a pre-cropped title region, runs OCR + Scryfall lookup, and returns card info without detection or art matching.
GET /api/printings enables lazy-loading printings in the frontend when the initial response doesn't include them. Validates that the URI starts with https://api.scryfall.com/ to prevent SSRF.
File: web/static/js/upload.js
Three input methods:
- File picker —
<input type="file" accept="image/*" capture="environment"> - Drag & drop — drop zone with visual feedback (
dragenter/dragleaveevents) - Clipboard paste —
Ctrl+V/Cmd+V(only active when upload tab is selected)
Flow: select image → show preview → click "Detect Card" → POST /api/detect → show results:
- Annotated image with colored bounding boxes
- Detection stats table (class, confidence bar, bbox size)
- Card info panel with OCR result, card image, metadata, and printings gallery
File: web/static/js/live.js
Uses navigator.mediaDevices.getUserMedia() with 1280×720 resolution, rear camera preferred (facingMode: 'environment').
| Parameter | Value | Purpose |
|---|---|---|
| Frame interval | 1200ms | Time between detection requests |
| JPEG quality | 0.8 (auto), 0.92 (manual capture) | Balance speed vs quality |
The detection loop:
- Captures a frame from
<video>to an offscreen<canvas> - Converts to JPEG blob
- Sends
POST /api/detect(skips if a request is already in flight) - Draws detection boxes on an overlay
<canvas>(positioned over the video) - Updates the card panel if a new card is identified
A "Capture Photo" button triggers a one-shot high-quality detection. Camera cleanup (stopLiveCamera) is exposed globally for tab switching.
File: web/static/js/card-panel.js
Shared rendering function renderCardPanel(data, container) used by both upload and live modes:
- OCR result — displayed at the top with the matched printing info
- Card image — from Scryfall (normal size)
- Card metadata — name, mana cost, type line, oracle text, P/T or loyalty, rarity, set, price (USD/foil)
- Scryfall link — direct link to the card page
- Printings gallery — horizontal scrolling thumbnail strip:
- DINOv2-matched printing highlighted with
printing-selectedclass - Falls back to Scryfall default match if no art match
- Click any printing to swap the main card image, set name, and Scryfall link
- Each printing shows set code, collector number, and a Scryfall link icon
- Lazy-loaded via
GET /api/printingsif not included in the initial response
- DINOv2-matched printing highlighted with
| Platform | Role | Cost | Auth Required |
|---|---|---|---|
| Roboflow | Dataset hosting, model deployment, hosted inference API | Free tier available | ROBOFLOW_API_KEY |
| RunPod | Cloud GPU training (RTX 4090) | $0.73–$5.60 per experiment | RunPod account + API key |
| Google Colab | Alternative cloud GPU (free T4) | Free | Google account |
| Apple Silicon | Local CPU training, inference | Free | — |
| Scryfall API | Card data, printings, art crop images | Free, no auth | — |
| Meta DINOv2 | Art matching (ViT-S/14 via torch.hub) |
Free, no auth | — |
| RapidOCR | ONNX-based OCR for title text extraction | Free, no auth | — |
| Key | Required For | How to Get |
|---|---|---|
ROBOFLOW_API_KEY |
Dataset download, model deployment, hosted inference | Free at roboflow.com |
| RunPod API key | Cloud GPU training | runpod.io, configured via runpodctl doctor |
Scryfall, DINOv2, and RapidOCR require no API keys or authentication.
From pyproject.toml:
| Package | Version | Purpose |
|---|---|---|
ultralytics |
≥8.3.0 | YOLOv11 training and inference |
roboflow |
≥1.1.0 | Dataset download and model deployment |
opencv-python |
≥4.8.0 | Image processing, bounding box drawing |
matplotlib |
≥3.8.0 | Dataset visualization plots |
numpy |
≥1.26.0 | Array operations |
Pillow |
≥10.0.0 | Image loading for DINOv2 |
rapidocr-onnxruntime |
≥1.4.0 | OCR engine |
tensorboard |
≥2.14.0 | Training metric visualization |
fastapi |
≥0.115.0 | Web server framework |
uvicorn |
≥0.32.0 | ASGI server |
httpx |
≥0.27.0 | Async HTTP client |
python-multipart |
≥0.0.12 | File upload parsing |
torch |
(via ultralytics) | DINOv2 model, tensor operations |
torchvision |
(via ultralytics) | Image preprocessing transforms |
A photo of Krenko, Mob Boss taken with a webcam is correctly identified:
- Detection: 7 regions detected (art, card, description, mana-cost, power, tags, title)
- OCR: Title region reads "Krenko, Mob Boss"
- Scryfall: Fuzzy search returns the card with 50 printings
- Art match: DINOv2 identifies Ravnica Remastered #335 with 0.9515 cosine similarity
| Limitation | Cause | Mitigation |
|---|---|---|
| Stylized/artistic fonts may fail OCR | RapidOCR trained on standard fonts | Scryfall fuzzy matching tolerates some OCR errors |
| All printings must be downloaded for art match | DINOv2 compares against each printing's art crop | Capped at 50 printings, batched 10-concurrent downloads |
| Art match latency scales with printing count | Sequential embedding computation | Cards with few printings are fast; popular cards (~50 printings) take 2–5s |
| MPS training bugs on macOS 26 | PyTorch 2.10 clamp_() tensor corruption |
Use CPU for training (still fast on Apple Silicon) |
| Detection accuracy ceiling ~97% mAP50 | Annotation noise in training data | Tighter annotations via Label Studio correction workflow |
| mAP50-95 lower for small objects | Small bbox errors = large IoU drops | Higher resolution (1280px) helps but has trade-offs |
| Document | Topic |
|---|---|
| architecture.md | Pipeline data flow, file formats, script breakdown, web app |
| concepts.md | ML concepts via software engineering analogies |
| parameters.md | Every training/inference parameter explained |
| training-strategies.md | Academic foundations of augmentation strategies |
| metrics-guide.md | How to interpret detection metrics |
| training-v2-status.md | Cloud training experiment log |
| training-cost-analysis.md | GPU cost comparisons |
| annotation-correction.md | Label Studio correction workflow |
| Resource | URL |
|---|---|
| YOLOv11 (Ultralytics) | https://docs.ultralytics.com/ |
| DINOv2 (Meta) | https://arxiv.org/abs/2304.07193 |
| Scryfall API | https://scryfall.com/docs/api |
| Roboflow | https://docs.roboflow.com/ |
| RapidOCR | https://github.com/RapidAI/RapidOCR |
| ONNX Runtime | https://onnxruntime.ai/ |
| FastAPI | https://fastapi.tiangolo.com/ |