MTG Card Detector — Complete Solution Document

End-to-end system for detecting, reading, and identifying Magic: The Gathering cards. Trains a custom object detection model, then chains detection → OCR → card lookup → art matching to identify cards down to the exact printing.

Key result: 96.7% mAP50 detection accuracy. A photo of Krenko, Mob Boss correctly identifies the Ravnica Remastered #335 printing (0.9515 cosine similarity).

Tech stack: YOLOv11 (Ultralytics) · DINOv2 (Meta) · RapidOCR · Scryfall API · Roboflow · FastAPI · vanilla JS

System Architecture
Dataset
Model Training
Card Identification Pipeline
Web Application
Platforms and Services
Results and Limitations
References

1. System Architecture

1.1 High-Level Pipeline

The system has two tracks: a training track (offline, one-time) and an inference track (online, per-request).

flowchart LR
    subgraph Training["Training Track (offline)"]
        RF[Roboflow Dataset<br/>4,065 images] --> YOLO[YOLOv11n Training<br/>100 epochs]
        YOLO --> WEIGHTS[best.pt<br/>2.6M params]
        WEIGHTS --> DEPLOY[Deploy to<br/>Roboflow API]
    end

    subgraph Inference["Inference Track (per-request)"]
        IMG[Input Image] --> DET[Stage 1: Detection<br/>Roboflow API]
        DET --> OCR_S[Stage 2: OCR<br/>RapidOCR]
        OCR_S --> SF[Stage 3: Card Lookup<br/>Scryfall API]
        SF --> DINO[Stage 4: Art Match<br/>DINOv2 ViT-S/14]
        DINO --> RESULT[Card Name +<br/>Exact Printing]
    end

    DEPLOY -.-> DET

1.2 Component Map

Component	File(s)	Responsibility	Dependencies
Dataset download	`scripts/01_setup_dataset.py`	Fetch annotated images from Roboflow	`roboflow`
Data exploration	`scripts/02_explore_dataset.py`	Visualize class distribution, quality checks	`opencv`, `matplotlib`
Local training	`scripts/03_train.py`	Train YOLOv11n on Apple Silicon CPU	`ultralytics`, `torch`
Cloud training	`scripts/train_cloud*.py`	Train on RunPod GPU (v1/v2/v3 experiments)	`ultralytics`, `roboflow`
Validation	`scripts/04_validate.py`	Compute per-class mAP, PR curves	`ultralytics`
Batch predict	`scripts/05_predict.py`	Run inference on image files	`ultralytics`
Live detection	`scripts/06_live_detect.py`	Webcam detection with bounding boxes	`ultralytics`, `opencv`
Test image download	`scripts/07_download_test_images.py`	Multilingual card images from Scryfall	stdlib only
Label export	`scripts/08_export_for_correction.py`	Export predictions for Label Studio	`ultralytics`
Card identification	`scripts/09_identify_card.py`	CLI: detect → OCR → Scryfall	`ultralytics`, `rapidocr`
Live identification	`scripts/10_live_identify.py`	Webcam: detect → OCR → Scryfall + info panel	`ultralytics`, `rapidocr`, `opencv`
Cloud pipeline	`scripts/run_cloud.py`	All steps combined for RunPod	all above
Web server	`web/app.py`	FastAPI: orchestrates 4-stage pipeline	`fastapi`, `httpx`
Detection service	`web/services/detection.py`	Roboflow hosted inference API client	`httpx`
OCR service	`web/services/ocr.py`	RapidOCR wrapper for title extraction	`rapidocr-onnxruntime`
Scryfall service	`web/services/scryfall.py`	Card lookup + printings pagination	`httpx`
Art matching	`web/services/image_match.py`	DINOv2 embedding comparison	`torch`, `torchvision`, `PIL`
Frontend	`web/static/`	Upload mode, live camera, card panel	vanilla JS

2. Dataset

2.1 Source

Roboflow Universe project mtg-detection-cixf6 version 8 — 4,065 annotated MTG card images, licensed CC BY 4.0.

2.2 Splits

Split	Images	Purpose
Train	3,761	Model learns from these
Valid	223	Monitors overfitting during training
Test	81	Final unbiased evaluation

2.3 Classes

Seven detection classes (indexed 0–6):

ID	Class	Region	Typical Size
0	`art`	Card illustration	Large
1	`card`	Full card boundary	Very large
2	`description`	Rules text box	Large
3	`mana-cost`	Mana symbols (top right)	Small
4	`power`	Power/toughness (bottom right)	Small
5	`tags`	Type line (e.g., "Creature — Dragon")	Medium
6	`title`	Card name (top center)	Medium

2.4 Annotation Quality

The 19-point gap between mAP50 (96.7%) and mAP50-95 (77.7%) is characteristic of annotation noise — bounding boxes in the training data aren't pixel-perfect. Small classes (mana-cost, power) suffer most because even a few pixels of imprecision causes proportionally large IoU drops. Further mAP50-95 gains require tighter annotations, not bigger models.

Details: metrics-guide.md

3. Model Training

3.1 Model

YOLOv11n (nano) — 2.6M parameters, ~5.4 MB. Pretrained on COCO (80 everyday object classes), then fine-tuned on the MTG dataset via transfer learning.

Architecture: CSP-Net backbone → FPN+PAN neck → 3-scale detection head (80×80 / 40×40 / 20×20 grid).

Architecture details: architecture.md#yolo-model-architecture

3.2 Training Parameters

Training parameters for scripts/03_train.py. Explicitly-passed parameters are noted; others are YOLO defaults applied automatically (visible in runs/mtg-detect/args.yaml after training):

Category	Parameter	Value	Justification
Core	`epochs`	100	Sufficient for nano model convergence
	`batch`	16	Fits in 16 GB unified memory
	`imgsz`	640	Standard YOLO resolution, good speed/accuracy balance
	`patience`	20	Stop early if no improvement for 20 epochs
	`device`	`cpu`	MPS has training bugs in PyTorch 2.10 on macOS 26
	`workers`	8	Parallel data loading
	`save_period`	25	Checkpoint every 25 epochs
Optimizer	`optimizer`	AdamW	Better generalization than SGD for small datasets
	`lr0`	0.001	Standard for AdamW with YOLO
	`lrf`	0.01	Final LR = 0.001 × 0.01 = 0.00001
	`cos_lr`	true	Cosine annealing for smooth LR decay
Geometric Aug	`degrees`	15.0	Cards held at angles up to ~15°
	`perspective`	0.001	Simulates keystoning from angled holding
	`shear`	2.0	Mild perspective variety
Scale Aug	`multi_scale`	0.5	Train at 320–960px for resolution invariance
Color Aug	`hsv_h`	0.015	Hue shift (YOLO default)
	`hsv_s`	0.7	Saturation shift (YOLO default)
	`hsv_v`	0.4	Brightness shift (YOLO default)
Composition	`mosaic`	1.0	100% mosaic — 4 images combined per training sample
	`mixup`	0.05	5% image blending — reduces background false positives
	`fliplr`	0.5	50% horizontal flip (YOLO default)
	`erasing`	0.4	Random erasing — occlusion robustness
Loss Weights	`box`	7.5	Bounding box regression (YOLO default)
	`cls`	0.5	Classification (YOLO default)
	`dfl`	1.5	Distribution focal loss (YOLO default)

Full parameter reference: parameters.md · Augmentation theory: training-strategies.md

3.3 Cloud Training Experiments

Three iterations on RunPod (RTX 4090):

Experiment	Script	Model	Resolution	Epochs	Key Changes	mAP50	mAP50-95	Cost
v1	`train_cloud.py`	yolo11n	640	100	Baseline	96.7%	77.7%	$0.73
v2-quick	`train_cloud_v2.py`	yolo11n	1280	150	Resolution ↑	96.2%	74.8%	~$0.90
v2-balanced	`train_cloud_v2.py`	yolo11s	1280	200	Larger model	—	—	~$1.80
v3-final	`train_cloud_v3.py`	yolo11m	1280	250	copy_paste=0.3, close_mosaic=30	96.1%	77.2%	~$5.60

Takeaway: The nano model at 640px (v1) remains the best balanced result. Higher resolution helped small objects but destabilized the card class. Larger models didn't reliably beat nano — annotation quality is the ceiling.

Experiment log: training-v2-status.md · Cost analysis: training-cost-analysis.md

3.4 Local Training

Training on Apple Silicon uses CPU (device="cpu") because PyTorch 2.10's MPS backend has tensor corruption bugs on macOS 26 — specifically clamp_() operations and TAL assigner indexing during training loss computation. MPS works correctly for inference (validation, prediction).

CPU training on Apple Silicon is still fast (~15–30 min on M3/M4 Pro) thanks to high-bandwidth unified memory (~200–400 GB/s on Pro/Max chips) and NEON SIMD instructions.

3.5 Best Result

v1 — yolo11n @ 640px (96.7% mAP50 / 77.7% mAP50-95):

Class	Precision	Recall	mAP50	mAP50-95
art	0.963	0.972	0.982	0.918
card	0.967	0.969	0.959	0.880
description	0.955	0.961	0.974	0.851
mana-cost	0.954	0.847	0.959	0.710
power	0.824	0.919	0.937	0.703
tags	0.984	0.964	0.983	0.651
title	0.986	0.949	0.974	0.727
ALL	0.948	0.940	0.967	0.777

Large regions (art, card, description) score highest because small positioning errors barely affect IoU. Small regions (mana-cost, power) score lower on strict metrics.

4. Card Identification Pipeline

The centerpiece of the system — a 4-stage pipeline that takes a photo and returns the card's name, metadata, and exact printing.

4.1 Pipeline Diagram

sequenceDiagram
    participant Browser
    participant FastAPI as FastAPI Server
    participant Roboflow as Roboflow API
    participant OCR as RapidOCR
    participant Scryfall as Scryfall API
    participant DINOv2 as DINOv2 ViT-S/14

    Browser->>FastAPI: POST /api/detect (image)
    FastAPI->>Roboflow: detect_cards(image, confidence=0.25, overlap=0.45)
    Roboflow-->>FastAPI: predictions [{class, bbox, confidence}]
    FastAPI->>FastAPI: find_best_title_box() → crop title region
    FastAPI->>OCR: ocr_image(title_crop)
    OCR-->>FastAPI: "Krenko, Mob Boss"
    FastAPI->>Scryfall: GET /cards/named?fuzzy=Krenko,+Mob+Boss
    Scryfall-->>FastAPI: card JSON (name, type, oracle, prices)
    FastAPI->>Scryfall: GET prints_search_uri (paginated)
    Scryfall-->>FastAPI: 50 printings [{set, number, art_crop_uri}]
    FastAPI->>FastAPI: find_best_art_box() → crop art region
    FastAPI->>DINOv2: embed(art_crop) → 384-dim vector
    loop Each printing art_crop
        FastAPI->>Scryfall: Download art_crop image (batched 10)
        FastAPI->>DINOv2: embed(printing_art) → 384-dim vector
        FastAPI->>FastAPI: cosine_similarity(input, printing)
    end
    DINOv2-->>FastAPI: best match: RVR #335 (score=0.9515)
    FastAPI-->>Browser: {card, printings, matched_printing_index, annotated_image}

4.2 Stage 1: Detection

Service: web/services/detection.py

Sends the image to the Roboflow hosted inference API, which runs the deployed YOLOv11n model.

Parameter	Value	Purpose
API endpoint	`https://detect.roboflow.com/mtg-detection-cixf6/8`	Roboflow model v8
`confidence`	0.25	Minimum detection confidence
`overlap`	0.45	NMS overlap threshold
Timeout	30s	HTTP request timeout

Returns predictions in center-format: {class, confidence, x, y, width, height} where x,y are center coordinates in pixels.

The service also provides helper functions:

find_best_title_box() — highest-confidence title prediction (≥0.3)
find_best_art_box() — highest-confidence art prediction (≥0.3)
crop_box_from_image() — extract a bounding box region as JPEG bytes
draw_detections() — render colored bounding boxes on the image

4.3 Stage 2: OCR

Service: web/services/ocr.py

Runs RapidOCR (ONNX Runtime backend) on the cropped title region.

Parameter	Value	Purpose
Engine	RapidOCR	ONNX-based, no GPU required
Input	Cropped title region (JPEG bytes)	From Stage 1 detection
Min text length	2 characters	Filter noise/artifacts
Warm-up	On server startup (`lifespan`)	Avoid cold-start latency

The engine is lazily initialized and reused across requests. OCR result is a single string (all detected text lines joined with spaces).

4.4 Stage 3: Card Lookup

Service: web/services/scryfall.py

Two API calls:

Fuzzy name search — GET https://api.scryfall.com/cards/named?fuzzy={ocr_text}

Scryfall's fuzzy matching handles OCR typos (e.g., "Krenk0" → "Krenko, Mob Boss")
Returns full card JSON with name, mana cost, type line, oracle text, prices, image URIs
User-Agent: MTGDetectionProject/1.0
Timeout: 10s

Printings pagination — follows the card's prints_search_uri to fetch all printings

Respects Scryfall rate limits (100ms sleep between pages)
Caps at MAX_PRINTINGS = 50 to bound response time
Extracts per-printing: name, set, collector number, release date, image URIs (small, normal, art_crop), Scryfall URI
Handles double-faced cards (falls back to card_faces[0] image URIs)

extract_card_info() normalizes the Scryfall response for the frontend: name, mana cost, type line, oracle text, P/T, loyalty, rarity, set, prices (USD/foil), image URI, Scryfall link, and prints_search_uri for lazy-loading printings.

4.5 Stage 4: Art Matching

Service: web/services/image_match.py

Compares the detected art region against printing art crops using DINOv2 embeddings to find the exact printing.

Model

Property	Value
Architecture	DINOv2 ViT-S/14 (Vision Transformer, Small, patch size 14)
Parameters	21M
Size	~86 MB
Training	Self-supervised (no labels needed)
Source	`torch.hub.load("facebookresearch/dinov2", "dinov2_vits14")`
Embedding dim	384 (CLS token)

Preprocessing

Standard ImageNet preprocessing, applied to both the input art crop and each printing art image:

Resize(256, BICUBIC) → CenterCrop(224) → ToTensor() → Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225]
)

Embedding and Similarity

Forward pass through DINOv2 with @torch.inference_mode()
Extract x_norm_clstoken from forward_features() output
L2-normalize the 384-dim vector
Cosine similarity = dot product of two normalized vectors
Best match = highest similarity above MIN_SIMILARITY = 0.5

Download Strategy

Printing art images are downloaded from Scryfall in batches:

Parameter	Value	Purpose
`DOWNLOAD_BATCH_SIZE`	10	Concurrent downloads per batch
Inter-batch pause	50ms (`asyncio.sleep(0.05)`)	Respect Scryfall rate limits
`DOWNLOAD_TIMEOUT`	5.0s	Per-image timeout
Fallback	`art_crop_uri` → `image_uri`	Some printings lack art crops

Why DINOv2

DINOv2's self-supervised vision features are robust to:

Lighting variation — photos taken under different light conditions
Perspective shift — cards photographed at angles
Domain gap — comparing a real photo against a digital render

This makes it superior to pixel-level methods (histogram comparison, template matching) which fail when the photo isn't perfectly aligned or lit.

5. Web Application

5.1 Architecture

flowchart TB
    subgraph Frontend["Frontend (vanilla JS)"]
        HTML[index.html<br/>Tab navigation]
        UP[upload.js<br/>File/drag/paste input]
        LIVE[live.js<br/>Camera + frame loop]
        CP[card-panel.js<br/>Card info + printings gallery]
    end

    subgraph Backend["Backend (FastAPI)"]
        APP[app.py<br/>Route handlers + lifespan]
        DET[detection.py<br/>Roboflow API client]
        OCR_B[ocr.py<br/>RapidOCR wrapper]
        SCR[scryfall.py<br/>Card lookup + printings]
        IMG[image_match.py<br/>DINOv2 art matcher]
    end

    UP --> APP
    LIVE --> APP
    APP --> DET
    APP --> OCR_B
    APP --> SCR
    APP --> IMG
    CP -.-> UP
    CP -.-> LIVE

The server is started with:

ROBOFLOW_API_KEY=your_key uv run uvicorn web.app:app --reload --host 0.0.0.0 --port 8000

On startup (lifespan), the server creates a shared httpx.AsyncClient, reads the Roboflow API key from the environment, and warms up the OCR engine.

5.2 API Routes

Method	Path	Purpose	Input	Output
`GET`	`/`	Serve `index.html`	—	HTML
`GET`	`/api/health`	Health check	—	`{"status": "ok"}`
`GET`	`/api/config`	Roboflow config for frontend	—	`{roboflow_publishable_key, model_id, model_version}`
`POST`	`/api/detect`	Full 4-stage pipeline (upload mode)	`image` (multipart file)	`{detections, card, printings, ocr_text, matched_printing_index, annotated_image}`
`POST`	`/api/identify`	OCR + Scryfall only (live mode)	`crop` (multipart file)	`{card, ocr_text}`
`GET`	`/api/printings`	Lazy-load printings	`prints_search_uri` (query param)	`{printings}`

POST /api/detect is the main endpoint. It runs all 4 stages sequentially:

Sends the image to Roboflow for detection
Crops the best title box and OCRs it
Looks up the card on Scryfall and fetches printings
Crops the best art box and runs DINOv2 art matching against all printings
Returns the annotated image (base64), detections, card info, printings, and matched printing index

POST /api/identify is a lightweight endpoint for live camera mode — accepts a pre-cropped title region, runs OCR + Scryfall lookup, and returns card info without detection or art matching.

GET /api/printings enables lazy-loading printings in the frontend when the initial response doesn't include them. Validates that the URI starts with https://api.scryfall.com/ to prevent SSRF.

5.3 Upload Mode

File: web/static/js/upload.js

Three input methods:

File picker — <input type="file" accept="image/*" capture="environment">
Drag & drop — drop zone with visual feedback (dragenter/dragleave events)
Clipboard paste — Ctrl+V / Cmd+V (only active when upload tab is selected)

Flow: select image → show preview → click "Detect Card" → POST /api/detect → show results:

Annotated image with colored bounding boxes
Detection stats table (class, confidence bar, bbox size)
Card info panel with OCR result, card image, metadata, and printings gallery

5.4 Live Camera Mode

File: web/static/js/live.js

Uses navigator.mediaDevices.getUserMedia() with 1280×720 resolution, rear camera preferred (facingMode: 'environment').

Parameter	Value	Purpose
Frame interval	1200ms	Time between detection requests
JPEG quality	0.8 (auto), 0.92 (manual capture)	Balance speed vs quality

The detection loop:

Captures a frame from <video> to an offscreen <canvas>
Converts to JPEG blob
Sends POST /api/detect (skips if a request is already in flight)
Draws detection boxes on an overlay <canvas> (positioned over the video)
Updates the card panel if a new card is identified

A "Capture Photo" button triggers a one-shot high-quality detection. Camera cleanup (stopLiveCamera) is exposed globally for tab switching.

5.5 Card Info Panel

File: web/static/js/card-panel.js

Shared rendering function renderCardPanel(data, container) used by both upload and live modes:

OCR result — displayed at the top with the matched printing info
Card image — from Scryfall (normal size)
Card metadata — name, mana cost, type line, oracle text, P/T or loyalty, rarity, set, price (USD/foil)
Scryfall link — direct link to the card page
Printings gallery — horizontal scrolling thumbnail strip:
- DINOv2-matched printing highlighted with printing-selected class
- Falls back to Scryfall default match if no art match
- Click any printing to swap the main card image, set name, and Scryfall link
- Each printing shows set code, collector number, and a Scryfall link icon
- Lazy-loaded via GET /api/printings if not included in the initial response

6. Platforms and Services

6.1 Platform Reference

Platform	Role	Cost	Auth Required
Roboflow	Dataset hosting, model deployment, hosted inference API	Free tier available	`ROBOFLOW_API_KEY`
RunPod	Cloud GPU training (RTX 4090)	$0.73–$5.60 per experiment	RunPod account + API key
Google Colab	Alternative cloud GPU (free T4)	Free	Google account
Apple Silicon	Local CPU training, inference	Free	—
Scryfall API	Card data, printings, art crop images	Free, no auth	—
Meta DINOv2	Art matching (ViT-S/14 via `torch.hub`)	Free, no auth	—
RapidOCR	ONNX-based OCR for title text extraction	Free, no auth	—

6.2 API Keys

Key	Required For	How to Get
`ROBOFLOW_API_KEY`	Dataset download, model deployment, hosted inference	Free at roboflow.com
RunPod API key	Cloud GPU training	runpod.io, configured via `runpodctl doctor`

Scryfall, DINOv2, and RapidOCR require no API keys or authentication.

6.3 Dependencies

From pyproject.toml:

Package	Version	Purpose
`ultralytics`	≥8.3.0	YOLOv11 training and inference
`roboflow`	≥1.1.0	Dataset download and model deployment
`opencv-python`	≥4.8.0	Image processing, bounding box drawing
`matplotlib`	≥3.8.0	Dataset visualization plots
`numpy`	≥1.26.0	Array operations
`Pillow`	≥10.0.0	Image loading for DINOv2
`rapidocr-onnxruntime`	≥1.4.0	OCR engine
`tensorboard`	≥2.14.0	Training metric visualization
`fastapi`	≥0.115.0	Web server framework
`uvicorn`	≥0.32.0	ASGI server
`httpx`	≥0.27.0	Async HTTP client
`python-multipart`	≥0.0.12	File upload parsing
`torch`	(via ultralytics)	DINOv2 model, tensor operations
`torchvision`	(via ultralytics)	Image preprocessing transforms

7. Results and Limitations

7.1 End-to-End Result

A photo of Krenko, Mob Boss taken with a webcam is correctly identified:

Detection: 7 regions detected (art, card, description, mana-cost, power, tags, title)
OCR: Title region reads "Krenko, Mob Boss"
Scryfall: Fuzzy search returns the card with 50 printings
Art match: DINOv2 identifies Ravnica Remastered #335 with 0.9515 cosine similarity

7.2 Known Limitations

Limitation	Cause	Mitigation
Stylized/artistic fonts may fail OCR	RapidOCR trained on standard fonts	Scryfall fuzzy matching tolerates some OCR errors
All printings must be downloaded for art match	DINOv2 compares against each printing's art crop	Capped at 50 printings, batched 10-concurrent downloads
Art match latency scales with printing count	Sequential embedding computation	Cards with few printings are fast; popular cards (~50 printings) take 2–5s
MPS training bugs on macOS 26	PyTorch 2.10 `clamp_()` tensor corruption	Use CPU for training (still fast on Apple Silicon)
Detection accuracy ceiling ~97% mAP50	Annotation noise in training data	Tighter annotations via Label Studio correction workflow
mAP50-95 lower for small objects	Small bbox errors = large IoU drops	Higher resolution (1280px) helps but has trade-offs

8. References

Internal Documentation

Document	Topic
architecture.md	Pipeline data flow, file formats, script breakdown, web app
concepts.md	ML concepts via software engineering analogies
parameters.md	Every training/inference parameter explained
training-strategies.md	Academic foundations of augmentation strategies
metrics-guide.md	How to interpret detection metrics
training-v2-status.md	Cloud training experiment log
training-cost-analysis.md	GPU cost comparisons
annotation-correction.md	Label Studio correction workflow

External References

Resource	URL
YOLOv11 (Ultralytics)	https://docs.ultralytics.com/
DINOv2 (Meta)	https://arxiv.org/abs/2304.07193
Scryfall API	https://scryfall.com/docs/api
Roboflow	https://docs.roboflow.com/
RapidOCR	https://github.com/RapidAI/RapidOCR
ONNX Runtime	https://onnxruntime.ai/
FastAPI	https://fastapi.tiangolo.com/

FilesExpand file tree

solution.md

Latest commit

History