Skip to content

Latest commit

 

History

History
553 lines (414 loc) · 25.4 KB

File metadata and controls

553 lines (414 loc) · 25.4 KB

MTG Card Detector — Complete Solution Document

End-to-end system for detecting, reading, and identifying Magic: The Gathering cards. Trains a custom object detection model, then chains detection → OCR → card lookup → art matching to identify cards down to the exact printing.

Key result: 96.7% mAP50 detection accuracy. A photo of Krenko, Mob Boss correctly identifies the Ravnica Remastered #335 printing (0.9515 cosine similarity).

Tech stack: YOLOv11 (Ultralytics) · DINOv2 (Meta) · RapidOCR · Scryfall API · Roboflow · FastAPI · vanilla JS


Table of Contents

  1. System Architecture
  2. Dataset
  3. Model Training
  4. Card Identification Pipeline
  5. Web Application
  6. Platforms and Services
  7. Results and Limitations
  8. References

1. System Architecture

1.1 High-Level Pipeline

The system has two tracks: a training track (offline, one-time) and an inference track (online, per-request).

flowchart LR
    subgraph Training["Training Track (offline)"]
        RF[Roboflow Dataset<br/>4,065 images] --> YOLO[YOLOv11n Training<br/>100 epochs]
        YOLO --> WEIGHTS[best.pt<br/>2.6M params]
        WEIGHTS --> DEPLOY[Deploy to<br/>Roboflow API]
    end

    subgraph Inference["Inference Track (per-request)"]
        IMG[Input Image] --> DET[Stage 1: Detection<br/>Roboflow API]
        DET --> OCR_S[Stage 2: OCR<br/>RapidOCR]
        OCR_S --> SF[Stage 3: Card Lookup<br/>Scryfall API]
        SF --> DINO[Stage 4: Art Match<br/>DINOv2 ViT-S/14]
        DINO --> RESULT[Card Name +<br/>Exact Printing]
    end

    DEPLOY -.-> DET
Loading

1.2 Component Map

Component File(s) Responsibility Dependencies
Dataset download scripts/01_setup_dataset.py Fetch annotated images from Roboflow roboflow
Data exploration scripts/02_explore_dataset.py Visualize class distribution, quality checks opencv, matplotlib
Local training scripts/03_train.py Train YOLOv11n on Apple Silicon CPU ultralytics, torch
Cloud training scripts/train_cloud*.py Train on RunPod GPU (v1/v2/v3 experiments) ultralytics, roboflow
Validation scripts/04_validate.py Compute per-class mAP, PR curves ultralytics
Batch predict scripts/05_predict.py Run inference on image files ultralytics
Live detection scripts/06_live_detect.py Webcam detection with bounding boxes ultralytics, opencv
Test image download scripts/07_download_test_images.py Multilingual card images from Scryfall stdlib only
Label export scripts/08_export_for_correction.py Export predictions for Label Studio ultralytics
Card identification scripts/09_identify_card.py CLI: detect → OCR → Scryfall ultralytics, rapidocr
Live identification scripts/10_live_identify.py Webcam: detect → OCR → Scryfall + info panel ultralytics, rapidocr, opencv
Cloud pipeline scripts/run_cloud.py All steps combined for RunPod all above
Web server web/app.py FastAPI: orchestrates 4-stage pipeline fastapi, httpx
Detection service web/services/detection.py Roboflow hosted inference API client httpx
OCR service web/services/ocr.py RapidOCR wrapper for title extraction rapidocr-onnxruntime
Scryfall service web/services/scryfall.py Card lookup + printings pagination httpx
Art matching web/services/image_match.py DINOv2 embedding comparison torch, torchvision, PIL
Frontend web/static/ Upload mode, live camera, card panel vanilla JS

2. Dataset

2.1 Source

Roboflow Universe project mtg-detection-cixf6 version 8 — 4,065 annotated MTG card images, licensed CC BY 4.0.

2.2 Splits

Split Images Purpose
Train 3,761 Model learns from these
Valid 223 Monitors overfitting during training
Test 81 Final unbiased evaluation

2.3 Classes

Seven detection classes (indexed 0–6):

ID Class Region Typical Size
0 art Card illustration Large
1 card Full card boundary Very large
2 description Rules text box Large
3 mana-cost Mana symbols (top right) Small
4 power Power/toughness (bottom right) Small
5 tags Type line (e.g., "Creature — Dragon") Medium
6 title Card name (top center) Medium

2.4 Annotation Quality

The 19-point gap between mAP50 (96.7%) and mAP50-95 (77.7%) is characteristic of annotation noise — bounding boxes in the training data aren't pixel-perfect. Small classes (mana-cost, power) suffer most because even a few pixels of imprecision causes proportionally large IoU drops. Further mAP50-95 gains require tighter annotations, not bigger models.

Details: metrics-guide.md


3. Model Training

3.1 Model

YOLOv11n (nano) — 2.6M parameters, ~5.4 MB. Pretrained on COCO (80 everyday object classes), then fine-tuned on the MTG dataset via transfer learning.

Architecture: CSP-Net backbone → FPN+PAN neck → 3-scale detection head (80×80 / 40×40 / 20×20 grid).

Architecture details: architecture.md#yolo-model-architecture

3.2 Training Parameters

Training parameters for scripts/03_train.py. Explicitly-passed parameters are noted; others are YOLO defaults applied automatically (visible in runs/mtg-detect/args.yaml after training):

Category Parameter Value Justification
Core epochs 100 Sufficient for nano model convergence
batch 16 Fits in 16 GB unified memory
imgsz 640 Standard YOLO resolution, good speed/accuracy balance
patience 20 Stop early if no improvement for 20 epochs
device cpu MPS has training bugs in PyTorch 2.10 on macOS 26
workers 8 Parallel data loading
save_period 25 Checkpoint every 25 epochs
Optimizer optimizer AdamW Better generalization than SGD for small datasets
lr0 0.001 Standard for AdamW with YOLO
lrf 0.01 Final LR = 0.001 × 0.01 = 0.00001
cos_lr true Cosine annealing for smooth LR decay
Geometric Aug degrees 15.0 Cards held at angles up to ~15°
perspective 0.001 Simulates keystoning from angled holding
shear 2.0 Mild perspective variety
Scale Aug multi_scale 0.5 Train at 320–960px for resolution invariance
Color Aug hsv_h 0.015 Hue shift (YOLO default)
hsv_s 0.7 Saturation shift (YOLO default)
hsv_v 0.4 Brightness shift (YOLO default)
Composition mosaic 1.0 100% mosaic — 4 images combined per training sample
mixup 0.05 5% image blending — reduces background false positives
fliplr 0.5 50% horizontal flip (YOLO default)
erasing 0.4 Random erasing — occlusion robustness
Loss Weights box 7.5 Bounding box regression (YOLO default)
cls 0.5 Classification (YOLO default)
dfl 1.5 Distribution focal loss (YOLO default)

Full parameter reference: parameters.md · Augmentation theory: training-strategies.md

3.3 Cloud Training Experiments

Three iterations on RunPod (RTX 4090):

Experiment Script Model Resolution Epochs Key Changes mAP50 mAP50-95 Cost
v1 train_cloud.py yolo11n 640 100 Baseline 96.7% 77.7% $0.73
v2-quick train_cloud_v2.py yolo11n 1280 150 Resolution ↑ 96.2% 74.8% ~$0.90
v2-balanced train_cloud_v2.py yolo11s 1280 200 Larger model ~$1.80
v3-final train_cloud_v3.py yolo11m 1280 250 copy_paste=0.3, close_mosaic=30 96.1% 77.2% ~$5.60

Takeaway: The nano model at 640px (v1) remains the best balanced result. Higher resolution helped small objects but destabilized the card class. Larger models didn't reliably beat nano — annotation quality is the ceiling.

Experiment log: training-v2-status.md · Cost analysis: training-cost-analysis.md

3.4 Local Training

Training on Apple Silicon uses CPU (device="cpu") because PyTorch 2.10's MPS backend has tensor corruption bugs on macOS 26 — specifically clamp_() operations and TAL assigner indexing during training loss computation. MPS works correctly for inference (validation, prediction).

CPU training on Apple Silicon is still fast (~15–30 min on M3/M4 Pro) thanks to high-bandwidth unified memory (~200–400 GB/s on Pro/Max chips) and NEON SIMD instructions.

3.5 Best Result

v1 — yolo11n @ 640px (96.7% mAP50 / 77.7% mAP50-95):

Class Precision Recall mAP50 mAP50-95
art 0.963 0.972 0.982 0.918
card 0.967 0.969 0.959 0.880
description 0.955 0.961 0.974 0.851
mana-cost 0.954 0.847 0.959 0.710
power 0.824 0.919 0.937 0.703
tags 0.984 0.964 0.983 0.651
title 0.986 0.949 0.974 0.727
ALL 0.948 0.940 0.967 0.777

Large regions (art, card, description) score highest because small positioning errors barely affect IoU. Small regions (mana-cost, power) score lower on strict metrics.


4. Card Identification Pipeline

The centerpiece of the system — a 4-stage pipeline that takes a photo and returns the card's name, metadata, and exact printing.

4.1 Pipeline Diagram

sequenceDiagram
    participant Browser
    participant FastAPI as FastAPI Server
    participant Roboflow as Roboflow API
    participant OCR as RapidOCR
    participant Scryfall as Scryfall API
    participant DINOv2 as DINOv2 ViT-S/14

    Browser->>FastAPI: POST /api/detect (image)
    FastAPI->>Roboflow: detect_cards(image, confidence=0.25, overlap=0.45)
    Roboflow-->>FastAPI: predictions [{class, bbox, confidence}]
    FastAPI->>FastAPI: find_best_title_box() → crop title region
    FastAPI->>OCR: ocr_image(title_crop)
    OCR-->>FastAPI: "Krenko, Mob Boss"
    FastAPI->>Scryfall: GET /cards/named?fuzzy=Krenko,+Mob+Boss
    Scryfall-->>FastAPI: card JSON (name, type, oracle, prices)
    FastAPI->>Scryfall: GET prints_search_uri (paginated)
    Scryfall-->>FastAPI: 50 printings [{set, number, art_crop_uri}]
    FastAPI->>FastAPI: find_best_art_box() → crop art region
    FastAPI->>DINOv2: embed(art_crop) → 384-dim vector
    loop Each printing art_crop
        FastAPI->>Scryfall: Download art_crop image (batched 10)
        FastAPI->>DINOv2: embed(printing_art) → 384-dim vector
        FastAPI->>FastAPI: cosine_similarity(input, printing)
    end
    DINOv2-->>FastAPI: best match: RVR #335 (score=0.9515)
    FastAPI-->>Browser: {card, printings, matched_printing_index, annotated_image}
Loading

4.2 Stage 1: Detection

Service: web/services/detection.py

Sends the image to the Roboflow hosted inference API, which runs the deployed YOLOv11n model.

Parameter Value Purpose
API endpoint https://detect.roboflow.com/mtg-detection-cixf6/8 Roboflow model v8
confidence 0.25 Minimum detection confidence
overlap 0.45 NMS overlap threshold
Timeout 30s HTTP request timeout

Returns predictions in center-format: {class, confidence, x, y, width, height} where x,y are center coordinates in pixels.

The service also provides helper functions:

  • find_best_title_box() — highest-confidence title prediction (≥0.3)
  • find_best_art_box() — highest-confidence art prediction (≥0.3)
  • crop_box_from_image() — extract a bounding box region as JPEG bytes
  • draw_detections() — render colored bounding boxes on the image

4.3 Stage 2: OCR

Service: web/services/ocr.py

Runs RapidOCR (ONNX Runtime backend) on the cropped title region.

Parameter Value Purpose
Engine RapidOCR ONNX-based, no GPU required
Input Cropped title region (JPEG bytes) From Stage 1 detection
Min text length 2 characters Filter noise/artifacts
Warm-up On server startup (lifespan) Avoid cold-start latency

The engine is lazily initialized and reused across requests. OCR result is a single string (all detected text lines joined with spaces).

4.4 Stage 3: Card Lookup

Service: web/services/scryfall.py

Two API calls:

Fuzzy name searchGET https://api.scryfall.com/cards/named?fuzzy={ocr_text}

  • Scryfall's fuzzy matching handles OCR typos (e.g., "Krenk0" → "Krenko, Mob Boss")
  • Returns full card JSON with name, mana cost, type line, oracle text, prices, image URIs
  • User-Agent: MTGDetectionProject/1.0
  • Timeout: 10s

Printings pagination — follows the card's prints_search_uri to fetch all printings

  • Respects Scryfall rate limits (100ms sleep between pages)
  • Caps at MAX_PRINTINGS = 50 to bound response time
  • Extracts per-printing: name, set, collector number, release date, image URIs (small, normal, art_crop), Scryfall URI
  • Handles double-faced cards (falls back to card_faces[0] image URIs)

extract_card_info() normalizes the Scryfall response for the frontend: name, mana cost, type line, oracle text, P/T, loyalty, rarity, set, prices (USD/foil), image URI, Scryfall link, and prints_search_uri for lazy-loading printings.

4.5 Stage 4: Art Matching

Service: web/services/image_match.py

Compares the detected art region against printing art crops using DINOv2 embeddings to find the exact printing.

Model

Property Value
Architecture DINOv2 ViT-S/14 (Vision Transformer, Small, patch size 14)
Parameters 21M
Size ~86 MB
Training Self-supervised (no labels needed)
Source torch.hub.load("facebookresearch/dinov2", "dinov2_vits14")
Embedding dim 384 (CLS token)

Preprocessing

Standard ImageNet preprocessing, applied to both the input art crop and each printing art image:

Resize(256, BICUBIC) → CenterCrop(224) → ToTensor() → Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225]
)

Embedding and Similarity

  1. Forward pass through DINOv2 with @torch.inference_mode()
  2. Extract x_norm_clstoken from forward_features() output
  3. L2-normalize the 384-dim vector
  4. Cosine similarity = dot product of two normalized vectors
  5. Best match = highest similarity above MIN_SIMILARITY = 0.5

Download Strategy

Printing art images are downloaded from Scryfall in batches:

Parameter Value Purpose
DOWNLOAD_BATCH_SIZE 10 Concurrent downloads per batch
Inter-batch pause 50ms (asyncio.sleep(0.05)) Respect Scryfall rate limits
DOWNLOAD_TIMEOUT 5.0s Per-image timeout
Fallback art_crop_uriimage_uri Some printings lack art crops

Why DINOv2

DINOv2's self-supervised vision features are robust to:

  • Lighting variation — photos taken under different light conditions
  • Perspective shift — cards photographed at angles
  • Domain gap — comparing a real photo against a digital render

This makes it superior to pixel-level methods (histogram comparison, template matching) which fail when the photo isn't perfectly aligned or lit.


5. Web Application

5.1 Architecture

flowchart TB
    subgraph Frontend["Frontend (vanilla JS)"]
        HTML[index.html<br/>Tab navigation]
        UP[upload.js<br/>File/drag/paste input]
        LIVE[live.js<br/>Camera + frame loop]
        CP[card-panel.js<br/>Card info + printings gallery]
    end

    subgraph Backend["Backend (FastAPI)"]
        APP[app.py<br/>Route handlers + lifespan]
        DET[detection.py<br/>Roboflow API client]
        OCR_B[ocr.py<br/>RapidOCR wrapper]
        SCR[scryfall.py<br/>Card lookup + printings]
        IMG[image_match.py<br/>DINOv2 art matcher]
    end

    UP --> APP
    LIVE --> APP
    APP --> DET
    APP --> OCR_B
    APP --> SCR
    APP --> IMG
    CP -.-> UP
    CP -.-> LIVE
Loading

The server is started with:

ROBOFLOW_API_KEY=your_key uv run uvicorn web.app:app --reload --host 0.0.0.0 --port 8000

On startup (lifespan), the server creates a shared httpx.AsyncClient, reads the Roboflow API key from the environment, and warms up the OCR engine.

5.2 API Routes

Method Path Purpose Input Output
GET / Serve index.html HTML
GET /api/health Health check {"status": "ok"}
GET /api/config Roboflow config for frontend {roboflow_publishable_key, model_id, model_version}
POST /api/detect Full 4-stage pipeline (upload mode) image (multipart file) {detections, card, printings, ocr_text, matched_printing_index, annotated_image}
POST /api/identify OCR + Scryfall only (live mode) crop (multipart file) {card, ocr_text}
GET /api/printings Lazy-load printings prints_search_uri (query param) {printings}

POST /api/detect is the main endpoint. It runs all 4 stages sequentially:

  1. Sends the image to Roboflow for detection
  2. Crops the best title box and OCRs it
  3. Looks up the card on Scryfall and fetches printings
  4. Crops the best art box and runs DINOv2 art matching against all printings
  5. Returns the annotated image (base64), detections, card info, printings, and matched printing index

POST /api/identify is a lightweight endpoint for live camera mode — accepts a pre-cropped title region, runs OCR + Scryfall lookup, and returns card info without detection or art matching.

GET /api/printings enables lazy-loading printings in the frontend when the initial response doesn't include them. Validates that the URI starts with https://api.scryfall.com/ to prevent SSRF.

5.3 Upload Mode

File: web/static/js/upload.js

Three input methods:

  • File picker<input type="file" accept="image/*" capture="environment">
  • Drag & drop — drop zone with visual feedback (dragenter/dragleave events)
  • Clipboard pasteCtrl+V / Cmd+V (only active when upload tab is selected)

Flow: select image → show preview → click "Detect Card" → POST /api/detect → show results:

  • Annotated image with colored bounding boxes
  • Detection stats table (class, confidence bar, bbox size)
  • Card info panel with OCR result, card image, metadata, and printings gallery

5.4 Live Camera Mode

File: web/static/js/live.js

Uses navigator.mediaDevices.getUserMedia() with 1280×720 resolution, rear camera preferred (facingMode: 'environment').

Parameter Value Purpose
Frame interval 1200ms Time between detection requests
JPEG quality 0.8 (auto), 0.92 (manual capture) Balance speed vs quality

The detection loop:

  1. Captures a frame from <video> to an offscreen <canvas>
  2. Converts to JPEG blob
  3. Sends POST /api/detect (skips if a request is already in flight)
  4. Draws detection boxes on an overlay <canvas> (positioned over the video)
  5. Updates the card panel if a new card is identified

A "Capture Photo" button triggers a one-shot high-quality detection. Camera cleanup (stopLiveCamera) is exposed globally for tab switching.

5.5 Card Info Panel

File: web/static/js/card-panel.js

Shared rendering function renderCardPanel(data, container) used by both upload and live modes:

  1. OCR result — displayed at the top with the matched printing info
  2. Card image — from Scryfall (normal size)
  3. Card metadata — name, mana cost, type line, oracle text, P/T or loyalty, rarity, set, price (USD/foil)
  4. Scryfall link — direct link to the card page
  5. Printings gallery — horizontal scrolling thumbnail strip:
    • DINOv2-matched printing highlighted with printing-selected class
    • Falls back to Scryfall default match if no art match
    • Click any printing to swap the main card image, set name, and Scryfall link
    • Each printing shows set code, collector number, and a Scryfall link icon
    • Lazy-loaded via GET /api/printings if not included in the initial response

6. Platforms and Services

6.1 Platform Reference

Platform Role Cost Auth Required
Roboflow Dataset hosting, model deployment, hosted inference API Free tier available ROBOFLOW_API_KEY
RunPod Cloud GPU training (RTX 4090) $0.73–$5.60 per experiment RunPod account + API key
Google Colab Alternative cloud GPU (free T4) Free Google account
Apple Silicon Local CPU training, inference Free
Scryfall API Card data, printings, art crop images Free, no auth
Meta DINOv2 Art matching (ViT-S/14 via torch.hub) Free, no auth
RapidOCR ONNX-based OCR for title text extraction Free, no auth

6.2 API Keys

Key Required For How to Get
ROBOFLOW_API_KEY Dataset download, model deployment, hosted inference Free at roboflow.com
RunPod API key Cloud GPU training runpod.io, configured via runpodctl doctor

Scryfall, DINOv2, and RapidOCR require no API keys or authentication.

6.3 Dependencies

From pyproject.toml:

Package Version Purpose
ultralytics ≥8.3.0 YOLOv11 training and inference
roboflow ≥1.1.0 Dataset download and model deployment
opencv-python ≥4.8.0 Image processing, bounding box drawing
matplotlib ≥3.8.0 Dataset visualization plots
numpy ≥1.26.0 Array operations
Pillow ≥10.0.0 Image loading for DINOv2
rapidocr-onnxruntime ≥1.4.0 OCR engine
tensorboard ≥2.14.0 Training metric visualization
fastapi ≥0.115.0 Web server framework
uvicorn ≥0.32.0 ASGI server
httpx ≥0.27.0 Async HTTP client
python-multipart ≥0.0.12 File upload parsing
torch (via ultralytics) DINOv2 model, tensor operations
torchvision (via ultralytics) Image preprocessing transforms

7. Results and Limitations

7.1 End-to-End Result

A photo of Krenko, Mob Boss taken with a webcam is correctly identified:

  1. Detection: 7 regions detected (art, card, description, mana-cost, power, tags, title)
  2. OCR: Title region reads "Krenko, Mob Boss"
  3. Scryfall: Fuzzy search returns the card with 50 printings
  4. Art match: DINOv2 identifies Ravnica Remastered #335 with 0.9515 cosine similarity

7.2 Known Limitations

Limitation Cause Mitigation
Stylized/artistic fonts may fail OCR RapidOCR trained on standard fonts Scryfall fuzzy matching tolerates some OCR errors
All printings must be downloaded for art match DINOv2 compares against each printing's art crop Capped at 50 printings, batched 10-concurrent downloads
Art match latency scales with printing count Sequential embedding computation Cards with few printings are fast; popular cards (~50 printings) take 2–5s
MPS training bugs on macOS 26 PyTorch 2.10 clamp_() tensor corruption Use CPU for training (still fast on Apple Silicon)
Detection accuracy ceiling ~97% mAP50 Annotation noise in training data Tighter annotations via Label Studio correction workflow
mAP50-95 lower for small objects Small bbox errors = large IoU drops Higher resolution (1280px) helps but has trade-offs

8. References

Internal Documentation

Document Topic
architecture.md Pipeline data flow, file formats, script breakdown, web app
concepts.md ML concepts via software engineering analogies
parameters.md Every training/inference parameter explained
training-strategies.md Academic foundations of augmentation strategies
metrics-guide.md How to interpret detection metrics
training-v2-status.md Cloud training experiment log
training-cost-analysis.md GPU cost comparisons
annotation-correction.md Label Studio correction workflow

External References

Resource URL
YOLOv11 (Ultralytics) https://docs.ultralytics.com/
DINOv2 (Meta) https://arxiv.org/abs/2304.07193
Scryfall API https://scryfall.com/docs/api
Roboflow https://docs.roboflow.com/
RapidOCR https://github.com/RapidAI/RapidOCR
ONNX Runtime https://onnxruntime.ai/
FastAPI https://fastapi.tiangolo.com/