Skip to content

Open-world vehicle recognition using metric learning - PyTorch implementation achieving 72% accuracy

Notifications You must be signed in to change notification settings

jam244-web/vericar

Repository files navigation

πŸš— Veri-Car: Open-World Vehicle Information Retrieval

Python 3.8+ PyTorch License: MIT

A PyTorch implementation of an open-world vehicle recognition system based on the research paper "Veri-Car: Towards Open-world Vehicle Information Retrieval" by MuΓ±oz et al. (JPMorgan Chase AI Research). The system uses metric learning and K-NN retrieval to identify vehicle make, model, type, and year, while detecting out-of-distribution vehicles without retraining.

🎯 Key Achievement: Solved critical gradient bug that improved accuracy from 0.20% to 72% (360x improvement)


πŸ“Š Results

Performance Comparison

Metric My Implementation Paper (Veri-Car) Notes
Retrieval Accuracy 72.45% 96.18% On Stanford Cars 196 dataset
Model Backbone ResNet50 OpenCLIP ViT-B/16 Pre-trained on ImageNet vs LAION-2B
Embedding Dimension 256-D 128-D Larger embeddings improved results
Training Time 6 hours (GPU) Not specified Single NVIDIA GPU
Model Size 95 MB Not specified Lightweight and deployable

Training Progress

Epoch   5:  14.67% β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘
Epoch  15:  26.89% β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘
Epoch  25:  36.16% β–“β–“β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘
Epoch  40:  53.96% β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘
Epoch  50:  60.96% β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘
Epoch 100:  72.45% β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘ βœ“

🌟 Key Features

Open-World Learning

  • βœ… No Retraining Required: Add new vehicle models by simply adding their embeddings to the database
  • βœ… OOD Detection: Automatically flags unknown vehicles using KNN+ algorithm (FPR95: 28.72%, AUROC: 93.10%)
  • βœ… Scalable: K-NN retrieval works efficiently with growing databases

Technical Implementation

  • βœ… Multi-Similarity Loss: Advanced metric learning for robust embeddings
  • βœ… Pre-trained Backbone: ResNet50 fine-tuned on vehicle data
  • βœ… Hierarchical Structure: Supports make β†’ type β†’ model β†’ year classification
  • βœ… Production Ready: Complete training, evaluation, and inference pipeline

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Input: Car Image                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              ResNet50 Backbone (Pre-trained)                β”‚
β”‚           Extracts 2048-dimensional features                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Projection Head (MLP)                       β”‚
β”‚         2048 β†’ 512 β†’ 256 (with BatchNorm, Dropout)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
                  256-D Embedding
                         β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚                     β”‚
              β–Ό                     β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  K-NN Retrieval β”‚   β”‚  OOD Detection  β”‚
    β”‚    (k=1)        β”‚   β”‚    (KNN+)       β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                     β”‚
             β–Ό                     β–Ό
      Vehicle Identity        Flag Unknown
    (Make, Model, Year)         Vehicles

πŸš€ Quick Start

Installation

# Clone repository
git clone https://github.com/jam244-web/vericar-portfolio.git
cd vericar-portfolio

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Download Dataset

# Download Stanford Cars 196 dataset
# From: https://www.kaggle.com/datasets/jessicali9530/stanford-cars-dataset
# Extract to: data/stanford_cars/

# Generate labels
python scripts/create_stanford_labels_improved.py

Train Model

# Quick training (50 epochs, ~2 hours)
python scripts/train_with_real_data.py \
    --data_dir data/stanford_cars \
    --dataset_type stanford \
    --batch_size 64 \
    --num_epochs 50

# Full training (150 epochs, ~6 hours, best results)
python scripts/train_with_real_data.py \
    --data_dir data/stanford_cars \
    --dataset_type stanford \
    --batch_size 64 \
    --embedding_dim 256 \
    --num_epochs 150

Run Inference

from src.models.embedding_model import VehicleEmbeddingModel
from src.retrieval.knn_retrieval import KNNRetrieval
import torch
from PIL import Image

# Load model
model = VehicleEmbeddingModel(embedding_dim=256)
model.load_state_dict(torch.load('models/best_model.pth'))
model.eval()

# Load database
retriever = KNNRetrieval(k=1)
retriever.build_database(train_embeddings, train_labels)

# Predict
image = Image.open('test_car.jpg')
embedding = model(preprocess(image))
prediction = retriever.predict(embedding)

print(f"Predicted: {prediction}")
# Output: "Toyota Camry Sedan 2019"

πŸ› οΈ Development Journey: From 0.20% to 72%

Challenge 1: Data Mismatch (Week 1)

Problem: Model stuck at 0.20% accuracy despite loss decreasing

Investigation:

Train samples: 8144, Classes: 196
Test samples:  8041, Classes: 1103  # ❌ Wrong!

Root Cause: Test set had 1103 different classes vs 196 in training

Solution: Split training data into train/val (80/20) instead of using corrupted test labels

Result: Still 0.20% - revealed deeper issue!


Challenge 2: The Gradient Bug (Week 2) πŸ›

Problem: Even with correct data split, accuracy remained at 0.20%

Investigation: Deep dive into loss function implementation

Root Cause: Broken gradient chain in loss accumulation

# ❌ WRONG (What I had):
loss = torch.tensor(0.0, requires_grad=True)
for i in range(batch_size):
    loss = loss + sample_loss  # Creates new tensor each iteration!
                                # Breaks gradient flow πŸ’”

# βœ… CORRECT (After fix):
losses = []
for i in range(batch_size):
    losses.append(sample_loss)  # Collect in Python list
loss = torch.stack(losses).mean()  # Proper gradient preservation ✨

Why it matters:

  • Each loss = loss + x created a new tensor, severing the computational graph
  • PyTorch couldn't backpropagate gradients properly
  • Model appeared to train (loss decreased) but wasn't actually learning

Solution: Refactored MultiSimilarityLoss to use torch.stack() for proper gradient flow

Result: πŸŽ‰ 300x improvement β†’ 61% accuracy!


Challenge 3: Optimization (Week 3)

Improvements Applied:

Change Impact
ResNet50 (vs ResNet18) +8% accuracy
256-D embeddings (vs 128-D) +5% accuracy
Better data augmentation +3% accuracy
150 epochs (vs 50) +7% accuracy

Final Result: 72.45% accuracy


🧠 Technical Deep Dive

Multi-Similarity Loss

Unlike traditional triplet loss, Multi-Similarity Loss considers all positive and negative pairs in a batch:

# For each anchor image:
# 1. Find all similar images (same car model) - positives
# 2. Find all different images (different models) - negatives
# 3. Push positives closer, push negatives farther

loss = (1/Ξ±) * log(1 + Ξ£ exp(-Ξ±(sim_pos - Ξ»))) +    # Positive term
       (1/Ξ²) * log(1 + Ξ£ exp(Ξ²(sim_neg - Ξ»)))       # Negative term

Advantages:

  • More efficient than triplet mining
  • Better gradient signal (uses all pairs, not just hard ones)
  • Achieves tighter clustering in embedding space

K-NN Retrieval

Instead of classification, uses nearest neighbor search:

# Traditional Classification (closed-world):
output = model(image)  # Fixed 196 classes
prediction = argmax(output)

# K-NN Retrieval (open-world):
embedding = model(image)  # 256-D vector
distances = euclidean(embedding, database_embeddings)
prediction = database_labels[argmin(distances)]

Benefits:

  • βœ… Add new vehicles without retraining
  • βœ… Natural confidence scores (inverse distance)
  • βœ… Can return top-K similar vehicles

πŸ“ˆ Performance Analysis

What Works Well

βœ… Common vehicles: 85%+ accuracy on popular makes (Toyota, Honda, Ford)
βœ… Distinctive models: 90%+ on unique designs (sports cars, SUVs)
βœ… Recent years: Better on 2010+ models (more training data)

Challenging Cases

⚠️ Similar models: 45% on visually similar cars (e.g., Honda Accord vs Toyota Camry)
⚠️ Rare vehicles: 55% on underrepresented classes (<20 training samples)
⚠️ Partial views: 60% when car is partially occluded

Error Analysis

# Example confusion:
Predicted: "BMW 3 Series Sedan 2012"
Actual:    "BMW 3 Series Coupe 2012"
Issue:     Sedan vs Coupe distinction (similar body styles)

# Solution: More training data or hierarchical loss

🎯 Future Improvements

Quick Wins (Expected +10-15% accuracy)

  • Use OpenCLIP ViT-B/16: Paper's backbone, pre-trained on LAION-2B
  • Implement HiMS-Min Loss: Hierarchical multi-similarity for make/type/model/year
  • Train longer: 200-300 epochs with learning rate scheduling
  • Ensemble models: Combine ResNet50, ResNet101, and EfficientNet

Advanced Features

  • License Plate Detection: YOLOv5-based detector
  • License Plate Recognition: TrOCR model fine-tuned on synthetic plates
  • Color Recognition: Separate model for vehicle color (15 classes)
  • Web Deployment: Flask/FastAPI REST API + React frontend
  • Mobile App: TensorFlow Lite conversion for on-device inference

πŸ“‚ Project Structure

vericar-portfolio/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ embedding_model.py        # ResNet50 + projection head
β”‚   β”‚   β”œβ”€β”€ loss_functions.py         # Multi-Similarity Loss
β”‚   β”‚   └── ood_detector.py           # KNN+ OOD detection
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   └── dataset.py                # Stanford Cars loader
β”‚   └── retrieval/
β”‚       └── knn_retrieval.py          # K-NN search engine
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train_with_real_data.py       # Main training script
β”‚   └── create_stanford_labels.py     # Data preprocessing
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_data_exploration.ipynb     # EDA
β”‚   β”œβ”€β”€ 02_model_training.ipynb       # Training experiments
β”‚   └── 03_demo.ipynb                 # Inference examples
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ app.py                        # Flask web server
β”‚   └── templates/
β”‚       └── index.html                # Web interface
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ best_model.pth                # Trained weights (95 MB)
β”‚   β”œβ”€β”€ train_embeddings.npy          # Database embeddings
β”‚   └── train_labels.npy              # Database labels
└── data/
    └── stanford_cars/                # Dataset (not included)

πŸ”¬ Key Learnings

1. PyTorch Gradient Mechanics

Understanding how PyTorch builds computational graphs is critical:

# Correct gradient flow
x = model(input)
loss = criterion(x, target)
loss.backward()  # Gradients flow from loss β†’ model β†’ input

# Broken gradient flow (my bug)
loss = 0.0
for i in range(N):
    loss = loss + item[i]  # Each += breaks the chain!

Lesson: Use torch.stack() or torch.cat() for proper tensor operations in training loops.

2. Metric Learning vs Classification

Classification: Learn decision boundaries between fixed classes
Metric Learning: Learn a distance function in embedding space

Metric learning is better for:

  • Open-world scenarios (new classes appear)
  • Few-shot learning (limited training samples)
  • Similarity search applications

3. Importance of Pre-training

Training from scratch: ~40% accuracy
With ImageNet pre-training: ~72% accuracy

Lesson: Always use pre-trained weights when possible. Transfer learning is powerful!


πŸ“š References

Original Paper:

@article{munoz2024vericar,
  title={Veri-Car: Towards Open-world Vehicle Information Retrieval},
  author={Mu{\~n}oz, Andr{\'e}s and Thomas, Nancy and Vapsi, Annita and Borrajo, Daniel},
  journal={arXiv preprint arXiv:2411.06864},
  year={2024}
}

Key Techniques:

Datasets:


🀝 Contributing

Contributions are welcome! Areas for improvement:

  • Implement hierarchical loss (HiMS-Min)
  • Add license plate detection module
  • Create more comprehensive tests
  • Improve data augmentation strategies
  • Deploy to cloud platform (AWS/GCP/Azure)

Please open an issue or submit a pull request!


πŸ“§ Contact

Your Name
LinkedIn β€’ Email

πŸ™ Acknowledgments

  • Original paper authors: AndrΓ©s MuΓ±oz, Nancy Thomas, Annita Vapsi, Daniel Borrajo (JPMorgan Chase AI Research)
  • Stanford University for the Cars 196 dataset
  • PyTorch and open-source community

⭐ If you find this project helpful, please consider giving it a star!

About

Open-world vehicle recognition using metric learning - PyTorch implementation achieving 72% accuracy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published