Evaluation & Annotation Tool

**Problem**:  No proper way to measure experiments.

**Task**:  Build a small web UI for validation.

**What to implement**
- Show candidate pairs
- Ask human mark: correct / wrong  
- Export gold standard
- Compute precision / recall / F1 (if there is any gold)