Skip to content

Add MTEBEvaluator for embedding model evaluation#2409

Open
natke wants to merge 4 commits intomicrosoft:mainfrom
natke:natke/mteb-evaluator
Open

Add MTEBEvaluator for embedding model evaluation#2409
natke wants to merge 4 commits intomicrosoft:mainfrom
natke:natke/mteb-evaluator

Conversation

@natke
Copy link
Copy Markdown
Contributor

@natke natke commented Apr 10, 2026

Summary

Add a built-in MTEB (Massive Text Embedding Benchmark) evaluator to Olive, following the same architecture as LMEvaluator/lmeval_ort.py.

Changes

olive/evaluator/olive_evaluator.py

  • New MTEBEvaluator class registered in the evaluator registry
  • Supports three model classes: hf (SentenceTransformer), ort (ONNX via ORT), ortgenai (GenAI with hidden_states)
  • Auto-detects model class from handler type, including GenAI detection via genai_config.json

olive/evaluator/mteb_ort.py (new)

  • MTEBOnnxBase: abstract base implementing MTEB's EncoderProtocol with mean pooling
  • MTEBORTEvaluator: wraps plain ONNX models via ort.InferenceSession
  • MTEBORTGenAIEvaluator: wraps GenAI models using og.Generator + hidden_states output

Usage

"evaluators": {
    "mteb": {
        "type": "MTEBEvaluator",
        "tasks": ["STS17"],
        "batch_size": 32
    }
},
"evaluator": "mteb"

Testing

Tested end-to-end with Qwen3-Embedding-0.6B CPU recipe — ModelBuilder export + MTEB STS17 evaluation on both input HF model and exported GenAI model.

Add built-in MTEB (Massive Text Embedding Benchmark) evaluator to Olive,
following the same architecture as LMEvaluator/lmeval_ort.py.

- MTEBEvaluator in olive_evaluator.py: supports hf, ort, and ortgenai model
  classes with auto-detection based on model handler type
- mteb_ort.py: MTEB-compatible wrappers for exported ONNX and GenAI models
  - MTEBORTEvaluator: wraps plain ONNX models via ORT InferenceSession
  - MTEBORTGenAIEvaluator: wraps GenAI models with hidden_states output
  - Both use mean pooling over attention-masked token embeddings

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new built-in evaluator to Olive for running MTEB (Massive Text Embedding Benchmark) embedding evaluations, following the existing evaluator registry pattern used for LM evaluation.

Changes:

  • Registers a new MTEBEvaluator in olive_evaluator.py with support for HuggingFace, plain ONNX Runtime, and ORT GenAI style models.
  • Introduces olive/evaluator/mteb_ort.py with ONNX Runtime and ORT GenAI adapter classes implementing an MTEB-compatible encoder interface.
  • Converts MTEB task results into Olive MetricResult structures for reporting.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
olive/evaluator/olive_evaluator.py Adds MTEBEvaluator and wiring to select the correct backend and translate MTEB outputs into Olive metrics.
olive/evaluator/mteb_ort.py Adds ONNX Runtime and ORT GenAI model wrappers to satisfy MTEB encoder requirements (tokenization, pooling, similarity).

natke and others added 3 commits April 10, 2026 15:32
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Raise ValueError for unknown model handlers instead of defaulting to ortgenai
- Pass device to SentenceTransformer for HF model evaluation
- Handle string input in encode() to avoid char-by-char iteration
- Raise RuntimeError when hidden_states unavailable instead of logits fallback
- Fix lint: blank line after docstring section (D413)
- Fix lint: replace list comprehension with list() (C416/R1721)
- Fix ruff formatting issues

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants