Skip to content

s-agbede/embedding-benchmark

Repository files navigation

embedding-benchmark

Benchmark embedding models across Redis-backed retrieval experiments.

What It Does

embedding-benchmark compares embedding models on retrieval tasks using:

  • Ollama models through the Ollama Python SDK
  • OpenAI models through RedisVL
  • Local Hugging Face models through RedisVL
  • Redis as the vector index and retrieval backend
  • ranx for retrieval metrics

The CLI writes:

  • terminal summary tables
  • summary.json
  • metrics.csv
  • per_dataset_metrics.csv
  • per_query_metrics.csv
  • config.resolved.yaml
  • run.log

Example Findings

This repo was used to benchmark 12 model and dimension configurations across 4 NanoBEIR datasets in one Redis-backed reference run.

Benchmark caveat: this is a small, single-run reference result with no error bars or latency warmup. Treat the numbers as an example of what the tool can surface, not as a universal model leaderboard.

A few useful takeaways from that run:

  • larger embeddings were not consistently better
  • openai-large@256 was one of the strongest quality and latency tradeoffs
  • retrieval latency increased with vector size, but stayed below 2 ms/query even at the largest dimension in this setup
  • the full benchmark build and indexing run took about 46 minutes

Overall examples from quality_retrieval_by_model.csv:

These overall results are macro-averages across datasets, so each dataset gets equal weight in the average.

Model Dims Avg nDCG@10 Avg Hit@10 Avg retrieval ms/query
openai-large 3072 0.7041 0.8900 1.308
openai-large 256 0.6858 0.8650 0.384
ollama-mxbai 1024 0.6694 0.8600 0.742
nomic-embed-text-v2-moe 512 0.6284 0.8400 0.475

Per-dataset winners from quality_retrieval_by_dataset.csv:

Dataset Best Model Dims nDCG@10 Hit@10 Retrieval ms/query
NanoNQ openai-large 256 0.8206 0.92 0.405
NanoSciFact openai-large 3072 0.7865 0.88 1.051
NanoFiQA2018 openai-large 3072 0.6628 0.84 1.446
NanoArguAna ollama-mxbai 1024 0.6797 0.98 0.663

The best configuration was not stable across datasets, which is exactly why it is worth benchmarking on the retrieval tasks that matter to you.

Supported Custom Models

The repo is built so people can benchmark their own model choices within the provider families implemented today:

  • custom Ollama embedding model names via the Ollama SDK
  • custom OpenAI embedding model names via RedisVL
  • custom Hugging Face embedding model names via RedisVL

It does not yet support arbitrary providers outside those adapter paths. Hugging Face models through RedisVL are treated as fixed-dimension models unless a future adapter explicitly supports safe truncation.

Supported Benchmark Definitions

Benchmarks are defined in YAML. Today you can configure:

  • custom model lists within the built-in provider families
  • custom dataset lists using built-in dataset kinds
  • custom ranx metric lists for benchmark scoring
  • a separate per_query_metrics list for per_query_metrics.csv

Built-in dataset kinds:

  • nanobeir
  • hf_beir

Dataset splits default to train, which matches the NanoBEIR datasets used by the starter config. For HF BEIR-style datasets with different splits, configure them explicitly:

datasets:
  - id: SciFact
    kind: hf_beir
    source: BeIR/scifact
    enabled: true
    options:
      corpus_split: train
      queries_split: test
      qrels_split: test

metrics can use valid ranx metric strings such as:

  • hit_rate@10
  • ndcg@10
  • mrr@10
  • precision@5

per_query_metrics is intentionally narrower right now and supports:

  • hit_rate@k
  • ndcg@k
  • mrr@k

Requirements

  • Python 3.11+
  • Redis with RediSearch / vector search support
  • Ollama running locally if you want to benchmark Ollama models
  • OPENAI_API_KEY set if you want to benchmark OpenAI models

Install

From the project root:

uv sync --extra dev

This creates .venv and installs the CLI locally.

Start Redis

You need a Redis instance with vector search enabled.

If you already have one running, point the config at it.

If not, a quick local option is:

docker run --rm -p 6379:6379 redis/redis-stack-server:latest

Prepare Model Access

OpenAI

Set your API key:

export OPENAI_API_KEY=<your-openai-api-key>

Ollama

Start Ollama and pull the models you want to test:

ollama serve
ollama pull nomic-embed-text-v2-moe
ollama pull mxbai-embed-large

Create A Config

Generate a local starter config:

./.venv/bin/embedding-benchmark init-config

This writes benchmark.yaml in the current directory. benchmark.yaml is treated as a local working file and is gitignored on purpose.

To write it somewhere else:

./.venv/bin/embedding-benchmark init-config --output my-benchmark.yaml

If you want a committed example to start from, copy benchmark.example.yaml.

Example dataset entry:

datasets:
  - id: NanoNQ
    kind: nanobeir
    source: zeta-alpha-ai/NanoNQ
    enabled: true

Example evaluation section:

evaluation:
  top_k: 10
  metrics:
    - hit_rate@10
    - ndcg@10
    - precision@5
  per_query_metrics:
    - hit_rate@10
    - ndcg@10
  output_dir: runs

Review Configured Models

./.venv/bin/embedding-benchmark list-models --config benchmark.yaml

The generated starter config includes:

  • ollama-nomic
  • ollama-mxbai
  • openai-small
  • openai-large
  • redisvl-minilm

Run A Benchmark

Run the benchmark with the config:

./.venv/bin/embedding-benchmark run --config benchmark.yaml

The generated starter config enables:

  • NanoNQ
  • NanoSciFact
  • NanoFiQA2018
  • NanoArguAna

Artifacts are written under runs/<timestamp>-benchmark/. The runs/ directory is gitignored because it contains local benchmark artifacts.

Sweep Dimensions

For models that support native reduced dimensions:

./.venv/bin/embedding-benchmark sweep-dims --config benchmark.yaml --model openai-small

Example models that can sweep dimensions:

  • openai-small
  • openai-large
  • ollama-nomic

Inspect A Run

./.venv/bin/embedding-benchmark inspect --config benchmark.yaml --run-id 20260519T202913Z-benchmark

This prints the saved summary, including Redis index names and key prefixes.

Clean Up Redis Data

Remove Redis keys for a specific run:

./.venv/bin/embedding-benchmark cleanup --config benchmark.yaml --run-id 20260519T202913Z-benchmark

Or remove by key prefix:

./.venv/bin/embedding-benchmark cleanup --config benchmark.yaml --prefix embedbench

Smoke Test Workflow

If you want the quickest manual check:

  1. Start Redis.
  2. Export OPENAI_API_KEY.
  3. Generate benchmark.yaml.
  4. Edit benchmark.yaml so only one small model and one dataset are enabled.
  5. Run:
./.venv/bin/embedding-benchmark run --config benchmark.yaml

Then confirm:

  • the CLI prints a results table
  • a new folder appears under runs/
  • summary.json and metrics.csv exist
  • the Redis keys/indexes are visible in Redis Insight if keep_indexes: true

Run Tests

./.venv/bin/pytest -q

Notes

  • OpenAI models in this project go through RedisVL, not the direct OpenAI SDK.
  • Ollama models use the Ollama SDK directly because RedisVL does not currently support Ollama embeddings.
  • Dataset loading is selected by datasets[].kind, so users can define different benchmark mixes without changing Python code.
  • Metric names are validated at config load time so bad ranx metric strings fail before a run starts.
  • Redis query results are converted into ranking scores before evaluation so ranx sees the correct ordering.
  • The committed benchmark.example.yaml matches the generated default config, which enables all four NanoBEIR datasets.

About

A simple benchmark for comparing embedding models, dimensions, and retrieval quality across public datasets using RedisVL, OpenAI, Ollama, and NanoBEIR.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages