Benchmark embedding models across Redis-backed retrieval experiments.
embedding-benchmark compares embedding models on retrieval tasks using:
- Ollama models through the Ollama Python SDK
- OpenAI models through RedisVL
- Local Hugging Face models through RedisVL
- Redis as the vector index and retrieval backend
ranxfor retrieval metrics
The CLI writes:
- terminal summary tables
summary.jsonmetrics.csvper_dataset_metrics.csvper_query_metrics.csvconfig.resolved.yamlrun.log
This repo was used to benchmark 12 model and dimension configurations across 4 NanoBEIR datasets in one Redis-backed reference run.
Benchmark caveat: this is a small, single-run reference result with no error bars or latency warmup. Treat the numbers as an example of what the tool can surface, not as a universal model leaderboard.
A few useful takeaways from that run:
- larger embeddings were not consistently better
openai-large@256was one of the strongest quality and latency tradeoffs- retrieval latency increased with vector size, but stayed below
2 ms/queryeven at the largest dimension in this setup - the full benchmark build and indexing run took about
46 minutes
Overall examples from quality_retrieval_by_model.csv:
These overall results are macro-averages across datasets, so each dataset gets equal weight in the average.
| Model | Dims | Avg nDCG@10 | Avg Hit@10 | Avg retrieval ms/query |
|---|---|---|---|---|
openai-large |
3072 | 0.7041 | 0.8900 | 1.308 |
openai-large |
256 | 0.6858 | 0.8650 | 0.384 |
ollama-mxbai |
1024 | 0.6694 | 0.8600 | 0.742 |
nomic-embed-text-v2-moe |
512 | 0.6284 | 0.8400 | 0.475 |
Per-dataset winners from quality_retrieval_by_dataset.csv:
| Dataset | Best Model | Dims | nDCG@10 | Hit@10 | Retrieval ms/query |
|---|---|---|---|---|---|
NanoNQ |
openai-large |
256 | 0.8206 | 0.92 | 0.405 |
NanoSciFact |
openai-large |
3072 | 0.7865 | 0.88 | 1.051 |
NanoFiQA2018 |
openai-large |
3072 | 0.6628 | 0.84 | 1.446 |
NanoArguAna |
ollama-mxbai |
1024 | 0.6797 | 0.98 | 0.663 |
The best configuration was not stable across datasets, which is exactly why it is worth benchmarking on the retrieval tasks that matter to you.
The repo is built so people can benchmark their own model choices within the provider families implemented today:
- custom Ollama embedding model names via the Ollama SDK
- custom OpenAI embedding model names via RedisVL
- custom Hugging Face embedding model names via RedisVL
It does not yet support arbitrary providers outside those adapter paths. Hugging Face models through RedisVL are treated as fixed-dimension models unless a future adapter explicitly supports safe truncation.
Benchmarks are defined in YAML. Today you can configure:
- custom model lists within the built-in provider families
- custom dataset lists using built-in dataset kinds
- custom
ranxmetric lists for benchmark scoring - a separate
per_query_metricslist forper_query_metrics.csv
Built-in dataset kinds:
nanobeirhf_beir
Dataset splits default to train, which matches the NanoBEIR datasets used by
the starter config. For HF BEIR-style datasets with different splits, configure
them explicitly:
datasets:
- id: SciFact
kind: hf_beir
source: BeIR/scifact
enabled: true
options:
corpus_split: train
queries_split: test
qrels_split: testmetrics can use valid ranx metric strings such as:
hit_rate@10ndcg@10mrr@10precision@5
per_query_metrics is intentionally narrower right now and supports:
hit_rate@kndcg@kmrr@k
- Python
3.11+ - Redis with RediSearch / vector search support
- Ollama running locally if you want to benchmark Ollama models
OPENAI_API_KEYset if you want to benchmark OpenAI models
From the project root:
uv sync --extra devThis creates .venv and installs the CLI locally.
You need a Redis instance with vector search enabled.
If you already have one running, point the config at it.
If not, a quick local option is:
docker run --rm -p 6379:6379 redis/redis-stack-server:latestSet your API key:
export OPENAI_API_KEY=<your-openai-api-key>Start Ollama and pull the models you want to test:
ollama serve
ollama pull nomic-embed-text-v2-moe
ollama pull mxbai-embed-largeGenerate a local starter config:
./.venv/bin/embedding-benchmark init-configThis writes benchmark.yaml in the current directory. benchmark.yaml is
treated as a local working file and is gitignored on purpose.
To write it somewhere else:
./.venv/bin/embedding-benchmark init-config --output my-benchmark.yamlIf you want a committed example to start from, copy benchmark.example.yaml.
Example dataset entry:
datasets:
- id: NanoNQ
kind: nanobeir
source: zeta-alpha-ai/NanoNQ
enabled: trueExample evaluation section:
evaluation:
top_k: 10
metrics:
- hit_rate@10
- ndcg@10
- precision@5
per_query_metrics:
- hit_rate@10
- ndcg@10
output_dir: runs./.venv/bin/embedding-benchmark list-models --config benchmark.yamlThe generated starter config includes:
ollama-nomicollama-mxbaiopenai-smallopenai-largeredisvl-minilm
Run the benchmark with the config:
./.venv/bin/embedding-benchmark run --config benchmark.yamlThe generated starter config enables:
NanoNQNanoSciFactNanoFiQA2018NanoArguAna
Artifacts are written under runs/<timestamp>-benchmark/. The runs/
directory is gitignored because it contains local benchmark artifacts.
For models that support native reduced dimensions:
./.venv/bin/embedding-benchmark sweep-dims --config benchmark.yaml --model openai-smallExample models that can sweep dimensions:
openai-smallopenai-largeollama-nomic
./.venv/bin/embedding-benchmark inspect --config benchmark.yaml --run-id 20260519T202913Z-benchmarkThis prints the saved summary, including Redis index names and key prefixes.
Remove Redis keys for a specific run:
./.venv/bin/embedding-benchmark cleanup --config benchmark.yaml --run-id 20260519T202913Z-benchmarkOr remove by key prefix:
./.venv/bin/embedding-benchmark cleanup --config benchmark.yaml --prefix embedbenchIf you want the quickest manual check:
- Start Redis.
- Export
OPENAI_API_KEY. - Generate
benchmark.yaml. - Edit
benchmark.yamlso only one small model and one dataset are enabled. - Run:
./.venv/bin/embedding-benchmark run --config benchmark.yamlThen confirm:
- the CLI prints a results table
- a new folder appears under
runs/ summary.jsonandmetrics.csvexist- the Redis keys/indexes are visible in Redis Insight if
keep_indexes: true
./.venv/bin/pytest -q- OpenAI models in this project go through RedisVL, not the direct OpenAI SDK.
- Ollama models use the Ollama SDK directly because RedisVL does not currently support Ollama embeddings.
- Dataset loading is selected by
datasets[].kind, so users can define different benchmark mixes without changing Python code. - Metric names are validated at config load time so bad
ranxmetric strings fail before a run starts. - Redis query results are converted into ranking scores before evaluation so
ranxsees the correct ordering. - The committed benchmark.example.yaml matches the generated default config, which enables all four NanoBEIR datasets.