Add Qwen3 Embedding recipes (0.6B and 8B) by natke · Pull Request #355 · microsoft/olive-recipes

natke · 2026-04-10T22:25:21Z

Summary

Add Olive recipes for Qwen3-Embedding-0.6B and Qwen3-Embedding-8B embedding models, targeting CPU, CUDA, and WebGPU execution providers.

Models

Qwen/Qwen3-Embedding-0.6B — 0.6B params
Qwen/Qwen3-Embedding-8B — 8B params

Recipes (6 total)

Qwen-Qwen3-Embedding-0.6B/{cpu,cuda,webgpu}
Qwen-Qwen3-Embedding-8B/{cpu,cuda,webgpu}

Each recipe includes:

INT4 precision with include_hidden_states=1 for embedding extraction
ModelBuilder pass for ONNX/GenAI export
MTEBEvaluator configured with STS17 benchmark
README with setup and evaluation instructions
info.yaml, requirements.txt, LICENSE

Dependencies

Requires the MTEBEvaluator from Add MTEBEvaluator for embedding model evaluation Olive#2409 (merged)
Requires device mapping and pooling fixes from Fix MTEBEvaluator: device mapping, padding-free inference, last-token pooling, L2 normalization Olive#2415 (merged)
Manually works around tie_word_embeddings` not restored in model config after RTN/GPTQ quantization with retied embeddings Olive#2424

Add Olive recipes for Qwen3-Embedding-0.6B and Qwen3-Embedding-8B models targeting CPU, CUDA, and WebGPU execution providers. - FP32 precision with include_hidden_states for embedding extraction - ModelBuilder pass for ONNX/GenAI export - MTEBEvaluator configured with STS17 benchmark task - READMEs with MTEB leaderboard scores and STS17 evaluation results Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds new Olive recipe folders for the Qwen3 embedding model variants (0.6B, 8B) across CPU/CUDA/WebGPU, including configs, environment requirements, and documentation to support FP32 export with hidden states enabled and MTEB (STS17) evaluation wiring.

Changes:

Added CPU/CUDA/WebGPU Olive configs using ModelBuilder (FP32) with include_hidden_states=1 for Qwen3 embedding models.
Added per-recipe requirements.txt, info.yaml, and auto-generated README.md files.
Added per-model Apache 2.0 LICENSE files.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
Qwen-Qwen3-Embedding-8B/LICENSE	Adds license file for the 8B embedding model recipe set.
Qwen-Qwen3-Embedding-8B/cpu/requirements.txt	Python deps for running the CPU recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-8B/cpu/README.md	CPU recipe usage + leaderboard references.
Qwen-Qwen3-Embedding-8B/cpu/Qwen-Qwen3-Embedding-8B_cpu_fp32.json	Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17).
Qwen-Qwen3-Embedding-8B/cpu/info.yaml	Registers the CPU recipe metadata (device/EP/file).
Qwen-Qwen3-Embedding-8B/cuda/requirements.txt	Python deps for running the CUDA recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-8B/cuda/README.md	CUDA recipe usage + leaderboard references.
Qwen-Qwen3-Embedding-8B/cuda/Qwen-Qwen3-Embedding-8B_cuda_fp32.json	Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17) on CUDA EP.
Qwen-Qwen3-Embedding-8B/cuda/info.yaml	Registers the CUDA recipe metadata (device/EP/file).
Qwen-Qwen3-Embedding-8B/webgpu/requirements.txt	Python deps for running the WebGPU recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-8B/webgpu/README.md	WebGPU recipe usage + leaderboard references.
Qwen-Qwen3-Embedding-8B/webgpu/Qwen-Qwen3-Embedding-8B_webgpu_fp32.json	Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17) on WebGPU EP.
Qwen-Qwen3-Embedding-8B/webgpu/info.yaml	Registers the WebGPU recipe metadata (device/EP/file).
Qwen-Qwen3-Embedding-0.6B/LICENSE	Adds license file for the 0.6B embedding model recipe set.
Qwen-Qwen3-Embedding-0.6B/cpu/requirements.txt	Python deps for running the CPU recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-0.6B/cpu/README.md	CPU recipe usage + leaderboard references + STS17 eval results section.
Qwen-Qwen3-Embedding-0.6B/cpu/Qwen-Qwen3-Embedding-0.6B_cpu_fp32.json	Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17).
Qwen-Qwen3-Embedding-0.6B/cpu/info.yaml	Registers the CPU recipe metadata (device/EP/file).
Qwen-Qwen3-Embedding-0.6B/cuda/requirements.txt	Python deps for running the CUDA recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-0.6B/cuda/README.md	CUDA recipe usage + leaderboard references.
Qwen-Qwen3-Embedding-0.6B/cuda/Qwen-Qwen3-Embedding-0.6B_cuda_fp32.json	Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17) on CUDA EP.
Qwen-Qwen3-Embedding-0.6B/cuda/info.yaml	Registers the CUDA recipe metadata (device/EP/file).
Qwen-Qwen3-Embedding-0.6B/webgpu/requirements.txt	Python deps for running the WebGPU recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-0.6B/webgpu/README.md	WebGPU recipe usage + leaderboard references.
Qwen-Qwen3-Embedding-0.6B/webgpu/Qwen-Qwen3-Embedding-0.6B_webgpu_fp32.json	Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17) on WebGPU EP.
Qwen-Qwen3-Embedding-0.6B/webgpu/info.yaml	Registers the WebGPU recipe metadata (device/EP/file).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Base recipes build the model only. _with_eval variants include MTEB evaluation, which requires the target EP to be available at runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

shaahji · 2026-04-15T16:51:51Z

/azp run

Add int4 ModelBuilder recipes for cpu, cuda, and webgpu targets with SelectiveMixedPrecision, GPTQ, and RTN passes matching the Qwen3-8B chat model pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add _with_eval variants of the int4 recipes for cpu, cuda, and webgpu targets, including MTEBEvaluator with STS17 task. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

hanbitmyths · 2026-04-16T22:02:44Z

It would be good to remove fp32 cuda and webgpu recipes and add fp16 cuda and webgpu recipes.

…cipes Expand MTEB eval tasks beyond STS17 to include retrieval benchmarks for better quantization quality assessment across all _with_eval recipes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Expand MTEB eval tasks to include NFCorpus, ArguAna, and SciFact alongside STS17 for the 0.6B embedding model _with_eval recipes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update precision from fp32 to fp16 for cuda and webgpu targets on both Qwen3-Embedding-0.6B and Qwen3-Embedding-8B recipes. Rename files accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

hanbitmyths · 2026-04-17T00:11:57Z

Did you add int4 quantization at only 8B model intentionally?

hanbitmyths · 2026-04-17T00:16:22Z

Qwen3-Embedding-0.6B model supports tie_word_embeddings. Did you check if an ONNX model correctly ties embedding?

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The filenames were renamed from fp32 to fp16 but the precision field inside the JSON configs was not updated. This fixes all 8 affected files (0.6B and 8B, cuda and webgpu, plain and with_eval variants). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds int4 recipes for cpu, cuda, and webgpu targets (plain and with_eval variants). Uses same SMP+GPTQ+RTN+ModelBuilder pipeline as the 8B int4 recipes. WebGPU uses group_size=32, others use 128. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Qwen3-Embedding-0.6B model has tie_word_embeddings=True but the ONNX models produced by ModelBuilder do not preserve this. Add a GraphSurgeries pass with TieWordEmbeddings surgeon after ModelBuilder, matching the pattern used by the Qwen3-0.6B chat model recipes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ecipes - Create baseline/ folder for 0.6B and 8B with PyTorch-only eval recipes - Baseline recipes have no passes, just evaluate the HfModel with MTEB - Add evaluate_input_model: false to all existing _with_eval recipes - This avoids duplicate PyTorch eval when running ONNX model evaluation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Prevents recipes in the same folder from overwriting each other's output. e.g. model_cpu_int4, model_cuda_fp16, model_pytorch, etc. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Update all 6 READMEs to reference INT4 build/eval configs instead of FP32/FP16 - Add config_sentence_transformers.json copy step instructions where needed - Remove stale FP32/FP16 evaluation results - Update pipeline descriptions to match actual INT4 pass chains - Update 0.6B CUDA INT4 eval results with prompt fix (NFCorpus: 0.287 -> 0.351)

…ansformers.json

Copilot AI review requested due to automatic review settings April 10, 2026 22:25

Copilot started reviewing on behalf of natke April 10, 2026 22:26 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Sort requirements.txt files alphabetically

6997cc8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

natke marked this pull request as draft April 10, 2026 22:33

natke and others added 7 commits April 14, 2026 12:27

Pin onnxruntime-genai==0.12.2 in all recipes

f1657a2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add onnxruntime-gpu to CUDA recipe requirements

c41753b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use onnxruntime-genai-cuda for CUDA recipes

a12550d

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add onnxruntime-webgpu to WebGPU recipe requirements

dba3554

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Pin onnxruntime-webgpu to 1.24.0.dev20251031004 with --pre

5a3ef26

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Split recipes into base (no eval) and _with_eval variants

7448369

Base recipes build the model only. _with_eval variants include MTEB evaluation, which requires the target EP to be available at runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update READMEs: remove benchmark tables, add eval instructions

0486d1a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

natke marked this pull request as ready for review April 14, 2026 21:46

natke and others added 4 commits April 14, 2026 15:59

Add olive-ai to recipe requirements

bb06c58

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add HF login step to 8B READMEs (gated model)

ee2e470

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix CUDA requirements.txt sort order

823acef

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add note: ONNX MTEB score should be within 5% of PyTorch baseline

3a6d7d4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

natke requested review from devang-ml, hanbitmyths and jambayk April 15, 2026 19:36

hanbitmyths reviewed Apr 16, 2026

View reviewed changes

Comment thread Qwen-Qwen3-Embedding-0.6B/cpu/Qwen-Qwen3-Embedding-0.6B_cpu_fp32.json

hanbitmyths reviewed Apr 16, 2026

View reviewed changes

Comment thread Qwen-Qwen3-Embedding-0.6B/cuda/Qwen-Qwen3-Embedding-0.6B_cuda_fp16.json Outdated

natke and others added 2 commits April 16, 2026 14:26

Add int4 quantization recipes for Qwen3-Embedding-8B

c78aa26

Add int4 ModelBuilder recipes for cpu, cuda, and webgpu targets with SelectiveMixedPrecision, GPTQ, and RTN passes matching the Qwen3-8B chat model pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add int4 with eval recipes for Qwen3-Embedding-8B

f2f7fde

Add _with_eval variants of the int4 recipes for cpu, cuda, and webgpu targets, including MTEBEvaluator with STS17 task. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

natke and others added 3 commits April 16, 2026 15:48

Add retrieval eval tasks to Qwen3-Embedding-0.6B recipes

3567382

Expand MTEB eval tasks to include NFCorpus, ArguAna, and SciFact alongside STS17 for the 0.6B embedding model _with_eval recipes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Change cuda/webgpu embedding recipes from fp32 to fp16

ec7fa97

Update precision from fp32 to fp16 for cuda and webgpu targets on both Qwen3-Embedding-0.6B and Qwen3-Embedding-8B recipes. Rename files accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'main' into natke/qwen3-embedding-recipes

aeceb42

hanbitmyths reviewed Apr 17, 2026

View reviewed changes

Comment thread Qwen-Qwen3-Embedding-0.6B/cuda/Qwen-Qwen3-Embedding-0.6B_cuda_fp16.json Outdated

natke and others added 5 commits April 17, 2026 11:03

Remove no_artifacts from eval recipes to enable metrics output

ec920bb

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

natke force-pushed the natke/qwen3-embedding-recipes branch from 04bba12 to fb20739 Compare April 18, 2026 15:58

natke and others added 6 commits April 18, 2026 09:05

Add baseline README with PyTorch eval results

e4c5d4b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use distinct output/cache dirs per recipe variant

cf2690f

Prevents recipes in the same folder from overwriting each other's output. e.g. model_cpu_int4, model_cuda_fp16, model_pytorch, etc. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix README copy step: ModelBuilder doesn't include config_sentence_tr…

1f470f1

…ansformers.json

Add tie_word_embeddings fix step to 0.6B READMEs (Olive#2424)

715a13a

Update info.yaml files to reference INT4 recipes

c7178c6

natke mentioned this pull request Apr 20, 2026

Add Qwen3-Embedding models (0.6B and 8B) to Foundry Local catalog Azure/azureml-assets#4943

Merged

hanbitmyths approved these changes Apr 22, 2026

View reviewed changes

hanbitmyths merged commit f467f99 into microsoft:main Apr 22, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3 Embedding recipes (0.6B and 8B)#355

Add Qwen3 Embedding recipes (0.6B and 8B)#355
hanbitmyths merged 30 commits intomicrosoft:mainfrom
natke:natke/qwen3-embedding-recipes

natke commented Apr 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shaahji commented Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

hanbitmyths commented Apr 16, 2026

Uh oh!

Uh oh!

hanbitmyths commented Apr 17, 2026

Uh oh!

hanbitmyths commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

natke commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Models

Recipes (6 total)

Dependencies

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shaahji commented Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

hanbitmyths commented Apr 16, 2026

Uh oh!

Uh oh!

hanbitmyths commented Apr 17, 2026

Uh oh!

hanbitmyths commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

natke commented Apr 10, 2026 •

edited

Loading