Skip to content

Add Qwen3 Embedding recipes (0.6B and 8B)#355

Merged
hanbitmyths merged 30 commits intomicrosoft:mainfrom
natke:natke/qwen3-embedding-recipes
Apr 22, 2026
Merged

Add Qwen3 Embedding recipes (0.6B and 8B)#355
hanbitmyths merged 30 commits intomicrosoft:mainfrom
natke:natke/qwen3-embedding-recipes

Conversation

@natke
Copy link
Copy Markdown
Contributor

@natke natke commented Apr 10, 2026

Summary

Add Olive recipes for Qwen3-Embedding-0.6B and Qwen3-Embedding-8B embedding models, targeting CPU, CUDA, and WebGPU execution providers.

Models

Recipes (6 total)

  • Qwen-Qwen3-Embedding-0.6B/{cpu,cuda,webgpu}
  • Qwen-Qwen3-Embedding-8B/{cpu,cuda,webgpu}

Each recipe includes:

  • INT4 precision with include_hidden_states=1 for embedding extraction
  • ModelBuilder pass for ONNX/GenAI export
  • MTEBEvaluator configured with STS17 benchmark
  • README with setup and evaluation instructions
  • info.yaml, requirements.txt, LICENSE

Dependencies

Add Olive recipes for Qwen3-Embedding-0.6B and Qwen3-Embedding-8B models
targeting CPU, CUDA, and WebGPU execution providers.

- FP32 precision with include_hidden_states for embedding extraction
- ModelBuilder pass for ONNX/GenAI export
- MTEBEvaluator configured with STS17 benchmark task
- READMEs with MTEB leaderboard scores and STS17 evaluation results

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 10, 2026 22:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new Olive recipe folders for the Qwen3 embedding model variants (0.6B, 8B) across CPU/CUDA/WebGPU, including configs, environment requirements, and documentation to support FP32 export with hidden states enabled and MTEB (STS17) evaluation wiring.

Changes:

  • Added CPU/CUDA/WebGPU Olive configs using ModelBuilder (FP32) with include_hidden_states=1 for Qwen3 embedding models.
  • Added per-recipe requirements.txt, info.yaml, and auto-generated README.md files.
  • Added per-model Apache 2.0 LICENSE files.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
Qwen-Qwen3-Embedding-8B/LICENSE Adds license file for the 8B embedding model recipe set.
Qwen-Qwen3-Embedding-8B/cpu/requirements.txt Python deps for running the CPU recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-8B/cpu/README.md CPU recipe usage + leaderboard references.
Qwen-Qwen3-Embedding-8B/cpu/Qwen-Qwen3-Embedding-8B_cpu_fp32.json Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17).
Qwen-Qwen3-Embedding-8B/cpu/info.yaml Registers the CPU recipe metadata (device/EP/file).
Qwen-Qwen3-Embedding-8B/cuda/requirements.txt Python deps for running the CUDA recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-8B/cuda/README.md CUDA recipe usage + leaderboard references.
Qwen-Qwen3-Embedding-8B/cuda/Qwen-Qwen3-Embedding-8B_cuda_fp32.json Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17) on CUDA EP.
Qwen-Qwen3-Embedding-8B/cuda/info.yaml Registers the CUDA recipe metadata (device/EP/file).
Qwen-Qwen3-Embedding-8B/webgpu/requirements.txt Python deps for running the WebGPU recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-8B/webgpu/README.md WebGPU recipe usage + leaderboard references.
Qwen-Qwen3-Embedding-8B/webgpu/Qwen-Qwen3-Embedding-8B_webgpu_fp32.json Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17) on WebGPU EP.
Qwen-Qwen3-Embedding-8B/webgpu/info.yaml Registers the WebGPU recipe metadata (device/EP/file).
Qwen-Qwen3-Embedding-0.6B/LICENSE Adds license file for the 0.6B embedding model recipe set.
Qwen-Qwen3-Embedding-0.6B/cpu/requirements.txt Python deps for running the CPU recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-0.6B/cpu/README.md CPU recipe usage + leaderboard references + STS17 eval results section.
Qwen-Qwen3-Embedding-0.6B/cpu/Qwen-Qwen3-Embedding-0.6B_cpu_fp32.json Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17).
Qwen-Qwen3-Embedding-0.6B/cpu/info.yaml Registers the CPU recipe metadata (device/EP/file).
Qwen-Qwen3-Embedding-0.6B/cuda/requirements.txt Python deps for running the CUDA recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-0.6B/cuda/README.md CUDA recipe usage + leaderboard references.
Qwen-Qwen3-Embedding-0.6B/cuda/Qwen-Qwen3-Embedding-0.6B_cuda_fp32.json Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17) on CUDA EP.
Qwen-Qwen3-Embedding-0.6B/cuda/info.yaml Registers the CUDA recipe metadata (device/EP/file).
Qwen-Qwen3-Embedding-0.6B/webgpu/requirements.txt Python deps for running the WebGPU recipe and MTEB evaluation.
Qwen-Qwen3-Embedding-0.6B/webgpu/README.md WebGPU recipe usage + leaderboard references.
Qwen-Qwen3-Embedding-0.6B/webgpu/Qwen-Qwen3-Embedding-0.6B_webgpu_fp32.json Olive config: HF model → ModelBuilder FP32 + MTEBEvaluator(STS17) on WebGPU EP.
Qwen-Qwen3-Embedding-0.6B/webgpu/info.yaml Registers the WebGPU recipe metadata (device/EP/file).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Qwen-Qwen3-Embedding-8B/cpu/README.md Outdated
Comment thread Qwen-Qwen3-Embedding-8B/cuda/README.md Outdated
Comment thread Qwen-Qwen3-Embedding-8B/webgpu/README.md Outdated
Comment thread Qwen-Qwen3-Embedding-0.6B/cuda/README.md Outdated
Comment thread Qwen-Qwen3-Embedding-0.6B/webgpu/README.md Outdated
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@natke natke marked this pull request as draft April 10, 2026 22:33
natke and others added 7 commits April 14, 2026 12:27
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Base recipes build the model only. _with_eval variants include MTEB
evaluation, which requires the target EP to be available at runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@natke natke marked this pull request as ready for review April 14, 2026 21:46
natke and others added 4 commits April 14, 2026 15:59
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@shaahji
Copy link
Copy Markdown
Contributor

shaahji commented Apr 15, 2026

/azp run

Comment thread Qwen-Qwen3-Embedding-0.6B/cuda/Qwen-Qwen3-Embedding-0.6B_cuda_fp16.json Outdated
natke and others added 2 commits April 16, 2026 14:26
Add int4 ModelBuilder recipes for cpu, cuda, and webgpu targets with
SelectiveMixedPrecision, GPTQ, and RTN passes matching the Qwen3-8B
chat model pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add _with_eval variants of the int4 recipes for cpu, cuda, and webgpu
targets, including MTEBEvaluator with STS17 task.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@hanbitmyths
Copy link
Copy Markdown
Contributor

It would be good to remove fp32 cuda and webgpu recipes and add fp16 cuda and webgpu recipes.

natke and others added 3 commits April 16, 2026 15:48
…cipes

Expand MTEB eval tasks beyond STS17 to include retrieval benchmarks
for better quantization quality assessment across all _with_eval recipes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Expand MTEB eval tasks to include NFCorpus, ArguAna, and SciFact
alongside STS17 for the 0.6B embedding model _with_eval recipes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update precision from fp32 to fp16 for cuda and webgpu targets on both
Qwen3-Embedding-0.6B and Qwen3-Embedding-8B recipes. Rename files
accordingly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread Qwen-Qwen3-Embedding-0.6B/cuda/Qwen-Qwen3-Embedding-0.6B_cuda_fp16.json Outdated
@hanbitmyths
Copy link
Copy Markdown
Contributor

Did you add int4 quantization at only 8B model intentionally?

@hanbitmyths
Copy link
Copy Markdown
Contributor

Qwen3-Embedding-0.6B model supports tie_word_embeddings. Did you check if an ONNX model correctly ties embedding?

natke and others added 5 commits April 17, 2026 11:03
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The filenames were renamed from fp32 to fp16 but the precision field
inside the JSON configs was not updated. This fixes all 8 affected
files (0.6B and 8B, cuda and webgpu, plain and with_eval variants).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds int4 recipes for cpu, cuda, and webgpu targets (plain and
with_eval variants). Uses same SMP+GPTQ+RTN+ModelBuilder pipeline
as the 8B int4 recipes. WebGPU uses group_size=32, others use 128.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Qwen3-Embedding-0.6B model has tie_word_embeddings=True but the
ONNX models produced by ModelBuilder do not preserve this. Add a
GraphSurgeries pass with TieWordEmbeddings surgeon after ModelBuilder,
matching the pattern used by the Qwen3-0.6B chat model recipes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ecipes

- Create baseline/ folder for 0.6B and 8B with PyTorch-only eval recipes
- Baseline recipes have no passes, just evaluate the HfModel with MTEB
- Add evaluate_input_model: false to all existing _with_eval recipes
- This avoids duplicate PyTorch eval when running ONNX model evaluation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@natke natke force-pushed the natke/qwen3-embedding-recipes branch from 04bba12 to fb20739 Compare April 18, 2026 15:58
natke and others added 6 commits April 18, 2026 09:05
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Prevents recipes in the same folder from overwriting each other's output.
e.g. model_cpu_int4, model_cuda_fp16, model_pytorch, etc.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update all 6 READMEs to reference INT4 build/eval configs instead of FP32/FP16
- Add config_sentence_transformers.json copy step instructions where needed
- Remove stale FP32/FP16 evaluation results
- Update pipeline descriptions to match actual INT4 pass chains
- Update 0.6B CUDA INT4 eval results with prompt fix (NFCorpus: 0.287 -> 0.351)
@hanbitmyths hanbitmyths merged commit f467f99 into microsoft:main Apr 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants