ggml: add f16 out_prod support for CPU and out_prod op for Vulkan by Lamothe · Pull Request #23997 · ggml-org/llama.cpp

Lamothe · 2026-06-01T21:55:19Z

Overview

Add F16 OUT_PROD support for the CPU backend and OUT_PROD op support for the Vulkan backend. Also adds F16 src1 support for IM2COL_BACK on CPU. My goal is to enable training support on non-CUDA backends (CPU, Vulkan), in my case, the Strix Halo.

CPU:

Implement ggml_compute_forward_out_prod_f16_f32 — handles OUT_PROD with F16 src0 and F32 src1/dst, matching the existing quantized out_prod pattern
Enable F16 in the graph plan for OUT_PROD (ggml-cpu.c)
Relax IM2COL_BACK assertion to accept F16 src1 (convolution kernel)
Update supports_op for both ops accordingly

Vulkan:

Add OUT_PROD pipeline (F32) with shader compilation, pipeline creation, dispatch, and backend registration
Link SPIRV-Headers in CMakeLists.txt (required for shader compilation)

Additional information

Implements two items from #14909 (missing ops across backends):

OUT_PROD on Vulkan (F32)
OUT_PROD F16 on CPU (was a GGML_ABORT stub)

Tested: test-backend-ops -b CPU — 15900/15900 passed, including 16 new F16 OUT_PROD test cases. Builds clean with LLAMA_FATAL_WARNINGS=ON.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES — AI was used in an assistive capacity for code review and formatting corrections. All changes are fully understood and owned by the contributor.

ggml-gh-bot · 2026-06-01T22:00:23Z

Hi @Lamothe, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.
AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Lamothe · 2026-06-01T22:01:16Z

Just on a more personal note, this is huge deal for me, a game changer. With these changes I can now train AI models using my Strix Halo 128 GB.

Lamothe · 2026-06-01T22:10:53Z

On the multiple backends thing, just let me know if you really need to do to that. The Vulkan changes are minimal (shader + pipeline wiring following existing patterns) and both backends are needed for the same feature (training ops). I'm not using the GPU for everything yet but as I mentioned, this is a huge step forward. I just 10x'd my hardware.

jeffbolznv · 2026-06-01T23:07:17Z

I haven't reviewed in detail, but I get bunch of failures in OUT_PROD:


  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.683476711 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.812003502 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 0.953885595 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.165509393 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.600564174 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.095811581 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.717469282 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.261481574 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.328894783 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.209753475 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.381416679 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.215906795 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.438983623 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.722599003 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.319761004 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.316149836 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.204602585 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.279120625 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.378535947 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.408648883 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.787642379 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.333422868 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.308338372 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.346531372 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.339575193 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.350936315 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.346864411 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.764744929 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[8,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[16,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[32,1],nr=[1,1],trans_b=0): OK

On the multiple backends thing, just let me know if you really need to do to that.

Since the op already exists, I think what you're doing is fine. But you might consider splitting into two PRs, one per op, to make it easier to review (and since one is currently failing).

jeffbolznv · 2026-06-02T00:01:01Z

codex found this shader fix:

-    uint a_i2 = i2 % p.ne02;
-    uint a_i3 = i3 % p.ne03;
+    uint a_i2 = i2 / (p.ne22 / p.ne02);
+    uint a_i3 = i3 / (p.ne23 / p.ne03);

jeffbolznv · 2026-06-02T00:55:33Z

im2col_back.comp looks like it's not hooked up?

Lamothe · 2026-06-02T00:55:42Z

I haven't reviewed in detail, but I get bunch of failures in OUT_PROD:


  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.683476711 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.812003502 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 0.953885595 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.165509393 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.600564174 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.095811581 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.717469282 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.261481574 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.328894783 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.209753475 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.381416679 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.215906795 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.438983623 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.722599003 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.319761004 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.316149836 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.204602585 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.279120625 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.378535947 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.408648883 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.787642379 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.333422868 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.308338372 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.346531372 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.339575193 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.350936315 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.346864411 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.764744929 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[8,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[16,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[32,1],nr=[1,1],trans_b=0): OK

On the multiple backends thing, just let me know if you really need to do to that.

Since the op already exists, I think what you're doing is fine. But you might consider splitting into two PRs, one per op, to make it easier to review (and since one is currently failing).

So, this is not from my change. The F32 OUT_PROD failures are with type_a=f32,type_b=f32 tests with broadcast (nr>1), which use the existing F32 implementation (ggml_compute_forward_out_prod_f32 at ops.cpp:4200) that I didn't touch. These are pre-existing failures in upstream master. I can't solve everything in one commit ;)

jeffbolznv · 2026-06-02T01:12:25Z

There are no failures in master, the op isn't supported in vulkan and doesn't run. This is reporting a mismatch between the cpu and vulkan backends. The cpu backend is very likely correct, since it is the reference and is passing against other backends.

Lamothe · 2026-06-02T01:44:06Z

There are no failures in master, the op isn't supported in vulkan and doesn't run. This is reporting a mismatch between the cpu and vulkan backends. The cpu backend is very likely correct, since it is the reference and is passing against other backends.

Right you are! Apologies for not trusting in you the first time. The updated broadcast calculation in out_prod.comp has been corrected as per your suggestion and I believe that it's the issue here, right? The shader now uses division for broadcast index mapping, matching the CPU F32 reference.

jeffbolznv · 2026-06-02T02:44:43Z

    )

-    target_link_libraries(ggml-vulkan PRIVATE Vulkan::Vulkan)
+    target_link_libraries(ggml-vulkan PRIVATE Vulkan::Vulkan SPIRV-Headers::SPIRV-Headers)


I don't think this is needed and is unrelated to the other changes.

0cc4m · 2026-06-02T07:34:20Z

I thought you wanted Vulkan f16 support as well and added support to the CPU backend for that, but I see Vulkan is f32-only. That means these really are two separate changes in one PR. What's the reasoning?

Lamothe · 2026-06-02T09:00:43Z

codex found this shader fix:

-    uint a_i2 = i2 % p.ne02;
-    uint a_i3 = i3 % p.ne03;
+    uint a_i2 = i2 / (p.ne22 / p.ne02);
+    uint a_i3 = i3 / (p.ne23 / p.ne03);

Updated.

Lamothe · 2026-06-02T20:04:11Z

I thought you wanted Vulkan f16 support as well and added support to the CPU backend for that, but I see Vulkan is f32-only. That means these really are two separate changes in one PR. What's the reasoning?

Great question, and I appreciate the critique because it's a little unorthodox but I believe that it's justifiable. As previously mentioned, my goal is to enable model training on non-CUDA backends, so I have a goal/project in mind and these changes serve different stages of the same goal.

The training back-propagation phase is using F16 on the CPU backend because getting that working on Vulkan is another leap that I haven't had the time to make (this is not my day-job). Now, that CPU change alone allows for training on the CPU backend, but it's not a practical option for me. Vulkan is the path of least resistance however, the entire OUT_PROD op was missing on Vulkan, so I plugged it. I've only implemented F32 at this point; F16 can follow.

Now, I'm sure that you can argue that the CPU changes could go in separately but my end goal was always Vulkan. If you want me to split them I will, however, that's the "reasoning" that you requested.

jeffbolznv · 2026-06-02T20:11:32Z

I think it would be OK to take this as-is, I could fill in the gaps in a follow-on change.

0cc4m · 2026-06-03T06:33:25Z

If nobody from the CPU side objects we can do it this way.

ggerganov · 2026-06-03T07:10:52Z

+    const int64_t ir0 = dr*ith;
+    const int64_t ir1 = MIN(ir0 + dr, nr);
+
+    float * wdata = (float *) params->wdata + (ne0 + CACHE_LINE_SIZE_F32) * ith;


You need to reserve work buffer by handling the F16 case here:

llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c

Lines 2841 to 2846 in d545a2a

case GGML_OP_OUT_PROD:

{

if (ggml_is_quantized(node->src[0]->type)) {

cur = ggml_type_size(GGML_TYPE_F32) * node->src[0]->ne[0] * n_tasks;

}

} break;

ggerganov · 2026-06-03T07:11:00Z

            return src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16;
        case GGML_OP_OUT_PROD:
-            return (src0->type == GGML_TYPE_F32 || (ggml_is_quantized(src0->type) && src0->ne[2] == src1->ne[2] && src0->ne[3] == src1->ne[3])) &&
+            return (src0->type == GGML_TYPE_F32 || (src0->type == GGML_TYPE_F16 && src0->ne[2] == src1->ne[2] && src0->ne[3] == src1->ne[3]) || (ggml_is_quantized(src0->type) && src0->ne[2] == src1->ne[2] && src0->ne[3] == src1->ne[3])) &&


This can be simplified

Lamothe requested review from a team and ggerganov as code owners June 1, 2026 21:55

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 1, 2026

Lamothe force-pushed the master branch from 158c9d0 to f75fcba Compare June 2, 2026 00:59

Lamothe force-pushed the master branch 2 times, most recently from 14534ea to 44fff6a Compare June 2, 2026 01:34

jeffbolznv reviewed Jun 2, 2026

View reviewed changes

Lamothe force-pushed the master branch 2 times, most recently from f4f191c to 38b8dfe Compare June 2, 2026 07:18

ggml: add f16 out_prod support for CPU and out_prod op for Vulkan

898299d

Lamothe force-pushed the master branch from 38b8dfe to 898299d Compare June 3, 2026 02:58

ggerganov reviewed Jun 3, 2026

View reviewed changes

	case GGML_OP_OUT_PROD:
	{
	if (ggml_is_quantized(node->src[0]->type)) {
	cur = ggml_type_size(GGML_TYPE_F32) * node->src[0]->ne[0] * n_tasks;
	}
	} break;

Conversation

Lamothe commented Jun 1, 2026

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 1, 2026

Uh oh!

Lamothe commented Jun 1, 2026

Uh oh!

Lamothe commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Jun 1, 2026

Uh oh!

jeffbolznv commented Jun 2, 2026

Uh oh!

jeffbolznv commented Jun 2, 2026

Uh oh!

Lamothe commented Jun 2, 2026

Uh oh!

jeffbolznv commented Jun 2, 2026

Uh oh!

Lamothe commented Jun 2, 2026

Uh oh!

jeffbolznv Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

0cc4m commented Jun 2, 2026

Uh oh!

Lamothe commented Jun 2, 2026

Uh oh!

Lamothe commented Jun 2, 2026

Uh oh!

jeffbolznv commented Jun 2, 2026

Uh oh!

0cc4m commented Jun 3, 2026

Uh oh!

ggerganov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Lamothe commented Jun 1, 2026 •

edited

Loading