Skip to content

ggml: add f16 out_prod support for CPU and out_prod op for Vulkan#23997

Open
Lamothe wants to merge 1 commit into
ggml-org:masterfrom
Lamothe:master
Open

ggml: add f16 out_prod support for CPU and out_prod op for Vulkan#23997
Lamothe wants to merge 1 commit into
ggml-org:masterfrom
Lamothe:master

Conversation

@Lamothe
Copy link
Copy Markdown

@Lamothe Lamothe commented Jun 1, 2026

Overview

Add F16 OUT_PROD support for the CPU backend and OUT_PROD op support for the Vulkan backend. Also adds F16 src1 support for IM2COL_BACK on CPU. My goal is to enable training support on non-CUDA backends (CPU, Vulkan), in my case, the Strix Halo.

CPU:

  • Implement ggml_compute_forward_out_prod_f16_f32 — handles OUT_PROD with F16 src0 and F32 src1/dst, matching the existing quantized out_prod pattern
  • Enable F16 in the graph plan for OUT_PROD (ggml-cpu.c)
  • Relax IM2COL_BACK assertion to accept F16 src1 (convolution kernel)
  • Update supports_op for both ops accordingly

Vulkan:

  • Add OUT_PROD pipeline (F32) with shader compilation, pipeline creation, dispatch, and backend registration
  • Link SPIRV-Headers in CMakeLists.txt (required for shader compilation)

Additional information

Implements two items from #14909 (missing ops across backends):

  • OUT_PROD on Vulkan (F32)
  • OUT_PROD F16 on CPU (was a GGML_ABORT stub)

Tested: test-backend-ops -b CPU — 15900/15900 passed, including 16 new F16 OUT_PROD test cases. Builds clean with LLAMA_FATAL_WARNINGS=ON.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES — AI was used in an assistive capacity for code review and formatting corrections. All changes are fully understood and owned by the contributor.

@Lamothe Lamothe requested review from a team and ggerganov as code owners June 1, 2026 21:55
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 1, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Jun 1, 2026

Hi @Lamothe, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@Lamothe
Copy link
Copy Markdown
Author

Lamothe commented Jun 1, 2026

Just on a more personal note, this is huge deal for me, a game changer. With these changes I can now train AI models using my Strix Halo 128 GB.

@Lamothe
Copy link
Copy Markdown
Author

Lamothe commented Jun 1, 2026

On the multiple backends thing, just let me know if you really need to do to that. The Vulkan changes are minimal (shader + pipeline wiring following existing patterns) and both backends are needed for the same feature (training ops). I'm not using the GPU for everything yet but as I mentioned, this is a huge step forward. I just 10x'd my hardware.

@jeffbolznv
Copy link
Copy Markdown
Contributor

I haven't reviewed in detail, but I get bunch of failures in OUT_PROD:


  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.683476711 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.812003502 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 0.953885595 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.165509393 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.600564174 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.095811581 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.717469282 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.261481574 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.328894783 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.209753475 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.381416679 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.215906795 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.438983623 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.722599003 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.319761004 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.316149836 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.204602585 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.279120625 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.378535947 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.408648883 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.787642379 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.333422868 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.308338372 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.346531372 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.339575193 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.350936315 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.346864411 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.764744929 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[8,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[16,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[32,1],nr=[1,1],trans_b=0): OK

On the multiple backends thing, just let me know if you really need to do to that.

Since the op already exists, I think what you're doing is fine. But you might consider splitting into two PRs, one per op, to make it easier to review (and since one is currently failing).

@jeffbolznv
Copy link
Copy Markdown
Contributor

codex found this shader fix:

-    uint a_i2 = i2 % p.ne02;
-    uint a_i3 = i3 % p.ne03;
+    uint a_i2 = i2 / (p.ne22 / p.ne02);
+    uint a_i3 = i3 / (p.ne23 / p.ne03);

@jeffbolznv
Copy link
Copy Markdown
Contributor

im2col_back.comp looks like it's not hooked up?

@Lamothe
Copy link
Copy Markdown
Author

Lamothe commented Jun 2, 2026

I haven't reviewed in detail, but I get bunch of failures in OUT_PROD:


  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.683476711 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.812003502 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 0.953885595 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.165509393 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.600564174 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.095811581 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.717469282 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=1,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.261481574 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.328894783 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.209753475 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.381416679 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.215906795 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.438983623 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.722599003 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=1,k=16,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.319761004 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.316149836 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.204602585 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.279120625 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.378535947 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.408648883 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.787642379 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=1,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[2,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[2,2],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.333422868 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[1,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[2,1],trans_b=0): OK
[OUT_PROD] ERR = 1.308338372 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[1,2],trans_b=0): OK
[OUT_PROD] ERR = 1.346531372 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.339575193 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,1],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[1,1],trans_b=0): OK
[OUT_PROD] ERR = 1.350936315 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[1,2],trans_b=0): FAIL
[OUT_PROD] ERR = 1.346864411 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[2,1],trans_b=0): FAIL
[OUT_PROD] ERR = 1.764744929 > 0.000500000   OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[3,3],nr=[2,2],trans_b=0): FAIL
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[1,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[8,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[16,1],nr=[1,1],trans_b=0): OK
  OUT_PROD(type_a=f32,type_b=f32,m=256,n=16,k=16,bs=[32,1],nr=[1,1],trans_b=0): OK

On the multiple backends thing, just let me know if you really need to do to that.

Since the op already exists, I think what you're doing is fine. But you might consider splitting into two PRs, one per op, to make it easier to review (and since one is currently failing).

So, this is not from my change. The F32 OUT_PROD failures are with type_a=f32,type_b=f32 tests with broadcast (nr>1), which use the existing F32 implementation (ggml_compute_forward_out_prod_f32 at ops.cpp:4200) that I didn't touch. These are pre-existing failures in upstream master. I can't solve everything in one commit ;)

@jeffbolznv
Copy link
Copy Markdown
Contributor

There are no failures in master, the op isn't supported in vulkan and doesn't run. This is reporting a mismatch between the cpu and vulkan backends. The cpu backend is very likely correct, since it is the reference and is passing against other backends.

@Lamothe Lamothe force-pushed the master branch 2 times, most recently from 14534ea to 44fff6a Compare June 2, 2026 01:34
@Lamothe
Copy link
Copy Markdown
Author

Lamothe commented Jun 2, 2026

There are no failures in master, the op isn't supported in vulkan and doesn't run. This is reporting a mismatch between the cpu and vulkan backends. The cpu backend is very likely correct, since it is the reference and is passing against other backends.

Right you are! Apologies for not trusting in you the first time. The updated broadcast calculation in out_prod.comp has been corrected as per your suggestion and I believe that it's the issue here, right? The shader now uses division for broadcast index mapping, matching the CPU F32 reference.

Comment thread ggml/src/ggml-vulkan/CMakeLists.txt Outdated
)

target_link_libraries(ggml-vulkan PRIVATE Vulkan::Vulkan)
target_link_libraries(ggml-vulkan PRIVATE Vulkan::Vulkan SPIRV-Headers::SPIRV-Headers)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed and is unrelated to the other changes.

Comment thread ggml/src/ggml-vulkan/vulkan-shaders/out_prod.comp
Comment thread ggml/src/ggml-vulkan/vulkan-shaders/out_prod.comp Outdated
@Lamothe Lamothe force-pushed the master branch 2 times, most recently from f4f191c to 38b8dfe Compare June 2, 2026 07:18
@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Jun 2, 2026

I thought you wanted Vulkan f16 support as well and added support to the CPU backend for that, but I see Vulkan is f32-only. That means these really are two separate changes in one PR. What's the reasoning?

@Lamothe
Copy link
Copy Markdown
Author

Lamothe commented Jun 2, 2026

codex found this shader fix:

-    uint a_i2 = i2 % p.ne02;
-    uint a_i3 = i3 % p.ne03;
+    uint a_i2 = i2 / (p.ne22 / p.ne02);
+    uint a_i3 = i3 / (p.ne23 / p.ne03);

Updated.

@Lamothe
Copy link
Copy Markdown
Author

Lamothe commented Jun 2, 2026

I thought you wanted Vulkan f16 support as well and added support to the CPU backend for that, but I see Vulkan is f32-only. That means these really are two separate changes in one PR. What's the reasoning?

Great question, and I appreciate the critique because it's a little unorthodox but I believe that it's justifiable. As previously mentioned, my goal is to enable model training on non-CUDA backends, so I have a goal/project in mind and these changes serve different stages of the same goal.

The training back-propagation phase is using F16 on the CPU backend because getting that working on Vulkan is another leap that I haven't had the time to make (this is not my day-job). Now, that CPU change alone allows for training on the CPU backend, but it's not a practical option for me. Vulkan is the path of least resistance however, the entire OUT_PROD op was missing on Vulkan, so I plugged it. I've only implemented F32 at this point; F16 can follow.

Now, I'm sure that you can argue that the CPU changes could go in separately but my end goal was always Vulkan. If you want me to split them I will, however, that's the "reasoning" that you requested.

@jeffbolznv
Copy link
Copy Markdown
Contributor

I think it would be OK to take this as-is, I could fill in the gaps in a follow-on change.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Jun 3, 2026

If nobody from the CPU side objects we can do it this way.

Comment thread ggml/src/ggml-cpu/ops.cpp
const int64_t ir0 = dr*ith;
const int64_t ir1 = MIN(ir0 + dr, nr);

float * wdata = (float *) params->wdata + (ne0 + CACHE_LINE_SIZE_F32) * ith;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to reserve work buffer by handling the F16 case here:

case GGML_OP_OUT_PROD:
{
if (ggml_is_quantized(node->src[0]->type)) {
cur = ggml_type_size(GGML_TYPE_F32) * node->src[0]->ne[0] * n_tasks;
}
} break;

return src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16;
case GGML_OP_OUT_PROD:
return (src0->type == GGML_TYPE_F32 || (ggml_is_quantized(src0->type) && src0->ne[2] == src1->ne[2] && src0->ne[3] == src1->ne[3])) &&
return (src0->type == GGML_TYPE_F32 || (src0->type == GGML_TYPE_F16 && src0->ne[2] == src1->ne[2] && src0->ne[3] == src1->ne[3]) || (ggml_is_quantized(src0->type) && src0->ne[2] == src1->ne[2] && src0->ne[3] == src1->ne[3])) &&
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants