ggml: add f16 out_prod support for CPU and out_prod op for Vulkan#23997
ggml: add f16 out_prod support for CPU and out_prod op for Vulkan#23997Lamothe wants to merge 1 commit into
Conversation
|
Hi @Lamothe, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Just on a more personal note, this is huge deal for me, a game changer. With these changes I can now train AI models using my Strix Halo 128 GB. |
|
On the multiple backends thing, just let me know if you really need to do to that. The Vulkan changes are minimal (shader + pipeline wiring following existing patterns) and both backends are needed for the same feature (training ops). I'm not using the GPU for everything yet but as I mentioned, this is a huge step forward. I just 10x'd my hardware. |
|
I haven't reviewed in detail, but I get bunch of failures in OUT_PROD:
Since the op already exists, I think what you're doing is fine. But you might consider splitting into two PRs, one per op, to make it easier to review (and since one is currently failing). |
|
codex found this shader fix: |
|
im2col_back.comp looks like it's not hooked up? |
So, this is not from my change. The F32 OUT_PROD failures are with type_a=f32,type_b=f32 tests with broadcast (nr>1), which use the existing F32 implementation (ggml_compute_forward_out_prod_f32 at ops.cpp:4200) that I didn't touch. These are pre-existing failures in upstream master. I can't solve everything in one commit ;) |
|
There are no failures in master, the op isn't supported in vulkan and doesn't run. This is reporting a mismatch between the cpu and vulkan backends. The cpu backend is very likely correct, since it is the reference and is passing against other backends. |
14534ea to
44fff6a
Compare
Right you are! Apologies for not trusting in you the first time. The updated broadcast calculation in out_prod.comp has been corrected as per your suggestion and I believe that it's the issue here, right? The shader now uses division for broadcast index mapping, matching the CPU F32 reference. |
| ) | ||
|
|
||
| target_link_libraries(ggml-vulkan PRIVATE Vulkan::Vulkan) | ||
| target_link_libraries(ggml-vulkan PRIVATE Vulkan::Vulkan SPIRV-Headers::SPIRV-Headers) |
There was a problem hiding this comment.
I don't think this is needed and is unrelated to the other changes.
f4f191c to
38b8dfe
Compare
|
I thought you wanted Vulkan f16 support as well and added support to the CPU backend for that, but I see Vulkan is f32-only. That means these really are two separate changes in one PR. What's the reasoning? |
Updated. |
Great question, and I appreciate the critique because it's a little unorthodox but I believe that it's justifiable. As previously mentioned, my goal is to enable model training on non-CUDA backends, so I have a goal/project in mind and these changes serve different stages of the same goal. The training back-propagation phase is using F16 on the CPU backend because getting that working on Vulkan is another leap that I haven't had the time to make (this is not my day-job). Now, that CPU change alone allows for training on the CPU backend, but it's not a practical option for me. Vulkan is the path of least resistance however, the entire OUT_PROD op was missing on Vulkan, so I plugged it. I've only implemented F32 at this point; F16 can follow. Now, I'm sure that you can argue that the CPU changes could go in separately but my end goal was always Vulkan. If you want me to split them I will, however, that's the "reasoning" that you requested. |
|
I think it would be OK to take this as-is, I could fill in the gaps in a follow-on change. |
|
If nobody from the CPU side objects we can do it this way. |
| const int64_t ir0 = dr*ith; | ||
| const int64_t ir1 = MIN(ir0 + dr, nr); | ||
|
|
||
| float * wdata = (float *) params->wdata + (ne0 + CACHE_LINE_SIZE_F32) * ith; |
There was a problem hiding this comment.
You need to reserve work buffer by handling the F16 case here:
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c
Lines 2841 to 2846 in d545a2a
| return src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16; | ||
| case GGML_OP_OUT_PROD: | ||
| return (src0->type == GGML_TYPE_F32 || (ggml_is_quantized(src0->type) && src0->ne[2] == src1->ne[2] && src0->ne[3] == src1->ne[3])) && | ||
| return (src0->type == GGML_TYPE_F32 || (src0->type == GGML_TYPE_F16 && src0->ne[2] == src1->ne[2] && src0->ne[3] == src1->ne[3]) || (ggml_is_quantized(src0->type) && src0->ne[2] == src1->ne[2] && src0->ne[3] == src1->ne[3])) && |
Overview
Add F16
OUT_PRODsupport for the CPU backend andOUT_PRODop support for the Vulkan backend. Also adds F16src1support forIM2COL_BACKon CPU. My goal is to enable training support on non-CUDA backends (CPU, Vulkan), in my case, the Strix Halo.CPU:
ggml_compute_forward_out_prod_f16_f32— handlesOUT_PRODwith F16 src0 and F32 src1/dst, matching the existing quantized out_prod patternOUT_PROD(ggml-cpu.c)IM2COL_BACKassertion to accept F16 src1 (convolution kernel)supports_opfor both ops accordinglyVulkan:
OUT_PRODpipeline (F32) with shader compilation, pipeline creation, dispatch, and backend registrationAdditional information
Implements two items from #14909 (missing ops across backends):
OUT_PRODon Vulkan (F32)OUT_PRODF16 on CPU (was aGGML_ABORTstub)Tested:
test-backend-ops -b CPU— 15900/15900 passed, including 16 new F16 OUT_PROD test cases. Builds clean withLLAMA_FATAL_WARNINGS=ON.Requirements