Cherry pick 12447 #9

arthw · 2025-03-19T02:20:51Z

add: rm other backend CI.

metal: use dequantize_q templates --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

It's useful to be able to have this from the library layer as it's a key parameter of the model (e.g. to figure out how much KV cache memory is needed).

* Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* vulkan: implement GGML_OP_ROPE_BACK * vulkan: implement GGML_OP_RMS_NORM_BACK * vulkan: implement GGML_OP_SILU_BACK * vulkan: implement GGML_OP_SOFTMAX_BACK

* ggml-cpu: Fix build with sve Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml-cpu: Remove unused variable in sve q3_k vec dot Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

Co-authored-by: Judd <foldl@boxvest.com>

Looks like a copy/paste bug from qx_needs_dequant.

…ht (ggml-org#12069)

Signed-off-by: kerthcet <kerthcet@gmail.com>

Currently self.byte_order is never used. Actually use it to byteswap read data to allow reading big endian files on little endian systems and vice versa. Now it's possible to convert little-endian model into a big-endian model and back on a little-endian system.

* Refactor gguf scripts to improve metadata handling Added contents method to ReaderField class Added endianess property to GGUFReader class * update scripts * fix import * remove unused import * attempt to work around flake and pyright errors * second attempt * give up, ignore type * bump version * apply newbyteorder fixes

* add struct for FFI bindgen * Apply suggestions from code review --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Fix dependencies between ggml and backends ggml backends link only to ggml-base and ggml links to all backends. * Fix installation of ggml backends Set up GNUInstallDirs before setting the installation directory of ggml backends

* vulkan: improve im2col performance

* faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8

Remove unused header file that causes compilation failure on ARM platform with GCC 13.

…rg#12064) * Added SVE Support for Q2_K Quantized Models * Use 4-space indentation in the switch cases * removed comments lines * Remove the loop Retain the curly bracess for better understanding of code * Remove the comment like added for q3_k_q8_k kernel --------- Co-authored-by: vithulep <p.m.vithule1517@gmail.com>

…ns (ggml-org#11595) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

…2108) * Added Phi-4-mini-instruct support * Update regex per ngxson * Change the vocab base to Xenova/gpt-4o * fix conversion update script * no need to check longrope * minor style fix * fix python style --------- Co-authored-by: Nicholas Sparks <nisparks@microsoft.com>

* Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <slarengh@gmail.com>

* convert : fix Norway problem when parsing YAML * Update gguf-py/gguf/metadata.py * add newline at correct place

* fix typos and improve menu text clarity * rename variable trimedValue to trimmedValue * add updated index.html.gz * rebuild --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".

…versation mode (ggml-org#12131) * Add --system-prompt parameter * use user defined system prompt * clarify Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * add warning * clarify Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

…) (ggml-org#12132) * Update outdated message * wording Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Use jinja chat template system prompt by default * faster conditional order * remove nested ternary --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <picard12@live.de>

…12312)

It's already found by FindVulkan.cmake in the parent CMakeLists

* Enable CUDA Graph on CTK < 12.x `cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x. * Fix compilation errors with MUSA * Disable CUDA Graph for MUSA

…g#12426)

* ggml: Add op l2_norm Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code-format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * fix MUSA build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

…ion and driver issues (ggml-org#12434)

Closes ggml-org#12240

ggml-ci

…g#12447) * context : always use non-causal attention for encoder graphs ggml-ci * context : move the change to llama_context::encode() ggml-ci

* Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

gcp and others added 30 commits March 18, 2025 23:11

metal : copy kernels for quant to F32/F16 conversions (ggml-org#12017)

c4a3e97

metal: use dequantize_q templates --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama : expose llama_model_n_head_kv in the API (ggml-org#11997)

e920a42

It's useful to be able to have this from the library layer as it's a key parameter of the model (e.g. to figure out how much KV cache memory is needed).

Add Doc for Converting Granite Vision -> GGUF (ggml-org#12006)

fc55585

* Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

server: support add_generation_prompt query param (ggml-org#12062)

fcdd8fa

vulkan: implement more backpropagation operators (ggml-org#11914)

da1a1e2

* vulkan: implement GGML_OP_ROPE_BACK * vulkan: implement GGML_OP_RMS_NORM_BACK * vulkan: implement GGML_OP_SILU_BACK * vulkan: implement GGML_OP_SOFTMAX_BACK

add OP sigmoid (ggml-org#12056)

0ff95db

Co-authored-by: Judd <foldl@boxvest.com>

server: handle echo=false on /v1/completions (ggml-org#12060)

dea4756

vulkan: fix assertion when qy_needs_dequant (ggml-org#12068)

e1fd56d

Looks like a copy/paste bug from qx_needs_dequant.

docs: add docs/function-calling.md to lighten server/README.md's plig…

fe93d46

…ht (ggml-org#12069)

readme : update infra list (ggml-org#9096)

89d1d41

Signed-off-by: kerthcet <kerthcet@gmail.com>

llava : add struct for FFI bindgen (ggml-org#12079)

db4aebe

* add struct for FFI bindgen * Apply suggestions from code review --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

vulkan: improve im2col (ggml-org#11826)

02df4c5

* vulkan: improve im2col performance

vulkan: matmul dequantization improvements (ggml-org#12015)

e34d633

* faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8

CANN: Fix build error with GCC 13 (ggml-org#11990)

5f61b24

Remove unused header file that causes compilation failure on ARM platform with GCC 13.

CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (ggml-org#12098)

eab6c76

Update granite vision docs for 3.2 model (ggml-org#12105)

d851f59

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

convert : fix Norway problem when parsing YAML (ggml-org#12114)

7b616fc

* convert : fix Norway problem when parsing YAML * Update gguf-py/gguf/metadata.py * add newline at correct place

webui : minor typo fixes (ggml-org#12116)

2c1f2eb

* fix typos and improve menu text clarity * rename variable trimedValue to trimmedValue * add updated index.html.gz * rebuild --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

CUDA: compress mode option and default to size (ggml-org#12029)

02fe919

cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".

main: update outdated system prompt message (followup to ggml-org#12131…

556a32b

…) (ggml-org#12132) * Update outdated message * wording Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

main: use jinja chat template system prompt by default (ggml-org#12118)

863955a

* Use jinja chat template system prompt by default * faster conditional order * remove nested ternary --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

daniandtheweb and others added 13 commits March 19, 2025 09:38

vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (ggml-org#…

1e33f90

…12312)

ggml-vulkan: remove unused find_program(glslc) (ggml-org#12416)

0897424

It's already found by FindVulkan.cmake in the parent CMakeLists

docs : bring llama-cli conversation/template docs up-to-date (ggml-or…

24c26e4

…g#12426)

fixed compilation warnings in ggml-sycl (ggml-org#12424)

b1972f5

Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentat…

2402c3d

…ion and driver issues (ggml-org#12434)

ggml : add SVE support for q6_K_q8_K (ggml-org#12361)

e866ff8

cmake : fix PowerPC build (ggml-org#12241)

568e027

Closes ggml-org#12240

server : fix warmup draft cache type (ggml-org#12446)

b7417a3

ggml-ci

context : always use non-causal attention for encoder graphs (ggml-or…

4f03d40

…g#12447) * context : always use non-causal attention for encoder graphs ggml-ci * context : move the change to llama_context::encode() ggml-ci

rm other backend ci, keep cpu, rpc, sycl

784783c

github-actions bot added documentation Improvements or additions to documentation SYCL ggml Apple Metal Nvidia GPU testing build examples devops python script android server Vulkan labels Mar 19, 2025

rm tail space

7e411f9

arthw merged commit 7b8f354 into master Mar 19, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cherry pick 12447 #9

Cherry pick 12447 #9

Uh oh!

arthw commented Mar 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Cherry pick 12447 #9

Cherry pick 12447 #9

Uh oh!

Conversation

arthw commented Mar 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants