Merge upstream by curtisgray · Pull Request #1 · curtisgray/wingman.cpp

curtisgray · 2024-06-01T20:00:35Z

No description provided.

ggml-ci

* test-backend-ops: add support for specifying output format Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Add build_commit and build_number in test_result Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * refactor Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Get build commit from ggml_commit() Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Merge errors into test_operation_info && address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * remove visitor nonsense * remove visitor comment Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> --------- Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.

Commit taken from remyoudompheng's PR #12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci

* model : add hunyuan moe * tokenizer ok * fix tensor name * cgraph init * chat template * wip * almost working * skip embed, fix bos * cleanup * yarn scaling * cleanup * correct rope type * failed token fix * ntk alpha freq_base * tokenization working * cleanup and pr changes * vocab_size sanity check * ntk alpha generic * Update convert_hf_to_gguf.py * Apply suggestions from code review * fix regression * fix style --------- Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>

* Add server_prefix * Correct server path env * Rename cli flag to --api-prefix * Change all to api_prefix

Splits producing more than one ubatch per batch for recurrent models were broken with #14512. This fixes it by moving the completeness check after the ubatch split loop.

* Init - first pass. * Model -> ModelBase. * fix errors in conversion. * Update the graph. * up. * up. * wip * cgraph ok * rm redundant code --------- Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com>

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

* v1 * push more fixes * another fix * fix * more fixes * minor fix * more cleaning on python code * python fixes * changed precision for multipliers float 32->64 * fixes * another fix * fix * pre-norm -> norm * fix * Revert "fix" This reverts commit 243e4d1. * fix * small fix ffn_norm * try * mix instead of max * fix vocab size * conflict solve * fixed multipliers * falcon-h1 specefic vocab resolved * read arch from gguf.MODEL_ARCH * mamba_d_ssm added to d_inner find_hparam * remove unused functions from gguf_writer.py * override modify_tensors instead of get_tensors * fix conversion and d_inner * added some cb functions for debugging puposes * inp_out_ids moved outside of layers loop * mup_vec create as float64 * fix rope_theta * injected mup * clean ups * rm extra space * rm unused MAMBA_CHUNK_SIZE * rm unused key * add bos False * changed ROPE_TYPE * cleaning debugging stuff * cleaning debug quant * fix comment * some cleanups * some cleanups * Update src/llama-model-loader.cpp * more cleanups * moe cleanuips * d_ssm -> d_inner; * cleaning unused hparams * cleanup * more cleanups * more cleanups on python conversion; * minor cleanups * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * remove todo * added falcon-h1 * tensor not required * clean * remove unneeded attributes * more cleanups and fixed conversion * remove final_norm * flake8 fixes * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * flake8 fixes * Update src/llama-hparams.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * added hashes * Update src/llama-arch.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-vocab.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update the update file * Revert "update the update file" This reverts commit 082ab4a. * fix: address suggestions * fix: update convert_hf_to_gguf.py * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * d_inner fixed * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * reshaping ssm_norm for 34B * removing generate_mup * remove duplicates metadata keys * rm comment * final comment * fix unused args * fix constants * fix bad merge * Update src/llama-model.cpp Co-authored-by: compilade <git@compilade.net> * falcon-h1: remove unused ssm_in_b and bad merge * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * falcon-h1: fix last comment * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * falcon-h1: revert add_add_bos(False) * falcon-h1: fix tied weights * falcon-h1: remove whitespace * falcon-h1: fix wrong size param * falcon-h1: fix whitespace issues --------- Co-authored-by: younesbelkada <younes.belkada@tii.ae> Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: compilade <git@compilade.net>

* support minicpm-v 4 * add md * support MiniCPM-o 4.0 * add default location * temp rm MiniCPM-o 4.0 * fix code * fix "minicpmv_projector" default path

* vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support

…ion (#14990)

…14992)

@JohannesGaessler

…4392) * compare-commits.sh: support both llama-bench and test-backend-ops Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Speed up the build by specifying -j 12 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Remove build_number from test-backend-ops db Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Apply suggestion from @JohannesGaessler Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Refine tool selection logic Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* docker: add cann build pipline * docker: add cann build pipline * docker: fix cann devops * cann : fix multi card hccl * Update ggml/src/ggml-cann/ggml-cann.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update ggml-cann.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

ggml-ci

* Initial Q2_K Block Interleaving Implementation * Addressed review comments and clean up of the code * Post rebase fixes * Initial CI/CD fixes * Update declarations in arch-fallback.h * Changes for GEMV Q2_K in arch-fallback.h * Enable repacking only on AVX-512 machines * Update comments in repack.cpp * Address q2k comments --------- Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>

* support hunyuan_v1_dense Signed-off-by: stevenkuang <stevenkuang@tencent.com> * update hunyuan_moe to hunyuan_v1_moe Signed-off-by: stevenkuang <stevenkuang@tencent.com> * fix rope alpha assert and bos token Signed-off-by: stevenkuang <stevenkuang@tencent.com> * add blank line Signed-off-by: stevenkuang <stevenkuang@tencent.com> * Revert "update hunyuan_moe to hunyuan_v1_moe" This reverts commit aa973ca. * use hunyuan_dense instead of hunyuan_v1_dense Signed-off-by: stevenkuang <stevenkuang@tencent.com> * fix hunyuan_moe chat template Signed-off-by: stevenkuang <stevenkuang@tencent.com> * remove leftover code Signed-off-by: stevenkuang <stevenkuang@tencent.com> * update hunyuan dense chat template Signed-off-by: stevenkuang <stevenkuang@tencent.com> * fix hunyuan dense vocab and chat template Signed-off-by: stevenkuang <stevenkuang@tencent.com> --------- Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* vendor : update vendored copy of google/minja Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com> * Re-remove trailing whitespace Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com> * Remove another trailing whitespace Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com> --------- Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>

* vulkan: optimizations for direct convolution - Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV. * Three tiles sizes for CONV_2D, and a heuristic to choose * reallow collectives for pre-Turing * make SHMEM_PAD a spec constant * fixes for intel perf - no shmem padding, placeholder shader core count * shader variants with/without unrolling * 0cc4m's fixes for AMD perf Co-authored-by: 0cc4m <picard12@live.de> --------- Co-authored-by: 0cc4m <picard12@live.de>

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

- Increase tile size for k-quants, to match non-k-quants - Choose more carefully between large and medium tiles, considering how it interacts with split_k - Allow larger/non-power of two split_k, and make the splits a multiple of 256 - Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used

* torch is not required for convert_hf_to_gguf_update * add --check-missing parameter * check that pre-tokenizer hashes are up-to-date

* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ggml-ci * cont : fix cont types ggml-ci * cont : adopt variable names and comment from the other branch

ggml-ci

…5040) This commit removes the right alignment the `n_stream` value in the log message in the `llama_kv_cache_unified` constructor. The motivation for this change is to enhance the readability of log message. Currently the output looks like this: ```console llama_kv_cache_unified: size = 2048.00 MiB ( 4096 cells, 32 layers, 1/ 1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB ``` Notice that the `n_stream` value is right aligned, which makes it a little harder to read. With the change in this commit the output will look like ```console llama_kv_cache_unified: size = 2048.00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB ```

curtisgray self-assigned this Jun 1, 2024

ggerganov and others added 29 commits July 4, 2025 09:05

graph : prepare for 4D mask (#14515)

7b50f7c

ggml-ci

batch : add optional for sequential equal split (#14511)

67d1ef2

ggml-ci

metal : disable fast math in all quantize kernels (#14528)

ef797db

ggml-ci

eval-callback : check for empty input (#14539)

bac8bed

opencl: add GELU_ERF (#14476)

6681688

server : fix assistant prefilling when content is an array (#14360)

ddef995

vulkan: Handle updated FA dim2/3 definition (#14518)

a0374a6

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

vulkan: fix rms_norm+mul fusion (#14545)

e592be1

The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.

vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)

6491d6e

Commit taken from remyoudompheng's PR #12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>

CUDA: add bf16 and i32 to getrows (#14529)

b9c3eef

llama : remove ggml_cont where possible (#14568)

12f55c3

llama : fix incorrect minicpm3 v_states shape (#14571)

e1a7059

musa: fix build warnings (unused variable) (#14561)

68155c6

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

CUDA: add bilinear interpolation for upscale (#14563)

75c91de

cuda : fix rope with partial rotation and non-cont src (#14580)

4d0dcd4

* cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci

vulkan: increase timeout for CI (#14574)

53903ae

server: Add ability to mount server at prefix (#14544)

17a1f0d

* Add server_prefix * Correct server path env * Rename cli flag to --api-prefix * Change all to api_prefix

vulkan : fix rope with partial rotation and non-cont src (#14582)

b8eeb87

memory : fix broken batch splits for recurrent cache (#14575)

bb4f7a9

Splits producing more than one ubatch per batch for recurrent models were broken with #14512. This fixes it by moving the completeness check after the ubatch split loop.

model : add SmolLM3 (#14581)

0838286

* Init - first pass. * Model -> ModelBase. * fix errors in conversion. * Update the graph. * up. * up. * wip * cgraph ok * rm redundant code --------- Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com>

model : fix hunyuan moe chat template (#14584)

699f439

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

convert : fix smollm3 jinja template (#14586)

20b7bf8

llama : remove unintended whitespace (#14592)

1055545

model : add skt/A.X-4.0 model vocabulary (#14589)

ffd59e7

ggml : prevent integer overflow in gguf tensor size calculation (#14595)

26a48ad

tc-mb and others added 29 commits July 31, 2025 17:22

mtmd : support MiniCPM-V 4.0 (#14983)

952a47f

* support minicpm-v 4 * add md * support MiniCPM-o 4.0 * add default location * temp rm MiniCPM-o 4.0 * fix code * fix "minicpmv_projector" default path

Vulkan: Fix minor debug mode issues (#14899)

e08a988

* vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support

llama : allow other bufts when overriding to CPU, add --no-repack opt…

d6818d0

…ion (#14990)

Fix params bug in diffusion example (#14993)

7845240

llama : add simple option to enable CPU for MoE weights (--cpu-moe) (#…

a06ed5f

…14992)

quantize : skip tensor override when in fallback mode (#14995)

daf2dd7

graph : fix equal_seq() check (#14986)

ba42794

ggml-ci

opencl: add f16 for add, sub, mul, div (#14984)

1c872f7

CUDA: fix MMQ nwarps for AMD with warp_size==32 (#15014)

9c35706

server: enable token array inputs for OAI API (#15001)

f906275

model : support Qwen3-Embedding (#15023)

339bd02

vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015)

ec0b188

llama-bench: rename DB table name from test to llama_bench (#15003)

3025b62

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

chat : fix multiple tool_calls on hermes-2-pro (#14962)

f738989

convert : fix Qwen3-Embedding pre-tokenizer hash (#15030)

711d5e6

ci : check that pre-tokenizer hashes are up-to-date (#15032)

2bf3fbf

* torch is not required for convert_hf_to_gguf_update * add --check-missing parameter * check that pre-tokenizer hashes are up-to-date

cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (#15038)

15e92fd

* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ggml-ci * cont : fix cont types ggml-ci * cont : adopt variable names and comment from the other branch

llama : enable LLAMA_SET_ROWS=1 by default (#14959)

a4569c4

ggml-ci

cuda: make im2col a little faster (#15025)

3303c19

CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (#15035)

03d4698

opencl: fix adreno compiler detection logic (#15029)

5c0eb5e

curtisgray closed this Aug 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge upstream#1

Merge upstream#1
curtisgray wants to merge 3295 commits into
curtisgray:llamacppfrom
ggml-org:master

curtisgray commented Jun 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

curtisgray commented Jun 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants