[DO NOT MERGE] Diff with the master branch by ykhrustalev · Pull Request #4 · Liquid4All/benchmarks-llama.cpp

ykhrustalev · 2025-11-20T12:59:11Z

A control PR to show the divergence from the original llama.cpp repo

Implement conditional prefill computation skipping in llama-bench: disable computation for --depth prefill while keeping it enabled for prefill benchmarks. - Default behavior (no flag): Depth prefill skips computation - With `--enable-depth-computation`: Depth prefill performs full computation - `-p` benchmarks: Always perform computation (not affected by this flag)

Added a new windows-cuda job that: - Uses Windows 2022 runner with CUDA 12.4 - Installs CUDA toolkit and Ninja build system - Builds llama-bench with CUDA support enabled - Packages and uploads the benchmark tool artifacts - Follows the same pattern as the release.yml windows-cuda job Updated the release job to depend on the new windows-cuda job. *Make sure to read the [contributing guidelines](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md) before submitting a PR* Co-authored-by: Claude <noreply@anthropic.com>

* Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness

# Conflicts: # tools/llama-bench/llama-bench.cpp

This reverts commit 7065e05.

ykhrustalev added 3 commits November 20, 2025 07:33

Add llama-bench for android (#1)

908eef8

Add Vulkan build (#2)

0626fa0

Merge branch 'master' into benchmarks

5da91d5

ykhrustalev changed the title ~~[DO NOT MERGE] A control PR to show the divergence from the original llama.cpp repo~~ [DO NOT MERGE] Diff with the master branch Nov 20, 2025

ykhrustalev and others added 12 commits November 20, 2025 08:22

Add release button (#5)

1a2643d

Add release button, take 2 (#6)

bb7e07a

Add release button, take 3 (#8)

a956f73

Add macos build (#9)

2d985fa

Merge branch 'master' into benchmarks

8d04bae

Add win cpu variants (#11)

d1f2d38

Merge branch 'master' into benchmarks

4587087

Merge branch 'master' into benchmarks

583dbd2

Merge branch 'master' into benchmarks

f914051

Merge branch 'master' into benchmarks

c786e29

ykhrustalev added 7 commits December 8, 2025 16:15

Merge branch 'master' into benchmarks

d4b4c9f

Add ubuntu (#12)

b3eeeb9

Merge branch 'master' into benchmarks

989be8e

Merge branch 'master' into benchmarks

df27858

Merge branch 'master' into benchmarks

126f2fb

Merge branch 'master' into benchmarks

95eb654

Merge branch 'master' into benchmarks

592e014

ykhrustalev and others added 5 commits January 14, 2026 15:24

Merge remote-tracking branch 'origin/master' into benchmarks

7ce8fc0

curl off (#13)

3e01ecd

Merge branch 'master' into benchmarks

3a9012f

Merge branch 'master' into benchmarks

83d3594

Merge branch 'master' into benchmarks

a0c8753

ykhrustalev and others added 7 commits March 15, 2026 17:42

Merge remote-tracking branch 'origin/master' into benchmarks

bd52a98

# Conflicts: # tools/llama-bench/llama-bench.cpp

Merge branch 'master' into benchmarks

f647cd9

Merge branch 'master' into benchmarks

4689c7e

build llama-server

23dff81

Merge remote-tracking branch 'origin/master' into benchmarks

0f345f3

# Conflicts: # tools/llama-bench/llama-bench.cpp

Merge branch 'master' into benchmarks

38cc8e3

arm builder

7065e05

ykhrustalev force-pushed the benchmarks branch from 46da04d to 7065e05 Compare April 16, 2026 16:48

Revert "arm builder"

029457e

This reverts commit 7065e05.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Diff with the master branch#4

[DO NOT MERGE] Diff with the master branch#4
ykhrustalev wants to merge 35 commits intomasterfrom
benchmarks

ykhrustalev commented Nov 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ykhrustalev commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ykhrustalev commented Nov 20, 2025 •

edited

Loading