Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: Unleashing leading On-Device AI performance with llama.cpp, SME2 and KleidiAI

minutes_to_complete: 40

who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners

learning_objectives:
- Build llama.cpp library with KleidiAI and SME2 support
- Profile performance of LLMs running on llama-cli
- Learn how KleidiAI and SME2 accelerates LLM operators

prerequisites:
- Knowledge of KleidiAI and SME2
- An Linux or Android device with Arm SME2 support

author: Zenon Zhilong Xiu

### Tags
skilllevels: Advanced
subjects: ML
armips:
- Arm C1 CPU
- Arm SME2 unit
tools_software_languages:
- C++
- llama.cpp
operatingsystems:
- Android
- Linux



further_reading:
- resource:
title: part 1 Arm Scalable Matrix Extension Introduction
link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction
type: blog
- resource:
title: part 2 Arm Scalable Matrix Extension Instructions
link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2
type: blog
- resource:
title: part4 Arm SME2 Introduction
link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/part4-arm-sme2-introduction
type: blog
- resource:
title: Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels
link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/
type: blog



### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # The weight controls the order of the pages. _index.md always has weight 1.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
title: Build llama.cpp with KleidiAI and SME2 enabled
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Build llama.cpp with KleidiAI and SME2 enabled
For convenience, llama.cpp is statically linked. We use aarch64 GCC cross compile toolchain, *aarch64-none-linux-gnu-* to build the project. To support SME2, GCC compiler version 14.2 and onwards is required. The toolchain can be downloaded [here](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads) .

The llama.cpp with Tag b7611 is used in this tutorial, newer versions should also work but they are not tested.

After downloading the llama.cpp source code [here](https://github.com/ggml-org/llama.cpp/releases/tag/b7610), create a new directory *build* under the llama.cpp root directory and change to the new directory,

```bash
cd ~/llama.cpp
mkdir build && cd build
```
Next, configure the project,

```bash
cmake .. \
-DCMAKE_SYSTEM_NAME=Linux \
-DCMAKE_SYSTEM_PROCESSOR=arm \
-DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
-DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \
-DLLAMA_NATIVE=OFF \
-DLLAMA_F16C=OFF \
-DLLAMA_GEMM_ARM=ON \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_EXE_LINKER_FLAGS="-static -g" \
-DGGML_OPENMP=OFF \
-DCMAKE_C_FLAGS=" -march=armv8.7-a+sve+i8mm+dotprod+sme2 -g" \
-DCMAKE_CXX_FLAGS=" -march=armv8.7-a+sve+i8mm+dotprod+sme2 -g" \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_CURL=OFF \
-DGGML_CPU_KLEIDIAI=ON
```
Set CMAKE_C_COMPILER and CMAKE_CXX_COMPILER to your cross compiler path. Make sure that *-march* in CMAKE_C_FLAGS and CMAKE_CXX_FLAGS includes "+sme2".

The *-static* and *-g* options are specified to produce a statically linked executable, in oder to run on different Arm64 Linux/Android environments and include debug information.

Next, build the project,

```bash
cd ~/llama.cpp/build
cmake --build ./ --config Release -j $(nproc)
```
After the building process completes, you can find the application,*llama-cli*,in the ~/llama.cpp/build/bin/ directory.

To enable SME2 microkernels, you must set following environment variable before running the application.

```bash
GGML_KLEIDIAI_SME="1"
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
title: Overview
weight: 2

### FIXED, DO NOT MODIFY
layout: Overview
---

## Introduction
Arm’s latest Client CPU processors such as Arm C1 include Scalable Matrix Extension 2 (SME2). SME2 accelerates the matrix-heavy AI operations behind large language models (LLMs), media processing, speech recognition, computer vision, real-time apps and multimodal apps.

llama.cpp provides extensive support for many LLMs, including Phi, Llama, DeepSeek, Gemma and Qwen. Llama.cpp is designed for efficient CPU-based inference. It enables on-device LLM execution, reducing latency and enhancing privacy.

By default llama.cpp integrates with Arm KleidiAI, a suite of optimized microkernels for Arm CPUs. KleidiAI includes SME2 optimized microkernels to get more performance benefits.

In this learning path, llama.cpp and Llama-3.2-3B-Instruct-Q4_0.gguf model with 3 Billion parameters is used for the tutorial.



Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
---
title: Integration of SME2 optimized KleidiAI microkernels in llama.cpp
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Integration of SME2 optimized KleidiAI microkernels in llama.cpp
KleidiAI library provides optimized matrix multiplication (matmul) kernels tailored for hardware features such as SME, I8MM, and Dot Product(DotProd) acceleration. The feature can be enabled with the build option GGML_CPU_KLEIDIAI.

![Figure showing components of llama.cpp alt-text#center](images/llama_components.jpg "Components of llama.cpp")

The KleidiAI is integrated as a trait of ggml-cpu in llama.cpp CPU backend.
The integration source code locates in following directory of llama.cpp.
```text
./ggml/src/ggml-cpu/kleidiai
```
KleidiAI matmul microkernels can be used for some types of GGML_OP_MUL_MAT operators. The table below lists some matmul operators with specific input and output data type that can be accelerated by KleidiAI microkernels.

| LHS data type | RHS data type | Output data type |
|---------|----------------|----------------|
| GGML_TYPE_F32 | GGML_TYPE_Q4_0 | GGML_TYPE_F32 |
| GGML_TYPE_F32 | GGML_TYPE_Q8_0 | GGML_TYPE_F32 |
| GGML_TYPE_F32 | GGML_TYPE_F16 | GGML_TYPE_F32 |

Note:
LHS is short for Left Hand Source(or Left Hand Input Matrix).
RHS is short for Right Hand Source(or Right Hand Input Matrix).

More operators and data types are being supported by KleidiAI microkernels.

The figure below shows how KleidiAI microkernels are used for matmul with GGML_TYPE_Q4_0 or GGML_TYPE_Q8_0 RHS(weight).

![Figure showing how kleidiai microkernel is used for quantization, packing and matrix multiply llama.cpp alt-text#center](images/kai_matmul_kernel.jpg "Quantization, packing and matrix multiply microkernels")

The packing of GGML_TYPE_Q4_0 or GGML_TYPE_Q8_0 weight (RHS) only needs to be performed one-time when llama.cpp loads the model and weight tensor data, since the weight never changes during the inference. For performance, it repacks the original GGUF weights into a layout optimized for cache-friendly access and DotProd, I8MM, and SME2 operations with the KleidiAI microkernels.
Generally, if multiple choice of KleidiAI matmul microkernels (implemented with DotProd, I8MM or SME2) can be used for acceleration, the KleidAI trait selects one of implementation in following order,

```text
SME2, I8MM, DotProd
```
Once the matmul microkernel is decided, its corresponding RHS packing and LHS quantizing & packing micro-kernel will be used.

In case of using the Llama-3.2-3B-Instruct-Q4_0.gguf model and SME2 microkernels, RHS packing is done by the *kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon* microkernel when loading the model. It is shown in following function call stack,

```text
llama_model_load
llama_model::load_tensors
llama_model_loader::load_all_data
ggml_backend_tensor_set
ggml_backend_cpu_kleidiai_buffer_set_tensor
ggml::cpu::kleidiai::tensor_traits::repack
kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
```
The F32 activation input matrix (LHS) is dynamically quantized and packed by the *kai_run_lhs_quant_pack_qsi8d32p_f32_neon* microkernel every time, since the activation input keeps changing during the model run. It is done by following function call stack,

```text
llama_context::decode
llama_context::process_ubatch
llama_context::graph_compute
ggml_backend_sched_compute_splits
ggml_backend_cpu_graph_compute
ggml_graph_compute //tick off the compute thread
ggml_graph_compute_thread //the compute thread
ggml_compute_forward
ggml_cpu_extra_compute_forward
ggml::cpu::kleidiai::tensor_traits::compute_forward
ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0
kai_run_lhs_quant_pack_qsi8d32p_f32_neon
```
Once the LHS and RHS is ready, KleidiAI matmul microkernel can be executed.

In this example, we use Llama-3.2-3B-Instruct-Q4_0.gguf model and 512-bit SME2 streaming vector length. At the Prefill stage, the KleidiAI GEMM microkernel optimized with SME2, *kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa*, is selected by the KleidiAI trait, it produces a dequantized F32 output matrix. It is done right after LHS quantizing & packing by function call stack shown below.
```text
llama_context::decode
llama_context::process_ubatch
llama_context::graph_compute
ggml_backend_sched_compute_splits
ggml_backend_cpu_graph_compute
ggml_graph_compute //tick off the compute thread
ggml_graph_compute_thread //the compute thread
ggml_compute_forward
ggml_cpu_extra_compute_forward
ggml::cpu::kleidiai::tensor_traits::compute_forward
ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0
kai_run_lhs_quant_pack_qsi8d32p_f32_neon
kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
```
At the LLM decode stage, KleidiAI GEMV micro-kernel optimized with SME2, *kai_run_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot*, is selected by the KleidiAI trait in llama.cpp, it produces a dequantized F32 output vector. It is done right after LHS quantizing & packing by function call stack shown below,

```text
llama_context::decode
llama_context::process_ubatch
llama_context::graph_compute
ggml_backend_sched_compute_splits
ggml_backend_cpu_graph_compute
ggml_graph_compute //tick off the compute thread
ggml_graph_compute_thread //the compute thread
ggml_compute_forward
ggml_cpu_extra_compute_forward
ggml::cpu::kleidiai::tensor_traits::compute_forward
ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0
kai_run_lhs_quant_pack_qsi8d32p_f32_neon
kai_run_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
title: Run the Llama-3.2-3B-Instruct-Q4_0.gguf model with llama-cli
weight: 5

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Run the Llama-3.2-3B-Instruct-Q4_0.gguf model with llama-cli
Put the built llama-cli executable and Llama-3.2-3B-Instruct-Q4_0.gguf model file to your aarch64 Linux/Android target that supports SME2.
The model can be downloaded [here](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct-GGUF).

The figure below shows the architecture of Llama-3.2-3B model,
![Figure showing Llama-3.2-3B architecture alt-text#center](images/llama-3.2-3b_architecture.jpg "Architecture of Llama-3.2-3B")

For performance evaluation, we run the model by binding it to a single Arm C1-Pro core with CPU affinity.
To run the model with SME2 microkernels enabled, set the environment variable first.

```bash
env GGML_KLEIDIAI_SME="1" taskset 2 ./llama-cli -m ./ Llama-3.2-3B-Instruct-Q4_0.gguf -st -C 0x2 -Cb 0x2 -t 1 -p "input your prompt"
```
Where
- *env GGML_KLEIDIAI_SME="1"* sets the environment variable
- *taskset 2* sets the task affinity and binds the execution of llama-cli to CPU core 2 (the Arm C1-Pro core in our case)
- *-C 0x2 -Cb 0x2* sets the CPU affinity of the execution of operators
- *-t 1* sets the number of threads to 1

For performance comparison, we also run the model with SME2 microkernels disabled by the setting the environment variable,

```bash
GGML_KLEIDIAI_SME="0"
```
so that I8MM and Dotprod microkernels are used instead.

```bash
env GGML_KLEIDIAI_SME="0" taskset 2 ./llama-cli -m ./ Llama-3.2-3B-Instruct-Q4_0.gguf -st -C 0x2 -Cb 0x2 -t 1 -p "input your prompt"
```
We can profile the model execution with the approach introduced in [Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels](https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/).


The Streamline Timeline view and Annotate Markers in the figure below show that the token generation speeds up significantly at both Prefill and Decode stage. The PMU event counters show that many SME2 instructions, especially SME2 Integer Outer Product Accumulate instructions at the Prefill stage and SME2 Integer Outer Product instructions at the Decode stage, are used for acceleration.

![Figure showing Streamline Timeline view alt-text#center](images/streamline_timeline_combined.jpg "Combined Streamline Timeline view with and without SME2")

The Streamline Call Paths view below indicates similar speedup, it also shows that DotProd and I8MM KleidiAI microkernels are used instead when SME2 is not enabled.

![Figure showing Streamline Call Paths view alt-text#center](images/streamline_call_paths_combined.jpg "Combined Streamline Call Paths view with and without SME2")

To investigate which operators in the model graph are delegated to KleidiAI microkernels, we can add some codes as below to *./ggml/src/ggml-cpu/kleidiai/kleidiai.cpp* to print out the name of operators that make use of kleidiAI microkernels. This is only for debugging purpose.

```c++
bool compute_forward(struct ggml_compute_params * params, struct
ggml_tensor * dst) override {
if (dst->op == GGML_OP_MUL_MAT) {
if (dst->src[0]->type == GGML_TYPE_Q4_0) {
//add log for kai microkernel
std::cout << "kai matmul Q4_0" << dst->name << std::endl;
return compute_forward_q4_0(params, dst);
} else if (dst->src[0]->type == GGML_TYPE_Q8_0) {
//add log for kai microkernel
std::cout << "kai matmul Q8_0" << dst->name << std::endl;
return compute_forward_q8_0(params, dst);
} else if (dst->src[0]->type == GGML_TYPE_F16) {
//add log for kai microkernel
std::cout << "kai matmul fp16" << dst->name << std::endl;
return compute_forward_fp16(params, dst);
}
```
When running the model, some log will be printed as below,
```text
kai matmul Q4_0 Qcur-27
kai matmul Q4_0 Vcur-27
kai matmul Q4_0 Kcur-27
kai matmul Q4_0 attn_out-27
kai matmul Q4_0 ffn_gate-27
kai matmul Q4_0 ffn_up-27
kai matmul Q4_0 ffn_out-27
```
Take one attention block of Llama-3.2-3B-Instruct-Q4_0 model for example, the operators that are accelerated by KleidiAI SME2-optimized microkernels are highlighted manually by the blue boxs in the figure of graph of the attention block. How to get the graph is beyond the scope of this learning path. Please refer to external resources.

![Figure highlighting operators acclerated by KleidiAI SME2-optimized microkernels alt-text#center](images/one_attention_block.jpg "Operators acclerated by KleidiAI SME2-optimized microkernels in one attention block")

KleidiAI support in llama.cpp is still evolving, more operators will be accelerated by KleidiAI microkernels, unleashing greater potential of SME2.

## Summary
With out-of-box KleidiAI and SME2 support in llama.cpp, we can get significant performance uplift at both Prefill and Decode stage, which enhances the experience of running LLM locally on device.