ArmDeveloperEcosystem · zenonxiu81 · Feb 5, 2026
diff --git a/.../learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/_index.md b/.../learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/_index.md
@@ -0,0 +1,59 @@
+---
+title: Unleashing leading On-Device AI performance with llama.cpp, SME2 and KleidiAI
+
+minutes_to_complete: 40
+
+who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners 
+
+learning_objectives: 
+    - Build llama.cpp library with KleidiAI and SME2 support
+    - Profile performance of LLMs running on llama-cli
+    - Learn how KleidiAI and SME2 accelerates LLM operators
+
+prerequisites:
+    - Knowledge of KleidiAI and SME2
+    - An Linux or Android device with Arm SME2 support
+
+author: Zenon Zhilong Xiu
+
+### Tags
+skilllevels: Advanced
+subjects: ML
+armips:
+    - Arm C1 CPU
+    - Arm SME2 unit
+tools_software_languages:
+    - C++
+    - llama.cpp
+operatingsystems:
+    - Android
+    - Linux
+
+
+
+further_reading:
+    - resource:
+        title: part 1 Arm Scalable Matrix Extension Introduction 
+        link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction
+        type: blog
+    - resource:
+        title: part 2 Arm Scalable Matrix Extension Instructions 
+        link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2
+        type: blog
+    - resource:
+        title: part4 Arm SME2 Introduction 
+        link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/part4-arm-sme2-introduction
+        type: blog
+    - resource:
+        title: Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels
+        link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/
+        type: blog
+
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/...ning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/_next-steps.md b/...ning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # The weight controls the order of the pages. _index.md always has weight 1.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/...-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/build_llama_cpp.md b/...-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/build_llama_cpp.md
@@ -0,0 +1,57 @@
+---
+title: Build llama.cpp with KleidiAI and SME2 enabled
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Build llama.cpp with KleidiAI and SME2 enabled
+For convenience, llama.cpp is statically linked. We use aarch64 GCC cross compile toolchain, *aarch64-none-linux-gnu-* to build the project. To support SME2, GCC compiler version 14.2 and onwards is required. The toolchain can be downloaded [here](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads) . 
+
+The llama.cpp with Tag b7611 is used in this tutorial, newer versions should also work but they are not tested. 
+
+After downloading the llama.cpp source code [here](https://github.com/ggml-org/llama.cpp/releases/tag/b7610), create a new directory *build* under the llama.cpp root directory and change to the new directory,
+
+```bash
+cd ~/llama.cpp
+mkdir build && cd build
+```
+Next, configure the project,
+
+```bash
+cmake .. \
+  -DCMAKE_SYSTEM_NAME=Linux \
+  -DCMAKE_SYSTEM_PROCESSOR=arm \
+  -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
+  -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \
+  -DLLAMA_NATIVE=OFF \
+  -DLLAMA_F16C=OFF \
+  -DLLAMA_GEMM_ARM=ON \
+  -DBUILD_SHARED_LIBS=OFF \
+  -DCMAKE_EXE_LINKER_FLAGS="-static -g" \
+  -DGGML_OPENMP=OFF \
+  -DCMAKE_C_FLAGS=" -march=armv8.7-a+sve+i8mm+dotprod+sme2 -g" \
+  -DCMAKE_CXX_FLAGS=" -march=armv8.7-a+sve+i8mm+dotprod+sme2 -g" \
+  -DLLAMA_BUILD_TESTS=OFF  \
+  -DLLAMA_BUILD_EXAMPLES=ON \
+  -DLLAMA_CURL=OFF \
+  -DGGML_CPU_KLEIDIAI=ON 
+```
+Set CMAKE_C_COMPILER and CMAKE_CXX_COMPILER to your cross compiler path. Make sure that *-march* in CMAKE_C_FLAGS and CMAKE_CXX_FLAGS includes "+sme2".
+
+The *-static* and *-g* options are specified to produce a statically linked executable, in oder to run on different Arm64 Linux/Android environments and include debug information.
+
+Next, build the project,
+
+```bash
+cd ~/llama.cpp/build
+cmake --build ./ --config Release -j $(nproc)
+```
+After the building process completes, you can find the application,*llama-cli*,in the ~/llama.cpp/build/bin/ directory.
+
+To enable SME2 microkernels, you must set following environment variable before running the application.
+
+```bash 
+GGML_KLEIDIAI_SME="1"
+```
diff --git a/...ile-graphics-and-gaming/performance_llama_cpp_sme2/images/kai_matmul_kernel.jpg b/...ile-graphics-and-gaming/performance_llama_cpp_sme2/images/kai_matmul_kernel.jpg
diff --git a/...hics-and-gaming/performance_llama_cpp_sme2/images/llama-3.2-3b_architecture.jpg b/...hics-and-gaming/performance_llama_cpp_sme2/images/llama-3.2-3b_architecture.jpg
diff --git a/...bile-graphics-and-gaming/performance_llama_cpp_sme2/images/llama_components.jpg b/...bile-graphics-and-gaming/performance_llama_cpp_sme2/images/llama_components.jpg
diff --git a/...e-graphics-and-gaming/performance_llama_cpp_sme2/images/one_attention_block.jpg b/...e-graphics-and-gaming/performance_llama_cpp_sme2/images/one_attention_block.jpg
diff --git a/...and-gaming/performance_llama_cpp_sme2/images/streamline_call_paths_combined.jpg b/...and-gaming/performance_llama_cpp_sme2/images/streamline_call_paths_combined.jpg
diff --git a/...s-and-gaming/performance_llama_cpp_sme2/images/streamline_timeline_combined.jpg b/...s-and-gaming/performance_llama_cpp_sme2/images/streamline_timeline_combined.jpg
diff --git a/...ing-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/introduction.md b/...ing-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/introduction.md
@@ -0,0 +1,19 @@
+---
+title: Overview
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: Overview
+---
+
+## Introduction 
+Arm’s latest Client CPU processors such as Arm C1 include Scalable Matrix Extension 2 (SME2). SME2 accelerates the matrix-heavy AI operations behind large language models (LLMs), media processing, speech recognition, computer vision, real-time apps and multimodal apps.
+
+llama.cpp provides extensive support for many LLMs, including Phi, Llama, DeepSeek, Gemma and Qwen. Llama.cpp is designed for efficient CPU-based inference. It enables on-device LLM execution, reducing latency and enhancing privacy.
+
+By default llama.cpp integrates with Arm KleidiAI, a suite of optimized microkernels for Arm CPUs. KleidiAI includes SME2 optimized microkernels to get more performance benefits. 
+
+In this learning path, llama.cpp and Llama-3.2-3B-Instruct-Q4_0.gguf model with 3 Billion parameters is used for the tutorial. 
+
+
+
diff --git a/...s/mobile-graphics-and-gaming/performance_llama_cpp_sme2/kleidiai_integration.md b/...s/mobile-graphics-and-gaming/performance_llama_cpp_sme2/kleidiai_integration.md
@@ -0,0 +1,106 @@
+---
+title: Integration of SME2 optimized KleidiAI microkernels in llama.cpp
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Integration of SME2 optimized KleidiAI microkernels in llama.cpp
+KleidiAI library provides optimized matrix multiplication (matmul) kernels tailored for hardware features such as SME, I8MM, and Dot Product(DotProd) acceleration. The feature can be enabled with the build option GGML_CPU_KLEIDIAI.
+
+![Figure showing components of llama.cpp alt-text#center](images/llama_components.jpg "Components of llama.cpp")
+
+The KleidiAI is integrated as a trait of ggml-cpu in llama.cpp CPU backend. 
+The integration source code locates in following directory of llama.cpp.
+```text
+ ./ggml/src/ggml-cpu/kleidiai
+```
+KleidiAI matmul microkernels can be used for some types of GGML_OP_MUL_MAT operators. The table below lists some matmul operators with specific input and output data type that can be accelerated by KleidiAI microkernels. 
+
+| LHS data type   | RHS data type  | Output data type     | 
+|---------|----------------|----------------|
+| GGML_TYPE_F32 | GGML_TYPE_Q4_0          | GGML_TYPE_F32  |
+| GGML_TYPE_F32 | GGML_TYPE_Q8_0          | GGML_TYPE_F32  |
+| GGML_TYPE_F32 | GGML_TYPE_F16          | GGML_TYPE_F32  |
+
+Note:  
+LHS is short for Left Hand Source(or Left Hand Input Matrix).
+RHS is short for Right Hand Source(or Right Hand Input Matrix).
+
+More operators and data types are being supported by KleidiAI microkernels.
+
+The figure below shows how KleidiAI microkernels are used for matmul with GGML_TYPE_Q4_0 or GGML_TYPE_Q8_0 RHS(weight).
+
+![Figure showing how kleidiai microkernel is used for quantization, packing and matrix multiply llama.cpp alt-text#center](images/kai_matmul_kernel.jpg "Quantization, packing and matrix multiply microkernels")
+
+The packing of GGML_TYPE_Q4_0 or GGML_TYPE_Q8_0 weight (RHS) only needs to be performed one-time when llama.cpp loads the model and weight tensor data, since the weight never changes during the inference. For performance, it repacks the original GGUF weights into a layout optimized for cache-friendly access and DotProd, I8MM, and SME2 operations with the KleidiAI microkernels. 
+Generally, if multiple choice of KleidiAI matmul microkernels (implemented with DotProd, I8MM or SME2) can be used for acceleration, the KleidAI trait selects one of implementation in following order,
+
+ ```text
+   SME2, I8MM, DotProd
+```
+Once the matmul microkernel is decided, its corresponding RHS packing and LHS quantizing & packing micro-kernel will be used.
+
+In case of using the Llama-3.2-3B-Instruct-Q4_0.gguf model and SME2 microkernels, RHS packing is done by the *kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon* microkernel when loading the model. It is shown in following function call stack,
+
+ ```text
+llama_model_load  
+    llama_model::load_tensors  
+        llama_model_loader::load_all_data
+            ggml_backend_tensor_set		 
+                ggml_backend_cpu_kleidiai_buffer_set_tensor 
+                    ggml::cpu::kleidiai::tensor_traits::repack
+                        kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
+```
+The F32 activation input matrix (LHS) is dynamically quantized and packed by the *kai_run_lhs_quant_pack_qsi8d32p_f32_neon* microkernel every time, since the activation input keeps changing during the model run. It is done by following function call stack,
+
+ ```text
+llama_context::decode
+    llama_context::process_ubatch
+        llama_context::graph_compute
+            ggml_backend_sched_compute_splits
+                ggml_backend_cpu_graph_compute
+                    ggml_graph_compute             //tick off the compute thread
+                        ggml_graph_compute_thread   //the compute thread
+                            ggml_compute_forward
+                                ggml_cpu_extra_compute_forward
+                                    ggml::cpu::kleidiai::tensor_traits::compute_forward
+                                        ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0
+                                            kai_run_lhs_quant_pack_qsi8d32p_f32_neon
+  ```
+Once the LHS and RHS is ready, KleidiAI matmul microkernel can be executed. 
+
+In this example, we use Llama-3.2-3B-Instruct-Q4_0.gguf model and 512-bit SME2 streaming vector length. At the Prefill stage, the KleidiAI GEMM microkernel optimized with SME2, *kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa*, is selected by the KleidiAI trait, it produces a dequantized F32 output matrix. It is done right after LHS quantizing & packing by function call stack shown below.
+```text
+llama_context::decode
+    llama_context::process_ubatch
+        llama_context::graph_compute
+            ggml_backend_sched_compute_splits
+                ggml_backend_cpu_graph_compute
+                    ggml_graph_compute             //tick off the compute thread
+                        ggml_graph_compute_thread   //the compute thread
+                            ggml_compute_forward
+                                ggml_cpu_extra_compute_forward
+                                    ggml::cpu::kleidiai::tensor_traits::compute_forward
+                                        ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0
+                                            kai_run_lhs_quant_pack_qsi8d32p_f32_neon
+                                            kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa								 
+```
+At the LLM decode stage, KleidiAI GEMV micro-kernel optimized with SME2, *kai_run_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot*， is selected by the KleidiAI trait in llama.cpp, it produces a dequantized F32 output vector. It is done right after LHS quantizing & packing by function call stack shown below,
+
+```text
+llama_context::decode
+    llama_context::process_ubatch
+        llama_context::graph_compute
+            ggml_backend_sched_compute_splits
+                ggml_backend_cpu_graph_compute
+                    ggml_graph_compute             //tick off the compute thread
+                        ggml_graph_compute_thread   //the compute thread
+                            ggml_compute_forward
+                                ggml_cpu_extra_compute_forward
+                                    ggml::cpu::kleidiai::tensor_traits::compute_forward
+                                        ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0
+                                            kai_run_lhs_quant_pack_qsi8d32p_f32_neon
+                                            kai_run_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot								 
+```
diff --git a/...learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/run_llm.md b/...learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/run_llm.md
@@ -0,0 +1,86 @@
+---
+title: Run the Llama-3.2-3B-Instruct-Q4_0.gguf model with llama-cli
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Run the Llama-3.2-3B-Instruct-Q4_0.gguf model with llama-cli 
+Put the built llama-cli executable and Llama-3.2-3B-Instruct-Q4_0.gguf model file to your aarch64 Linux/Android target that supports SME2.
+The model can be downloaded [here](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct-GGUF).
+
+The figure below shows the architecture of Llama-3.2-3B model,
+![Figure showing Llama-3.2-3B architecture alt-text#center](images/llama-3.2-3b_architecture.jpg "Architecture of Llama-3.2-3B")
+
+For performance evaluation, we run the model by binding it to a single Arm C1-Pro core with CPU affinity. 
+To run the model with SME2 microkernels enabled, set the environment variable first. 
+
+```bash
+env GGML_KLEIDIAI_SME="1" taskset 2 ./llama-cli -m ./ Llama-3.2-3B-Instruct-Q4_0.gguf -st -C 0x2 -Cb 0x2 -t 1 -p "input your prompt"
+```
+Where 
+- *env GGML_KLEIDIAI_SME="1"* sets the environment variable
+- *taskset 2* sets the task affinity and binds the execution of llama-cli to CPU core 2 (the Arm C1-Pro core in our case)
+- *-C 0x2 -Cb 0x2* sets the CPU affinity of the execution of operators
+- *-t 1* sets the number of threads to 1
+
+For performance comparison, we also run the model with SME2 microkernels disabled by the setting the environment variable,
+
+```bash 
+GGML_KLEIDIAI_SME="0"
+```
+so that I8MM and Dotprod microkernels are used instead.
+
+```bash
+env GGML_KLEIDIAI_SME="0" taskset 2 ./llama-cli -m ./ Llama-3.2-3B-Instruct-Q4_0.gguf -st -C 0x2 -Cb 0x2  -t 1 -p "input your prompt"
+```
+We can profile the model execution with the approach introduced in [Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels](https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/).
+
+
+The Streamline Timeline view and Annotate Markers in the figure below show that the token generation speeds up significantly at both Prefill and Decode stage. The PMU event counters show that many SME2 instructions, especially SME2 Integer Outer Product Accumulate instructions at the Prefill stage and SME2 Integer Outer Product instructions at the Decode stage, are used for acceleration. 
+
+![Figure showing Streamline Timeline view alt-text#center](images/streamline_timeline_combined.jpg "Combined Streamline Timeline view with and without SME2")
+
+The Streamline Call Paths view below indicates similar speedup, it also shows that DotProd and I8MM KleidiAI microkernels are used instead when SME2 is not enabled. 
+
+![Figure showing Streamline Call Paths view alt-text#center](images/streamline_call_paths_combined.jpg "Combined Streamline Call Paths view with and without SME2")
+
+To investigate which operators in the model graph are delegated to KleidiAI microkernels, we can add some codes as below to *./ggml/src/ggml-cpu/kleidiai/kleidiai.cpp* to print out the name of operators that make use of kleidiAI microkernels. This is only for debugging purpose.
+
+```c++
+    bool compute_forward(struct ggml_compute_params * params, struct
+    ggml_tensor * dst) override {
+        if (dst->op == GGML_OP_MUL_MAT) {
+            if (dst->src[0]->type == GGML_TYPE_Q4_0) {
+                //add log for kai microkernel
+                std::cout << "kai matmul Q4_0" << dst->name << std::endl;
+                return compute_forward_q4_0(params, dst);
+            } else if (dst->src[0]->type == GGML_TYPE_Q8_0) {
+                //add log for kai microkernel
+                std::cout << "kai matmul Q8_0" << dst->name << std::endl;
+                return compute_forward_q8_0(params, dst);
+            } else if (dst->src[0]->type == GGML_TYPE_F16) {
+                //add log for kai microkernel
+                std::cout << "kai matmul fp16" << dst->name << std::endl;
+                return compute_forward_fp16(params, dst);
+            }
+```
+When running the model, some log will be printed as below,
+```text
+kai matmul Q4_0 Qcur-27
+kai matmul Q4_0 Vcur-27
+kai matmul Q4_0 Kcur-27
+kai matmul Q4_0 attn_out-27
+kai matmul Q4_0 ffn_gate-27
+kai matmul Q4_0 ffn_up-27
+kai matmul Q4_0 ffn_out-27
+```
+Take one attention block of Llama-3.2-3B-Instruct-Q4_0 model for example, the operators that are accelerated by KleidiAI SME2-optimized microkernels are highlighted manually by the blue boxs in the figure of graph of the attention block. How to get the graph is beyond the scope of this learning path. Please refer to external resources.
+
+![Figure highlighting operators acclerated by KleidiAI SME2-optimized microkernels alt-text#center](images/one_attention_block.jpg "Operators acclerated by KleidiAI SME2-optimized microkernels in one attention block")
+
+KleidiAI support in llama.cpp is still evolving, more operators will be accelerated by KleidiAI microkernels, unleashing greater potential of SME2. 
+
+## Summary
+With out-of-box KleidiAI and SME2 support in llama.cpp, we can get significant performance uplift at both Prefill and Decode stage, which enhances the experience of running LLM locally on device.