ArmDeveloperEcosystem · zenonxiu81 · Feb 5, 2026 · Feb 5, 2026 · Feb 5, 2026
diff --git a/...ng-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/_index.md b/...ng-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/_index.md
@@ -0,0 +1,60 @@
+---
+title: KleidiAI SME2 matmul microkernel for quantized models explained
+
+minutes_to_complete: 40
+
+who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners 
+
+learning_objectives: 
+    - Learn how a KleidiAI matmual microkernel performs matrix multiplication with quantized data
+    - Learn how SME2 INT8 Outer Product Accumulate instructions are used for matrix multiplication
+    - Learn how a KleidiAI SME2 matmul microkernel accelerates matmul operators in a Large Lanague Model
+    - Learn how to integrate KleidiAI SME2 matmul microkernels to an AI framework or application
+
+prerequisites:
+    - Knowledge of KleidiAI and SME2
+
+author: Zenon Zhilong Xiu
+
+### Tags
+skilllevels: Advanced
+subjects: ML
+armips:
+    - Arm C1 CPU
+    - Arm SME2 unit
+tools_software_languages:
+    - C++
+    - KleidiAI
+    - llama.cpp
+operatingsystems:
+    - Android
+    - Linux
+
+
+
+further_reading:
+    - resource:
+        title: part 1 Arm Scalable Matrix Extension Introduction 
+        link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction
+        type: blog
+    - resource:
+        title: part 2 Arm Scalable Matrix Extension Instructions 
+        link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2
+        type: blog
+    - resource:
+        title: part4 Arm SME2 Introduction 
+        link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/part4-arm-sme2-introduction
+        type: blog
+    - resource:
+        title: Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels
+        link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/
+        type: blog
+
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/...ths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/_next-steps.md b/...ths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # The weight controls the order of the pages. _index.md always has weight 1.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/...hics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p1.md b/...hics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p1.md
@@ -0,0 +1,55 @@
+---
+title: Explain the SME2 matmul microkernel with an example - Part 1
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Explain the SME2 matmul microkernel with an example - Part 1
+By integrating the SME2‑optimized KleidiAI kernels into llama.cpp, the heavy matrix‑multiplication workloads in the K, Q, and V computations of the attention blocks, as well as in the FFN layers, can be delegated to the SME2 matmul microkernel when running the Llama-3.2-3B-Q4_0.gguf model.
+In these operators, the LHS (activation) data type is FP32, while RHS (weight) type uses GGML Q4_0 quantized type. 
+
+To make the demonstration easier in this learning path, the LHS dimension [m, k] is simplified to [16, 64], the RHS dimension [n, k] is simplified to [64, 64], and the SME2 SVL is set as 512-bit.
+
+###Packing the RHS
+Although the original Q4_0 RHS(weight) in the model uses INT4 quantization, it is signed INT4 quantization, rather than the unsigned INT4 quantization that the SME2 matmul microkernel requires. Moreover,the layout of the INT4 quantized data and the quantization scale does not meet the requirements of the SME2 matmul microkernel neither. Therefore, the LHS from the model needs to be converted from the signed INT4 data to unsigned INT4 and repacked. 
+Since the RHS(weight) remains unchanged during the inference, this conversion and packing only need to be performed only once when loading the model. 
+
+
+Let us have a close look at GGML Q4_0 quantization first to know how the orginal FP32 weight is quantized to Q4_0 format.
+In the Q4_0 model, the Q4_0 weights are stored in layout of [n, k].
+GGML Q4_0 quantizes weights in blocks of 32 floats. For each block, it calculates a scale for the block and then converts each value into a signed 4-bit integer. The scale is stored as FP16. 
+Then GGML Q4_0 packs the values in a way of,
+- the low nibble (bits 0–3) holds the first value (even index)
+- and the high nibble (bits 4–7) holds the second value (odd index)
+Thus, each byte contains a low/high pair. 
+The following diagram shows how GGML Q4_0 quantizes and packs the original [n, k] FP32 matrix into Q4_0 type with layout of [n, k].
+![Figure showing GGML Q4_0 quantization alt-text#center](images/q4_0_format.jpg "GGML Q4_0 quantization")
+
+Unfortunately, the Q4_0 format does not meet the requirements of the SME2 matmul microkernel. It needs to be converted to an unsigned INT4 quantization format and repacked using the *kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon* function. 
+
+In this example, we use m=16 and k=64.
+- The required mr value for the SME2 matmul kernel is obtained using *kai_get_mr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, mr=16.
+- The required nr value for the SME2 matmul kernel is obtained using *kai_get_nr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, nr=64.
+- The required kr value for the SME2 matmul kernel is obtained using *kai_get_kr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, kr=4.
+- The required sr value for the SME2 matmul kernel is obtained using *kai_get_sr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, sr=2 (two INT4 elements in a byte).
+
+The function call stack for this process in llama.cpp when loading the model is as follows:
+```text
+llama_model_load  
+  llama_model::load_tensors  
+     llama_model_loader::load_all_data					 
+			ggml_backend_tensor_set		 
+				ggml_backend_cpu_kleidiai_buffer_set_tensor 
+                        ggml::cpu::kleidiai::tensor_traits::repack
+						    kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
+```
+This process can be illustrated with the diagram below.
+![Figure showing RHS packing with KleidiAI alt-text#center](images/kai_kernel_packed_rhs.jpg "RHS packing with KleidiAI")
+
+The numerical label of an element in the diagram is used to indicate its row and column number in the original matrix. For example , 
+![Figure showing Row_Col lable alt-text#center](images/row_col_lable.png "Row_Col lable")
+it indicates that the element locates at row 01, column 02 in the original matrix. This row and column number remains unchanged in its quantized and packed matrix, so that the location of the element can be tracked easily. 
+
+Now, the RHS is converted and packed into a format that can be handled by the SME2 matmul microkernel, allowing the packed RHS to be loaded into SME2 Z registers with sequential memory access. This improves memory access efficiency and reduces cache misses.
diff --git a/...hics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p2.md b/...hics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p2.md
@@ -0,0 +1,34 @@
+---
+title: Explain the SME2 matmul microkernel with an example - Part 2
+weight: 6
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Explain the SME2 matmul microkernel with an example - Part 2
+Next, the FP32 LHS (activation) needs to be quantized and packed when the llama.cpp graph runner computes the matmul nodes/operators. 
+
+### Quantization and Packing of the LHS
+Since the LHS (activation) keep changing, we need to dynamically quantize the original FP32 matrix and pack it into the qsi8d32p1vlx4 format. This can be achieved using the *kai_run_lhs_quant_pack_qsi8d32p_f32_neon* microkernel.
+
+The function call stack for this process in llama.cpp is as follows:
+```text
+llama_context::decode
+   llama_context::process_ubatch
+      llama_context::graph_compute
+	     ggml_backend_sched_compute_splits
+		    ggml_backend_cpu_graph_compute
+			   ggml_graph_compute             //tick off the compute thread
+			      ggml_graph_compute_thread   //the compute thread
+				     ggml_compute_forward
+					     ggml_cpu_extra_compute_forward
+						    ggml::cpu::kleidiai::tensor_traits::compute_forward
+							    ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0
+								    kai_run_lhs_quant_pack_qsi8d32p_f32_neon
+```
+The diagram below illustrates how the RHS is quantized and packed by *kai_run_lhs_quant_pack_qsi8d32p_f32_neon*,
+![Figure showing Quantization and Packing of the LHS alt-text#center](images/kai_run_lhs_quant_pack_qsi8d32p_f32_neon_for_sme2.jpg "Quantization and Packing of the LHS")
+
+The values of mr, nr, and kr can be obtained in the same way as described above.
+The mr, nr, and kr together with the matrix dimensions m and k are passed as parameters to *kai_run_lhs_quant_pack_qsi8d32p_f32_neon*. This function quantizes the FP32 LHS to signed INT8 type and packed the quantized data and quantization scales as shown in the diagram above. It divides the m x n matrix into submatrices of size mr x kr (it is 16 x 4) as shown in blocks outlined by dashed lines in the upper matrix of the diagram, and then sequentially packs the rows within each submatrix. This allows the SME2 matmul kernel to load an entire submatrix into an SME2 Z register from contiguous memory, thus reducing cache misses by avoiding loading the submatrix across multiple rows.
diff --git a/...hics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p3.md b/...hics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p3.md
@@ -0,0 +1,78 @@
+---
+title: Explain the SME2 matmul microkernel with an example- Part 3
+weight: 7
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Explain the SME2 matmul microkernel with an example - Part 3
+Once the required LHS and RHS are both ready, *kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa* microkernel can run now.
+
+### Run the SME2 matmul microkernel
+The operations performed to compute an 16x64 result submatrice (four 16x16 submatrices) (1VL x 4VL) are as follows:
+
+- Iterate along blocks along K dimension
+  - Iterate in a block with step of kr (kr=4) 
+    - Load one SME2 SVL-length (512-bit) of data from the quantized and packed LHS (containing 64 INT8 values) into one SME2 Z register
+    - Load two SME2 SVL-lengths of data from the packed RHS (containing 2 x64x2 INT4 values) into two SME2 Z registers, then use the SME2 LUTI4 lookup table instruction to convert these INT4 values into INT8 type, extending them to four SME2 Z registers (4VL).
+    - Use the SME2 INT8 Outer Product Accumulate (MPOA) instruction to perform outer product operations with source from the Z register and each of the four Z registers, accumulates the results in four ZA tiles (which are initialized to zero). It produces intermediate results of four 16x16 output submatrices.
+    The processes of the first itration can be illustrated in the diagram below:
+![Figure showing the first itration of the inner loop alt-text#center](images/run_matmul_sme2_step1.jpg "The first itration of the inner loop")
+    The diagram below illustrates the process of the second iteration along the K dimension,
+![Figure showing the second itration of the inner loop alt-text#center](images/run_matmul_sme2_step2.jpg "The second itration of the inner loop")
+  - After completing the iterations in the block, the intermediate INT32 results of four 16x16 output submatrices are dequantized with the per-block LHS and RHS scale to FP32 floats, using Floating-point Multiply (FMUL), Floating-point Multiply and Accumulate (FMLA) and Signed fixed-point Convert to Floating-point (SCVTF) vector instructions. It produces the intermediate FP32 results of four 16x16 output submatrices.
+  - Accumulate the FP32 result above
+
+After completing itration along the K dimension, the FP32 results of four 16x16 output submatrices is ready. Then, save the result into memory.
+
+The code can be found [here](https://github.com/ARM-software/kleidiai/blob/main/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa_asm.S#L80) 
+Some comments are added to the code to help understanding the code.
+```asm
+KAI_ASM_LABEL(label_3)              // K Loop
+    KAI_ASM_INST(0xc00800ff)        // zero {za} ， zeros the four ZA tile (za0.s, za1.s, za2.s, za3.s)
+    mov x11, x4                //Set block size
+KAI_ASM_LABEL(label_4)              // Block Loop
+    KAI_ASM_INST(0xa0404342)        //ld1w {z2.s - z3.s}, pn8/z, [x26]   // load two VLs packed RHS data (64x2x2 INT4 data)
+    addvl x26, x26, #2    // increase RHS address by two VLs
+    ld1h {z8.h}, p0/z, [x3]    //load one VL quantized and packed LHS data (64 INT8 data)
+    addvl x3, x3, #1       // increase LHS address by one VLs
+    KAI_ASM_INST(0xc08a4044)        // luti4 {z4.b - z5.b}, zt0, z2[0]  //use LUT4I instruction to convert INT4 to INT8, one source VL produces two VLs result 
+    KAI_ASM_INST(0xc08a4066)        // luti4 {z6.b - z7.b}, zt0, z3[0]  //use LUT4I instruction to convert INT4 to INT8, one source VL produces two VLs result
+    KAI_ASM_INST(0xa0840100)        // smopa za0.s, p0/m, p0/m, z8.b, z4.b   ]  //Outer Product Accumulate with the VL of LHS, the first VL of RHS and ZA0.S
+    KAI_ASM_INST(0xa0850101)        // smopa za1.s, p0/m, p0/m, z8.b, z5.b   //Outer Product Accumulate with the VL of LHS, the second VL of RHS and ZA1.S
+    KAI_ASM_INST(0xa0860102)        // smopa za2.s, p0/m, p0/m, z8.b, z6.b   //Outer Product Accumulate with the VL of LHS, the third VL of RHS and ZA2.S
+    KAI_ASM_INST(0xa0870103)        // smopa za3.s, p0/m, p0/m, z8.b, z7.b  b   //Outer Product Accumulate with the VL of LHS, the forth VL of RHS and ZA3.S
+
+    subs x11, x11, #4     //block_index - 4
+    b.gt label_4       //end of block iteration?
+
+   // the code below performs per block dequantization of the four tiles with LHS and RHS scales
+    mov w12, #0
+    mov x25, x24
+    ld1b {z17.b}, p4/z, [x3]               // lhs sum
+    ld1b {z16.b}, p4/z, [x3, #1, mul vl]   // lhs scale
+    addvl x3, x3, #2
+    KAI_ASM_INST(0xa040c354)        // ld1w { z20.s - z23.s }, pn8/z, [x26]            // rhs zp
+    KAI_ASM_INST(0xa041c340)        // ld1w { z0.s - z3.s }, pn8/z, [x26, #4, mul vl ] // rhs scale
+    addvl x26, x26, #8
+    pfalse p3.b
+KAI_ASM_LABEL(label_5)
+    // omit some codes that perform the block quantization and save the result to memory 
+    ……
+    blt label_5
+    subs x10, x10, x4     //decrease the K index
+    b.gt label_3     //end of K loop?
+
+```
+In a single block loop, four pipelined SME2 INT8 MOPA instructions perform 4,096 MAC operations, calculating the intermediate results for the four 16x16 submatrices. It proves that SME2 MOPA can significantly improve matrix multiplication performance.
+
+To help understand the whole process, we map the first itration of LHS and RHS quantization and packing steps, as well as SME2 outer product accumulate operation and dequantization, back to the original FP32 LHS and RHS operations. Essentially, they equally perform the operation as shown below (there might be some quantization loss),
+![Figure showing the original matrix representing of the first itration alt-text#center](images/run_matmul_sme2_original_present_step1.jpg "the original matrix representing of the first itration")
+
+The second iteration can be mapped back to the original FP32 LHS and RHS operations as below,
+![Figure showing the original matrix representing of the second itration alt-text#center](images/run_matmul_sme2_original_present_step2.jpg "the original matrix representing of the second itration")
+
+**Note**: In this diagram, the RHS is laid out in the dimension of [N, K], which is different from the [K, N] dimension layout of the RHS in the video demonstration of 1VLx4VL. If you interpret the RHS in the diagrams above using the [K, N] dimension, you can match the previous video demonstration with the diagrams above.
+
+By repeating the submatrix computation across the M and N dimensions, the entire result matrix can be calculated. If a non-empty bias is passed to the SME2 matmul microkernel, it also adds the bias to the result matrix.
diff --git a/...s-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_kernel_packed_rhs.jpg b/...s-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_kernel_packed_rhs.jpg
diff --git a/...phics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_matmul_kernel.jpg b/...phics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_matmul_kernel.jpg
diff --git a/..._ukernel_explained/images/kai_run_lhs_quant_pack_qsi8d32p_f32_neon_for_sme2.jpg b/..._ukernel_explained/images/kai_run_lhs_quant_pack_qsi8d32p_f32_neon_for_sme2.jpg
diff --git a/...le-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/q4_0_format.jpg b/...le-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/q4_0_format.jpg
diff --git a/...-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/row_col_lable.png b/...-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/row_col_lable.png
diff --git a/...sme2_matmul_ukernel_explained/images/run_matmul_sme2_original_present_step1.jpg b/...sme2_matmul_ukernel_explained/images/run_matmul_sme2_original_present_step1.jpg
diff --git a/...sme2_matmul_ukernel_explained/images/run_matmul_sme2_original_present_step2.jpg b/...sme2_matmul_ukernel_explained/images/run_matmul_sme2_original_present_step2.jpg
diff --git a/...s-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_step1.jpg b/...s-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_step1.jpg
diff --git a/...s-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_step2.jpg b/...s-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_step2.jpg
diff --git a/...bile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/sme2_mopa.jpg b/...bile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/sme2_mopa.jpg
diff --git a/...hics-and-gaming/kai_sme2_matmul_ukernel_explained/kai_matmul_kernel_overview.md b/...hics-and-gaming/kai_sme2_matmul_ukernel_explained/kai_matmul_kernel_overview.md
@@ -0,0 +1,57 @@
+---
+title: How does a KleidiAI matmual microkernel perform matrix multiplication with quantized data?
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## How does a KleidiAI matmual microkernel perform matrix multiplication with quantized data? 
+Essentially, a KleidiAI matmul microkernel uses tile-based matrix multiplication(matmul) where small submatrices of the output are computed one by one. 
+- **mr**: number of rows of Matrix C (and Matrix A) computed at once
+- **nr**: number of columns of Matrix C (and Matrix B) computed at once
+- **bl**: number of elements from the K dimension processed per block at once
+- **kr**: number of elements from the K dimension processed per inner step
+
+The video below demonstrates how matrix multiplication is carried out using this method.
+![Figure showing Tile-Based matrix multiplication with KleidiAI alt-text#center](videos/matrix_tile.gif "Tile-Based matrix multiplication with KleidiAI")
+
+This process can be denoted with the following pseudocode,
+```c
+// RHS N LOOP
+for(n_idx = 0; n_idx < n; n_idx+=nr){
+    // LHS M LOOP
+    for(m_idx = 0; m_idx < m; m_idx+=mr){
+        // K LOOP, break K into blocks first
+        blocks_in_K= K/bl;     // bl is the block length
+        //Block Loop 
+        for(bl_idx = 0; bl_idx< blocks_in_K; bl_idx += 1) {
+              //Loop inside a block
+              krs_in_block= bl/kr;  //kr is the number of elements in K dimension per inner loop
+              for(k_idx = 0; k_idx < krs_in_block; k_idx +=1) {
+               // Perform the matrix multiplication with source submatrices of size [mr, kr] and [kr, nr]
+               // Accumulate the matrix multiplication result above into per block level result.
+                 …
+              }       
+          // Accumulate per block level results along K dimension. When iteration on K dimension is completed,a submatrix of size [mr, nr] of the output matrix is ready        
+        }
+    //Continue computing a submatrix of size [mr, nr] of the output matrix along M dimension
+   }
+  //Continue computing a submatrix of size [mr, nr] of the output matrix along N dimension 
+}
+```
+In general, KleidiAI matmul microkernels implement matrix mulitplication in a similar way as the pseudocode.
+
+KleidiAI also provides corresponding packing microkernels for the matmul microkernels, in order to make efficient contiguous memory access to the input of the matrix multiplication, reducing cache misses.
+
+KleidiAI supports quantized matrix multiplication to speed up AI inference on Arm CPUs. Instead of multiplying full precision (FP32) matrices A and B directly, it quantizes:
+- The Left Hand Source (LHS , or Left Hand Martix/activation) matrix to 8-bit integers
+- The Right Hand Source( RHS, or Left Hand Matrix/weights) matrix to 4-bit or 8-bit integers
+
+then packs those quantized values into memory layouts suitable for the CPU vector instructions such as Dotprod, I8MM, SME2 instructions.
+Runs a microkernel that efficiently computes on packed quantized data, then scales back to floating point.
+
+This process can be illustrated in the following diagram,
+![Figure showing quantized matrix multiplication with KleidiAI kernels alt-text#center](images/kai_matmul_kernel.jpg "Quantized matrix multiplication with KleidiAI kernel")
+
+Please find more information in this learning path, [Accelerate Generative AI workloads using KleidiAI](https://learn.arm.com/learning-paths/cross-platform/kleidiai-explainer/).