Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: KleidiAI SME2 matmul microkernel for quantized models explained

minutes_to_complete: 40

who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners

learning_objectives:
- Learn how a KleidiAI matmual microkernel performs matrix multiplication with quantized data
- Learn how SME2 INT8 Outer Product Accumulate instructions are used for matrix multiplication
- Learn how a KleidiAI SME2 matmul microkernel accelerates matmul operators in a Large Lanague Model
- Learn how to integrate KleidiAI SME2 matmul microkernels to an AI framework or application

prerequisites:
- Knowledge of KleidiAI and SME2

author: Zenon Zhilong Xiu

### Tags
skilllevels: Advanced
subjects: ML
armips:
- Arm C1 CPU
- Arm SME2 unit
tools_software_languages:
- C++
- KleidiAI
- llama.cpp
operatingsystems:
- Android
- Linux



further_reading:
- resource:
title: part 1 Arm Scalable Matrix Extension Introduction
link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction
type: blog
- resource:
title: part 2 Arm Scalable Matrix Extension Instructions
link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2
type: blog
- resource:
title: part4 Arm SME2 Introduction
link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/part4-arm-sme2-introduction
type: blog
- resource:
title: Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels
link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/
type: blog



### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # The weight controls the order of the pages. _index.md always has weight 1.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
title: Explain the SME2 matmul microkernel with an example - Part 1
weight: 5

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Explain the SME2 matmul microkernel with an example - Part 1
By integrating the SME2‑optimized KleidiAI kernels into llama.cpp, the heavy matrix‑multiplication workloads in the K, Q, and V computations of the attention blocks, as well as in the FFN layers, can be delegated to the SME2 matmul microkernel when running the Llama-3.2-3B-Q4_0.gguf model.
In these operators, the LHS (activation) data type is FP32, while RHS (weight) type uses GGML Q4_0 quantized type.

To make the demonstration easier in this learning path, the LHS dimension [m, k] is simplified to [16, 64], the RHS dimension [n, k] is simplified to [64, 64], and the SME2 SVL is set as 512-bit.

###Packing the RHS
Although the original Q4_0 RHS(weight) in the model uses INT4 quantization, it is signed INT4 quantization, rather than the unsigned INT4 quantization that the SME2 matmul microkernel requires. Moreover,the layout of the INT4 quantized data and the quantization scale does not meet the requirements of the SME2 matmul microkernel neither. Therefore, the LHS from the model needs to be converted from the signed INT4 data to unsigned INT4 and repacked.
Since the RHS(weight) remains unchanged during the inference, this conversion and packing only need to be performed only once when loading the model.


Let us have a close look at GGML Q4_0 quantization first to know how the orginal FP32 weight is quantized to Q4_0 format.
In the Q4_0 model, the Q4_0 weights are stored in layout of [n, k].
GGML Q4_0 quantizes weights in blocks of 32 floats. For each block, it calculates a scale for the block and then converts each value into a signed 4-bit integer. The scale is stored as FP16.
Then GGML Q4_0 packs the values in a way of,
- the low nibble (bits 0–3) holds the first value (even index)
- and the high nibble (bits 4–7) holds the second value (odd index)
Thus, each byte contains a low/high pair.
The following diagram shows how GGML Q4_0 quantizes and packs the original [n, k] FP32 matrix into Q4_0 type with layout of [n, k].
![Figure showing GGML Q4_0 quantization alt-text#center](images/q4_0_format.jpg "GGML Q4_0 quantization")

Unfortunately, the Q4_0 format does not meet the requirements of the SME2 matmul microkernel. It needs to be converted to an unsigned INT4 quantization format and repacked using the *kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon* function.

In this example, we use m=16 and k=64.
- The required mr value for the SME2 matmul kernel is obtained using *kai_get_mr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, mr=16.
- The required nr value for the SME2 matmul kernel is obtained using *kai_get_nr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, nr=64.
- The required kr value for the SME2 matmul kernel is obtained using *kai_get_kr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, kr=4.
- The required sr value for the SME2 matmul kernel is obtained using *kai_get_sr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, sr=2 (two INT4 elements in a byte).

The function call stack for this process in llama.cpp when loading the model is as follows:
```text
llama_model_load
llama_model::load_tensors
llama_model_loader::load_all_data
ggml_backend_tensor_set
ggml_backend_cpu_kleidiai_buffer_set_tensor
ggml::cpu::kleidiai::tensor_traits::repack
kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
```
This process can be illustrated with the diagram below.
![Figure showing RHS packing with KleidiAI alt-text#center](images/kai_kernel_packed_rhs.jpg "RHS packing with KleidiAI")

The numerical label of an element in the diagram is used to indicate its row and column number in the original matrix. For example ,
![Figure showing Row_Col lable alt-text#center](images/row_col_lable.png "Row_Col lable")
it indicates that the element locates at row 01, column 02 in the original matrix. This row and column number remains unchanged in its quantized and packed matrix, so that the location of the element can be tracked easily.

Now, the RHS is converted and packed into a format that can be handled by the SME2 matmul microkernel, allowing the packed RHS to be loaded into SME2 Z registers with sequential memory access. This improves memory access efficiency and reduces cache misses.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: Explain the SME2 matmul microkernel with an example - Part 2
weight: 6

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Explain the SME2 matmul microkernel with an example - Part 2
Next, the FP32 LHS (activation) needs to be quantized and packed when the llama.cpp graph runner computes the matmul nodes/operators.

### Quantization and Packing of the LHS
Since the LHS (activation) keep changing, we need to dynamically quantize the original FP32 matrix and pack it into the qsi8d32p1vlx4 format. This can be achieved using the *kai_run_lhs_quant_pack_qsi8d32p_f32_neon* microkernel.

The function call stack for this process in llama.cpp is as follows:
```text
llama_context::decode
llama_context::process_ubatch
llama_context::graph_compute
ggml_backend_sched_compute_splits
ggml_backend_cpu_graph_compute
ggml_graph_compute //tick off the compute thread
ggml_graph_compute_thread //the compute thread
ggml_compute_forward
ggml_cpu_extra_compute_forward
ggml::cpu::kleidiai::tensor_traits::compute_forward
ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0
kai_run_lhs_quant_pack_qsi8d32p_f32_neon
```
The diagram below illustrates how the RHS is quantized and packed by *kai_run_lhs_quant_pack_qsi8d32p_f32_neon*,
![Figure showing Quantization and Packing of the LHS alt-text#center](images/kai_run_lhs_quant_pack_qsi8d32p_f32_neon_for_sme2.jpg "Quantization and Packing of the LHS")

The values of mr, nr, and kr can be obtained in the same way as described above.
The mr, nr, and kr together with the matrix dimensions m and k are passed as parameters to *kai_run_lhs_quant_pack_qsi8d32p_f32_neon*. This function quantizes the FP32 LHS to signed INT8 type and packed the quantized data and quantization scales as shown in the diagram above. It divides the m x n matrix into submatrices of size mr x kr (it is 16 x 4) as shown in blocks outlined by dashed lines in the upper matrix of the diagram, and then sequentially packs the rows within each submatrix. This allows the SME2 matmul kernel to load an entire submatrix into an SME2 Z register from contiguous memory, thus reducing cache misses by avoiding loading the submatrix across multiple rows.
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
title: Explain the SME2 matmul microkernel with an example- Part 3
weight: 7

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Explain the SME2 matmul microkernel with an example - Part 3
Once the required LHS and RHS are both ready, *kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa* microkernel can run now.

### Run the SME2 matmul microkernel
The operations performed to compute an 16x64 result submatrice (four 16x16 submatrices) (1VL x 4VL) are as follows:

- Iterate along blocks along K dimension
- Iterate in a block with step of kr (kr=4)
- Load one SME2 SVL-length (512-bit) of data from the quantized and packed LHS (containing 64 INT8 values) into one SME2 Z register
- Load two SME2 SVL-lengths of data from the packed RHS (containing 2 x64x2 INT4 values) into two SME2 Z registers, then use the SME2 LUTI4 lookup table instruction to convert these INT4 values into INT8 type, extending them to four SME2 Z registers (4VL).
- Use the SME2 INT8 Outer Product Accumulate (MPOA) instruction to perform outer product operations with source from the Z register and each of the four Z registers, accumulates the results in four ZA tiles (which are initialized to zero). It produces intermediate results of four 16x16 output submatrices.
The processes of the first itration can be illustrated in the diagram below:
![Figure showing the first itration of the inner loop alt-text#center](images/run_matmul_sme2_step1.jpg "The first itration of the inner loop")
The diagram below illustrates the process of the second iteration along the K dimension,
![Figure showing the second itration of the inner loop alt-text#center](images/run_matmul_sme2_step2.jpg "The second itration of the inner loop")
- After completing the iterations in the block, the intermediate INT32 results of four 16x16 output submatrices are dequantized with the per-block LHS and RHS scale to FP32 floats, using Floating-point Multiply (FMUL), Floating-point Multiply and Accumulate (FMLA) and Signed fixed-point Convert to Floating-point (SCVTF) vector instructions. It produces the intermediate FP32 results of four 16x16 output submatrices.
- Accumulate the FP32 result above

After completing itration along the K dimension, the FP32 results of four 16x16 output submatrices is ready. Then, save the result into memory.

The code can be found [here](https://github.com/ARM-software/kleidiai/blob/main/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa_asm.S#L80)
Some comments are added to the code to help understanding the code.
```asm
KAI_ASM_LABEL(label_3) // K Loop
KAI_ASM_INST(0xc00800ff) // zero {za} , zeros the four ZA tile (za0.s, za1.s, za2.s, za3.s)
mov x11, x4 //Set block size
KAI_ASM_LABEL(label_4) // Block Loop
KAI_ASM_INST(0xa0404342) //ld1w {z2.s - z3.s}, pn8/z, [x26] // load two VLs packed RHS data (64x2x2 INT4 data)
addvl x26, x26, #2 // increase RHS address by two VLs
ld1h {z8.h}, p0/z, [x3] //load one VL quantized and packed LHS data (64 INT8 data)
addvl x3, x3, #1 // increase LHS address by one VLs
KAI_ASM_INST(0xc08a4044) // luti4 {z4.b - z5.b}, zt0, z2[0] //use LUT4I instruction to convert INT4 to INT8, one source VL produces two VLs result
KAI_ASM_INST(0xc08a4066) // luti4 {z6.b - z7.b}, zt0, z3[0] //use LUT4I instruction to convert INT4 to INT8, one source VL produces two VLs result
KAI_ASM_INST(0xa0840100) // smopa za0.s, p0/m, p0/m, z8.b, z4.b ] //Outer Product Accumulate with the VL of LHS, the first VL of RHS and ZA0.S
KAI_ASM_INST(0xa0850101) // smopa za1.s, p0/m, p0/m, z8.b, z5.b //Outer Product Accumulate with the VL of LHS, the second VL of RHS and ZA1.S
KAI_ASM_INST(0xa0860102) // smopa za2.s, p0/m, p0/m, z8.b, z6.b //Outer Product Accumulate with the VL of LHS, the third VL of RHS and ZA2.S
KAI_ASM_INST(0xa0870103) // smopa za3.s, p0/m, p0/m, z8.b, z7.b b //Outer Product Accumulate with the VL of LHS, the forth VL of RHS and ZA3.S

subs x11, x11, #4 //block_index - 4
b.gt label_4 //end of block iteration?

// the code below performs per block dequantization of the four tiles with LHS and RHS scales
mov w12, #0
mov x25, x24
ld1b {z17.b}, p4/z, [x3] // lhs sum
ld1b {z16.b}, p4/z, [x3, #1, mul vl] // lhs scale
addvl x3, x3, #2
KAI_ASM_INST(0xa040c354) // ld1w { z20.s - z23.s }, pn8/z, [x26] // rhs zp
KAI_ASM_INST(0xa041c340) // ld1w { z0.s - z3.s }, pn8/z, [x26, #4, mul vl ] // rhs scale
addvl x26, x26, #8
pfalse p3.b
KAI_ASM_LABEL(label_5)
// omit some codes that perform the block quantization and save the result to memory
……
blt label_5
subs x10, x10, x4 //decrease the K index
b.gt label_3 //end of K loop?

```
In a single block loop, four pipelined SME2 INT8 MOPA instructions perform 4,096 MAC operations, calculating the intermediate results for the four 16x16 submatrices. It proves that SME2 MOPA can significantly improve matrix multiplication performance.

To help understand the whole process, we map the first itration of LHS and RHS quantization and packing steps, as well as SME2 outer product accumulate operation and dequantization, back to the original FP32 LHS and RHS operations. Essentially, they equally perform the operation as shown below (there might be some quantization loss),
![Figure showing the original matrix representing of the first itration alt-text#center](images/run_matmul_sme2_original_present_step1.jpg "the original matrix representing of the first itration")

The second iteration can be mapped back to the original FP32 LHS and RHS operations as below,
![Figure showing the original matrix representing of the second itration alt-text#center](images/run_matmul_sme2_original_present_step2.jpg "the original matrix representing of the second itration")

**Note**: In this diagram, the RHS is laid out in the dimension of [N, K], which is different from the [K, N] dimension layout of the RHS in the video demonstration of 1VLx4VL. If you interpret the RHS in the diagrams above using the [K, N] dimension, you can match the previous video demonstration with the diagrams above.

By repeating the submatrix computation across the M and N dimensions, the entire result matrix can be calculated. If a non-empty bias is passed to the SME2 matmul microkernel, it also adds the bias to the result matrix.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
title: How does a KleidiAI matmual microkernel perform matrix multiplication with quantized data?
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## How does a KleidiAI matmual microkernel perform matrix multiplication with quantized data?
Essentially, a KleidiAI matmul microkernel uses tile-based matrix multiplication(matmul) where small submatrices of the output are computed one by one.
- **mr**: number of rows of Matrix C (and Matrix A) computed at once
- **nr**: number of columns of Matrix C (and Matrix B) computed at once
- **bl**: number of elements from the K dimension processed per block at once
- **kr**: number of elements from the K dimension processed per inner step

The video below demonstrates how matrix multiplication is carried out using this method.
![Figure showing Tile-Based matrix multiplication with KleidiAI alt-text#center](videos/matrix_tile.gif "Tile-Based matrix multiplication with KleidiAI")

This process can be denoted with the following pseudocode,
```c
// RHS N LOOP
for(n_idx = 0; n_idx < n; n_idx+=nr){
// LHS M LOOP
for(m_idx = 0; m_idx < m; m_idx+=mr){
// K LOOP, break K into blocks first
blocks_in_K= K/bl; // bl is the block length
//Block Loop
for(bl_idx = 0; bl_idx< blocks_in_K; bl_idx += 1) {
//Loop inside a block
krs_in_block= bl/kr; //kr is the number of elements in K dimension per inner loop
for(k_idx = 0; k_idx < krs_in_block; k_idx +=1) {
// Perform the matrix multiplication with source submatrices of size [mr, kr] and [kr, nr]
// Accumulate the matrix multiplication result above into per block level result.
}
// Accumulate per block level results along K dimension. When iteration on K dimension is completed,a submatrix of size [mr, nr] of the output matrix is ready
}
//Continue computing a submatrix of size [mr, nr] of the output matrix along M dimension
}
//Continue computing a submatrix of size [mr, nr] of the output matrix along N dimension
}
```
In general, KleidiAI matmul microkernels implement matrix mulitplication in a similar way as the pseudocode.

KleidiAI also provides corresponding packing microkernels for the matmul microkernels, in order to make efficient contiguous memory access to the input of the matrix multiplication, reducing cache misses.

KleidiAI supports quantized matrix multiplication to speed up AI inference on Arm CPUs. Instead of multiplying full precision (FP32) matrices A and B directly, it quantizes:
- The Left Hand Source (LHS , or Left Hand Martix/activation) matrix to 8-bit integers
- The Right Hand Source( RHS, or Left Hand Matrix/weights) matrix to 4-bit or 8-bit integers

then packs those quantized values into memory layouts suitable for the CPU vector instructions such as Dotprod, I8MM, SME2 instructions.
Runs a microkernel that efficiently computes on packed quantized data, then scales back to floating point.

This process can be illustrated in the following diagram,
![Figure showing quantized matrix multiplication with KleidiAI kernels alt-text#center](images/kai_matmul_kernel.jpg "Quantized matrix multiplication with KleidiAI kernel")

Please find more information in this learning path, [Accelerate Generative AI workloads using KleidiAI](https://learn.arm.com/learning-paths/cross-platform/kleidiai-explainer/).
Loading