Skip to content

uadk supports heterogeneous computing#658

Closed
Liulongfang wants to merge 201 commits intoLinaro:developfrom
Liulongfang:master
Closed

uadk supports heterogeneous computing#658
Liulongfang wants to merge 201 commits intoLinaro:developfrom
Liulongfang:master

Conversation

@Liulongfang
Copy link
Copy Markdown
Collaborator

@Liulongfang Liulongfang commented Jan 3, 2025

    After uadk supports hardware acceleration and instruction acceleration functions. Users expect to be able to

use both hardware acceleration and instruction acceleration. It is used to use instructions to continue to improve
and accelerate business performance after the hardware business is full. And it can automatically adapt to a variety
of acceleration devices.
The current patchset was developed for this purpose. And it has been fully adapted to all algorithm types of uadk.

   When using the updated framework, compared with separate hardware acceleration, the performance of

hybrid acceleration is significantly higher, and the acceleration effect has been significantly improved.

sm3 test cmd:
numactl --cpunodebind=0 --membind=0 uadk_tool benchmark --alg sm3 --mode sva --opt 0 --sync --pktlen 1024 --seconds 10 --thread 1 --multi 1 --ctxnum 1 --prefetch
numactl --cpunodebind=0 --membind=0 uadk_tool benchmark --alg sm3 --mode sva --opt 0 --sync --pktlen 1024 --seconds 10 --thread 1 --multi 1 --ctxnum 1 --prefetch --init2

SM3 1024B Performance(MB/s)                

tds------init1(HW)-----init2(HW + CE)----increase
1-----------393.3--------437.1-------------11.14%
2----------762.1---------823.4------------8.04%
4----------1508.4-------1564.1------------3.69%
8----------3007.4------3074.9-----------2.24%
16---------4851.8-------5429.2-----------11.90%
32--------4854.1-------8698.8------------79.21%

sm4 test cmd:
numactl --cpunodebind=0 --membind=0 uadk_tool benchmark --alg sm4-128-ecb --mode sva --opt 0 --sync --pktlen 1024 --seconds 10 --thread 1 --multi 1 --ctxnum 1 --prefetch
numactl --cpunodebind=0 --membind=0 uadk_tool benchmark --alg sm4-128-ecb --mode sva --opt 0 --sync --pktlen 1024 --seconds 10 --thread 1 --multi 1 --ctxnum 1 --prefetch --init2
numactl --cpunodebind=0 --membind=0 uadk_tool benchmark --alg sm4-128-ecb --mode sva --opt 0 --async --pktlen 1024 --seconds 10 --thread 1 --multi 1 --ctxnum 1 --prefetch --init2

SM4 1024B Performance(MB/s)                

tds-------init1(HW)----init2(HW + CE)---------increase
1-------------461----------1482.5---------------221.58%
2------------914----------2575.4---------------181.77%
4-----------1699.9--------4737.6---------------178.70%
8-----------3301.5--------7327.8---------------121.95%
16----------5837.5--------9737.4---------------66.81%
32----------8897.7-------10432.4--------------17.25%

SM4 1024B async Performance(MB/s)

tds-------init1(HW)----init2(HW + CE)---------increase
1-----------1368.3--------1683.9---------------23.07%
2------------2652---------3235.5---------------22.00%
4-----------3979.5--------5094.5---------------28.02%
8-----------6667.7---------8587----------------28.79%
16----------8900.9-------11067.8---------------24.34%
32----------8905.9-------10209.1--------------14.63%

Liulongfang and others added 7 commits December 30, 2024 12:02
Add the algorithm hmac(sm3)-cbc(sm4) to the nosva scene,
the following fileds of the session setup need to be set,
the calg(WCRYPTO_CIPHER_SM4), the cmode(WCRYPTO_CIPHER_CBC),
the dalg(WCRYPTO_SM3) and the dmode(WCRYPTO_DIGEST_HMAC).

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
Currently, the algorithm name of the aead cbc mode
is designed only for sha256, but it is not suitable
any more when other algorithms are added, such as
hmac(sm3)-cbc(aes).
Now a common name is used, authenc(generic,cbc(aes)),
the actual algorithm and mode are still specified
by dalg and dmode in the session setup.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
In stream processing encryption mode, a long file
needs to be encrypted. When the accelerator is invoked,
the encryption result of each block is assembled.
The assembled result is the same as the result of
encrypting the entire file at a time.
For hisi_sec, the AAD is filled to the first message,
plaintext are done with the middle and the end message.
In an encrypted stream, the first and the end message
are unique and must be delivered to hardware.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
For the gcm stream mode, assoc bytes should not be 0,
check it to avoid hardware error.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
The hardware only uses the block mode, so set the aead
message state to the block mode first.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
The hardware supports only 16-byte alignment for the aead
middle messages, the invalid length check is added now.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
@gaozhangfei
Copy link
Copy Markdown
Collaborator

gaozhangfei commented Jan 3, 2025

有单侧ce的数据么
还有测试命令,要是方便也贴下

@Liulongfang
Copy link
Copy Markdown
Collaborator Author

Liulongfang commented Jan 3, 2025

单侧ce的数据如下:
SM4 1024B CE Performance(MB/s)
tds-------init1(CE)
1-----------2955.9
2-----------3446.6
4-----------5774.3
8-----------8399.2
16----------10035.8
32----------10638.9

SM3 1024B CE Performance(MB/s)
tds-------init1(CE)
1-----------436.2
2-----------824.9
4-----------1571.6
8-----------3107.2
16----------5571.2
32----------9071.6

@gaozhangfei
Copy link
Copy Markdown
Collaborator

硬件性能偏低,可有测过 --thread 8 --ctxnum 8?
有1+1>2的情形么
可以选择是否打开调度吧。

Qi Tao and others added 2 commits January 9, 2025 14:42
In common digest stream mode, io_bytes and iv_bytes need to
be set to 0 when the final bd is calculated. Therefore, in the
appending tag scenario, need to restore the values of io_bytes
and iv_bytes to the values before they are set to 0.

Therefore, the hardware can compute the overall hash value of
the appending packet and the previously calculated packet,
and reduce the repeated calculation.

Signed-off-by: Qi Tao <taoqi10@huawei.com>
uadk: support appending tag for digest stream model
@Liulongfang Liulongfang force-pushed the master branch 3 times, most recently from 6bb4cdb to d78e979 Compare January 13, 2025 03:52
@Liulongfang
Copy link
Copy Markdown
Collaborator Author

无法完全达到1+1 > 2的情况,只能是1+1 ≈ 2。也就是CPU使用率没有增加情况下,通过软算硬算的混合计算,强化业务性能,让综合性能尽可能的发挥出所有计算设备的算力:

SM4算法,8KB业务包长,分别测试硬算,软算,混合计算的性能,以及达成情况(混合算力/(硬算算力 + 软算算力))
sync mode:
tds-----------HW------------CE-----------(HW+CE)-----achievement rate
1-----------1417.1---------5299.5----------3629.4-------54.04%
2------------2817----------7439.3----------6175.3-------60.21%
4-----------5438.4---------9680.8----------9854.2-------65.18%
8-----------9032.7--------11140.2---------11701.2------58.00%
16----------9143.3--------11837.1---------12495.5------59.56%
32----------9128.6--------12115.2---------13709.7------64.54%

async mode:
tds-----------HW------------CE-----------(HW+CE)-----achievement rate
1-----------9113.1---------5372.8----------7837.7-------54.11%
2-----------9139.1---------7211.1---------11365.7-------69.51%
4-----------9132.6---------9750.4---------13306.6-------70.47%
8-----------9144.1--------11145.6---------13948.9-------68.75%
16----------9139.3--------11727.8---------14644.3-------70.18%
32----------9124.8--------11951-----------13959.6-------66.24%

SM3算法,8KB业务包长,分别测试硬算,软算,混合计算的性能,以及达成情况(混合算力/(硬算算力 + 软算算力))
sync mode:
tds-----------HW------------CE-----------(HW+CE)-----achievement rate
1------------962.2----------508.7-----------549.9--------37.39%
2-----------1905.4----------998.9----------1094.2-------37.68%
4-----------3810.9---------2000.1----------2163.1-------37.22%
8-----------5161.1---------3989.5----------4305.7-------47.05%
16----------5161.1---------7606.1----------8107.1-------63.50%
32----------5161.1--------13482.8---------14493.8------77.74%

async mode:
tds-----------HW------------CE-----------(HW+CE)-----achievement rate
1-----------5161.2----------508.5----------1419.7-------25.04%
2-----------5161.2---------1005.7----------2046.6-------33.19%
4-----------5161.1---------2014.1----------5683.1-------79.20%
8-----------5161.0---------4001.3----------8801.7-------96.06%
16----------5159.2---------7529.5---------12098.6-------95.35%
32----------5160.8--------12587.8---------17534.1-------98.79%

Liulongfang and others added 10 commits January 13, 2025 14:50
uadk: support aead stream mode and sm4-sm3 alg
When a combined algorithm is used, the authsize
should not be 0, so add check for it.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
According to the HMAC rfc, the auth key could be 0 bytes,
so remove the wrong judgment.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
According to the HMAC rfc, the auth key could be 0 bytes,
so remove the wrong judgment.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
The auth key could be 0 bytes, remove the wrong judgment.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
The ctx key may be null if the user use the
normal mode, it should return an error before
copy data to the key.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
First, move the algorithm check to the right level,
then we modified the alignment to 4 bytes from 16
bytes according to the hardware specification.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
The alignment of authsize should be 4 bytes not 16 bytes
according to the hardware specification.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
Add print help when dfx/benchmark/test input
empty parameters.

Signed-off-by: Junchong Pan <panjunchong@h-partners.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
When soft computing is required, an invalid BD
is used to ensure the integrity of the sending
and receiving process, it is more efficient.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Qi Tao <taoqi10@huawei.com>
Liulongfang and others added 28 commits December 11, 2025 10:38
uadk: Fix static analysis warning
The original timer in uadk_tool requires re-triggering upon expiration,
leading to nested timing and potential inaccuracy. This update improves
the timer mechanism.
Additionally, the random number generator used a fixed seed during
initialization, resulting in insufficient randomness; this has been
updated. Furthermore, for non-aligned random length values, the
generated result could be empty—this issue has also been fixed.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
uadk_tool: update component functionality in uadk_tool
Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
When ci fails, it is very difficult to reproduce.

Add set -x in build script, to make it is easier to
find which cmd fails.

Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
sanity_test: add set -x to print cmd
Ignore zip test since the zip tool is not built if OpenSSL 3.0

Refer uadk_tool/Makefile.am
if HAVE_CRYPTO
uadk_tool_SOURCES+=test/comp_main.c
endif

Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Weili Qian <qianweili@huawei.com>
uadk: sanity_test ignore zip if OpenSSL 3.0
Release 2.10 in 2025.12

Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
When looking up the corresponding driver by algorithm type,
since the driver does not save the algorithm type, it cannot
be directly obtained. Therefore, the algorithm type should be
saved during algorithm registration.

Signed-off-by: Weili Qian <qianweili@huawei.com>
uadk adds API to support obtaining the current bandwidth
utilization of a device. When the device driver creates
the "dev_usage" file, users can obtain the current bandwidth
utilization of the specified algorithm on the device by passing
in the device and algorithm name to be queried.

Signed-off-by: Weili Qian <qianweili@huawei.com>
uadk supports obtaining the bandwidth utilization of
specified devices and algorithms through user-space drivers.
After hardware resources are initialized, the bandwidth utilization
can be directly obtained through the hardware mmio space, replacing
the method of reading sysfs files and reducing system calls.

Signed-off-by: Weili Qian <qianweili@huawei.com>
Supports obtaining the device's bandwidth utilization.
For usage details, refer to "uadk_tool dfx --help".

Signed-off-by: Weili Qian <qianweili@huawei.com>
uadk: support querying device bandwidth utilization
Due to changes in chip specifications, the hash agg 8B and 16B operations
have been reduced from 9 columns to 8 columns,requiring the driver to be
adapted accordingly.

Signed-off-by: Zhushuai Yin <yinzhushuai@huawei.com>
uadk: hash agg 8B type of the adaptation chip supports 8 columns
Adjusted the rehash descriptors counta_vld, agg_col_bit_map,
Agg_Oid, Agg_Out_Type, Col_Data_Type, and Col_Data_Info.
These descriptors are consistent with those generated by the
hash aggregation task. In addition, an extra 4 bytes are added
when calculating the row size to ensure that each hash table
contains 4 bytes of empty information.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: Zhushuai Yin <yinzhushuai@huawei.com>
Add error warning when CRC errors occur;When using the same
ctx, the context data of the previous service flow that has
ended needs to be cleared;An error message is added to report
related information to zip module;The minimum output length
of the lz77_zstd_price algorithm should be 4096+16+800+insize.

Signed-off-by: Zongyu Wu <wuzongyu1@huawei.com>
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
When the sgl pool is busy, the hisi_qm_get_hw_sgl function
returns an error, causing the operation to fail.
Now, this function returns the code -WD_EBUSY to inform
the user to wait until the sgl pool is available again.

Signed-off-by: Wenkai Lin <linwenkai6@hisilicon.com>
1.In the original approach, using sched_getcpu() followed by
numa_node_of_cpu() requires two system calls, resulting in low
efficiency.By adopting the new getcpu() method, only one system
call is needed, and in some cases, the information can even be
directly obtained from process data without any system call.
2.Use getcpu() to directly obtain the node id,instead of first
obtaining the cpu id and then the node id, to reduce the number
of system calls.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Signed-off-by: Weili Qian <qianweili@huawei.com>
Set the fd for soft ctx to avoid requesting reserved memory.

Signed-off-by: Weili Qian <qianweili@huawei.com>
uadk: add the empty size for the hash table row size
Fix the compilation failure of wd_alg.h, error log likes:
wd_alg.h:121:9: error: unknown type name ‘__u8’.
And improve code portability by including linux/types.h
instead of asm/types.h.

Signed-off-by: Weili Qian <qianweili@huawei.com>
uadk: fix the compilation failure of wd_alg.h
Update the README document of the UADK project to make it more concise and understandable.
    to ensure the clarity and completeness of the README document, it is
necessary to reformat it into markdown type and refine its content

Signed-off-by: Liulongfang <liulongfang@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants