Skip to content

Conversation

@Polaris-911
Copy link
Contributor

Zicclsm: Main memory supports misaligned loads/stores
According to the RVA20U64 specification, the Zicclsm extension is mandatory and is supported in gcc versions 14.1 and above.
References
GCC Zicclsm
RVA20U64 specification

Performance Test Results: zstd with Different MEM_FORCE_MEMORY_ACCESS Settings

Test Environment

[root@r2044-r1-s1 ~]# dmidecode -t processor | grep "Version"
        Version: SG2044
[root@r2044-r1-s1 ~]# uname -a
Linux r2044-r1-s1 6.12.47-25.09.16.17.riscv64 #1 SMP Tue Sep 16 17:47:24 CST 2025 riscv64 riscv64 riscv64 GNU/Linux
Compressor Metric MEM_FORCE_MEMORY_ACCESS=2  MEM_FORCE_MEMORY_ACCESS=1  Improvement Ratio*
zstd 1.5.7 -1 Compression Speed 72.0 MB/s 57.5 MB/s ~25.2%
zstd 1.5.7 -1 Decompression Speed 93.7 MB/s 93.4 MB/s ~0.3%
zstd 1.5.7 -22 Compression Speed 0.24 MB/s 0.22 MB/s ~9.1%
zstd 1.5.7 -22 Decompression Speed 71.0 MB/s 66.8 MB/s ~6.3%
  1. MEM_FORCE_MEMORY_ACCESS=1
[root@r2044-r1-s1 lzbench-master]#./lzbench -t0,0 -i5,5 -ezstd,1,22 ../silesia.tar
lzbench 2.1 | GCC 12.3.1 | 64-bit Linux |

Compressor name         Compress. Decompress. Compr. size  Ratio Filename
zstd 1.5.7 -1            57.5 MB/s  93.4 MB/s    73216302  34.54 ../silesia.tar
zstd 1.5.7 -22           0.22 MB/s  66.8 MB/s    52222248  24.64 ../silesia.tar
[Params] cIters=5 dIters=5 cTime=0.0 dTime=0.0 chunkSize=0KB cSpeed=0MB
  1. MEM_FORCE_MEMORY_ACCESS=2
[root@r2044-r1-s1 lzbench-master]#./lzbench -t0,0 -i5,5 -ezstd,1,22 ../silesia.tar
lzbench 2.1 | GCC 12.3.1 | 64-bit Linux |

Compressor name         Compress. Decompress. Compr. size  Ratio Filename
zstd 1.5.7 -1            72.0 MB/s  93.7 MB/s    73216302  34.54 ../silesia.tar
zstd 1.5.7 -22           0.24 MB/s  71.0 MB/s    52222248  24.64 ../silesia.tar
[Params] cIters=5 dIters=5 cTime=0.0 dTime=0.0 chunkSize=0KB cSpeed=0MB

@meta-cla meta-cla bot added the CLA Signed label Nov 3, 2025
@Polaris-911
Copy link
Contributor Author

Hi @Cyan4973 , I know you're busy—just wanted to check if you could spare a moment to review this PR. Thanks in advance!

@Cyan4973 Cyan4973 self-assigned this Dec 2, 2025
Copy link
Contributor

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recommend retaining Method 2 solely as a "last resort" to force enable unaligned memory access on a local system.

However, we do not endorse its use "in general".

Method 2 essentially misleads the C virtual machine by asserting that memory addresses are aligned when, in reality, they are not. This constitutes undefined behavior (UB), and as such, we cannot guarantee reliable or predictable results.

Consequently, we are unable to approve this pull request in its current form.

The preferred and correct approach is Method 0, which is fully portable.

If the compiler recognizes that the target CPU supports unaligned memory access, it should optimize memcpy(d, s, 8) into a single read or write instruction. If this optimization does not occur, the issue lies with the compiler's optimization capabilities.

If that's the current situation regarding RISC-V, I would recommend pursuing improvements in compiler optimization to achieve a more robust and future-proof solution.

@Polaris-911
Copy link
Contributor Author

Polaris-911 commented Dec 20, 2025

Hi @Cyan4973 , Thank you for the detailed explanation regarding Undefined Behavior. I completely understand the project's policy on avoiding UB to ensure portability and correctness.

However, I have benchmarked all three methods on the target hardware (RISC-V with zicclsm, GCC 12.3.1), and the performance gap is substantial.

Benchmark Data (Compression Speed)

Method zstd -1 Speed vs Method 0
Method 2 74.2 MB/s +74%
Method 1 59.2 MB/s +39%
Method 0 42.5 MB/s -

My Suggestion

Method 0 is effectively unusable for high performance: As seen above, relying on memcpy results in a 42% performance drop compared to direct access (Method 2), and a 28% drop compared to the packed struct approach (Method 1). The current compiler unfortunately does not optimize memcpy into single instructions efficiently on this platform.

Request regarding Method 2: Method 2 yields the peak performance (74.2 MB/s). While I strictly acknowledge the UB concerns, unaligned access is natively supported on this hardware, and the current compiler limitations are the main bottleneck.

Could we consider allowing Method 2 as a temporary optimization strategy strictly under the __riscv_zicclsm guard?

We view this as a stopgap solution to bridge the massive performance gap while RISC-V compiler support matures. We are fully open to deprecating this path and reverting to the standard approach in the future, once GCC/Clang demonstrates the capability to optimize memcpy correctly for this target.

Data screenshot

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants