Skip to content

crc32: Add implementation for loongarch64#511

Draft
Gelbpunkt wants to merge 1 commit into
trifectatechfoundation:mainfrom
Gelbpunkt:loong64
Draft

crc32: Add implementation for loongarch64#511
Gelbpunkt wants to merge 1 commit into
trifectatechfoundation:mainfrom
Gelbpunkt:loong64

Conversation

@Gelbpunkt
Copy link
Copy Markdown

@Gelbpunkt Gelbpunkt commented May 22, 2026

Promised @folkertdev I'd start working on this at RustWeek (hi!), so here's the very low hanging fruit: LoongArch has CRC32 instructions that are extremely similar to those on aarch64, but they're part of the base integer instruction set and therefore unconditionally available: https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#crc-check-instructions.

stdarch has intrinsics for these, see e.g. here, but they're nightly-only, so inline assembly is necessary to make this work on stable.

I think the function signatures in stdarch are terrible, they do match LLVM's, but a parameter where only the lower 8 bits are used should really just be a u8/i8 instead of an i32. I went with that approach for the re-implementation here and will see if I can get this changed in LLVM and stdarch. The choice of signed integers makes this uglier than the aarch64 implementation, but it matches stdarch, so I went with that. This also won't work in miri yet.

I'd love to add CI for loongarch64 here, but the cross compilers are only available in Ubuntu starting with 25.10 and 26.04, so we'll have to wait until GitHub adds a runner image based on that.

cargo nextest run --features gz passes for me on a Loongson 3B6000. Are there existing benchmarks for these implementations or do you always just copy the relevant parts out and run benchmarks separately when adding new implementations?

zlib-ng has managed to somewhat abstract over the aarch64 and loongarch64 intrinsics to share the implementation. Ideally we'd do something similar here, I guess we could wrap the intrinsics to have the same signatures?

@Gelbpunkt
Copy link
Copy Markdown
Author

Testing this requires rust-lang/libz-sys#269

Comment on lines +52 to +63
// CRC32 instructions are part of the basic integer operations and therefore
// always available.
mod asm {
/// CRC32 single round checksum for bytes (8 bits).
///
/// [Loongson's documentation](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#crc-check-instructions)
pub fn crc_w_b_w(data: i8, mut crc: i32) -> i32 {
unsafe {
core::arch::asm!("crc.w.b.w {crc}, {data}, {crc}", crc = inout(reg) crc, data = in(reg) data);
crc
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is an idea: rust-lang/miri#4899 added crc32 support for aarch, you could add similar support for loongarch? A requirement for new shims is often that you actually check the results on real hardware, but that is something you can actually do.

With miri support we can actually test this in CI today if instead of asm you use the intrinsics here. We can add some __internal feature for it if needed.

Separately we could actually stabilize the crc intrinsics on loongarch if the target maintainer is OK with that. For signatures we follow the clang headers by default (but it's ultimately up to the target maintainer), hence the i32 instead of more accurate widths.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, adding support for these in Miri sounds good. I'll try to put together a PR this evening.

With miri support we can actually test this in CI today if instead of asm you use the intrinsics here. We can add some __internal feature for it if needed.

So you would be okay with requiring nightly behind a feature flag?

Separately we could actually stabilize the crc intrinsics on loongarch if the target maintainer is OK with that. For signatures we follow the clang headers by default (but it's ultimately up to the target maintainer), hence the i32 instead of more accurate widths.

Funny enough, the clang headers do have different (in my view, better) signatures (see https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/larchintrin.h#L60) and use i8 and i16 where appropriate. GCC also seems to do this better (https://github.com/gcc-mirror/gcc/blob/master/gcc/config/loongarch/larchintrin.h#L152). I'll submit a patch to align the LLVM intrinsics with clang and GCC. Afterwards I think stabilization would be reasonable, though I've also noticed that LLVM completely prohibits using the CRC intrinsics on 32-bit, while Rust currently exposes them on both 32-bit and 64-bit, which should cause compile errors. I'm not sure whether that's a hardware limitation or not and what is actually correct in that case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you would be okay with requiring nightly behind a feature flag?

We already do for gzprintf, which will just become a default-enabled feature when our msrv reaches whatever version c-variadic will be stable in. We're conservative with bumping the msrv but c-variadic and custom allocators are enticing reasons to bump.

For now, given that I don't think there is a really urgent use case I'd say add it as an __internal_loongarch_crc32 feature, we can give it a proper name once we have a bit more clarity on the testing and stabilization situation.

Funny enough, the clang headers do have different (in my view, better) signatures

cc @heiher on the crc questions

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm actually, now that I'm looking at it (after I'm hitting operand promotion LLVM crashes when changing the LLVM intrinsics), the aarch64 intrinsics are also defined with i32 in LLVM. The stdarch wrappers around them simply perform the cast, similar to how clang does it. I guess this doesn't need changes in LLVM, only in stdarch.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@folkertdev
Copy link
Copy Markdown
Member

Are there existing benchmarks for these implementations

I run https://github.com/trifectatechfoundation/zlib-rs/blob/main/test-libz-rs-sys/examples/crc32_bench.rs with https://github.com/andrewrk/poop for the extra counters (cycles, instructions)

zlib-ng has managed to somewhat abstract over the aarch64 and loongarch64 intrinsics to share the implementation.

At the current level of complexity I think the duplication (perhaps with a comment pointing this out) is fine.

@Gelbpunkt
Copy link
Copy Markdown
Author

Taking a random 10GB file of bytes from /dev/urandom, here's some benchmarks. Had to patch poop since branch misses are not exposed by the CPU it seems and performing the perf_event_open syscall for that just returns EINVAL.

$ poop './target/release/examples/crc32_bench_before sse /tmp/10G.bin'
Benchmark 1 (3 runs): ./target/release/examples/crc32_bench_before sse /tmp/10G.bin
  measurement          mean ± σ            min … max           outliers
  wall_time          7.58s  ± 67.4ms    7.51s  … 7.64s           0 ( 0%)
  peak_rss           10.7GB ± 98.3KB    10.7GB … 10.7GB          0 ( 0%)
  cpu_cycles         10.5G  ± 31.1M     10.4G  … 10.5G           0 ( 0%)
  instructions       49.1G  ±    0      49.1G  … 49.1G           0 ( 0%)
  cache_references   13.4G  ±  239K     13.4G  … 13.4G           0 ( 0%)
  cache_misses        493M  ± 47.3K      493M  …  493M           0 ( 0%)

$ poop './target/release/examples/crc32_bench sse /tmp/10G.bin'
Benchmark 1 (3 runs): ./target/release/examples/crc32_bench sse /tmp/10G.bin
  measurement          mean ± σ            min … max           outliers
  wall_time          3.92s  ± 9.08ms    3.91s  … 3.93s           0 ( 0%)
  peak_rss           10.7GB ±  303KB    10.7GB … 10.7GB          0 ( 0%)
  cpu_cycles         2.74G  ± 13.4M     2.73G  … 2.76G           0 ( 0%)
  instructions       6.71G  ±    0      6.71G  … 6.71G           0 ( 0%)
  cache_references   1.34G  ± 5.40K     1.34G  … 1.34G           0 ( 0%)
  cache_misses       1.34G  ±  936K     1.34G  … 1.34G           0 ( 0%)

@folkertdev
Copy link
Copy Markdown
Member

FYI you can give poop 2 (or more) argument strings and it will do the math and statistical significance and all that. Not really needed here though, it's pretty obvious. Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants