Skip to content

AVX-512 detection and Argon2 support#330

Open
NexusXe wants to merge 10 commits into
tevador:masterfrom
NexusXe:master
Open

AVX-512 detection and Argon2 support#330
NexusXe wants to merge 10 commits into
tevador:masterfrom
NexusXe:master

Conversation

@NexusXe
Copy link
Copy Markdown

@NexusXe NexusXe commented May 24, 2026

This PR introduces an AVX-512F optimized implementation of the Argon2 round function used during dataset initialization. By reducing instruction cache and decoder pressure, this implementation yields a consistent minor hashrate improvement in benchmarks.

To prevent performance regressions on early Intel AVX-512 implementations (e.g., Skylake-X) that suffer from severe frequency/power state-transition penalties, this path is additionally gated VAES presence (which is only present alongside AVX-512 on more recent microarchitectures). This ensures the AVX-512 path is only auto-enabled on architectures with fixed power scaling (Ice Lake / Zen 4 and newer), where the wider instructions can be utilized without transition penalties.

Support was also added to tests and benchmarks.

NexusXe added 7 commits May 23, 2026 18:34
Adds AVX-512F feature detection and uses VAES presence alongside to
detect "good" AVX-512 support, present on Ice Lake/Zen 4 and later.

This is to prevent "bad" implementations (specifically early Intel
implementations) from automatically being used.
Based on src/blake2/blamka-round-avx2.h
Based on src/argon2_avx2.c
I was unsure if extensions past AVX-512F would be needed, but it turned
out that since the primary data element for this code is a 64-bit
integer, only AVX-512F is needed.
@tevador
Copy link
Copy Markdown
Owner

tevador commented May 24, 2026

Can you post some benchmark results to compare AVX2 vs AVX512 cache init performance?

Also the build is failing on most platforms.

@SChernykh
Copy link
Copy Markdown
Collaborator

I don't expect more than a quarter of a second saved compared to AVX-256. Argon2 is pretty fast on Zen4/Ice lake.

NexusXe added 3 commits May 25, 2026 11:02
GCC/Clang more strictly ensures that the `_xgetbv` macro is only used
when the `XSAVE` target feature is enabled. This project is
(intentionally) built without strict target features, so instead use an
assembly shim that manually uses the intrinsic. Since this is only run
when `OSXSAVE` is enabled (and thus the `XSAVE` feature *must* be
enabled on the host), this is safe.
@NexusXe
Copy link
Copy Markdown
Author

NexusXe commented May 25, 2026

Can you post some benchmark results to compare AVX2 vs AVX512 cache init performance?

Benchmarks were run on an AMD AI 9 HX 370, which is a mobile Zen 5 chip.
benchmark_plot

The AVX-512 implementation is consistently the fastest across all thread counts. Under heavy multithreading, it is 1%-2% faster than AVX2. Compared especially to SSSE3, the AVX-512 implementation scales better than all others as init thread count is increased.

It is also by far the most stable implementation; in all tests except for one outlier (where the difference is minimal) the run-to-run performance variation of the AVX-512 implementation is less than that of all others.

I will run more benchmarks on a desktop Zen 5 chip with a native 512-bit datapath later today, which will be more representative of what speedups should be expected on modern Intel chips and desktop/server AMD chips.

@NexusXe
Copy link
Copy Markdown
Author

NexusXe commented May 25, 2026

Results on a 9950X3D:
benchmark_plot

This implementation shows significant improvements in parallel, with 32 threads being consistently >20% faster than reference and 4% or so faster than AVX2.

And again, on average, AVX-512 is the most consistent performer.

@SChernykh
Copy link
Copy Markdown
Collaborator

@NexusXe "20%" faster doesn't mean much if it's 0.16 seconds vs 0.2 seconds to initialize the RandomX cache. Give me the number in seconds on 9950X3D. Yes, it's faster, but what's the point to optimize it if it's so little time already? This is the reason this optimization doesn't exist even in XMRig yet.

@NexusXe
Copy link
Copy Markdown
Author

NexusXe commented May 26, 2026

Give me the number in seconds on 9950X3D.

Here's the test data:
randomx_bench.csv

Indeed, that referenced best-case 20% is from 0.963s to 0.784s.

"20%" faster doesn't mean much if it's 0.16 seconds vs 0.2 seconds to initialize the RandomX cache.

It matters because reinitializations are not a rare, one-off event. Especially for miners on profit-switching pools, dataset rebuilds happen regularly. Any absolute time saved here directly reduces dead time, allowing the host more time to work.

Yes, it's faster, but what's the point to optimize it if it's so little time already?

The exact same reason for there to be an AVX2 implementation. From this standpoint, the SSSE3 implementation could be seen as being "fast enough", yet AVX2 was still implemented. This kind of unilateral improvement, however negligible, is still an improvement.

This is the reason this optimization doesn't exist even in XMRig yet.

From my experience, the case for many projects historically omitting AVX-512 acceleration was due to Intel's catastrophic implementations on older parts. This PR explicitly mitigates this by only auto-enabling on parts with better implementations. On these newer parts, using AVX-512 is free.

@SChernykh
Copy link
Copy Markdown
Collaborator

Indeed, that referenced best-case 20% is from 0.963s to 0.784s.

In XMRig it's about 0.5-0.6s on 9950X because it has an optimized dataset generation using AVX2 - fast enough, which is why AVX-512 is not there so far. Dataset generation takes the most time, not the cache generation. If anything, AVX2 and AVX-512 code for generating dataset would give much more impact. But still, it's less than a second to switch already, and this optimization adds a lot of code to save so little time.

@NexusXe
Copy link
Copy Markdown
Author

NexusXe commented May 26, 2026

this optimization adds a lot of code to save so little time.

The architectural footprint of this PR is actually very lightweight. The avx512-specific files introduced are nearly 1:1 identical to the existing AVX2 implementations, simply using wider registers and consequently fewer intrinsics. For any maintainer familiar with the AVX2 code for blamka/Argon2, the AVX-512 paths introduce zero new complexity. It's structurally identical.

In XMRig it's about 0.5-0.6s on 9950X because it has an optimized dataset generation using AVX2

While XMRig's optimizations to dataset generation are excellent, from what I can tell that AVX2 dataset generation code does not currently exist in this repository.

If there is interest, I can look into implementing those optimizations as well. However, that is not what this PR is for.

@SChernykh
Copy link
Copy Markdown
Collaborator

Especially for miners on profit-switching pools, dataset rebuilds happen regularly.

This repository is not used in miners though, profit-switching is almost exclusively done on the XMRig-MO fork. While I don't have anything against the code in this PR itself, it has better place in XMRig repository (XMRig-MO will pull from it anyway).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants