AVX-512 detection and Argon2 support#330
Conversation
Adds AVX-512F feature detection and uses VAES presence alongside to detect "good" AVX-512 support, present on Ice Lake/Zen 4 and later. This is to prevent "bad" implementations (specifically early Intel implementations) from automatically being used.
Based on src/blake2/blamka-round-avx2.h
Based on src/argon2_avx2.c
I was unsure if extensions past AVX-512F would be needed, but it turned out that since the primary data element for this code is a 64-bit integer, only AVX-512F is needed.
|
Can you post some benchmark results to compare AVX2 vs AVX512 cache init performance? Also the build is failing on most platforms. |
|
I don't expect more than a quarter of a second saved compared to AVX-256. Argon2 is pretty fast on Zen4/Ice lake. |
GCC/Clang more strictly ensures that the `_xgetbv` macro is only used when the `XSAVE` target feature is enabled. This project is (intentionally) built without strict target features, so instead use an assembly shim that manually uses the intrinsic. Since this is only run when `OSXSAVE` is enabled (and thus the `XSAVE` feature *must* be enabled on the host), this is safe.
|
@NexusXe "20%" faster doesn't mean much if it's 0.16 seconds vs 0.2 seconds to initialize the RandomX cache. Give me the number in seconds on 9950X3D. Yes, it's faster, but what's the point to optimize it if it's so little time already? This is the reason this optimization doesn't exist even in XMRig yet. |
Here's the test data: Indeed, that referenced best-case 20% is from 0.963s to 0.784s.
It matters because reinitializations are not a rare, one-off event. Especially for miners on profit-switching pools, dataset rebuilds happen regularly. Any absolute time saved here directly reduces dead time, allowing the host more time to work.
The exact same reason for there to be an AVX2 implementation. From this standpoint, the SSSE3 implementation could be seen as being "fast enough", yet AVX2 was still implemented. This kind of unilateral improvement, however negligible, is still an improvement.
From my experience, the case for many projects historically omitting AVX-512 acceleration was due to Intel's catastrophic implementations on older parts. This PR explicitly mitigates this by only auto-enabling on parts with better implementations. On these newer parts, using AVX-512 is free. |
In XMRig it's about 0.5-0.6s on 9950X because it has an optimized dataset generation using AVX2 - fast enough, which is why AVX-512 is not there so far. Dataset generation takes the most time, not the cache generation. If anything, AVX2 and AVX-512 code for generating dataset would give much more impact. But still, it's less than a second to switch already, and this optimization adds a lot of code to save so little time. |
The architectural footprint of this PR is actually very lightweight. The avx512-specific files introduced are nearly 1:1 identical to the existing AVX2 implementations, simply using wider registers and consequently fewer intrinsics. For any maintainer familiar with the AVX2 code for blamka/Argon2, the AVX-512 paths introduce zero new complexity. It's structurally identical.
While XMRig's optimizations to dataset generation are excellent, from what I can tell that AVX2 dataset generation code does not currently exist in this repository. If there is interest, I can look into implementing those optimizations as well. However, that is not what this PR is for. |
This repository is not used in miners though, profit-switching is almost exclusively done on the XMRig-MO fork. While I don't have anything against the code in this PR itself, it has better place in XMRig repository (XMRig-MO will pull from it anyway). |


This PR introduces an AVX-512F optimized implementation of the Argon2 round function used during dataset initialization. By reducing instruction cache and decoder pressure, this implementation yields a consistent minor hashrate improvement in benchmarks.
To prevent performance regressions on early Intel AVX-512 implementations (e.g., Skylake-X) that suffer from severe frequency/power state-transition penalties, this path is additionally gated VAES presence (which is only present alongside AVX-512 on more recent microarchitectures). This ensures the AVX-512 path is only auto-enabled on architectures with fixed power scaling (Ice Lake / Zen 4 and newer), where the wider instructions can be utilized without transition penalties.
Support was also added to tests and benchmarks.