AVX-512 detection and Argon2 support by NexusXe · Pull Request #330 · tevador/RandomX

NexusXe · 2026-05-24T00:08:38Z

This PR introduces an AVX-512F optimized implementation of the Argon2 round function used during dataset initialization. By reducing instruction cache and decoder pressure, this implementation yields a consistent minor hashrate improvement in benchmarks.

To prevent performance regressions on early Intel AVX-512 implementations (e.g., Skylake-X) that suffer from severe frequency/power state-transition penalties, this path is additionally gated VAES presence (which is only present alongside AVX-512 on more recent microarchitectures). This ensures the AVX-512 path is only auto-enabled on architectures with fixed power scaling (Ice Lake / Zen 4 and newer), where the wider instructions can be utilized without transition penalties.

Support was also added to tests and benchmarks.

Adds AVX-512F feature detection and uses VAES presence alongside to detect "good" AVX-512 support, present on Ice Lake/Zen 4 and later. This is to prevent "bad" implementations (specifically early Intel implementations) from automatically being used.

Based on src/blake2/blamka-round-avx2.h

Based on src/argon2_avx2.c

I was unsure if extensions past AVX-512F would be needed, but it turned out that since the primary data element for this code is a 64-bit integer, only AVX-512F is needed.

tevador · 2026-05-24T08:54:56Z

Can you post some benchmark results to compare AVX2 vs AVX512 cache init performance?

Also the build is failing on most platforms.

SChernykh · 2026-05-24T09:17:56Z

I don't expect more than a quarter of a second saved compared to AVX-256. Argon2 is pretty fast on Zen4/Ice lake.

GCC/Clang more strictly ensures that the `_xgetbv` macro is only used when the `XSAVE` target feature is enabled. This project is (intentionally) built without strict target features, so instead use an assembly shim that manually uses the intrinsic. Since this is only run when `OSXSAVE` is enabled (and thus the `XSAVE` feature *must* be enabled on the host), this is safe.

NexusXe · 2026-05-25T19:59:02Z

Can you post some benchmark results to compare AVX2 vs AVX512 cache init performance?

Benchmarks were run on an AMD AI 9 HX 370, which is a mobile Zen 5 chip.

The AVX-512 implementation is consistently the fastest across all thread counts. Under heavy multithreading, it is 1%-2% faster than AVX2. Compared especially to SSSE3, the AVX-512 implementation scales better than all others as init thread count is increased.

It is also by far the most stable implementation; in all tests except for one outlier (where the difference is minimal) the run-to-run performance variation of the AVX-512 implementation is less than that of all others.

I will run more benchmarks on a desktop Zen 5 chip with a native 512-bit datapath later today, which will be more representative of what speedups should be expected on modern Intel chips and desktop/server AMD chips.

NexusXe · 2026-05-25T22:55:48Z

Results on a 9950X3D:

This implementation shows significant improvements in parallel, with 32 threads being consistently >20% faster than reference and 4% or so faster than AVX2.

And again, on average, AVX-512 is the most consistent performer.

SChernykh · 2026-05-26T03:21:24Z

@NexusXe "20%" faster doesn't mean much if it's 0.16 seconds vs 0.2 seconds to initialize the RandomX cache. Give me the number in seconds on 9950X3D. Yes, it's faster, but what's the point to optimize it if it's so little time already? This is the reason this optimization doesn't exist even in XMRig yet.

NexusXe · 2026-05-26T04:31:04Z

Give me the number in seconds on 9950X3D.

Here's the test data:
randomx_bench.csv

Indeed, that referenced best-case 20% is from 0.963s to 0.784s.

"20%" faster doesn't mean much if it's 0.16 seconds vs 0.2 seconds to initialize the RandomX cache.

It matters because reinitializations are not a rare, one-off event. Especially for miners on profit-switching pools, dataset rebuilds happen regularly. Any absolute time saved here directly reduces dead time, allowing the host more time to work.

Yes, it's faster, but what's the point to optimize it if it's so little time already?

The exact same reason for there to be an AVX2 implementation. From this standpoint, the SSSE3 implementation could be seen as being "fast enough", yet AVX2 was still implemented. This kind of unilateral improvement, however negligible, is still an improvement.

This is the reason this optimization doesn't exist even in XMRig yet.

From my experience, the case for many projects historically omitting AVX-512 acceleration was due to Intel's catastrophic implementations on older parts. This PR explicitly mitigates this by only auto-enabling on parts with better implementations. On these newer parts, using AVX-512 is free.

SChernykh · 2026-05-26T04:36:33Z

Indeed, that referenced best-case 20% is from 0.963s to 0.784s.

In XMRig it's about 0.5-0.6s on 9950X because it has an optimized dataset generation using AVX2 - fast enough, which is why AVX-512 is not there so far. Dataset generation takes the most time, not the cache generation. If anything, AVX2 and AVX-512 code for generating dataset would give much more impact. But still, it's less than a second to switch already, and this optimization adds a lot of code to save so little time.

NexusXe · 2026-05-26T05:04:16Z

this optimization adds a lot of code to save so little time.

The architectural footprint of this PR is actually very lightweight. The avx512-specific files introduced are nearly 1:1 identical to the existing AVX2 implementations, simply using wider registers and consequently fewer intrinsics. For any maintainer familiar with the AVX2 code for blamka/Argon2, the AVX-512 paths introduce zero new complexity. It's structurally identical.

In XMRig it's about 0.5-0.6s on 9950X because it has an optimized dataset generation using AVX2

While XMRig's optimizations to dataset generation are excellent, from what I can tell that AVX2 dataset generation code does not currently exist in this repository.

If there is interest, I can look into implementing those optimizations as well. However, that is not what this PR is for.

SChernykh · 2026-05-26T05:26:58Z

Especially for miners on profit-switching pools, dataset rebuilds happen regularly.

This repository is not used in miners though, profit-switching is almost exclusively done on the XMRig-MO fork. While I don't have anything against the code in this PR itself, it has better place in XMRig repository (XMRig-MO will pull from it anyway).

NexusXe added 7 commits May 23, 2026 18:34

CPU feature detection

3477b1b

Adds AVX-512F feature detection and uses VAES presence alongside to detect "good" AVX-512 support, present on Ice Lake/Zen 4 and later. This is to prevent "bad" implementations (specifically early Intel implementations) from automatically being used.

AVX-512F blamka round implementation

4d68a07

Based on src/blake2/blamka-round-avx2.h

AVX-512F Argon2 implementation

1166344

Based on src/argon2_avx2.c

Use AVX-512F Argon2

33fe836

Add AVX-512F to benchmarks & tests

1bf1e08

Add AVX-512 Argon2 files to MSVC and Clang config files

51f8369

Remove old comment

e940c9b

I was unsure if extensions past AVX-512F would be needed, but it turned out that since the primary data element for this code is a 64-bit integer, only AVX-512F is needed.

NexusXe added 3 commits May 25, 2026 11:02

ensure _XCR_XFEATURE_ENABLED_MASK gets defined

679db5d

specify macro section as for both GCC and Clang

9f583ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX-512 detection and Argon2 support#330

AVX-512 detection and Argon2 support#330
NexusXe wants to merge 10 commits into
tevador:masterfrom
NexusXe:master

NexusXe commented May 24, 2026

Uh oh!

tevador commented May 24, 2026

Uh oh!

SChernykh commented May 24, 2026

Uh oh!

NexusXe commented May 25, 2026

Uh oh!

NexusXe commented May 25, 2026 •

edited

Loading

Uh oh!

SChernykh commented May 26, 2026

Uh oh!

NexusXe commented May 26, 2026 •

edited

Loading

Uh oh!

SChernykh commented May 26, 2026

Uh oh!

NexusXe commented May 26, 2026

Uh oh!

SChernykh commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NexusXe commented May 24, 2026

Uh oh!

tevador commented May 24, 2026

Uh oh!

SChernykh commented May 24, 2026

Uh oh!

NexusXe commented May 25, 2026

Uh oh!

NexusXe commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SChernykh commented May 26, 2026

Uh oh!

NexusXe commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SChernykh commented May 26, 2026

Uh oh!

NexusXe commented May 26, 2026

Uh oh!

SChernykh commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NexusXe commented May 25, 2026 •

edited

Loading

NexusXe commented May 26, 2026 •

edited

Loading