Skip to content

Comments

Eliminate caddq intrinsics#905

Open
willieyz wants to merge 1 commit intomainfrom
eliminate-caddq-intrinsics
Open

Eliminate caddq intrinsics#905
willieyz wants to merge 1 commit intomainfrom
eliminate-caddq-intrinsics

Conversation

@willieyz
Copy link
Contributor

@willieyz willieyz commented Jan 22, 2026

In this PR, we replace the AVX2 intrinsics implementation of poly_caddq with a x86_64 assembly version.
To estimate the performance impact, we compare the results shown in the two tables below.
Overall, for keypair, sign, and verify (opt), the performance difference is below 1%, which is consistent with the no-opt case.

In the component-level benchmark for mld_poly_caddq, the observed performance differences are at least 17%. After unrolling the loop by a factor of 4, the differences are reduced to approximately 10%.

  • bench components
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
mld_poly_caddq (avg) AVX2 intrinsics no-opt 391 393 391
x86_64 asm no-opt 390 393 392
Δ (%) no-opt -0.26 0.00 +0.26
mld_poly_caddq (avg) AVX2 intrinsics opt 38 40 39
x86_64 asm opt 51 50 46
x86_64 asm (unroll) opt 42 42 42 unroll by 4
Δ (%) opt +34.21 +25.00 +17.95
Δ (%) (unroll) opt +10.53 +5.00 +7.69 unroll by 4
  • bench
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
keypair cycles (avg) AVX2 no-opt 134355 226117 377069 baseline (main)
x86_64 asm no-opt 133831 226345 374963
Δ (%) no-opt -0.39 +0.10 -0.56
AVX2 opt 60367 105019 166676 baseline (main)
x86_64 asm opt 60535 104479 165781
x86_64 asm (unroll) opt 59921 104367 165795 unroll by 4
Δ (%) opt +0.28 -0.51 -0.54
Δ (%) (unroll) opt -0.74 -0.62 -0.53 unroll by 4
sign cycles (avg) AVX2 no-opt 473892 779091 998026 baseline (main)
x86_64 asm no-opt 473262 779359 993245
Δ (%) no-opt -0.13 +0.03 -0.48
AVX2 opt 179804 301077 364509 baseline (main)
x86_64 asm opt 180253 298598 363742
x86_64 asm (unroll) opt 178255 299153 363505 unroll by 4
Δ (%) opt +0.25 -0.82 -0.21
Δ (%) (unroll) opt -0.86 -0.64 -0.28 unroll by 4
verify cycles (avg) AVX2 no-opt 140765 228322 379244 baseline (main)
x86_64 asm no-opt 140872 228255 377091
Δ (%) no-opt +0.08 -0.03 -0.57
AVX2 opt 63674 105734 164897 baseline (main)
x86_64 asm opt 63924 105192 164131
x86_64 asm (unroll) opt 62955 105111 163861 unroll by 4
Δ (%) opt +0.39 -0.51 -0.46
Δ (%) (unroll) opt -1.13 -0.59 -0.63 unroll by 4

@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch from 00b155f to 3819863 Compare January 23, 2026 06:52
@oqs-bot
Copy link
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-DSA-87)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2627s 2449s +7.3%
sign_verify_internal 377s 353s +7%
mld_attempt_signature_generation 239s 227s +5%
polyvecl_pointwise_acc_montgomery_c 180s 165s +9%
polyvec_matrix_expand 158s 153s +3%
poly_pointwise_montgomery_c 151s 128s +18%
rej_uniform_native 149s 139s +7%
mld_invntt_layer 126s 114s +11%
polyvec_matrix_expand_serial 112s 110s +2%
mld_ct_memcmp 88s 74s +19%
mld_ntt_layer 48s 44s +9%
sign_signature_internal 48s 46s +4%
keccak_squeezeblocks_x4 42s 42s +0%
mld_compute_t0_t1_tr_from_sk_components 25s 25s +0%
polymat_permute_bitrev_to_custom 25s 24s +4%
rej_uniform 21s 21s +0%
fqmul 20s 18s +11%
poly_uniform_eta_4x 19s 17s +12%
rej_uniform_c 18s 16s +12%
poly_chknorm_c 17s 17s +0%
polyveck_add 17s 13s +31%
polyt0_unpack 16s 17s -6%
polyeta_unpack 15s 13s +15%
polyveck_power2round 15s 14s +7%
poly_uniform_4x 14s 17s -18%
polyvec_matrix_pointwise_montgomery 14s 12s +17%
mld_ntt_butterfly_block 13s 13s +0%
keccakf1600_permute_native 12s 10s +20%
keccakf1600x4_permute_native 12s 12s +0%
polyveck_use_hint 12s 9s +33%
mld_check_pct 11s 7s +57%
polyveck_reduce 11s 13s -15%
keccak_absorb_once_x4 10s 10s +0%
poly_decompose_c 10s 7s +43%
polyvecl_ntt 10s 11s -9%
sign_pk_from_sk 10s 9s +11%
keccakf1600_permute 9s 7s +29%
poly_invntt_tomont_c 9s 9s +0%
polyveck_caddq 9s 7s +29%
polyveck_chknorm 9s 6s +50%
polyveck_invntt_tomont 9s 10s -10%
polyveck_pointwise_poly_montgomery 9s 8s +12%
sign 8s 7s +14%
mld_polyvecl_permute_bitrev_to_custom_native 7s 8s -12%
mld_sample_s1_s2 7s 6s +17%
polyveck_decompose 7s 6s +17%
polyveck_ntt 7s 8s -12%
polyveck_shiftl 7s 6s +17%
polyveck_sub 7s 7s +0%
sign_signature 7s 5s +40%
keccak_absorb 6s 7s -14%
mld_h 6s 2s +200%
pack_sk 6s 3s +100%
poly_challenge 6s 5s +20%
poly_invntt_tomont_native 6s 2s +200%
poly_pointwise_montgomery 6s 4s +50%
poly_uniform_gamma1_4x 6s 6s +0%
poly_use_hint_native 6s 4s +50%
polyt0_pack 6s 5s +20%
polyveck_make_hint 6s 5s +20%
polyveck_pack_w1 6s 4s +50%
polyvecl_uniform_gamma1_serial 6s 7s -14%
rej_eta_native 6s 5s +20%
sign_verify_pre_hash_internal 6s 5s +20%
mld_compute_pack_z 5s 7s -29%
mld_sample_s1_s2_serial 5s 6s -17%
poly_chknorm 5s 2s +150%
poly_shiftl 5s 2s +150%
polyt1_pack 5s 2s +150%
polyveck_unpack_t0 5s 6s -17%
polyvecl_pointwise_acc_montgomery 5s 4s +25%
polyvecl_unpack_eta 5s 3s +67%
polyvecl_unpack_z 5s 3s +67%
polyz_unpack_c 5s 5s +0%
power2round 5s 4s +25%
rej_eta_c 5s 4s +25%
shake256 5s 4s +25%
sign_keypair_internal 5s 6s -17%
sign_signature_pre_hash_shake256 5s 4s +25%
sign_verify 5s 2s +150%
unpack_hints 5s 4s +25%
unpack_sig 5s 5s +0%
unpack_sk 5s 6s -17%
caddq 4s 4s +0%
keccakf1600_extract_bytes (big endian) 4s 3s +33%
keccakf1600_xor_bytes 4s 2s +100%
keccakf1600_xor_bytes (big endian) 4s 3s +33%
keccakf1600x4_extract_bytes 4s 2s +100%
keccakf1600x4_xor_bytes 4s 2s +100%
mld_ct_cmask_neg_i32 4s 1s +300%
poly_caddq_native 4s 2s +100%
poly_decompose 4s 3s +33%
poly_invntt_tomont 4s 2s +100%
poly_make_hint 4s 2s +100%
poly_ntt 4s 3s +33%
poly_pointwise_montgomery_native 4s 3s +33%
poly_sub 4s 3s +33%
poly_uniform 4s 6s -33%
poly_uniform_gamma1 4s 3s +33%
poly_use_hint_c 4s 4s +0%
polyveck_pack_t0 4s 2s +100%
polyveck_unpack_eta 4s 3s +33%
polyvecl_pack_eta 4s 3s +33%
polyvecl_pointwise_acc_montgomery_native 4s 3s +33%
polyvecl_uniform_gamma1 4s 4s +0%
polyw1_pack 4s 2s +100%
shake128_absorb 4s 2s +100%
sign_open 4s 6s -33%
decompose 3s 4s -25%
fqscale 3s 5s -40%
intt_native_x86_64 3s 5s -40%
keccak_squeeze 3s 4s -25%
keccakf1600x4_permute 3s 2s +50%
make_hint 3s 5s -40%
mld_ct_cmask_nonzero_u8 3s 3s +0%
mld_keccakf1600_extract_bytes 3s 1s +200%
mld_prepare_domain_separation_prefix 3s 3s +0%
mld_value_barrier_u32 3s 2s +50%
montgomery_reduce 3s 3s +0%
pack_sig_c_h 3s 3s +0%
poly_add 3s 4s -25%
poly_caddq 3s 2s +50%
poly_caddq_native_aarch64 3s 3s +0%
poly_decompose_native 3s 5s -40%
poly_reduce 3s 4s -25%
poly_uniform_eta 3s 4s -25%
polyeta_pack 3s 4s -25%
polyt1_unpack 3s 3s +0%
polyveck_pack_eta 3s 2s +50%
polyvecl_permute_bitrev_to_custom 3s 2s +50%
polyz_pack 3s 2s +50%
polyz_unpack 3s 3s +0%
polyz_unpack_native 3s 4s -25%
reduce32 3s 4s -25%
rej_eta 3s 2s +50%
shake128_finalize 3s 3s +0%
shake256_finalize 3s 3s +0%
shake256_squeeze 3s 2s +50%
shake256x4_absorb_once 3s 2s +50%
shake256x4_squeezeblocks 3s 1s +200%
sign_keypair 3s 3s +0%
sign_signature_extmu 3s 5s -40%
sign_signature_pre_hash_internal 3s 5s -40%
sign_verify_extmu 3s 3s +0%
sign_verify_pre_hash_shake256 3s 4s -25%
unpack_pk 3s 6s -50%
use_hint 3s 3s +0%
keccak_finalize 2s 2s +0%
keccak_init 2s 3s -33%
mld_ct_abs_i32 2s 2s +0%
mld_ct_cmask_nonzero_u32 2s 4s -50%
mld_ct_get_optblocker_i64 2s 3s -33%
mld_ct_get_optblocker_u32 2s 2s +0%
mld_ct_get_optblocker_u8 2s 2s +0%
mld_value_barrier_i64 2s 3s -33%
mld_value_barrier_u8 2s 2s +0%
ntt_native_x86_64 2s 4s -50%
pack_sig_z 2s 3s -33%
poly_caddq_c 2s 3s -33%
poly_chknorm_native 2s 2s +0%
poly_ntt_c 2s 1s +100%
poly_ntt_native 2s 6s -67%
poly_power2round 2s 3s -33%
poly_use_hint 2s 3s -33%
polyvecl_chknorm 2s 4s -50%
shake128_init 2s 2s +0%
shake128_release 2s 3s -33%
shake128_squeeze 2s 5s -60%
shake128x4_absorb_once 2s 4s -50%
shake128x4_squeezeblocks 2s 2s +0%
shake256_absorb 2s 3s -33%
shake256_init 2s 2s +0%
sys_check_capability 2s 3s -33%
mld_ct_sel_int32 1s 2s -50%
pack_pk 1s 4s -75%
shake256_release 1s 3s -67%

@oqs-bot
Copy link
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-DSA-44)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2002s 2055s -2.6%
sign_verify_internal 244s 254s -4%
mld_attempt_signature_generation 226s 221s +2%
polyvecl_pointwise_acc_montgomery_c 195s 208s -6%
rej_uniform_native 136s 144s -6%
poly_pointwise_montgomery_c 132s 143s -8%
mld_ct_memcmp 75s 82s -9%
mld_invntt_layer 51s 50s +2%
sign_signature_internal 47s 45s +4%
keccak_squeezeblocks_x4 44s 44s +0%
mld_ntt_layer 43s 44s -2%
poly_invntt_tomont_c 40s 39s +3%
rej_uniform 20s 20s +0%
fqmul 19s 19s +0%
polymat_permute_bitrev_to_custom 19s 16s +19%
rej_uniform_c 19s 20s -5%
poly_uniform_4x 17s 13s +31%
mld_compute_t0_t1_tr_from_sk_components 15s 15s +0%
poly_uniform_eta_4x 15s 19s -21%
polyvec_matrix_expand 15s 16s -6%
mld_polyvecl_permute_bitrev_to_custom_native 14s 14s +0%
poly_chknorm_c 14s 12s +17%
polyt0_unpack 14s 15s -7%
polyeta_unpack 13s 12s +8%
keccakf1600x4_permute_native 12s 14s -14%
mld_ntt_butterfly_block 12s 13s -8%
polyz_unpack_c 12s 12s +0%
keccak_absorb_once_x4 10s 9s +11%
keccakf1600_permute 9s 8s +12%
keccakf1600_permute_native 8s 6s +33%
polyveck_decompose 8s 7s +14%
polyveck_reduce 8s 5s +60%
polyvecl_ntt 8s 6s +33%
mld_check_pct 7s 6s +17%
mld_ct_cmask_nonzero_u32 7s 2s +250%
polyveck_add 7s 7s +0%
polyveck_pointwise_poly_montgomery 7s 5s +40%
sign 7s 5s +40%
mld_sample_s1_s2_serial 6s 3s +100%
pack_sk 6s 3s +100%
polyveck_power2round 6s 6s +0%
polyw1_pack 6s 2s +200%
sign_open 6s 6s +0%
sign_signature_pre_hash_internal 6s 4s +50%
mld_compute_pack_z 5s 6s -17%
mld_ct_sel_int32 5s 4s +25%
poly_decompose_c 5s 5s +0%
poly_ntt_native 5s 2s +150%
poly_power2round 5s 5s +0%
poly_sub 5s 3s +67%
poly_uniform_eta 5s 5s +0%
poly_uniform_gamma1 5s 4s +25%
poly_uniform_gamma1_4x 5s 4s +25%
poly_use_hint_c 5s 5s +0%
polyvec_matrix_expand_serial 5s 7s -29%
polyvec_matrix_pointwise_montgomery 5s 6s -17%
polyveck_caddq 5s 8s -38%
polyveck_ntt 5s 4s +25%
polyveck_sub 5s 4s +25%
polyveck_use_hint 5s 5s +0%
sign_keypair 5s 2s +150%
sign_keypair_internal 5s 4s +25%
decompose 4s 3s +33%
keccak_absorb 4s 8s -50%
keccak_finalize 4s 4s +0%
mld_h 4s 6s -33%
mld_sample_s1_s2 4s 4s +0%
poly_challenge 4s 3s +33%
poly_invntt_tomont_native 4s 4s +0%
poly_make_hint 4s 4s +0%
poly_pointwise_montgomery 4s 2s +100%
poly_uniform 4s 5s -20%
poly_use_hint 4s 2s +100%
polyveck_invntt_tomont 4s 6s -33%
polyveck_pack_t0 4s 3s +33%
polyvecl_chknorm 4s 6s -33%
polyz_pack 4s 2s +100%
polyz_unpack 4s 5s -20%
rej_eta_c 4s 4s +0%
shake128_finalize 4s 2s +100%
shake256_init 4s 3s +33%
shake256x4_absorb_once 4s 4s +0%
sign_pk_from_sk 4s 6s -33%
sign_signature 4s 6s -33%
sign_signature_pre_hash_shake256 4s 5s -20%
sign_verify 4s 3s +33%
sign_verify_extmu 4s 2s +100%
sys_check_capability 4s 4s +0%
unpack_hints 4s 5s -20%
intt_native_x86_64 3s 3s +0%
keccak_init 3s 3s +0%
keccakf1600_extract_bytes (big endian) 3s 3s +0%
keccakf1600_xor_bytes 3s 2s +50%
keccakf1600x4_xor_bytes 3s 2s +50%
mld_ct_abs_i32 3s 2s +50%
mld_ct_cmask_neg_i32 3s 2s +50%
mld_ct_cmask_nonzero_u8 3s 4s -25%
mld_keccakf1600_extract_bytes 3s 6s -50%
mld_value_barrier_i64 3s 4s -25%
mld_value_barrier_u32 3s 4s -25%
mld_value_barrier_u8 3s 1s +200%
ntt_native_x86_64 3s 3s +0%
pack_pk 3s 3s +0%
pack_sig_c_h 3s 4s -25%
pack_sig_z 3s 2s +50%
poly_add 3s 4s -25%
poly_caddq 3s 3s +0%
poly_caddq_c 3s 3s +0%
poly_caddq_native_aarch64 3s 3s +0%
poly_chknorm_native 3s 6s -50%
poly_decompose 3s 4s -25%
poly_invntt_tomont 3s 2s +50%
poly_ntt 3s 3s +0%
poly_pointwise_montgomery_native 3s 3s +0%
poly_reduce 3s 4s -25%
poly_shiftl 3s 4s -25%
polyeta_pack 3s 3s +0%
polyt0_pack 3s 4s -25%
polyt1_pack 3s 1s +200%
polyt1_unpack 3s 4s -25%
polyveck_pack_eta 3s 3s +0%
polyveck_shiftl 3s 5s -40%
polyveck_unpack_eta 3s 4s -25%
polyvecl_pack_eta 3s 3s +0%
polyvecl_permute_bitrev_to_custom 3s 2s +50%
power2round 3s 2s +50%
rej_eta 3s 1s +200%
rej_eta_native 3s 4s -25%
shake128_release 3s 3s +0%
shake128_squeeze 3s 2s +50%
shake256 3s 3s +0%
shake256_finalize 3s 4s -25%
shake256_release 3s 3s +0%
shake256_squeeze 3s 5s -40%
sign_signature_extmu 3s 7s -57%
sign_verify_pre_hash_internal 3s 4s -25%
sign_verify_pre_hash_shake256 3s 3s +0%
unpack_sk 3s 4s -25%
use_hint 3s 2s +50%
caddq 2s 3s -33%
keccak_squeeze 2s 5s -60%
keccakf1600_xor_bytes (big endian) 2s 2s +0%
keccakf1600x4_extract_bytes 2s 3s -33%
keccakf1600x4_permute 2s 3s -33%
make_hint 2s 3s -33%
mld_ct_get_optblocker_u32 2s 1s +100%
mld_ct_get_optblocker_u8 2s 1s +100%
mld_prepare_domain_separation_prefix 2s 5s -60%
montgomery_reduce 2s 2s +0%
poly_caddq_native 2s 2s +0%
poly_chknorm 2s 4s -50%
poly_decompose_native 2s 3s -33%
poly_ntt_c 2s 2s +0%
poly_use_hint_native 2s 3s -33%
polyveck_chknorm 2s 3s -33%
polyveck_make_hint 2s 4s -50%
polyveck_unpack_t0 2s 4s -50%
polyvecl_pointwise_acc_montgomery 2s 4s -50%
polyvecl_pointwise_acc_montgomery_native 2s 3s -33%
polyvecl_uniform_gamma1 2s 2s +0%
polyvecl_uniform_gamma1_serial 2s 3s -33%
polyvecl_unpack_eta 2s 2s +0%
polyvecl_unpack_z 2s 5s -60%
polyz_unpack_native 2s 3s -33%
reduce32 2s 3s -33%
shake128_init 2s 2s +0%
shake128x4_absorb_once 2s 4s -50%
shake256_absorb 2s 2s +0%
shake256x4_squeezeblocks 2s 2s +0%
unpack_pk 2s 3s -33%
unpack_sig 2s 3s -33%
fqscale 1s 1s +0%
mld_ct_get_optblocker_i64 1s 2s -50%
polyveck_pack_w1 1s 4s -75%
shake128_absorb 1s 1s +0%
shake128x4_squeezeblocks 1s 2s -50%

@oqs-bot
Copy link
Contributor

oqs-bot commented Jan 23, 2026

CBMC Results (ML-DSA-65)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2320s 2286s +1.5%
polyvecl_pointwise_acc_montgomery_c 231s 226s +2%
mld_attempt_signature_generation 200s 197s +2%
sign_verify_internal 181s 177s +2%
polyvec_matrix_expand 146s 145s +1%
rej_uniform_native 145s 144s +1%
poly_pointwise_montgomery_c 141s 138s +2%
mld_invntt_layer 122s 117s +4%
mld_ct_memcmp 83s 79s +5%
polyvec_matrix_expand_serial 67s 65s +3%
sign_signature_internal 53s 50s +6%
mld_ntt_layer 45s 44s +2%
keccak_squeezeblocks_x4 43s 42s +2%
mld_compute_t0_t1_tr_from_sk_components 24s 27s -11%
polymat_permute_bitrev_to_custom 20s 18s +11%
fqmul 19s 18s +6%
polyveck_decompose 19s 16s +19%
rej_uniform 19s 21s -10%
rej_uniform_c 18s 19s -5%
poly_chknorm_c 17s 16s +6%
poly_uniform_4x 16s 13s +23%
poly_uniform_eta_4x 16s 17s -6%
polyt0_unpack 16s 17s -6%
mld_ntt_butterfly_block 13s 13s +0%
polyvec_matrix_pointwise_montgomery 13s 14s -7%
polyveck_use_hint 13s 14s -7%
keccakf1600x4_permute_native 12s 14s -14%
mld_check_pct 12s 9s +33%
mld_polyvecl_permute_bitrev_to_custom_native 12s 14s -14%
keccak_absorb_once_x4 11s 10s +10%
polyveck_power2round 11s 11s +0%
sign 11s 9s +22%
keccakf1600_permute_native 9s 9s +0%
poly_decompose_c 9s 7s +29%
poly_invntt_tomont_c 9s 8s +12%
polyveck_add 9s 9s +0%
polyveck_caddq 9s 8s +12%
polyveck_ntt 9s 6s +50%
polyeta_unpack 8s 7s +14%
polyveck_pointwise_poly_montgomery 8s 6s +33%
polyveck_reduce 8s 9s -11%
polyveck_shiftl 8s 8s +0%
polyveck_sub 8s 7s +14%
polyvecl_ntt 8s 8s +0%
sign_pk_from_sk 8s 8s +0%
keccak_absorb 7s 5s +40%
keccakf1600_permute 7s 9s -22%
mld_sample_s1_s2 7s 4s +75%
poly_uniform_eta 7s 3s +133%
polyveck_invntt_tomont 7s 8s -12%
unpack_sk 7s 5s +40%
mld_h 6s 3s +100%
mld_prepare_domain_separation_prefix 6s 5s +20%
poly_decompose 6s 3s +100%
poly_uniform 6s 4s +50%
shake256x4_squeezeblocks 6s 3s +100%
sign_verify_pre_hash_internal 6s 3s +100%
decompose 5s 5s +0%
intt_native_x86_64 5s 3s +67%
keccakf1600x4_extract_bytes 5s 1s +400%
mld_compute_pack_z 5s 7s -29%
mld_sample_s1_s2_serial 5s 6s -17%
montgomery_reduce 5s 2s +150%
poly_sub 5s 5s +0%
polyt0_pack 5s 4s +25%
polyveck_pack_w1 5s 6s -17%
polyvecl_permute_bitrev_to_custom 5s 3s +67%
polyvecl_pointwise_acc_montgomery_native 5s 5s +0%
sign_keypair_internal 5s 6s -17%
sign_signature_pre_hash_internal 5s 6s -17%
sign_verify 5s 4s +25%
sign_verify_extmu 5s 4s +25%
fqscale 4s 4s +0%
mld_ct_get_optblocker_u8 4s 2s +100%
mld_value_barrier_u32 4s 3s +33%
mld_value_barrier_u8 4s 3s +33%
ntt_native_x86_64 4s 3s +33%
poly_add 4s 4s +0%
poly_caddq_native_aarch64 4s 6s -33%
poly_challenge 4s 4s +0%
poly_decompose_native 4s 3s +33%
poly_invntt_tomont 4s 6s -33%
poly_ntt 4s 5s -20%
poly_pointwise_montgomery 4s 5s -20%
poly_power2round 4s 5s -20%
poly_shiftl 4s 2s +100%
poly_uniform_gamma1 4s 5s -20%
poly_use_hint_c 4s 7s -43%
polyveck_make_hint 4s 5s -20%
polyveck_pack_eta 4s 4s +0%
polyveck_unpack_t0 4s 3s +33%
polyvecl_pack_eta 4s 4s +0%
polyvecl_unpack_eta 4s 5s -20%
polyw1_pack 4s 1s +300%
polyz_unpack_c 4s 5s -20%
power2round 4s 2s +100%
rej_eta_c 4s 3s +33%
rej_eta_native 4s 4s +0%
shake256_absorb 4s 5s -20%
shake256_release 4s 1s +300%
shake256_squeeze 4s 2s +100%
sign_open 4s 5s -20%
sign_verify_pre_hash_shake256 4s 6s -33%
unpack_hints 4s 5s -20%
unpack_sig 4s 4s +0%
keccak_finalize 3s 2s +50%
keccak_squeeze 3s 2s +50%
keccakf1600_extract_bytes (big endian) 3s 3s +0%
keccakf1600_xor_bytes 3s 3s +0%
keccakf1600_xor_bytes (big endian) 3s 3s +0%
keccakf1600x4_permute 3s 1s +200%
mld_ct_get_optblocker_i64 3s 1s +200%
pack_sig_c_h 3s 2s +50%
pack_sig_z 3s 2s +50%
poly_caddq 3s 3s +0%
poly_caddq_c 3s 3s +0%
poly_chknorm 3s 3s +0%
poly_chknorm_native 3s 4s -25%
poly_invntt_tomont_native 3s 7s -57%
poly_ntt_c 3s 2s +50%
poly_ntt_native 3s 3s +0%
poly_pointwise_montgomery_native 3s 2s +50%
poly_reduce 3s 5s -40%
poly_uniform_gamma1_4x 3s 6s -50%
poly_use_hint_native 3s 4s -25%
polyeta_pack 3s 2s +50%
polyt1_pack 3s 2s +50%
polyt1_unpack 3s 6s -50%
polyveck_chknorm 3s 5s -40%
polyveck_pack_t0 3s 3s +0%
polyveck_unpack_eta 3s 4s -25%
polyvecl_chknorm 3s 5s -40%
polyvecl_uniform_gamma1 3s 5s -40%
polyz_unpack 3s 2s +50%
rej_eta 3s 4s -25%
shake128x4_squeezeblocks 3s 1s +200%
shake256 3s 2s +50%
shake256x4_absorb_once 3s 4s -25%
sign_signature 3s 4s -25%
sign_signature_extmu 3s 4s -25%
sign_signature_pre_hash_shake256 3s 4s -25%
sys_check_capability 3s 3s +0%
unpack_pk 3s 3s +0%
caddq 2s 3s -33%
keccak_init 2s 3s -33%
keccakf1600x4_xor_bytes 2s 2s +0%
make_hint 2s 3s -33%
mld_ct_abs_i32 2s 4s -50%
mld_ct_cmask_nonzero_u32 2s 2s +0%
mld_ct_sel_int32 2s 3s -33%
mld_keccakf1600_extract_bytes 2s 2s +0%
mld_value_barrier_i64 2s 2s +0%
pack_pk 2s 3s -33%
pack_sk 2s 2s +0%
poly_caddq_native 2s 4s -50%
poly_make_hint 2s 3s -33%
poly_use_hint 2s 3s -33%
polyvecl_pointwise_acc_montgomery 2s 5s -60%
polyvecl_uniform_gamma1_serial 2s 4s -50%
polyvecl_unpack_z 2s 2s +0%
polyz_pack 2s 4s -50%
polyz_unpack_native 2s 3s -33%
reduce32 2s 3s -33%
shake128_absorb 2s 2s +0%
shake128_init 2s 1s +100%
shake128_release 2s 3s -33%
shake128_squeeze 2s 2s +0%
shake128x4_absorb_once 2s 4s -50%
shake256_finalize 2s 5s -60%
shake256_init 2s 2s +0%
sign_keypair 2s 3s -33%
use_hint 2s 2s +0%
mld_ct_cmask_neg_i32 1s 2s -50%
mld_ct_cmask_nonzero_u8 1s 2s -50%
mld_ct_get_optblocker_u32 1s 4s -75%
shake128_finalize 1s 3s -67%

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 45681 cycles 45685 cycles 1.00
ML-DSA-44 sign 131153 cycles 131164 cycles 1.00
ML-DSA-44 verify 47527 cycles 47530 cycles 1.00
ML-DSA-65 keypair 80457 cycles 80479 cycles 1.00
ML-DSA-65 sign 215715 cycles 215740 cycles 1.00
ML-DSA-65 verify 79737 cycles 79735 cycles 1.00
ML-DSA-87 keypair 131177 cycles 131175 cycles 1.00
ML-DSA-87 sign 277048 cycles 277004 cycles 1.00
ML-DSA-87 verify 130004 cycles 129971 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 111983 cycles 111979 cycles 1.00
ML-DSA-44 sign 403592 cycles 403622 cycles 1.00
ML-DSA-44 verify 119886 cycles 119876 cycles 1.00
ML-DSA-65 keypair 192137 cycles 192166 cycles 1.00
ML-DSA-65 sign 657120 cycles 657078 cycles 1.00
ML-DSA-65 verify 193900 cycles 193891 cycles 1.00
ML-DSA-87 keypair 317930 cycles 318010 cycles 1.00
ML-DSA-87 sign 836905 cycles 836903 cycles 1.00
ML-DSA-87 verify 322922 cycles 322994 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 34340 cycles 34361 cycles 1.00
ML-DSA-44 sign 119648 cycles 120023 cycles 1.00
ML-DSA-44 verify 37990 cycles 38140 cycles 1.00
ML-DSA-65 keypair 60562 cycles 60626 cycles 1.00
ML-DSA-65 sign 201239 cycles 200228 cycles 1.01
ML-DSA-65 verify 62873 cycles 62578 cycles 1.00
ML-DSA-87 keypair 93377 cycles 93913 cycles 0.99
ML-DSA-87 sign 232229 cycles 235482 cycles 0.99
ML-DSA-87 verify 94479 cycles 94514 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 229063 cycles 232745 cycles 0.98
ML-DSA-44 sign 628858 cycles 629812 cycles 1.00
ML-DSA-44 verify 229339 cycles 229277 cycles 1.00
ML-DSA-65 keypair 378941 cycles 422090 cycles 0.90
ML-DSA-65 sign 1007370 cycles 1067756 cycles 0.94
ML-DSA-65 verify 376246 cycles 393848 cycles 0.96
ML-DSA-87 keypair 690237 cycles 673725 cycles 1.02
ML-DSA-87 sign 1396068 cycles 1405386 cycles 0.99
ML-DSA-87 verify 663094 cycles 657567 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 93562 cycles 93808 cycles 1.00
ML-DSA-44 sign 332581 cycles 332528 cycles 1.00
ML-DSA-44 verify 99714 cycles 99696 cycles 1.00
ML-DSA-65 keypair 159833 cycles 160037 cycles 1.00
ML-DSA-65 sign 543737 cycles 544483 cycles 1.00
ML-DSA-65 verify 160524 cycles 160826 cycles 1.00
ML-DSA-87 keypair 267186 cycles 266702 cycles 1.00
ML-DSA-87 sign 707232 cycles 705628 cycles 1.00
ML-DSA-87 verify 270355 cycles 270568 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 68896 cycles 69270 cycles 0.99
ML-DSA-44 sign 187431 cycles 187049 cycles 1.00
ML-DSA-44 verify 68887 cycles 69047 cycles 1.00
ML-DSA-65 keypair 119600 cycles 119031 cycles 1.00
ML-DSA-65 sign 299540 cycles 299818 cycles 1.00
ML-DSA-65 verify 115518 cycles 115291 cycles 1.00
ML-DSA-87 keypair 203742 cycles 203891 cycles 1.00
ML-DSA-87 sign 393131 cycles 394659 cycles 1.00
ML-DSA-87 verify 195707 cycles 195766 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 57327 cycles 56563 cycles 1.01
ML-DSA-44 sign 180726 cycles 181874 cycles 0.99
ML-DSA-44 verify 60901 cycles 61156 cycles 1.00
ML-DSA-65 keypair 98660 cycles 98757 cycles 1.00
ML-DSA-65 sign 298138 cycles 298537 cycles 1.00
ML-DSA-65 verify 100095 cycles 100518 cycles 1.00
ML-DSA-87 keypair 152331 cycles 152679 cycles 1.00
ML-DSA-87 sign 355616 cycles 355558 cycles 1.00
ML-DSA-87 verify 154183 cycles 152966 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 41477 cycles 41136 cycles 1.01
ML-DSA-44 sign 133286 cycles 132617 cycles 1.01
ML-DSA-44 verify 44031 cycles 44492 cycles 0.99
ML-DSA-65 keypair 72317 cycles 72104 cycles 1.00
ML-DSA-65 sign 213181 cycles 214651 cycles 0.99
ML-DSA-65 verify 71974 cycles 72444 cycles 0.99
ML-DSA-87 keypair 107833 cycles 107657 cycles 1.00
ML-DSA-87 sign 250476 cycles 250266 cycles 1.00
ML-DSA-87 verify 109230 cycles 112595 cycles 0.97

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 3819863 Previous: 9258ea1 Ratio
ML-DSA-65 keypair 75829 cycles 72591 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 134758 cycles 134710 cycles 1.00
ML-DSA-44 sign 523723 cycles 526054 cycles 1.00
ML-DSA-44 verify 147705 cycles 147500 cycles 1.00
ML-DSA-65 keypair 226449 cycles 226690 cycles 1.00
ML-DSA-65 sign 860712 cycles 861192 cycles 1.00
ML-DSA-65 verify 235070 cycles 235381 cycles 1.00
ML-DSA-87 keypair 370974 cycles 370668 cycles 1.00
ML-DSA-87 sign 1079141 cycles 1078305 cycles 1.00
ML-DSA-87 verify 383049 cycles 383429 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 157394 cycles 157188 cycles 1.00
ML-DSA-44 sign 549561 cycles 548996 cycles 1.00
ML-DSA-44 verify 169498 cycles 169283 cycles 1.00
ML-DSA-65 keypair 267800 cycles 269077 cycles 1.00
ML-DSA-65 sign 903011 cycles 906033 cycles 1.00
ML-DSA-65 verify 273909 cycles 275229 cycles 1.00
ML-DSA-87 keypair 449680 cycles 448040 cycles 1.00
ML-DSA-87 sign 1161535 cycles 1157923 cycles 1.00
ML-DSA-87 verify 460234 cycles 457343 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 67942 cycles 68235 cycles 1.00
ML-DSA-44 sign 201998 cycles 201899 cycles 1.00
ML-DSA-44 verify 70776 cycles 70799 cycles 1.00
ML-DSA-65 keypair 121036 cycles 121045 cycles 1.00
ML-DSA-65 sign 331322 cycles 331301 cycles 1.00
ML-DSA-65 verify 117850 cycles 117988 cycles 1.00
ML-DSA-87 keypair 198669 cycles 197907 cycles 1.00
ML-DSA-87 sign 428529 cycles 426619 cycles 1.00
ML-DSA-87 verify 194582 cycles 194362 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 72254 cycles 72298 cycles 1.00
ML-DSA-44 sign 211942 cycles 211862 cycles 1.00
ML-DSA-44 verify 75645 cycles 75651 cycles 1.00
ML-DSA-65 keypair 127516 cycles 127564 cycles 1.00
ML-DSA-65 sign 350254 cycles 350256 cycles 1.00
ML-DSA-65 verify 125449 cycles 125447 cycles 1.00
ML-DSA-87 keypair 208196 cycles 208014 cycles 1.00
ML-DSA-87 sign 448893 cycles 448910 cycles 1.00
ML-DSA-87 verify 205308 cycles 205681 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 120311 cycles 120460 cycles 1.00
ML-DSA-44 sign 447300 cycles 447234 cycles 1.00
ML-DSA-44 verify 129710 cycles 130455 cycles 0.99
ML-DSA-65 keypair 204437 cycles 203981 cycles 1.00
ML-DSA-65 sign 728421 cycles 730686 cycles 1.00
ML-DSA-65 verify 209421 cycles 210398 cycles 1.00
ML-DSA-87 keypair 337688 cycles 337748 cycles 1.00
ML-DSA-87 sign 926903 cycles 922242 cycles 1.01
ML-DSA-87 verify 346060 cycles 347109 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 128224 cycles 128233 cycles 1.00
ML-DSA-44 sign 447684 cycles 447406 cycles 1.00
ML-DSA-44 verify 138331 cycles 142161 cycles 0.97
ML-DSA-65 keypair 220728 cycles 220585 cycles 1.00
ML-DSA-65 sign 727613 cycles 726570 cycles 1.00
ML-DSA-65 verify 223172 cycles 223096 cycles 1.00
ML-DSA-87 keypair 365009 cycles 365027 cycles 1.00
ML-DSA-87 sign 926270 cycles 926682 cycles 1.00
ML-DSA-87 verify 372774 cycles 372462 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 138503 cycles 138431 cycles 1.00
ML-DSA-44 sign 484053 cycles 483804 cycles 1.00
ML-DSA-44 verify 148725 cycles 156357 cycles 0.95
ML-DSA-65 keypair 241276 cycles 241178 cycles 1.00
ML-DSA-65 sign 792427 cycles 792015 cycles 1.00
ML-DSA-65 verify 241215 cycles 241086 cycles 1.00
ML-DSA-87 keypair 396496 cycles 396336 cycles 1.00
ML-DSA-87 sign 1013013 cycles 1012796 cycles 1.00
ML-DSA-87 verify 402599 cycles 402305 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 113820 cycles 113345 cycles 1.00
ML-DSA-44 sign 357341 cycles 356055 cycles 1.00
ML-DSA-44 verify 118529 cycles 118038 cycles 1.00
ML-DSA-65 keypair 196679 cycles 196907 cycles 1.00
ML-DSA-65 sign 588785 cycles 590403 cycles 1.00
ML-DSA-65 verify 194716 cycles 195057 cycles 1.00
ML-DSA-87 keypair 323237 cycles 322985 cycles 1.00
ML-DSA-87 sign 754039 cycles 753517 cycles 1.00
ML-DSA-87 verify 320375 cycles 320636 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 827476 cycles 828088 cycles 1.00
ML-DSA-44 sign 3238353 cycles 3233170 cycles 1.00
ML-DSA-44 verify 921919 cycles 920794 cycles 1.00
ML-DSA-65 keypair 1413613 cycles 1413452 cycles 1.00
ML-DSA-65 sign 5340696 cycles 5347688 cycles 1.00
ML-DSA-65 verify 1477470 cycles 1477937 cycles 1.00
ML-DSA-87 keypair 2311391 cycles 2312894 cycles 1.00
ML-DSA-87 sign 6659117 cycles 6665352 cycles 1.00
ML-DSA-87 verify 2409640 cycles 2411069 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 214003 cycles 213077 cycles 1.00
ML-DSA-44 sign 765036 cycles 760523 cycles 1.01
ML-DSA-44 verify 230465 cycles 233125 cycles 0.99
ML-DSA-65 keypair 380442 cycles 380915 cycles 1.00
ML-DSA-65 sign 1253729 cycles 1251999 cycles 1.00
ML-DSA-65 verify 371997 cycles 372378 cycles 1.00
ML-DSA-87 keypair 604923 cycles 605968 cycles 1.00
ML-DSA-87 sign 1594853 cycles 1593941 cycles 1.00
ML-DSA-87 verify 619102 cycles 617894 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 311698 cycles 306606 cycles 1.02
ML-DSA-44 sign 1174058 cycles 1166146 cycles 1.01
ML-DSA-44 verify 333560 cycles 335430 cycles 0.99
ML-DSA-65 keypair 550737 cycles 562274 cycles 0.98
ML-DSA-65 sign 1894590 cycles 1916493 cycles 0.99
ML-DSA-65 verify 529438 cycles 533535 cycles 0.99
ML-DSA-87 keypair 872695 cycles 865006 cycles 1.01
ML-DSA-87 sign 2468410 cycles 2417913 cycles 1.02
ML-DSA-87 verify 900121 cycles 884966 cycles 1.02

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 309195 cycles 299195 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 277182 cycles 278160 cycles 1.00
ML-DSA-44 sign 816109 cycles 822535 cycles 0.99
ML-DSA-44 verify 280990 cycles 278070 cycles 1.01
ML-DSA-65 keypair 477648 cycles 476503 cycles 1.00
ML-DSA-65 sign 1398700 cycles 1347085 cycles 1.04
ML-DSA-65 verify 461181 cycles 456015 cycles 1.01
ML-DSA-87 keypair 825204 cycles 796551 cycles 1.04
ML-DSA-87 sign 1886968 cycles 1773335 cycles 1.06
ML-DSA-87 verify 803609 cycles 772360 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch 2 times, most recently from 72bc3f8 to d186f5e Compare January 26, 2026 10:12
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 113150 cycles 113204 cycles 1.00
ML-DSA-44 sign 355525 cycles 355548 cycles 1.00
ML-DSA-44 verify 117877 cycles 117886 cycles 1.00
ML-DSA-65 keypair 196192 cycles 196406 cycles 1.00
ML-DSA-65 sign 588774 cycles 588666 cycles 1.00
ML-DSA-65 verify 194576 cycles 194481 cycles 1.00
ML-DSA-87 keypair 322391 cycles 321917 cycles 1.00
ML-DSA-87 sign 751848 cycles 752728 cycles 1.00
ML-DSA-87 verify 319927 cycles 320132 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 212745 cycles 212659 cycles 1.00
ML-DSA-44 sign 759537 cycles 759393 cycles 1.00
ML-DSA-44 verify 229014 cycles 228980 cycles 1.00
ML-DSA-65 keypair 380339 cycles 380359 cycles 1.00
ML-DSA-65 sign 1251422 cycles 1251433 cycles 1.00
ML-DSA-65 verify 372106 cycles 372151 cycles 1.00
ML-DSA-87 keypair 605932 cycles 605385 cycles 1.00
ML-DSA-87 sign 1591645 cycles 1591182 cycles 1.00
ML-DSA-87 verify 617975 cycles 617388 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 verify 241958 cycles 229196 cycles 1.06

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Details
Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-44 keypair 464439 cycles 465373 cycles 1.00
ML-DSA-44 sign 2143725 cycles 2152438 cycles 1.00
ML-DSA-44 verify 551422 cycles 551474 cycles 1.00
ML-DSA-65 keypair 783189 cycles 781420 cycles 1.00
ML-DSA-65 sign 3519184 cycles 3519262 cycles 1.00
ML-DSA-65 verify 855778 cycles 854831 cycles 1.00
ML-DSA-87 keypair 1261568 cycles 1263149 cycles 1.00
ML-DSA-87 sign 4343372 cycles 4339952 cycles 1.00
ML-DSA-87 verify 1377900 cycles 1379633 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: f9a6d30 Previous: 9258ea1 Ratio
ML-DSA-44 keypair 237495 cycles 229189 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch from 6faaac2 to 5b1b8a7 Compare January 27, 2026 10:46
@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch from 5b1b8a7 to f9a6d30 Compare January 28, 2026 03:58
@willieyz willieyz marked this pull request as ready for review January 28, 2026 06:42
@willieyz willieyz requested a review from a team as a code owner January 28, 2026 06:42
Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @willieyz. Performance is looking good and I checked that the code is doing the correct thing. Here are a few stylistic comments.

.balign 16
MLD_ASM_FN_SYMBOL(poly_caddq_avx2)

movabsq $35993616950222849, %rdx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using a 64-bit constant here? It would be much easier to follow if you take a 32-bit one:

mov      $8380417, %edx
vmovd    %edx, %xmm1
vpbroadcastd %xmm1, %ymm1

Unless that is slower, it should be prefered.

addq $128, %rdi # advance by 128 bytes (4 vectors)
cmpq %rdi, %rax
jne poly_caddq_avx2_loop # 8 iterations (32/4 = 8)
vzeroupper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never use vzeroupper in any other AVX2 files, so this should be eliminated.

Comment on lines 50 to 53
vpcmpgtd (%rdi), %ymm2, %ymm0
vpand %ymm1, %ymm0, %ymm0
vpaddd (%rdi), %ymm0, %ymm0
vmovdqa %ymm0, (%rdi)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap this in a macro - similar to the caddq code for aarch64.

vpaddd 96(%rdi), %ymm5, %ymm5
vmovdqa %ymm5, 96(%rdi)

addq $128, %rdi # advance by 128 bytes (4 vectors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never use # comments.
Please use // comments or /* */ comments like in the other files. Also try to follow their style.

@mkannwischer mkannwischer marked this pull request as draft February 10, 2026 02:45
@willieyz willieyz force-pushed the eliminate-caddq-intrinsics branch 4 times, most recently from 7dc5f6f to 6761759 Compare February 11, 2026 03:12
@willieyz willieyz marked this pull request as ready for review February 11, 2026 09:40
@mkannwischer mkannwischer force-pushed the eliminate-caddq-intrinsics branch from 6761759 to 14097e6 Compare February 11, 2026 09:51
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 14097e6 Previous: daf271b Ratio
ML-DSA-65 sign 1398700 cycles 1347085 cycles 1.04
ML-DSA-87 keypair 825204 cycles 796551 cycles 1.04
ML-DSA-87 sign 1886968 cycles 1773335 cycles 1.06
ML-DSA-87 verify 803609 cycles 772360 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@jakemas jakemas self-requested a review February 13, 2026 03:49
This commit replaces the caddq AVX2 intrinsics implementation with assembly.
It allso adds caddq to the component benchmarks.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
@mkannwischer mkannwischer force-pushed the eliminate-caddq-intrinsics branch from 14097e6 to 16cbdfe Compare February 18, 2026 17:17
Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @willieyz. I took the liberty to clean up the commit history, but this looks good to me now.

@jakemas, WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AVX2: Replace intrinsics implementation of poly_caddq with assembly

3 participants