Skip to content

Comments

x86_64/AArch64: Add AVX2/Neon polyw1_pack to x86_64 native backend#973

Draft
mkannwischer wants to merge 2 commits intomainfrom
x86-native-w1-pack
Draft

x86_64/AArch64: Add AVX2/Neon polyw1_pack to x86_64 native backend#973
mkannwischer wants to merge 2 commits intomainfrom
x86-native-w1-pack

Conversation

@mkannwischer
Copy link
Contributor

@mkannwischer mkannwischer commented Feb 21, 2026

Integrate polyw1_pack AVX2 implementations for both GAMMA2 variants into the native backend.

polyw1_pack component benchmarks

Intel Xeon 8375C (c6i.metal, no Turbo Boost, no SMT)

ML-DSA-44 ML-DSA-65 ML-DSA-87
C 500 310 311
AVX2 216 144 144
Speedup 2.3x 2.2x 2.2x

Apple M1

ML-DSA-44 ML-DSA-65 ML-DSA-87
C 203 49 49
AArch64 32 22 22
Speedup 6.3x 2.2x 2.2x

TODO:

  • Add CBMC proof
  • Add Unit tests
  • Try AArch64 implementation
  • Create HOL-Light + conversion to asm issues

Integrate polyw1_pack AVX2 implementations for both GAMMA2 variants
into the native backend.

Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 45691 cycles 45683 cycles 1.00
ML-DSA-44 sign 128992 cycles 131163 cycles 0.98
ML-DSA-44 verify 47012 cycles 47529 cycles 0.99
ML-DSA-65 keypair 80465 cycles 80463 cycles 1.00
ML-DSA-65 sign 214956 cycles 215738 cycles 1.00
ML-DSA-65 verify 79587 cycles 79737 cycles 1.00
ML-DSA-87 keypair 131152 cycles 131178 cycles 1.00
ML-DSA-87 sign 276231 cycles 277066 cycles 1.00
ML-DSA-87 verify 129895 cycles 129990 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 112049 cycles 111974 cycles 1.00
ML-DSA-44 sign 403823 cycles 403601 cycles 1.00
ML-DSA-44 verify 119939 cycles 119892 cycles 1.00
ML-DSA-65 keypair 192134 cycles 192181 cycles 1.00
ML-DSA-65 sign 657108 cycles 657104 cycles 1.00
ML-DSA-65 verify 193869 cycles 193901 cycles 1.00
ML-DSA-87 keypair 318020 cycles 318040 cycles 1.00
ML-DSA-87 sign 837065 cycles 837047 cycles 1.00
ML-DSA-87 verify 323003 cycles 323045 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 34648 cycles 34818 cycles 1.00
ML-DSA-44 sign 115881 cycles 119743 cycles 0.97
ML-DSA-44 verify 37118 cycles 38134 cycles 0.97
ML-DSA-65 keypair 60681 cycles 60836 cycles 1.00
ML-DSA-65 sign 198566 cycles 200613 cycles 0.99
ML-DSA-65 verify 62513 cycles 62640 cycles 1.00
ML-DSA-87 keypair 93487 cycles 93373 cycles 1.00
ML-DSA-87 sign 236201 cycles 232798 cycles 1.01
ML-DSA-87 verify 95896 cycles 95570 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 93841 cycles 93709 cycles 1.00
ML-DSA-44 sign 333807 cycles 332609 cycles 1.00
ML-DSA-44 verify 99869 cycles 99635 cycles 1.00
ML-DSA-65 keypair 159923 cycles 160109 cycles 1.00
ML-DSA-65 sign 543699 cycles 544366 cycles 1.00
ML-DSA-65 verify 160683 cycles 160833 cycles 1.00
ML-DSA-87 keypair 266467 cycles 267045 cycles 1.00
ML-DSA-87 sign 705143 cycles 706279 cycles 1.00
ML-DSA-87 verify 270387 cycles 270100 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 113214 cycles 113201 cycles 1.00
ML-DSA-44 sign 350397 cycles 355543 cycles 0.99
ML-DSA-44 verify 116730 cycles 117896 cycles 0.99
ML-DSA-65 keypair 196335 cycles 196439 cycles 1.00
ML-DSA-65 sign 588222 cycles 588538 cycles 1.00
ML-DSA-65 verify 194453 cycles 194475 cycles 1.00
ML-DSA-87 keypair 322439 cycles 321909 cycles 1.00
ML-DSA-87 sign 751213 cycles 752725 cycles 1.00
ML-DSA-87 verify 319951 cycles 320145 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 68774 cycles 69001 cycles 1.00
ML-DSA-44 sign 179744 cycles 187631 cycles 0.96
ML-DSA-44 verify 67303 cycles 69172 cycles 0.97
ML-DSA-65 keypair 122437 cycles 119360 cycles 1.03
ML-DSA-65 sign 301955 cycles 299878 cycles 1.01
ML-DSA-65 verify 118088 cycles 115464 cycles 1.02
ML-DSA-87 keypair 203390 cycles 203890 cycles 1.00
ML-DSA-87 sign 390522 cycles 394779 cycles 0.99
ML-DSA-87 verify 195089 cycles 195702 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 57073 cycles 56454 cycles 1.01
ML-DSA-44 sign 175719 cycles 181513 cycles 0.97
ML-DSA-44 verify 60021 cycles 61053 cycles 0.98
ML-DSA-65 keypair 98321 cycles 98631 cycles 1.00
ML-DSA-65 sign 297249 cycles 298535 cycles 1.00
ML-DSA-65 verify 100096 cycles 100069 cycles 1.00
ML-DSA-87 keypair 152050 cycles 152650 cycles 1.00
ML-DSA-87 sign 353023 cycles 355109 cycles 0.99
ML-DSA-87 verify 152239 cycles 152994 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 68230 cycles 68148 cycles 1.00
ML-DSA-44 sign 196078 cycles 201830 cycles 0.97
ML-DSA-44 verify 69368 cycles 70787 cycles 0.98
ML-DSA-65 keypair 121376 cycles 121099 cycles 1.00
ML-DSA-65 sign 330550 cycles 331249 cycles 1.00
ML-DSA-65 verify 117750 cycles 117837 cycles 1.00
ML-DSA-87 keypair 198142 cycles 197912 cycles 1.00
ML-DSA-87 sign 426496 cycles 426817 cycles 1.00
ML-DSA-87 verify 194325 cycles 194367 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 212357 cycles 212585 cycles 1.00
ML-DSA-44 sign 759494 cycles 759285 cycles 1.00
ML-DSA-44 verify 228690 cycles 228959 cycles 1.00
ML-DSA-65 keypair 379923 cycles 380251 cycles 1.00
ML-DSA-65 sign 1252035 cycles 1251223 cycles 1.00
ML-DSA-65 verify 371571 cycles 372021 cycles 1.00
ML-DSA-87 keypair 604671 cycles 605353 cycles 1.00
ML-DSA-87 sign 1593513 cycles 1591234 cycles 1.00
ML-DSA-87 verify 618457 cycles 617441 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 134786 cycles 134688 cycles 1.00
ML-DSA-44 sign 523577 cycles 524187 cycles 1.00
ML-DSA-44 verify 147461 cycles 147201 cycles 1.00
ML-DSA-65 keypair 226346 cycles 226675 cycles 1.00
ML-DSA-65 sign 860567 cycles 859973 cycles 1.00
ML-DSA-65 verify 234837 cycles 234911 cycles 1.00
ML-DSA-87 keypair 372003 cycles 370452 cycles 1.00
ML-DSA-87 sign 1083875 cycles 1078410 cycles 1.01
ML-DSA-87 verify 384062 cycles 382956 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 42148 cycles 41562 cycles 1.01
ML-DSA-44 sign 129288 cycles 133651 cycles 0.97
ML-DSA-44 verify 43759 cycles 44169 cycles 0.99
ML-DSA-65 keypair 71975 cycles 72989 cycles 0.99
ML-DSA-65 sign 214109 cycles 220760 cycles 0.97
ML-DSA-65 verify 73284 cycles 74207 cycles 0.99
ML-DSA-87 keypair 107702 cycles 108105 cycles 1.00
ML-DSA-87 sign 248261 cycles 250082 cycles 0.99
ML-DSA-87 verify 109090 cycles 108427 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 157594 cycles 157373 cycles 1.00
ML-DSA-44 sign 549971 cycles 549788 cycles 1.00
ML-DSA-44 verify 169054 cycles 169220 cycles 1.00
ML-DSA-65 keypair 267930 cycles 267878 cycles 1.00
ML-DSA-65 sign 903155 cycles 903152 cycles 1.00
ML-DSA-65 verify 274249 cycles 274318 cycles 1.00
ML-DSA-87 keypair 447966 cycles 447643 cycles 1.00
ML-DSA-87 sign 1159788 cycles 1157310 cycles 1.00
ML-DSA-87 verify 457774 cycles 457942 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 128364 cycles 128172 cycles 1.00
ML-DSA-44 sign 447652 cycles 447244 cycles 1.00
ML-DSA-44 verify 138210 cycles 142135 cycles 0.97
ML-DSA-65 keypair 220785 cycles 220615 cycles 1.00
ML-DSA-65 sign 727254 cycles 726560 cycles 1.00
ML-DSA-65 verify 222808 cycles 223116 cycles 1.00
ML-DSA-87 keypair 364610 cycles 365048 cycles 1.00
ML-DSA-87 sign 926038 cycles 926588 cycles 1.00
ML-DSA-87 verify 372875 cycles 372428 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 120810 cycles 120331 cycles 1.00
ML-DSA-44 sign 447005 cycles 447521 cycles 1.00
ML-DSA-44 verify 130168 cycles 132880 cycles 0.98
ML-DSA-65 keypair 204763 cycles 205729 cycles 1.00
ML-DSA-65 sign 728023 cycles 728528 cycles 1.00
ML-DSA-65 verify 210330 cycles 211143 cycles 1.00
ML-DSA-87 keypair 337390 cycles 338699 cycles 1.00
ML-DSA-87 sign 922663 cycles 923705 cycles 1.00
ML-DSA-87 verify 348278 cycles 346629 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 72314 cycles 72302 cycles 1.00
ML-DSA-44 sign 205786 cycles 211874 cycles 0.97
ML-DSA-44 verify 73981 cycles 75647 cycles 0.98
ML-DSA-65 keypair 127445 cycles 127575 cycles 1.00
ML-DSA-65 sign 349694 cycles 350353 cycles 1.00
ML-DSA-65 verify 125303 cycles 125483 cycles 1.00
ML-DSA-87 keypair 205809 cycles 208020 cycles 0.99
ML-DSA-87 sign 448791 cycles 449002 cycles 1.00
ML-DSA-87 verify 205264 cycles 205683 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 138660 cycles 138464 cycles 1.00
ML-DSA-44 sign 484010 cycles 484091 cycles 1.00
ML-DSA-44 verify 148504 cycles 156396 cycles 0.95
ML-DSA-65 keypair 241327 cycles 241147 cycles 1.00
ML-DSA-65 sign 792592 cycles 792223 cycles 1.00
ML-DSA-65 verify 240723 cycles 241092 cycles 1.00
ML-DSA-87 keypair 395470 cycles 396403 cycles 1.00
ML-DSA-87 sign 1013125 cycles 1012979 cycles 1.00
ML-DSA-87 verify 402895 cycles 402335 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 113609 cycles 113289 cycles 1.00
ML-DSA-44 sign 350629 cycles 355883 cycles 0.99
ML-DSA-44 verify 117092 cycles 117973 cycles 0.99
ML-DSA-65 keypair 196545 cycles 196446 cycles 1.00
ML-DSA-65 sign 587254 cycles 589191 cycles 1.00
ML-DSA-65 verify 194237 cycles 194679 cycles 1.00
ML-DSA-87 keypair 322301 cycles 322682 cycles 1.00
ML-DSA-87 sign 752294 cycles 752805 cycles 1.00
ML-DSA-87 verify 320094 cycles 320327 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 212986 cycles 213066 cycles 1.00
ML-DSA-44 sign 760407 cycles 760558 cycles 1.00
ML-DSA-44 verify 241293 cycles 233103 cycles 1.04
ML-DSA-65 keypair 380806 cycles 381034 cycles 1.00
ML-DSA-65 sign 1252121 cycles 1252511 cycles 1.00
ML-DSA-65 verify 372320 cycles 372570 cycles 1.00
ML-DSA-87 keypair 606317 cycles 606046 cycles 1.00
ML-DSA-87 sign 1593381 cycles 1593756 cycles 1.00
ML-DSA-87 verify 618121 cycles 617945 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 verify 241293 cycles 233103 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 827435 cycles 828343 cycles 1.00
ML-DSA-44 sign 3238214 cycles 3235012 cycles 1.00
ML-DSA-44 verify 921978 cycles 920749 cycles 1.00
ML-DSA-65 keypair 1412999 cycles 1413905 cycles 1.00
ML-DSA-65 sign 5347624 cycles 5341776 cycles 1.00
ML-DSA-65 verify 1477830 cycles 1478062 cycles 1.00
ML-DSA-87 keypair 2312761 cycles 2313582 cycles 1.00
ML-DSA-87 sign 6663968 cycles 6664057 cycles 1.00
ML-DSA-87 verify 2410302 cycles 2412445 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 278362 cycles 277272 cycles 1.00
ML-DSA-44 sign 798797 cycles 822468 cycles 0.97
ML-DSA-44 verify 277666 cycles 277832 cycles 1.00
ML-DSA-65 keypair 480516 cycles 475993 cycles 1.01
ML-DSA-65 sign 1349323 cycles 1333415 cycles 1.01
ML-DSA-65 verify 456855 cycles 458979 cycles 1.00
ML-DSA-87 keypair 817862 cycles 817627 cycles 1.00
ML-DSA-87 sign 1841380 cycles 1833605 cycles 1.00
ML-DSA-87 verify 788029 cycles 798022 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 464640 cycles 465283 cycles 1.00
ML-DSA-44 sign 2150220 cycles 2151007 cycles 1.00
ML-DSA-44 verify 550751 cycles 550792 cycles 1.00
ML-DSA-65 keypair 779027 cycles 780624 cycles 1.00
ML-DSA-65 sign 3514123 cycles 3517857 cycles 1.00
ML-DSA-65 verify 856201 cycles 854537 cycles 1.00
ML-DSA-87 keypair 1261706 cycles 1268967 cycles 0.99
ML-DSA-87 sign 4350624 cycles 4402745 cycles 0.99
ML-DSA-87 verify 1373405 cycles 1380067 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 21, 2026

CBMC Results (ML-DSA-65)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2347s 2286s +2.7%
polyvecl_pointwise_acc_montgomery_c 246s 226s +9%
mld_attempt_signature_generation 206s 197s +5%
sign_verify_internal 183s 177s +3%
polyvec_matrix_expand 151s 145s +4%
rej_uniform_native 149s 144s +3%
poly_pointwise_montgomery_c 145s 138s +5%
mld_invntt_layer 124s 117s +6%
mld_ct_memcmp 86s 79s +9%
polyvec_matrix_expand_serial 66s 65s +2%
sign_signature_internal 53s 50s +6%
mld_ntt_layer 50s 44s +14%
keccak_squeezeblocks_x4 42s 42s +0%
mld_compute_t0_t1_tr_from_sk_components 27s 27s +0%
poly_chknorm_c 21s 16s +31%
rej_uniform 21s 21s +0%
fqmul 19s 18s +6%
polymat_permute_bitrev_to_custom 19s 18s +6%
polyveck_decompose 18s 16s +12%
rej_uniform_c 18s 19s -5%
polyt0_unpack 17s 17s +0%
poly_uniform_4x 15s 13s +15%
poly_uniform_eta_4x 15s 17s -12%
mld_ntt_butterfly_block 14s 13s +8%
mld_polyvecl_permute_bitrev_to_custom_native 13s 14s -7%
keccakf1600x4_permute_native 12s 14s -14%
polyvec_matrix_pointwise_montgomery 12s 14s -14%
polyveck_use_hint 12s 14s -14%
sign 12s 9s +33%
keccak_absorb_once_x4 11s 10s +10%
polyveck_invntt_tomont 11s 8s +38%
keccakf1600_permute_native 10s 9s +11%
mld_check_pct 10s 9s +11%
polyveck_ntt 10s 6s +67%
polyveck_power2round 9s 11s -18%
polyveck_reduce 9s 9s +0%
sign_pk_from_sk 9s 8s +12%
mld_sample_s1_s2 8s 4s +100%
poly_invntt_tomont_c 8s 8s +0%
polyveck_add 8s 9s -11%
polyveck_caddq 8s 8s +0%
polyveck_shiftl 8s 8s +0%
polyveck_sub 8s 7s +14%
keccakf1600_permute 7s 9s -22%
mld_prepare_domain_separation_prefix 7s 5s +40%
poly_decompose_c 7s 7s +0%
poly_ntt_c 7s 2s +250%
polyvecl_ntt 7s 8s -12%
mld_compute_pack_z 6s 7s -14%
poly_use_hint_native 6s 4s +50%
polyeta_unpack 6s 7s -14%
polyveck_unpack_t0 6s 3s +100%
polyvecl_uniform_gamma1_serial 6s 4s +50%
sign_open 6s 5s +20%
keccakf1600_extract_bytes (big endian) 5s 3s +67%
mld_h 5s 3s +67%
mld_sample_s1_s2_serial 5s 6s -17%
poly_add 5s 4s +25%
poly_caddq_c 5s 3s +67%
polyveck_make_hint 5s 5s +0%
polyveck_pack_t0 5s 3s +67%
polyveck_pointwise_poly_montgomery 5s 6s -17%
polyvecl_chknorm 5s 5s +0%
rej_eta_native 5s 4s +25%
sign_signature 5s 4s +25%
sign_signature_pre_hash_internal 5s 6s -17%
sign_verify 5s 4s +25%
unpack_sk 5s 5s +0%
fqscale 4s 4s +0%
intt_native_x86_64 4s 3s +33%
keccak_absorb 4s 5s -20%
keccak_init 4s 3s +33%
make_hint 4s 3s +33%
poly_caddq_native 4s 4s +0%
poly_chknorm 4s 3s +33%
poly_decompose_native 4s 3s +33%
poly_pointwise_montgomery 4s 5s -20%
poly_pointwise_montgomery_native 4s 2s +100%
poly_shiftl 4s 2s +100%
poly_uniform 4s 4s +0%
poly_uniform_eta 4s 3s +33%
poly_use_hint_c 4s 7s -43%
polyveck_chknorm 4s 5s -20%
polyveck_pack_w1 4s 6s -33%
polyveck_unpack_eta 4s 4s +0%
polyvecl_pointwise_acc_montgomery 4s 5s -20%
polyvecl_pointwise_acc_montgomery_native 4s 5s -20%
polyz_unpack 4s 2s +100%
polyz_unpack_c 4s 5s -20%
polyz_unpack_native 4s 3s +33%
power2round 4s 2s +100%
rej_eta_c 4s 3s +33%
sign_keypair 4s 3s +33%
sign_keypair_internal 4s 6s -33%
sign_signature_extmu 4s 4s +0%
sign_verify_extmu 4s 4s +0%
sign_verify_pre_hash_internal 4s 3s +33%
sign_verify_pre_hash_shake256 4s 6s -33%
unpack_hints 4s 5s -20%
unpack_sig 4s 4s +0%
keccak_finalize 3s 2s +50%
keccak_squeeze 3s 2s +50%
keccakf1600_xor_bytes 3s 3s +0%
keccakf1600x4_permute 3s 1s +200%
keccakf1600x4_xor_bytes 3s 2s +50%
mld_ct_cmask_neg_i32 3s 2s +50%
mld_ct_cmask_nonzero_u32 3s 2s +50%
mld_ct_cmask_nonzero_u8 3s 2s +50%
mld_ct_get_optblocker_u32 3s 4s -25%
mld_ct_get_optblocker_u8 3s 2s +50%
mld_value_barrier_u8 3s 3s +0%
montgomery_reduce 3s 2s +50%
pack_pk 3s 3s +0%
pack_sig_c_h 3s 2s +50%
poly_caddq 3s 3s +0%
poly_challenge 3s 4s -25%
poly_chknorm_native 3s 4s -25%
poly_decompose 3s 3s +0%
poly_invntt_tomont 3s 6s -50%
poly_invntt_tomont_native 3s 7s -57%
poly_ntt 3s 5s -40%
poly_power2round 3s 5s -40%
poly_sub 3s 5s -40%
poly_uniform_gamma1 3s 5s -40%
polyeta_pack 3s 2s +50%
polyt0_pack 3s 4s -25%
polyveck_pack_eta 3s 4s -25%
polyvecl_uniform_gamma1 3s 5s -40%
polyvecl_unpack_z 3s 2s +50%
polyw1_pack 3s 1s +200%
polyz_pack 3s 4s -25%
reduce32 3s 3s +0%
rej_eta 3s 4s -25%
shake128_absorb 3s 2s +50%
shake128_init 3s 1s +200%
shake128_release 3s 3s +0%
shake256x4_absorb_once 3s 4s -25%
sign_signature_pre_hash_shake256 3s 4s -25%
unpack_pk 3s 3s +0%
caddq 2s 3s -33%
decompose 2s 5s -60%
keccakf1600_xor_bytes (big endian) 2s 3s -33%
keccakf1600x4_extract_bytes 2s 1s +100%
mld_ct_abs_i32 2s 4s -50%
mld_ct_sel_int32 2s 3s -33%
mld_keccakf1600_extract_bytes 2s 2s +0%
ntt_native_x86_64 2s 3s -33%
pack_sig_z 2s 2s +0%
pack_sk 2s 2s +0%
poly_caddq_native_aarch64 2s 6s -67%
poly_make_hint 2s 3s -33%
poly_ntt_native 2s 3s -33%
poly_reduce 2s 5s -60%
poly_uniform_gamma1_4x 2s 6s -67%
polyt1_pack 2s 2s +0%
polyt1_unpack 2s 6s -67%
polyvecl_pack_eta 2s 4s -50%
polyvecl_permute_bitrev_to_custom 2s 3s -33%
polyvecl_unpack_eta 2s 5s -60%
shake128_finalize 2s 3s -33%
shake128x4_absorb_once 2s 4s -50%
shake128x4_squeezeblocks 2s 1s +100%
shake256 2s 2s +0%
shake256_absorb 2s 5s -60%
shake256_finalize 2s 5s -60%
shake256_init 2s 2s +0%
shake256_release 2s 1s +100%
shake256_squeeze 2s 2s +0%
sys_check_capability 2s 3s -33%
use_hint 2s 2s +0%
mld_ct_get_optblocker_i64 1s 1s +0%
mld_value_barrier_i64 1s 2s -50%
mld_value_barrier_u32 1s 3s -67%
poly_use_hint 1s 3s -67%
shake128_squeeze 1s 2s -50%
shake256x4_squeezeblocks 1s 3s -67%

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 21, 2026

CBMC Results (ML-DSA-44)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2048s 2055s -0.3%
sign_verify_internal 253s 254s -0%
mld_attempt_signature_generation 228s 221s +3%
polyvecl_pointwise_acc_montgomery_c 211s 208s +1%
rej_uniform_native 142s 144s -1%
poly_pointwise_montgomery_c 130s 143s -9%
mld_ct_memcmp 80s 82s -2%
mld_invntt_layer 50s 50s +0%
sign_signature_internal 45s 45s +0%
keccak_squeezeblocks_x4 43s 44s -2%
mld_ntt_layer 43s 44s -2%
poly_invntt_tomont_c 40s 39s +3%
rej_uniform 20s 20s +0%
fqmul 18s 19s -5%
rej_uniform_c 17s 20s -15%
poly_uniform_eta_4x 16s 19s -16%
mld_ntt_butterfly_block 15s 13s +15%
poly_uniform_4x 15s 13s +15%
polymat_permute_bitrev_to_custom 15s 16s -6%
polyt0_unpack 15s 15s +0%
mld_compute_t0_t1_tr_from_sk_components 14s 15s -7%
polyvec_matrix_expand 14s 16s -12%
keccakf1600x4_permute_native 13s 14s -7%
mld_polyvecl_permute_bitrev_to_custom_native 13s 14s -7%
polyeta_unpack 13s 12s +8%
poly_chknorm_c 12s 12s +0%
polyz_unpack_c 12s 12s +0%
keccak_absorb_once_x4 11s 9s +22%
keccakf1600_permute 9s 8s +12%
mld_check_pct 9s 6s +50%
polyveck_add 9s 7s +29%
keccakf1600_permute_native 8s 6s +33%
polyveck_decompose 8s 7s +14%
polyvec_matrix_expand_serial 7s 7s +0%
polyvec_matrix_pointwise_montgomery 7s 6s +17%
polyveck_pointwise_poly_montgomery 7s 5s +40%
sign_keypair 7s 2s +250%
sign_verify_pre_hash_internal 7s 4s +75%
mld_prepare_domain_separation_prefix 6s 5s +20%
poly_uniform_eta 6s 5s +20%
polyt0_pack 6s 4s +50%
polyveck_caddq 6s 8s -25%
polyveck_invntt_tomont 6s 6s +0%
polyveck_ntt 6s 4s +50%
polyveck_reduce 6s 5s +20%
polyvecl_chknorm 6s 6s +0%
sign 6s 5s +20%
sign_verify 6s 3s +100%
unpack_hints 6s 5s +20%
decompose 5s 3s +67%
fqscale 5s 1s +400%
keccak_absorb 5s 8s -38%
keccakf1600x4_extract_bytes 5s 3s +67%
mld_compute_pack_z 5s 6s -17%
mld_h 5s 6s -17%
mld_sample_s1_s2_serial 5s 3s +67%
poly_chknorm_native 5s 6s -17%
poly_decompose_c 5s 5s +0%
poly_ntt_c 5s 2s +150%
poly_use_hint_c 5s 5s +0%
polyeta_pack 5s 3s +67%
polyveck_chknorm 5s 3s +67%
polyveck_power2round 5s 6s -17%
polyveck_use_hint 5s 5s +0%
sign_verify_extmu 5s 2s +150%
sign_verify_pre_hash_shake256 5s 3s +67%
keccak_finalize 4s 4s +0%
keccak_init 4s 3s +33%
keccak_squeeze 4s 5s -20%
keccakf1600_xor_bytes (big endian) 4s 2s +100%
keccakf1600x4_permute 4s 3s +33%
mld_ct_abs_i32 4s 2s +100%
mld_sample_s1_s2 4s 4s +0%
mld_value_barrier_u8 4s 1s +300%
montgomery_reduce 4s 2s +100%
ntt_native_x86_64 4s 3s +33%
poly_challenge 4s 3s +33%
poly_pointwise_montgomery_native 4s 3s +33%
poly_power2round 4s 5s -20%
poly_uniform 4s 5s -20%
polyt1_unpack 4s 4s +0%
polyveck_sub 4s 4s +0%
polyveck_unpack_eta 4s 4s +0%
polyvecl_ntt 4s 6s -33%
polyvecl_pack_eta 4s 3s +33%
polyvecl_permute_bitrev_to_custom 4s 2s +100%
polyvecl_pointwise_acc_montgomery 4s 4s +0%
polyvecl_pointwise_acc_montgomery_native 4s 3s +33%
polyz_unpack_native 4s 3s +33%
reduce32 4s 3s +33%
rej_eta_c 4s 4s +0%
rej_eta_native 4s 4s +0%
shake128x4_squeezeblocks 4s 2s +100%
shake256 4s 3s +33%
shake256_release 4s 3s +33%
shake256x4_absorb_once 4s 4s +0%
sign_open 4s 6s -33%
sign_pk_from_sk 4s 6s -33%
sign_signature_extmu 4s 7s -43%
sign_signature_pre_hash_internal 4s 4s +0%
unpack_sk 4s 4s +0%
caddq 3s 3s +0%
intt_native_x86_64 3s 3s +0%
keccakf1600_extract_bytes (big endian) 3s 3s +0%
mld_ct_cmask_neg_i32 3s 2s +50%
mld_ct_cmask_nonzero_u32 3s 2s +50%
mld_ct_get_optblocker_i64 3s 2s +50%
mld_ct_get_optblocker_u8 3s 1s +200%
mld_keccakf1600_extract_bytes 3s 6s -50%
mld_value_barrier_i64 3s 4s -25%
pack_sig_c_h 3s 4s -25%
pack_sig_z 3s 2s +50%
pack_sk 3s 3s +0%
poly_add 3s 4s -25%
poly_caddq 3s 3s +0%
poly_caddq_native_aarch64 3s 3s +0%
poly_decompose_native 3s 3s +0%
poly_invntt_tomont 3s 2s +50%
poly_invntt_tomont_native 3s 4s -25%
poly_pointwise_montgomery 3s 2s +50%
poly_shiftl 3s 4s -25%
poly_uniform_gamma1_4x 3s 4s -25%
poly_use_hint_native 3s 3s +0%
polyt1_pack 3s 1s +200%
polyveck_make_hint 3s 4s -25%
polyveck_pack_t0 3s 3s +0%
polyveck_pack_w1 3s 4s -25%
polyveck_shiftl 3s 5s -40%
polyveck_unpack_t0 3s 4s -25%
polyvecl_uniform_gamma1 3s 2s +50%
polyvecl_uniform_gamma1_serial 3s 3s +0%
polyw1_pack 3s 2s +50%
polyz_unpack 3s 5s -40%
rej_eta 3s 1s +200%
shake128_init 3s 2s +50%
shake256_absorb 3s 2s +50%
shake256_init 3s 3s +0%
shake256_squeeze 3s 5s -40%
sign_keypair_internal 3s 4s -25%
sign_signature 3s 6s -50%
sys_check_capability 3s 4s -25%
unpack_pk 3s 3s +0%
use_hint 3s 2s +50%
make_hint 2s 3s -33%
mld_ct_cmask_nonzero_u8 2s 4s -50%
mld_value_barrier_u32 2s 4s -50%
pack_pk 2s 3s -33%
poly_caddq_c 2s 3s -33%
poly_caddq_native 2s 2s +0%
poly_chknorm 2s 4s -50%
poly_decompose 2s 4s -50%
poly_make_hint 2s 4s -50%
poly_ntt_native 2s 2s +0%
poly_sub 2s 3s -33%
poly_uniform_gamma1 2s 4s -50%
poly_use_hint 2s 2s +0%
polyvecl_unpack_z 2s 5s -60%
polyz_pack 2s 2s +0%
shake128_absorb 2s 1s +100%
shake128_release 2s 3s -33%
shake128_squeeze 2s 2s +0%
shake256_finalize 2s 4s -50%
shake256x4_squeezeblocks 2s 2s +0%
sign_signature_pre_hash_shake256 2s 5s -60%
unpack_sig 2s 3s -33%
keccakf1600_xor_bytes 1s 2s -50%
keccakf1600x4_xor_bytes 1s 2s -50%
mld_ct_get_optblocker_u32 1s 1s +0%
mld_ct_sel_int32 1s 4s -75%
poly_ntt 1s 3s -67%
poly_reduce 1s 4s -75%
polyveck_pack_eta 1s 3s -67%
polyvecl_unpack_eta 1s 2s -50%
power2round 1s 2s -50%
shake128_finalize 1s 2s -50%
shake128x4_absorb_once 1s 4s -75%

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 231675 cycles 223693 cycles 1.04
ML-DSA-44 sign 637708 cycles 608242 cycles 1.05
ML-DSA-44 verify 228579 cycles 221112 cycles 1.03
ML-DSA-65 keypair 412694 cycles 394259 cycles 1.05
ML-DSA-65 sign 1064124 cycles 1015180 cycles 1.05
ML-DSA-65 verify 390621 cycles 372405 cycles 1.05
ML-DSA-87 keypair 682821 cycles 653922 cycles 1.04
ML-DSA-87 sign 1412962 cycles 1363561 cycles 1.04
ML-DSA-87 verify 668826 cycles 637673 cycles 1.05

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 231675 cycles 223693 cycles 1.04
ML-DSA-44 sign 637708 cycles 608242 cycles 1.05
ML-DSA-44 verify 228579 cycles 221112 cycles 1.03
ML-DSA-65 keypair 412694 cycles 394259 cycles 1.05
ML-DSA-65 sign 1064124 cycles 1015180 cycles 1.05
ML-DSA-65 verify 390621 cycles 372405 cycles 1.05
ML-DSA-87 keypair 682821 cycles 653922 cycles 1.04
ML-DSA-87 sign 1412962 cycles 1363561 cycles 1.04
ML-DSA-87 verify 668826 cycles 637673 cycles 1.05

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 21, 2026

CBMC Results (ML-DSA-87)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2628s 2449s +7.3%
sign_verify_internal 376s 353s +7%
mld_attempt_signature_generation 255s 227s +12%
polyvecl_pointwise_acc_montgomery_c 195s 165s +18%
polyvec_matrix_expand 164s 153s +7%
poly_pointwise_montgomery_c 155s 128s +21%
rej_uniform_native 151s 139s +9%
mld_invntt_layer 121s 114s +6%
polyvec_matrix_expand_serial 114s 110s +4%
mld_ct_memcmp 86s 74s +16%
mld_ntt_layer 47s 44s +7%
sign_signature_internal 47s 46s +2%
keccak_squeezeblocks_x4 42s 42s +0%
mld_compute_t0_t1_tr_from_sk_components 26s 25s +4%
polymat_permute_bitrev_to_custom 26s 24s +8%
fqmul 23s 18s +28%
rej_uniform 22s 21s +5%
poly_uniform_eta_4x 20s 17s +18%
rej_uniform_c 18s 16s +12%
poly_chknorm_c 17s 17s +0%
poly_uniform_4x 16s 17s -6%
polyveck_add 15s 13s +15%
polyt0_unpack 14s 17s -18%
polyvec_matrix_pointwise_montgomery 14s 12s +17%
polyveck_power2round 14s 14s +0%
keccakf1600x4_permute_native 13s 12s +8%
polyeta_unpack 13s 13s +0%
mld_ntt_butterfly_block 12s 13s -8%
keccak_absorb_once_x4 11s 10s +10%
polyveck_reduce 11s 13s -15%
polyveck_use_hint 11s 9s +22%
sign 10s 7s +43%
mld_check_pct 9s 7s +29%
poly_invntt_tomont_c 9s 9s +0%
polyveck_pointwise_poly_montgomery 9s 8s +12%
mld_compute_pack_z 8s 7s +14%
mld_sample_s1_s2 8s 6s +33%
mld_sample_s1_s2_serial 8s 6s +33%
poly_decompose_c 8s 7s +14%
polyveck_chknorm 8s 6s +33%
polyveck_ntt 8s 8s +0%
polyveck_shiftl 8s 6s +33%
polyvecl_ntt 8s 11s -27%
keccakf1600_permute 7s 7s +0%
keccakf1600_permute_native 7s 10s -30%
mld_polyvecl_permute_bitrev_to_custom_native 7s 8s -12%
polyveck_caddq 7s 7s +0%
polyveck_decompose 7s 6s +17%
polyveck_invntt_tomont 7s 10s -30%
polyveck_sub 7s 7s +0%
sign_pk_from_sk 7s 9s -22%
sign_signature 7s 5s +40%
sign_signature_pre_hash_internal 7s 5s +40%
unpack_sig 7s 5s +40%
keccak_absorb 6s 7s -14%
polyz_unpack_native 6s 4s +50%
sign_keypair_internal 6s 6s +0%
mld_ct_sel_int32 5s 2s +150%
mld_prepare_domain_separation_prefix 5s 3s +67%
ntt_native_x86_64 5s 4s +25%
poly_caddq_c 5s 3s +67%
poly_challenge 5s 5s +0%
poly_uniform_eta 5s 4s +25%
poly_uniform_gamma1 5s 3s +67%
polyt0_pack 5s 5s +0%
polyveck_make_hint 5s 5s +0%
polyveck_pack_t0 5s 2s +150%
polyveck_unpack_t0 5s 6s -17%
polyvecl_unpack_eta 5s 3s +67%
polyvecl_unpack_z 5s 3s +67%
rej_eta 5s 2s +150%
rej_eta_c 5s 4s +25%
shake128_release 5s 3s +67%
sign_keypair 5s 3s +67%
sign_signature_pre_hash_shake256 5s 4s +25%
sign_verify_extmu 5s 3s +67%
unpack_hints 5s 4s +25%
unpack_sk 5s 6s -17%
keccak_init 4s 3s +33%
keccakf1600_xor_bytes (big endian) 4s 3s +33%
keccakf1600x4_extract_bytes 4s 2s +100%
mld_ct_abs_i32 4s 2s +100%
mld_h 4s 2s +100%
pack_sig_z 4s 3s +33%
pack_sk 4s 3s +33%
poly_add 4s 4s +0%
poly_caddq 4s 2s +100%
poly_decompose 4s 3s +33%
poly_decompose_native 4s 5s -20%
poly_make_hint 4s 2s +100%
poly_uniform 4s 6s -33%
polyt1_pack 4s 2s +100%
polyveck_unpack_eta 4s 3s +33%
polyvecl_chknorm 4s 4s +0%
polyvecl_pack_eta 4s 3s +33%
polyvecl_pointwise_acc_montgomery 4s 4s +0%
polyvecl_uniform_gamma1 4s 4s +0%
polyvecl_uniform_gamma1_serial 4s 7s -43%
polyz_unpack 4s 3s +33%
polyz_unpack_c 4s 5s -20%
rej_eta_native 4s 5s -20%
shake256x4_absorb_once 4s 2s +100%
sign_open 4s 6s -33%
sign_signature_extmu 4s 5s -20%
sign_verify_pre_hash_internal 4s 5s -20%
sign_verify_pre_hash_shake256 4s 4s +0%
caddq 3s 4s -25%
intt_native_x86_64 3s 5s -40%
keccakf1600_xor_bytes 3s 2s +50%
mld_ct_cmask_neg_i32 3s 1s +200%
mld_ct_cmask_nonzero_u8 3s 3s +0%
mld_ct_get_optblocker_i64 3s 3s +0%
mld_ct_get_optblocker_u32 3s 2s +50%
pack_sig_c_h 3s 3s +0%
poly_caddq_native 3s 2s +50%
poly_caddq_native_aarch64 3s 3s +0%
poly_chknorm_native 3s 2s +50%
poly_invntt_tomont 3s 2s +50%
poly_ntt_c 3s 1s +200%
poly_ntt_native 3s 6s -50%
poly_pointwise_montgomery 3s 4s -25%
poly_power2round 3s 3s +0%
poly_reduce 3s 4s -25%
poly_shiftl 3s 2s +50%
poly_sub 3s 3s +0%
poly_uniform_gamma1_4x 3s 6s -50%
polyt1_unpack 3s 3s +0%
polyveck_pack_eta 3s 2s +50%
polyvecl_pointwise_acc_montgomery_native 3s 3s +0%
polyw1_pack 3s 2s +50%
reduce32 3s 4s -25%
shake128_finalize 3s 3s +0%
shake256 3s 4s -25%
shake256_finalize 3s 3s +0%
shake256_release 3s 3s +0%
shake256_squeeze 3s 2s +50%
sign_verify 3s 2s +50%
unpack_pk 3s 6s -50%
use_hint 3s 3s +0%
decompose 2s 4s -50%
fqscale 2s 5s -60%
keccak_finalize 2s 2s +0%
keccak_squeeze 2s 4s -50%
keccakf1600x4_permute 2s 2s +0%
keccakf1600x4_xor_bytes 2s 2s +0%
mld_ct_get_optblocker_u8 2s 2s +0%
mld_value_barrier_i64 2s 3s -33%
mld_value_barrier_u32 2s 2s +0%
mld_value_barrier_u8 2s 2s +0%
montgomery_reduce 2s 3s -33%
pack_pk 2s 4s -50%
poly_chknorm 2s 2s +0%
poly_ntt 2s 3s -33%
poly_pointwise_montgomery_native 2s 3s -33%
poly_use_hint 2s 3s -33%
poly_use_hint_native 2s 4s -50%
polyeta_pack 2s 4s -50%
polyveck_pack_w1 2s 4s -50%
polyvecl_permute_bitrev_to_custom 2s 2s +0%
polyz_pack 2s 2s +0%
power2round 2s 4s -50%
shake128_absorb 2s 2s +0%
shake128_init 2s 2s +0%
shake128_squeeze 2s 5s -60%
shake128x4_absorb_once 2s 4s -50%
shake128x4_squeezeblocks 2s 2s +0%
shake256_absorb 2s 3s -33%
shake256x4_squeezeblocks 2s 1s +100%
sys_check_capability 2s 3s -33%
keccakf1600_extract_bytes (big endian) 1s 3s -67%
make_hint 1s 5s -80%
mld_ct_cmask_nonzero_u32 1s 4s -75%
mld_keccakf1600_extract_bytes 1s 1s +0%
poly_invntt_tomont_native 1s 2s -50%
poly_use_hint_c 1s 4s -75%
shake256_init 1s 2s -50%

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Details
Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-44 keypair 313397 cycles 321433 cycles 0.97
ML-DSA-44 sign 1215690 cycles 1202861 cycles 1.01
ML-DSA-44 verify 343963 cycles 340204 cycles 1.01
ML-DSA-65 keypair 572366 cycles 569351 cycles 1.01
ML-DSA-65 sign 2038364 cycles 1955934 cycles 1.04
ML-DSA-65 verify 548523 cycles 548845 cycles 1.00
ML-DSA-87 keypair 908267 cycles 885828 cycles 1.03
ML-DSA-87 sign 2517161 cycles 2512147 cycles 1.00
ML-DSA-87 verify 925719 cycles 902578 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 885bca1 Previous: 0b1c536 Ratio
ML-DSA-65 sign 2038364 cycles 1955934 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 4b4c152 Previous: 0b1c536 Ratio
ML-DSA-87 sign 1900701 cycles 1833605 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Add AArch64 assembly implementations of polyw1_pack for both GAMMA2
variants using TBL-based byte extraction from 32-bit coefficient
lanes.

Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
@mkannwischer mkannwischer changed the title x86_64: Add AVX2 polyw1_pack to x86_64 native backend x86_64/AArch64: Add AVX2/Neon polyw1_pack to x86_64 native backend Feb 21, 2026
@mkannwischer
Copy link
Contributor Author

The scheme benchmarks suggest we should implement the 88 variant (ML-DSA-44), but not the 32 variant (ML-DSA-65/ML-DSA-87). It makes sense that the 88 one is harder to auto-vectorize, but it feels a little inconsistent to only implement one.

WDYT @hanno-becker?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants