Skip to content

Conversation

@RaduBerinde
Copy link
Contributor

We manipulate the math and use bit tricks to derive the other two
indexes more efficiently during peeling.

Apple M1:

name                                old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-10         43.8 ± 2%      50.3 ± 3%  +14.88%  (p=0.000 n=8+9)
BinaryFusePopulate/8/n=100000-10        38.6 ± 3%      41.3 ± 1%   +7.09%  (p=0.000 n=9+8)
BinaryFusePopulate/8/n=1000000-10       35.0 ± 4%      36.5 ± 7%   +4.12%  (p=0.013 n=9+10)
BinaryFusePopulate/16/n=10000-10        48.6 ± 4%      48.5 ± 6%     ~     (p=1.000 n=10+10)
BinaryFusePopulate/16/n=100000-10       38.0 ± 3%      41.1 ± 1%   +8.35%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-10      33.8 ± 5%      36.6 ± 2%   +8.14%  (p=0.000 n=10+10)

GCE N4D (AMD Turin):

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         53.2 ± 3%      57.1 ± 1%   +7.46%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=100000-8        33.0 ± 0%      37.5 ± 1%  +13.38%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=1000000-8       28.5 ± 2%      31.8 ± 2%  +11.59%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=10000-8        53.1 ± 1%      56.2 ± 1%   +5.93%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=100000-8       31.8 ± 1%      37.3 ± 1%  +17.35%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      27.5 ± 1%      30.9 ± 1%  +12.34%  (p=0.000 n=10+10)

GCE C4 (Intel Emerald Rapids, turbo boost capped at "all core" max):

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         29.2 ± 1%      32.2 ± 1%  +10.00%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=100000-8        27.0 ± 3%      29.8 ± 5%  +10.22%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=1000000-8       25.6 ± 3%      28.2 ± 5%  +10.27%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=10000-8        28.9 ± 1%      32.0 ± 1%  +10.84%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=100000-8       26.2 ± 1%      28.8 ± 3%  +10.05%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      24.8 ± 2%      26.9 ± 2%   +8.37%  (p=0.000 n=10+10)

GCE C4A (Google's Axion ARM64):

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         45.1 ± 1%      45.1 ± 1%    ~     (p=0.511 n=9+10)
BinaryFusePopulate/8/n=100000-8        39.8 ± 1%      39.4 ± 1%  -0.79%  (p=0.018 n=9+10)
BinaryFusePopulate/8/n=1000000-8       33.9 ± 3%      34.2 ± 3%    ~     (p=0.363 n=10+10)
BinaryFusePopulate/16/n=10000-8        44.0 ± 1%      44.7 ± 1%  +1.54%  (p=0.000 n=9+10)
BinaryFusePopulate/16/n=100000-8       37.4 ± 1%      38.4 ± 1%  +2.75%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      30.9 ± 5%      32.4 ± 1%  +4.84%  (p=0.000 n=10+10)

We manipulate the math and use bit tricks to derive the other two
indexes more efficiently during peeling.

Apple M1:
```
name                                old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-10         43.8 ± 2%      50.3 ± 3%  +14.88%  (p=0.000 n=8+9)
BinaryFusePopulate/8/n=100000-10        38.6 ± 3%      41.3 ± 1%   +7.09%  (p=0.000 n=9+8)
BinaryFusePopulate/8/n=1000000-10       35.0 ± 4%      36.5 ± 7%   +4.12%  (p=0.013 n=9+10)
BinaryFusePopulate/16/n=10000-10        48.6 ± 4%      48.5 ± 6%     ~     (p=1.000 n=10+10)
BinaryFusePopulate/16/n=100000-10       38.0 ± 3%      41.1 ± 1%   +8.35%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-10      33.8 ± 5%      36.6 ± 2%   +8.14%  (p=0.000 n=10+10)
```

GCE N4D (AMD Turin):
```
name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         53.2 ± 3%      57.1 ± 1%   +7.46%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=100000-8        33.0 ± 0%      37.5 ± 1%  +13.38%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=1000000-8       28.5 ± 2%      31.8 ± 2%  +11.59%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=10000-8        53.1 ± 1%      56.2 ± 1%   +5.93%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=100000-8       31.8 ± 1%      37.3 ± 1%  +17.35%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      27.5 ± 1%      30.9 ± 1%  +12.34%  (p=0.000 n=10+10)
```

GCE C4 (Intel Emerald Rapids, turbo boost capped at "all core" max):
```
name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         29.2 ± 1%      32.2 ± 1%  +10.00%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=100000-8        27.0 ± 3%      29.8 ± 5%  +10.22%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=1000000-8       25.6 ± 3%      28.2 ± 5%  +10.27%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=10000-8        28.9 ± 1%      32.0 ± 1%  +10.84%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=100000-8       26.2 ± 1%      28.8 ± 3%  +10.05%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      24.8 ± 2%      26.9 ± 2%   +8.37%  (p=0.000 n=10+10)
```

GCE C4A (Google's Axion ARM64):
```
name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         45.1 ± 1%      45.1 ± 1%    ~     (p=0.511 n=9+10)
BinaryFusePopulate/8/n=100000-8        39.8 ± 1%      39.4 ± 1%  -0.79%  (p=0.018 n=9+10)
BinaryFusePopulate/8/n=1000000-8       33.9 ± 3%      34.2 ± 3%    ~     (p=0.363 n=10+10)
BinaryFusePopulate/16/n=10000-8        44.0 ± 1%      44.7 ± 1%  +1.54%  (p=0.000 n=9+10)
BinaryFusePopulate/16/n=100000-8       37.4 ± 1%      38.4 ± 1%  +2.75%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      30.9 ± 5%      32.4 ± 1%  +4.84%  (p=0.000 n=10+10)
```
@lemire
Copy link
Member

lemire commented Jan 15, 2026

On my todo to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants