Skip to content

Increase performance for fitness_score_cardinality and genes_cardinality#10

Open
tomtomwombat wants to merge 1 commit intobasvanwesting:mainfrom
tomtomwombat:cardinality-perf
Open

Increase performance for fitness_score_cardinality and genes_cardinality#10
tomtomwombat wants to merge 1 commit intobasvanwesting:mainfrom
tomtomwombat:cardinality-perf

Conversation

@tomtomwombat
Copy link
Copy Markdown

Cardinality estimation in calls to fitness_score_cardinality and genes_cardinality is a non-trivial performance bottleneck.
This PR replaces the cardinality-estimator with hyperloglockless to optimize those calls. I chose foldhash because it's fast (especially for small inputs) and because cardinality estimation doesn't need to be deterministic (correct me if I'm wrong on this! We can use a different hasher otherwise).

HyperLogLog::new(12); uses the same memory as CardinalityEstimator::<u64>::new(): 2^12 bytes.
The accuracy of estimation for small cardinalities is unchanged while being improved for cardinalities larger than 10^7 since cardinality-estimator no longer provides accurate estimation then (though such large cardinalities may be outside your use-case). You can find more performance and accuracy comparisons here.

In addition, hyperloglockless uses considerably less dependencies than cardinality-estimator.

The below benchmarks show before and after change (other benchmarks are not affected):

     Running benches\evolve.rs (target\release\deps\evolve-3edab9da1a4bc46e.exe)
Gnuplot not found, using plotters backend
evolve/binary-100-pop100-gen100
                        time:   [1.3711 ms 1.3717 ms 1.3723 ms]
                        change: [-9.0403% -8.7943% -8.5330%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
evolve/list-100-pop100-gen100
                        time:   [1.0351 ms 1.0357 ms 1.0361 ms]
                        change: [-6.8630% -6.4940% -6.0979%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking mutates-pop1000/MultiGeneDynamic(MultiGeneDynamic { _phantom: PhantomData<genetic_algorithm::genotyp...: Collecting 100 samples inmutates-pop1000/MultiGeneDynamic(MultiGeneDynamic { _phantom: PhantomData<genetic_algorithm::genotyp...
                        time:   [13.012 µs 13.108 µs 13.203 µs]
                        thrpt:  [75.741 Melem/s 76.289 Melem/s 76.854 Melem/s]
                 change:
                        time:   [-13.408% -7.1293% -2.4340%] (p = 0.01 < 0.05)
                        thrpt:  [+2.4948% +7.6766% +15.484%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
Benchmarking mutates-pop1000/MultiGeneDynamic(MultiGeneDynamic { _phantom: PhantomData<genetic_algorithm::genotyp...: Collecting 100 samples inmutates-pop1000/MultiGeneDynamic(MultiGeneDynamic { _phantom: PhantomData<genetic_algorithm::genotyp...
                        time:   [13.012 µs 13.108 µs 13.203 µs]
                        thrpt:  [75.741 Melem/s 76.289 Melem/s 76.854 Melem/s]
                 change:
                        time:   [-13.408% -7.1293% -2.4340%] (p = 0.01 < 0.05)
                        thrpt:  [+2.4948% +7.6766% +15.484%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
     Running benches\population.rs (target\release\deps\population-76b56ff9d5372789.exe)
Gnuplot not found, using plotters backend
population/fitness_score_cardinality (known score), low/100
                        time:   [1.4007 µs 1.4348 µs 1.4741 µs]
                        thrpt:  [67.837 Melem/s 69.695 Melem/s 71.395 Melem/s]
                 change:
                        time:   [-23.952% -21.707% -19.155%] (p = 0.00 < 0.05)
                        thrpt:  [+23.694% +27.726% +31.496%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
population/fitness_score_cardinality (known score), low/1000
                        time:   [13.278 µs 13.445 µs 13.640 µs]
                        thrpt:  [73.314 Melem/s 74.378 Melem/s 75.311 Melem/s]
                 change:
                        time:   [-25.040% -23.860% -22.609%] (p = 0.00 < 0.05)
                        thrpt:  [+29.214% +31.336% +33.404%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

These benchmarks were run with

  • AMD Ryzen 9 5900X 12-Core Processor (3.70 GHz)
  • 64-bit operating system, x64-based processor
  • RUSTFLAGS="-C target-cpu=native"

@basvanwesting
Copy link
Copy Markdown
Owner

Thanks, I see you are the author of hyperloglockless.

While the claimed performance increase is relevant percentage-wise, I'm not sure the cardinality estimation is a non-trivial performance bottleneck in practice. You did not provide evidence for that claim (especially with respect to the Fitness calculations, which are the bottleneck in real-world usage).

Also, our use case centers on very low cardinality counts: typically between 100 and 1000. So that is very limited use case, we need to keep that in mind. It will have special performance characteristics for these low levels.

Furthermore cardinality-estimator has a benchmark with conflicting conclusions regarding your implementation.

I will consider the switch. If you can provide evidence of the non-trivial performance bottleneck in real-world usage, that would make a difference.

Regards, Bas

@tomtomwombat
Copy link
Copy Markdown
Author

Thanks for the quick response

Furthermore cardinality-estimator has a benchmark with conflicting conclusions regarding your implementation.

Which benchmark and conflicting conclusion are you referring to? They don't include hyperloglockless in their benchmarks (they include hyperloglog crate instead).

Also, our use case centers on very low cardinality counts: typically between 100 and 1000. So that is very limited use case, we need to keep that in mind. It will have special performance characteristics for these low levels.

That's useful context. What do you mean by "It will have special performance characteristics for these low levels."? Also, I'm curious what you prioritize in cardinality estimation in genetic-algorithm, e.g. performance, accuracy, or memory?

I will consider the switch. If you can provide evidence of the non-trivial performance bottleneck in real-world usage, that would make a difference.

I don't use genetic-algorithm myself. Is there a benchmark the reflects real-world usage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants