There are two big optimizations we could do on both the chacha20 and salsa20 crates.
Avoid recomputing initial state
EDIT: both crates now have a new method to compute the initial state, and separate apply_keystream / generate methods to compute a block
RFC 8439 Section 3 describes caching the initial block state once computed as a performance optimization:
Each block of ChaCha20 involves 16 move operations and one increment
operation for loading the state, 80 each of XOR, addition and roll
operations for the rounds, 16 more add operations and 16 XOR
operations for protecting the plaintext. Section 2.3 describes the
ChaCha block function as "adding the original input words". This
implies that before starting the rounds on the ChaCha state, we copy
it aside, only to add it in later. This is correct, but we can save
a few operations if we instead copy the state and do the work on the
copy. This way, for the next block you don't need to recreate the
state, but only to increment the block counter. This saves
approximately 5.5% of the cycles.
SIMD support
Both ChaCha20 and Salsa20 are amenable to SIMD optimizations. We should add SIMD optimizations on x86/x86_64 at the very least.
x86/x86_64
Other CPU architectures
There are two big optimizations we could do on both the
chacha20andsalsa20crates.Avoid recomputing initial state
EDIT: both crates now have a
newmethod to compute the initial state, and separateapply_keystream/generatemethods to compute a blockchacha20cratesalsa20crateRFC 8439 Section 3 describes caching the initial block state once computed as a performance optimization:
SIMD support
Both ChaCha20 and Salsa20 are amenable to SIMD optimizations. We should add SIMD optimizations on
x86/x86_64at the very least.x86/x86_64chacha20salsa20Other CPU architectures