How Micro-optimizations in Keccak boosted Kyber's performance #12
ronhombre
announced in
Announcements
Replies: 1 comment
-
After another round of micro-optimizations Initial:
Karatsuba
Optimized bitsToBytes and bytesToBits
Reduced Barrett Reduction Operations in NTT and invNTT
Lazy Montgomery Reduction
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
You may or may not know that I also maintain the KeccakKotlin library that this library uses. Over a day, I have significantly reduced the performance overheads which directly boosted KyberKotlin's speed.
A bit of backstory. I initially used a different library, but during a Profiler run, I noticed it was awfully stuck doing a single operation repeatedly and very slowly. So, I decided to create my own Keccak implementation. It was not as bad as I thought because many KAT or Intermediate Values are already floating around on the internet. After I had done that, I did another Profiler run, and my naive Keccak implementation bested the library I was previously using. And now, I have bested my own Keccak implementation by a huge lead.
See for yourself below.
Initial (0.7.1)
In my opinion, this was already fast, but I know there are many applications that would greatly benefit from a faster Kyber.
Precompute Pi Shift values
This was a surprising performance improvement as it only removed some modulo operations. I might have forgotten some other optimizations that came along with it.
Combine Theta with Rho + Pi
It had always been an idea in my head to combine the three together. As I expected, the combination of the three reduces the memory overhead of transferring data to different state arrays.
Combine Chi with Iota and Unrolled loops
Combining Chi with Iota is fairly straightforward. Unrolling the loops removed the Iterator overhead for Kotlin. It was ridiculously slow.
Converted ULong to Long
I have always wanted to do this but it was only lately did I got the correct idea of how to implement it. Dealing with ULong and Longs are quite different so previous attempts had been a headache.
Removed all lambdas (0.8.0)
Who would have known? Lambda's in Kotlin uses the Iterator which as we have noticed before had been ridiculously slow. Basically, removing lambdas could be equivalent to unrolling it because that's what I had basically done.
50-70% Performance improvement!
All benchmarking codes are in JVMBenchmark.kt.
As stated in this repository's README, this is all relative to the current 'standard' branch.
After all of these changes and optimizations, I could finally rest easy and focus on optimizing my Kyber implementation.
Beta Was this translation helpful? Give feedback.
All reactions