How Micro-optimizations in Keccak boosted Kyber's performance #12

ronhombre · 2024-04-14T19:10:24Z

ronhombre
Apr 14, 2024
Maintainer

You may or may not know that I also maintain the KeccakKotlin library that this library uses. Over a day, I have significantly reduced the performance overheads which directly boosted KyberKotlin's speed.

A bit of backstory. I initially used a different library, but during a Profiler run, I noticed it was awfully stuck doing a single operation repeatedly and very slowly. So, I decided to create my own Keccak implementation. It was not as bad as I thought because many KAT or Intermediate Values are already floating around on the internet. After I had done that, I did another Profiler run, and my naive Keccak implementation bested the library I was previously using. And now, I have bested my own Keccak implementation by a huge lead.

See for yourself below.

Initial (0.7.1)

Variant	Generation	Encapsulation	Decapsulation
512	0.1665625 (500% Faster)	0.14875625 (601% Faster)	0.17250 (576% Faster)
768	0.260675 (520% Faster)	0.24445625 (616% Faster)	0.2788125 (606% Faster)
1024	0.3999375 (531% Faster)	0.3790375 (613% Faster)	0.42285625 (615% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

In my opinion, this was already fast, but I know there are many applications that would greatly benefit from a faster Kyber.

Precompute Pi Shift values

Variant	Generation	Encapsulation	Decapsulation
512	0.13521875 (616% Faster)	0.12863125 (695% Faster)	0.15266875 (651% Faster)
768	0.21453125 (632% Faster)	0.21104375 (713% Faster)	0.24531875 (688% Faster)
1024	0.33129375 (642% Faster)	0.32360625 (718% Faster)	0.3723375 (698% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

This was a surprising performance improvement as it only removed some modulo operations. I might have forgotten some other optimizations that came along with it.

Combine Theta with Rho + Pi

Variant	Generation	Encapsulation	Decapsulation
512	0.11865625 (702% Faster)	0.11149375 (802% Faster)	0.1354875 (733% Faster)
768	0.1854125 (731% Faster)	0.18048125 (834% Faster)	0.2170375 (778% Faster)
1024	0.28453125 (747% Faster)	0.275875 (842% Faster)	0.32110625 (810% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

It had always been an idea in my head to combine the three together. As I expected, the combination of the three reduces the memory overhead of transferring data to different state arrays.

Combine Chi with Iota and Unrolled loops

Variant	Generation	Encapsulation	Decapsulation
512	0.10521875 (792% Faster)	0.09880 (905% Faster)	0.1226875 (810% Faster)
768	0.16455 (824% Faster)	0.15940625 (944% Faster)	0.19393125 (871% Faster)
1024	0.25056875 (848% Faster)	0.241025 (964% Faster)	0.28864375 (901% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

Combining Chi with Iota is fairly straightforward. Unrolling the loops removed the Iterator overhead for Kotlin. It was ridiculously slow.

Converted ULong to Long

Variant	Generation	Encapsulation	Decapsulation
512	0.09763125 (854% Faster)	0.09216875 (970% Faster)	0.11640625 (854% Faster)
768	0.15390625 (881% Faster)	0.149225 (1008% Faster)	0.18441875 (916% Faster)
1024	0.2329875 (912% Faster)	0.225175 (1031% Faster)	0.27225625 (955% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

I have always wanted to do this but it was only lately did I got the correct idea of how to implement it. Dealing with ULong and Longs are quite different so previous attempts had been a headache.

Removed all lambdas (0.8.0)

Variant	Generation	Encapsulation	Decapsulation
512	0.09238125 (902% Faster)	0.08665625 (1031% Faster)	0.111925 (888% Faster)
768	0.14525625 (933% Faster)	0.139525 (1079% Faster)	0.17450 (968% Faster)
1024	0.2185125 (973% Faster)	0.2105875 (1103% Faster)	0.2573625 (1010% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

Who would have known? Lambda's in Kotlin uses the Iterator which as we have noticed before had been ridiculously slow. Basically, removing lambdas could be equivalent to unrolling it because that's what I had basically done.

50-70% Performance improvement!

All benchmarking codes are in JVMBenchmark.kt.

As stated in this repository's README, this is all relative to the current 'standard' branch.

After all of these changes and optimizations, I could finally rest easy and focus on optimizing my Kyber implementation.

ronhombre · 2024-04-17T09:51:23Z

ronhombre
Apr 17, 2024
Maintainer Author

After another round of micro-optimizations

Initial:

Variant	Generation	Encapsulation	Decapsulation
512	0.09238125 (902% Faster)	0.08665625 (1031% Faster)	0.111925 (888% Faster)
768	0.14525625 (933% Faster)	0.139525 (1079% Faster)	0.17450 (968% Faster)
1024	0.2185125 (973% Faster)	0.2105875 (1103% Faster)	0.2573625 (1010% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

Karatsuba

Variant	Generation	Encapsulation	Decapsulation
512	0.09088125 (917% Faster)	0.08574375 (1042% Faster)	0.10950 (907% Faster)
768	0.1425875 (951% Faster)	0.1374125 (1095% Faster)	0.17218125 (981% Faster)
1024	0.21510 (988% Faster)	0.2066625 (1124% Faster)	0.2538125 (1025% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

Optimized bitsToBytes and bytesToBits

Variant	Generation	Encapsulation	Decapsulation
512	0.08208125 (1015% Faster)	0.08170 (1094% Faster)	0.10538125 (943% Faster)
768	0.12895625 (1051% Faster)	0.13164375 (1143% Faster)	0.16641875 (1015% Faster)
1024	0.1967375 (1080% Faster)	0.19828125 (1171% Faster)	0.24541875 (1060% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

Reduced Barrett Reduction Operations in NTT and invNTT

Variant	Generation	Encapsulation	Decapsulation
512	0.07849375 (1062% Faster)	0.07928125 (1127% Faster)	0.10135 (980% Faster)
768	0.12480625 (1086% Faster)	0.128575 (1170% Faster)	0.163775 (1031% Faster)
1024	0.19335 (1099% Faster)	0.19326875 (1202% Faster)	0.23688125 (1098% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

Lazy Montgomery Reduction

Variant	Generation	Encapsulation	Decapsulation
512	0.0762125 (1093% Faster)	0.0745875 (1198% Faster)	0.09534375 (1042% Faster)
768	0.1194375 (1135% Faster)	0.12181875 (1235% Faster)	0.1507125 (1120% Faster)
1024	0.18365625 (1157% Faster)	0.183575 (1265% Faster)	0.22190 (1172% Faster)
ML-KEM	(in ms)	(in ms)	(in ms)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How Micro-optimizations in Keccak boosted Kyber's performance #12

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How Micro-optimizations in Keccak boosted Kyber's performance #12

ronhombre Apr 14, 2024 Maintainer

Initial (0.7.1)

Precompute Pi Shift values

Combine Theta with Rho + Pi

Combine Chi with Iota and Unrolled loops

Converted ULong to Long

Removed all lambdas (0.8.0)

50-70% Performance improvement!

Replies: 1 comment

ronhombre Apr 17, 2024 Maintainer Author

ronhombre
Apr 14, 2024
Maintainer

ronhombre
Apr 17, 2024
Maintainer Author