Faster python implementation. #227
Replies: 14 comments
-
I banged my head against this for a while, as I saw some inefficiencies as compared to Dave´s implementation and other solutions (including my Numba implementation). Yours was faster to start with, but that seemed to be mainly because you dropped the class. I optimized it a bit, and got it faster than the C++ solution on my machine:
from numba import njit, b1, int32
from math import sqrt
import numpy as np
@njit(b1[:](int32))
def run_sieve(limit):
prime_list = np.full(limit // 2, True)
prime_list[0] = False
for factor in range(3, int(sqrt(limit)) + 1, 2):
if prime_list[factor // 2]:
prime_list[(factor * 3) // 2:limit // 2:factor] = False
return prime_list[:(limit // 2) + 1] The only thing is that import itertools
def primes(sieve):
return itertools.chain([2], (prime * 2 + 1 for (prime, is_prime) in enumerate(sieve) if is_prime)) Additionally, I am seeing a shocking amount of cache misses and branch mispredicts. I don´t think we can do much about the mispredicts (and I don´t really know where they come from: they are likely an interpreter artifact), but a segmented sieve could be employed to improve locality on the data-cache. |
Beta Was this translation helpful? Give feedback.
-
Nice trick to only look at the odd integers! I think the major part of my speed improvement came from the removal of the double for loops in the while loop of the sieve method, But any function or method calls add overhead. How do you check the cache misses and branch mispredicts? I managed to squeeze out a couple of thousand iterations more by starting at factor ** 2 instead of factor * 3 and possibly the fastmath flag, I don't think it matters if you use the math.sqrt() or ** (1/2) function/operator. Now I get approx. 17 000 passes compared to the initial 40 (425 times faster). Not bad for a slow language ;) import timeit
import itertools
from numba import njit, b1, int32
import numpy as np
@njit(b1[:](int32), fastmath = True)
def run_sieve(limit):
prime_list = np.full(limit // 2, True) # defines a list for odd integers
prime_list[0] = False # 1 is not a prime
for factor in range(3, int(limit ** (1/2)) + 1, 2):
if prime_list[factor // 2]: # just looking at odd integers everything halves
prime_list[(factor ** 2) // 2:limit // 2:factor] = False
return prime_list[:(limit // 2) + 1]
def print_results(time_delta, passes, primes, limit):
prime_counts = {
10: 4, # Historical data for validating our results - the number of primes
100: 25, # to be found under some limit, such as 168 primes under 1000
1000: 168,
10000: 1229,
100000: 9592,
1000000: 78498,
10000000: 664579,
100000000: 5761455,
}
print(f"""Passes: {passes}\nTime: {time_delta}\nAvg: {time_delta/passes}\
\nLimit: {limit}\nCount: {len(primes)}\nValid: {prime_counts[limit]==len(primes)}""")
def primes(sieve):
return itertools.chain([2], (prime * 2 + 1 for (prime, is_prime) in enumerate(sieve) if is_prime))
if __name__=="__main__":
t_start = timeit.default_timer() # Record our starting time
passes = 0
limit = 1_000_000
while timeit.default_timer() - t_start < 10:
prime_validators = run_sieve(limit)
passes = passes + 1
time_delta = timeit.default_timer() - t_start
primes = list(primes(prime_validators))
print(primes[:100])
print_results(time_delta, passes, primes, limit) |
Beta Was this translation helpful? Give feedback.
-
Array slicing on numpy arrays is very fast, but it´s only a performance gain when not using Numba. Numba was able to optimize a loop to set each value separately.
I used
Cool! :) |
Beta Was this translation helpful? Give feedback.
-
Oh wow. I was thinking about doing this and making a video for my channel as a sort of playful response but it's already been done, and really well. Kudos! |
Beta Was this translation helpful? Give feedback.
-
@jugge83 & @rhbvkleef & @safijari: I'm not a python expert but I believe that this is an unfair comparison given that your code does not initialize a new class each time and, probably, reuses the same array. Also, I think you're using sequences instead of arrays, which means that most of the code does not actually execute until you use the list operation. Try executing your code once with a 100 million sieve size (1e8).My results were:
|
Beta Was this translation helpful? Give feedback.
-
I'm getting the following:
The code does not need to be object oriented and my goal was to optimize the python code as much as I could. |
Beta Was this translation helpful? Give feedback.
-
Something that I haven't seen considered here is taking the compilation time into account, since the first invocation of There's also AOT compilation for Numba. Speaking of fair, that would the most fair way to do the comparison. |
Beta Was this translation helpful? Give feedback.
-
@safijari you should make the video. I can Imagine that it will be popular and the more Python resources that are out there the better. |
Beta Was this translation helpful? Give feedback.
-
I was thinking I could do a shorter video that reviews this code instead.
Lemme see, I have other projects to finish first.
…On Thu, Apr 1, 2021 at 10:37 AM jugge83 ***@***.***> wrote:
@safijari <https://github.com/safijari> you should make the video. I can
Imagine that it will be popular and the more Python resources that are out
there the better.
Sadly the cache=True kwarg or an initial "pre-run" of the function didn't
improve execution times for me.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/davepl/Primes/issues/25#issuecomment-812062989>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHTRJHOZ2L32RXZSD6XT5LTGSVMHANCNFSM4Z5CH3XA>
.
--
*Jariullah Safi*
*Director of AI and Computer Vision*
Simbe Robotics <http://www.simberobotics.com/>
|
Beta Was this translation helpful? Give feedback.
-
You're right, I accidentally posted the results for the 1 billion sieve-size when I run the python code.
Yeah, but if you have to compare the two languages, you have to do similar things.
Yeah, no doubt. Python is, one of the fastest, if not the fastest, languages if you're comparing development time. |
Beta Was this translation helpful? Give feedback.
-
This is true, but I´ve also written an alternative that does initialize a class each time, and the performance is very similar. It is slightly slower, taking off a couple of percent, so it is not significant in any way.
It might, it might not, but that is a matter of page reuse, not a matter of data reuse. It is likely that the same happens in C++ and C#, and is thus, entirely fair.
This is demonstrably false. I extracted the LLVM IR, and that clearly shows that the code produces the actual sieve representation.
My measurements were these:
So I´m not entirely sure what to make of it. The python implementation implementation seems to at least be comparable to slightly better for a lower limit, whilst it is significantly slower for higher limit. I am certain that the reason that Python is comparable or faster for lower limits, is simply because it is actually faster, and that the C++ implementation isn´t as good as can be.
Indeed, something might be wrong. We should better look at whether the C++ implementation is as efficient as can be. We must consider this too, and I would like to point the finger squarely at this. (Technically C, and especially C++, are not the lowest level languages barring assembly, but I simply wish to point that out, and actually invalidating your point with that would be entirely unfair, and incorrect)
Indeed, and that seems to be the case. (there are more hypothetical options too, but I don´t think that is the case)
The reason that we don´t see that, is because by simply adding a type signature in the decorator, a compilation for that type signature is triggered. |
Beta Was this translation helpful? Give feedback.
-
@rhbvkleef if you're interested please read my edited comment. The c# code that I wrote is closer to the c++, the main difference is that I can't initialize the boolean array on true on c# so I used the opposite boolean, a "NoPrimes" array. The code that is written here is closer to the initial version of Dave's code which was a lot more inefficient. Note that I have practically zero experience on Github. |
Beta Was this translation helpful? Give feedback.
-
Dear Dave. I have enjoyed watching your videos. I know some about programming (and am trying to learn more as a hobby). I sat down this weekend, struggled with Anaconda, learned about Github (as linked to your video), and I re-wrote your code (I think) in a way that makes sense to me. I had to look up seive, by the way. Please keep up the videos. On my 13" Macbook Pro, this code computes the correct number of primes up to 1e9 in about 5 seconds. I tried 1e10, but my laptop decided it was too much. In some of the above posts, users mentioned a Numba package, but this did not really change the speed of my program.
|
Beta Was this translation helpful? Give feedback.
-
I am genuinely impressed by the big differences, just wanted to get that out :D |
Beta Was this translation helpful? Give feedback.
-
Hello Dave.
I saw your YouTube video and I instantly became a fan ;). Keep up the good work!
Made a slightly faster implementation with lists. Runs 255 passes compared to the original 40 (6 times faster):
Numba optimized numpy solution. Runs 10 000 passes (250 times faster):
Beta Was this translation helpful? Give feedback.
All reactions