Faster python implementation. #227

jugge83 · 2021-03-27T16:30:12Z

jugge83
Mar 27, 2021

Hello Dave.

I saw your YouTube video and I instantly became a fan ;). Keep up the good work!

Made a slightly faster implementation with lists. Runs 255 passes compared to the original 40 (6 times faster):

import math
import timeit
from textwrap import dedent
from numba import jit

@jit  # speeds up the funtion 3 times
def run_sieve(limit):
    prime_list = [0, 0, 1]
    prime_list += [i % 2 for i in range(3, limit + 1)]
    upper_limit = int(math.sqrt(limit)) + 1
    for prime_candidate in range(3, upper_limit, 2):
        if prime_list[prime_candidate] == 1:
            for non_prime in range(prime_candidate ** 2, limit, prime_candidate * 2):
                prime_list[non_prime] = 0
    return prime_list[: limit + 1]

tStart = timeit.default_timer()  # Record our starting time
passes = 0
limit = 1000000

while timeit.default_timer() - tStart < 10:
    # Run until more than 10 seconds have elapsed
    prime_validators = run_sieve(limit)  #  Find the results
    passes = passes + 1
time_delta = timeit.default_timer() - tStart
primes = [
    int(prime) for (prime, validator) in enumerate(prime_validators) if validator == 1
]
print(
    dedent(
        f"""\
    Passes: {passes}
    Time: {time_delta}
    Avg: {time_delta/passes}
    Limit: {limit}
    Count: {len(primes)}
    Valid: {primeCounts[limit]==len(primes)}"""
    )
)

Numba optimized numpy solution. Runs 10 000 passes (250 times faster):

import timeit
from textwrap import dedent
from numba import jit
import numpy as np

@jit(nopython=True, fastmath=True, parallel=False)
def run_sieve(limit, prime_list):
    prime_list[:2] = False
    prime_list[4::2] = False
    upper_limit = int(limit ** (1/2)) + 1
    for prime_candidate in range(3, upper_limit, 2):
        if prime_list[prime_candidate]:
            prime_list[prime_candidate ** 2: limit: prime_candidate * 2] = False
    return prime_list[: limit + 1]

def print_results(time_delta, passes, primes, limit):
    prime_counts = {
        10: 4,  # Historical data for validating our results - the number of primes
        100: 25,  # to be found under some limit, such as 168 primes under 1000
        1000: 168,
        10000: 1229,
        100000: 9592,
        1000000: 78498,
        10000000: 664579,
        100000000: 5761455,
    }
    print(f"""Passes: {passes}\nTime: {time_delta}\nAvg: {time_delta/passes}\
        \nLimit: {limit}\nCount: {len(primes)}\nValid: {prime_counts[limit]==len(primes)}""")

if __name__=="__main__":
    t_start = timeit.default_timer()  # Record our starting time
    passes = 0
    limit = 1000000
    while timeit.default_timer() - t_start < 10:
        # Run until more than 10 seconds have elapsed
        raw_array = np.ones(limit, bool)
        prime_validators = run_sieve(limit, raw_array)  #  Find the results
        passes = passes + 1
    time_delta = timeit.default_timer() - t_start
    primes = [prime for (prime, validator) in enumerate(prime_validators) if validator]
    print_results(time_delta, passes, primes, limit)

rhbvkleef · 2021-03-29T14:35:01Z

rhbvkleef
Mar 29, 2021

I banged my head against this for a while, as I saw some inefficiencies as compared to Dave´s implementation and other solutions (including my Numba implementation). Yours was faster to start with, but that seemed to be mainly because you dropped the class.

I optimized it a bit, and got it faster than the C++ solution on my machine:

Implementation	kops/s
Python with Numba	1.4
Python without numba	1.1
C++	1.1

from numba import njit, b1, int32
from math import sqrt
import numpy as np

@njit(b1[:](int32))
def run_sieve(limit):
    prime_list = np.full(limit // 2, True)
    prime_list[0] = False

    for factor in range(3, int(sqrt(limit)) + 1, 2):
        if prime_list[factor // 2]:
            prime_list[(factor * 3) // 2:limit // 2:factor] = False

    return prime_list[:(limit // 2) + 1]

The only thing is that 2 needs to be added to the list of primes, as even numbers are not considered:

import itertools

def primes(sieve):
    return itertools.chain([2], (prime * 2 + 1 for (prime, is_prime) in enumerate(sieve) if is_prime))

Additionally, I am seeing a shocking amount of cache misses and branch mispredicts. I don´t think we can do much about the mispredicts (and I don´t really know where they come from: they are likely an interpreter artifact), but a segmented sieve could be employed to improve locality on the data-cache.

0 replies

jugge83 · 2021-03-29T18:18:02Z

jugge83
Mar 29, 2021
Author

Nice trick to only look at the odd integers!

I think the major part of my speed improvement came from the removal of the double for loops in the while loop of the sieve method, But any function or method calls add overhead.

How do you check the cache misses and branch mispredicts?

I managed to squeeze out a couple of thousand iterations more by starting at factor ** 2 instead of factor * 3 and possibly the fastmath flag, I don't think it matters if you use the math.sqrt() or ** (1/2) function/operator.

Now I get approx. 17 000 passes compared to the initial 40 (425 times faster). Not bad for a slow language ;)

import timeit
import itertools
from numba import njit, b1, int32
import numpy as np


@njit(b1[:](int32), fastmath = True)
def run_sieve(limit):
    prime_list = np.full(limit // 2, True)  # defines a list for odd integers
    prime_list[0] = False                   # 1 is not a prime
    for factor in range(3, int(limit ** (1/2)) + 1, 2):
        if prime_list[factor // 2]:         # just looking at odd integers everything halves
            prime_list[(factor ** 2) // 2:limit // 2:factor] = False
    return prime_list[:(limit // 2) + 1]

def print_results(time_delta, passes, primes, limit):
    prime_counts = {
        10: 4,  # Historical data for validating our results - the number of primes
        100: 25,  # to be found under some limit, such as 168 primes under 1000
        1000: 168,
        10000: 1229,
        100000: 9592,
        1000000: 78498,
        10000000: 664579,
        100000000: 5761455,
    }
    print(f"""Passes: {passes}\nTime: {time_delta}\nAvg: {time_delta/passes}\
        \nLimit: {limit}\nCount: {len(primes)}\nValid: {prime_counts[limit]==len(primes)}""")

def primes(sieve):
    return itertools.chain([2], (prime * 2 + 1 for (prime, is_prime) in enumerate(sieve) if is_prime))

if __name__=="__main__":
    t_start = timeit.default_timer()  # Record our starting time
    passes = 0
    limit = 1_000_000
    while timeit.default_timer() - t_start < 10:
        prime_validators = run_sieve(limit)
        passes = passes + 1
    time_delta = timeit.default_timer() - t_start
    primes = list(primes(prime_validators))
    print(primes[:100])
    print_results(time_delta, passes, primes, limit)

0 replies

rhbvkleef · 2021-03-29T19:27:33Z

rhbvkleef
Mar 29, 2021

I think the major part of my speed improvement came from the removal of the double for loops in the while loop of the sieve method

Array slicing on numpy arrays is very fast, but it´s only a performance gain when not using Numba. Numba was able to optimize a loop to set each value separately.

How do you check the cache misses and branch mispredicts?

I used perf.

I managed to squeeze out a couple of thousand iterations more by starting at factor ** 2 instead of factor * 3

Cool! :)

0 replies

safijari · 2021-03-31T22:07:51Z

safijari
Mar 31, 2021

Oh wow. I was thinking about doing this and making a video for my channel as a sort of playful response but it's already been done, and really well. Kudos!

0 replies

iJohnnis · 2021-04-01T07:32:56Z

iJohnnis
Apr 1, 2021

@jugge83 & @rhbvkleef & @safijari:

I'm not a python expert but I believe that this is an unfair comparison given that your code does not initialize a new class each time and, probably, reuses the same array. Also, I think you're using sequences instead of arrays, which means that most of the code does not actually execute until you use the list operation.

Try executing your code once with a 100 million sieve size (1e8).

My results were:
• 347 ms for c++ (Dave's code)
• 413 ms for c# (an optimized version of Dave's code)
• 493 ms for python (your code)

Edit 03/04/2021: I had to fix the number because it will confuse ppl that read this and not the posts below this. My old results were 4093ms for python but I accidentally had the code running for 1000000000 (1e9) instead of the 1e8 that I had on the other languages. That was 100% my mistake and it is not representative of the language or the code. I have to add here that running once and comparing it doesn't mean much, all I wanted to illustrate with this example is the class initialization cost that the other two languages compare to rhbvkleef's.
If anyone is interested in the c# code that I run you can find it here: https://github.com/iJohnnis/PrimeSieve/tree/master/PrimeSieveCS.

Generally going faster than c++ means that something is wrong, given that c/c++ are the lowest level languages (apart from assembly) and that python uses c ... The only way that I see python doing anything faster than c is wrong/inefficient c code.

0 replies

jugge83 · 2021-04-01T15:44:51Z

jugge83
Apr 1, 2021
Author

I'm getting the following:

143 ms for Daves C++ code.
298 ms for the above Python code.

The code does not need to be object oriented and my goal was to optimize the python code as much as I could.
For me, Python development is way faster than C++ development ;)

0 replies

safijari · 2021-04-01T17:07:47Z

safijari
Apr 1, 2021

Something that I haven't seen considered here is taking the compilation time into account, since the first invocation of run_seive would take a fair bit longer because of Numba compiling it down into LLVM bytecode. Can be remedied either by running the function once before the timing starts or by using the cache=True kwarg so on subsequent runs the compilation doesn't need to happen.

There's also AOT compilation for Numba. Speaking of fair, that would the most fair way to do the comparison.

0 replies

jugge83 · 2021-04-01T17:36:53Z

jugge83
Apr 1, 2021
Author

@safijari you should make the video. I can Imagine that it will be popular and the more Python resources that are out there the better.
Sadly the cache=True kwarg or an initial "pre-run" of the function didn't improve execution times for me.

0 replies

safijari · 2021-04-01T17:41:16Z

safijari
Apr 1, 2021

I was thinking I could do a shorter video that reviews this code instead. Lemme see, I have other projects to finish first.

…

On Thu, Apr 1, 2021 at 10:37 AM jugge83 ***@***.***> wrote: @safijari <https://github.com/safijari> you should make the video. I can Imagine that it will be popular and the more Python resources that are out there the better. Sadly the cache=True kwarg or an initial "pre-run" of the function didn't improve execution times for me. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/davepl/Primes/issues/25#issuecomment-812062989>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHTRJHOZ2L32RXZSD6XT5LTGSVMHANCNFSM4Z5CH3XA> .

-- *Jariullah Safi* *Director of AI and Computer Vision* Simbe Robotics <http://www.simberobotics.com/>

0 replies

iJohnnis · 2021-04-01T20:50:36Z

iJohnnis
Apr 1, 2021

I'm getting the following:

143 ms for Daves C++ code.

298 ms for the above Python code.

You're right, I accidentally posted the results for the 1 billion sieve-size when I run the python code.

The code does not need to be object oriented and my goal was to optimize the python code as much as I could.

Yeah, but if you have to compare the two languages, you have to do similar things.
c++ doesn't need to initialize a new object each time either.

For me, Python development is way faster than C++ development ;)

Yeah, no doubt. Python is, one of the fastest, if not the fastest, languages if you're comparing development time.

0 replies

rhbvkleef · 2021-04-02T15:17:41Z

rhbvkleef
Apr 2, 2021

@iJohnnis

I'm not a python expert but I believe that this is an unfair comparison given that your code does not initialize a new class each time [...

This is true, but I´ve also written an alternative that does initialize a class each time, and the performance is very similar. It is slightly slower, taking off a couple of percent, so it is not significant in any way.

...] and, probably, reuses the same array.

It might, it might not, but that is a matter of page reuse, not a matter of data reuse. It is likely that the same happens in C++ and C#, and is thus, entirely fair.

Also, I think you're using sequences instead of arrays, which means that most of the code does not actually execute until you use the list operation.

This is demonstrably false. I extracted the LLVM IR, and that clearly shows that the code produces the actual sieve representation.

Try executing your code once with a 100 million sieve size (1e8).
My results were:

347 ms for c++ (Dave's code)

413 ms for c# (an optimized version of Dave's code)

4093 ms for python (your code)

My measurements were these:

Language	Iterations (limit=1 million)	Iterations (limit=100 million)
C++ (GCC)	11556	79
C++ (Clang)	11267	83
C# (Mono)	1521	Not measured
C# (DotNet Core)	3317	Not measured
Python (with classes)	10345	32
Python (this implementation)	14845	35

So I´m not entirely sure what to make of it. The python implementation implementation seems to at least be comparable to slightly better for a lower limit, whilst it is significantly slower for higher limit. I am certain that the reason that Python is comparable or faster for lower limits, is simply because it is actually faster, and that the C++ implementation isn´t as good as can be.

Generally going faster than c++ means that something is wrong, given that c/c++ are the lowest level languages

Indeed, something might be wrong. We should better look at whether the C++ implementation is as efficient as can be. We must consider this too, and I would like to point the finger squarely at this.

(Technically C, and especially C++, are not the lowest level languages barring assembly, but I simply wish to point that out, and actually invalidating your point with that would be entirely unfair, and incorrect)

The only way that I see python doing anything faster that c is wrong/inefficient c code.

Indeed, and that seems to be the case. (there are more hypothetical options too, but I don´t think that is the case)

@safijari, @jugge83

Something that I haven't seen considered here is taking the compilation time into account, since the first invocation of run_seive would take a fair bit longer because of Numba compiling it down into LLVM bytecode.

sadly the cache=True kwarg or an initial "pre-run" of the function didn't improve execution times for me.

The reason that we don´t see that, is because by simply adding a type signature in the decorator, a compilation for that type signature is triggered.

0 replies

iJohnnis · 2021-04-03T15:40:22Z

iJohnnis
Apr 3, 2021

@rhbvkleef if you're interested please read my edited comment. The c# code that I wrote is closer to the c++, the main difference is that I can't initialize the boolean array on true on c# so I used the opposite boolean, a "NoPrimes" array. The code that is written here is closer to the initial version of Dave's code which was a lot more inefficient. Note that I have practically zero experience on Github.

0 replies

rgtoomey · 2021-04-12T03:23:06Z

rgtoomey
Apr 12, 2021

Dear Dave. I have enjoyed watching your videos. I know some about programming (and am trying to learn more as a hobby). I sat down this weekend, struggled with Anaconda, learned about Github (as linked to your video), and I re-wrote your code (I think) in a way that makes sense to me. I had to look up seive, by the way. Please keep up the videos.

On my 13" Macbook Pro, this code computes the correct number of primes up to 1e9 in about 5 seconds. I tried 1e10, but my laptop decided it was too much. In some of the above posts, users mentioned a Numba package, but this did not really change the speed of my program.

import numpy as np
import math as m
from time import time


def primes(maxNumber):                                      #Function that returns all the prime numbers up to a maximum number   

    start=time()
    
    booleanTestArray = primeSearch(maxNumber)               #Returns a Boolean test of all odd numbers between 2 and the maximum number. A prime number returns true.
    primesArray=getPrimeNumbers(booleanTestArray)           #Converts the Boolean test to an array of prime numbers between 2 and the maximum number
    
    finish=time()  
    
    print(finish-start)     
    return primesArray


def primeSearch(maxNumber) :
    
    possibleTotalPrimes = (maxNumber + 1) // 2              #Sets the total number of potential prime number candidates
    booleanTestArray = np.full(possibleTotalPrimes,True)    #Establishes Boolean test array corresponding to [2,3,5,7, ..., possibleTotalPrimes]
    maxFactor = m.floor(m.sqrt(maxNumber))                  #Largest number potentially factorable into maxNumber
    
    for factor in range(3,maxFactor+1,2) :                  #Test all divisible factors starting from 3 up the maximum
        if booleanTestArray[factor//2] :                    #Check if divisible factor has not been set to false (meaning all multiples of it has already been eliminated as potential primes) 
            booleanTestArray[(factor**2)//2:(maxNumber+1):factor] = False   #If the divisible factor is a prime, set all multiples of the prime number to false in the booleanTestArray
    return booleanTestArray


def getPrimeNumbers(booleanTestArray) :                     #Converts booleanTestArray to array of prime numbers 
    
    primesArray = np.where(booleanTestArray)[0]
    primesArray= 2*primesArray+1 
    primesArray[0]=2
    return primesArray

0 replies

Zeratoxx · 2021-06-09T09:43:22Z

Zeratoxx
Jun 9, 2021

I am genuinely impressed by the big differences, just wanted to get that out :D

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster python implementation. #227

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 14 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Faster python implementation. #227

Replies: 14 comments

jugge83 Mar 29, 2021 Author

Try executing your code once with a 100 million sieve size (1e8).

My results were: • 347 ms for c++ (Dave's code) • 413 ms for c# (an optimized version of Dave's code) • 493 ms for python (your code)

jugge83 Apr 1, 2021 Author

jugge83 Apr 1, 2021 Author

jugge83
Mar 29, 2021
Author

My results were:
• 347 ms for c++ (Dave's code)
• 413 ms for c# (an optimized version of Dave's code)
• 493 ms for python (your code)

jugge83
Apr 1, 2021
Author

jugge83
Apr 1, 2021
Author