Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multithreading issues #82

Open
grlee77 opened this issue Nov 4, 2020 · 7 comments
Open

multithreading issues #82

grlee77 opened this issue Nov 4, 2020 · 7 comments

Comments

@grlee77
Copy link

grlee77 commented Nov 4, 2020

I only recently downloaded the library and am not sure what the expected result is, but currently the benchmarks involving multiple threads do not show any improvement for me.

possibly related, in the following I get None for fn.cpustring(), so it seems the threading is not enabled? Calling fn.thread_enable() does not seem to enable it either, though.

import fast_numpy_loops as fn
fn.cpustring()
@seberg
Copy link

seberg commented Nov 9, 2020

You have to run fn.initialize() first to get a result by fn.cpustring() it appears.

@mattip
Copy link
Collaborator

mattip commented Nov 9, 2020

https://quansight.github.io/numpy-threading-extensions/stable/use.html

@grlee77
Copy link
Author

grlee77 commented Nov 9, 2020

Thanks, the fn.cpustring does work once initialize has been called. Sorry about missing that in the docs.

Regarding performance, I do not expect all functions to benefit from multithreading, but thought perhaps some cases such as exp would. However if I run asv run -b UFunc_exp using the branch corresponding to #71, I get nearly identical performance regardless of the number of threads:

[  0.00%] ·· Benchmarking virtualenv-py3.8-numpy
[ 16.67%] ··· Running (bench_ufunc.UFunc_exp.time_ufunc_types--)...
[ 66.67%] ··· bench_ufunc.UFunc_exp.time_ufunc_types                                                            ok
[ 66.67%] ··· ========== ============
               nthreads              
              ---------- ------------
                  0       27.6±0.2ms 
                  2       27.4±0.3ms 
                  4       27.5±0.3ms 
              ========== ============

Is this consistent with what others are seeing?
cpustring: **CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz AVX2:1 BMI2:1 0x7ffefbbf 0xbfebfbff 0xd39ffffb 0x00000000

@tdimitri
Copy link
Collaborator

tdimitri commented Nov 17, 2020

I have learned a few more things (at least when testing on my computer).

  1. If hyperthreading is on (which we have begun to start detecting), we will only use every other core because when two threads run on the same core, it slow things down by about 10%.
  2. If the array size fits inside the L2 cache size (we can start returning L1/L2/L3 cache sizes in cpuinfo), and the array operation is simple and fast (like add two floats), there may be no speed up because the main thread is pulling L2 cache at top speed
  3. Some common array operations are not vectorized like multiplying int64 values

Below a little over 2x speed up (with 3 extra threads). int64 multiply is not vectorized

In [31]: pn.thread_disable()
In [32]: x=np.arange(100_000, dtype=np.int64)
In [33]: y=x.copy()
In [34]: c=x+y
In [35]: %timeit np.multiply(x,y,out=c)
87.4 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [36]: pn.thread_enable()
In [37]: %timeit np.multiply(x,y,out=c)
35.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Now if I make larger arrays that are larger than L3 cache size.. here 100million * 4bytes per float - 400MB array size.
Below a little over 2x speed up (with 3 extra threads) -- cache is blown

In [38]: x=np.arange(100_000_000, dtype=np.float32)
In [39]: y=x.copy()
In [40]: c=x+y

In [41]: %timeit np.add(x,y,out=c)
40.4 ms ± 82.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: pn.thread_disable()
In [43]: %timeit np.add(x,y,out=c)
85.6 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Below about 4x speed up (with 3 extra threads), inside L3 cache, but outside L2 cache on my computer

In [47]: pn.thread_enable()
In [48]: x=np.arange(1_000_000, dtype=np.float32); y=x.copy(); c=x+y
In [49]: %timeit np.add(x,y,out=c)
109 µs ± 755 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [50]: pn.thread_disable()
In [51]: %timeit np.add(x,y,out=c)
408 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Below runs slower (made small arrays, and reused them)

In [66]: pn.thread_disable()
In [67]: x=np.arange(50_000, dtype=np.float32); y=x.copy(); c=x+y
In [68]: %timeit np.add(x,x,out=c)
14.6 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [70]: pn.thread_enable()
In [71]: %timeit np.add(x,x,out=c)
11.9 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@mattip
Copy link
Collaborator

mattip commented Nov 17, 2020

This also demonstrates the problem with simple benchmarks vs. an application. When using many smaller arguments that together do not fit in the L2 cache, breaking an application into smaller blocks may enable all the blocks of arguments to fit in the L2 cache of the various CPUs, providing a speedup. This strategy would be very complicated to execute in pnumpy as a numpy add-on, and would be easier to do in a framework like dask.

@mattip
Copy link
Collaborator

mattip commented Dec 15, 2020

ping @jack-pappas to see if there is a way to get asv to reflect the performance increase when using multiple threads. Perhaps the benchmark should create an out result and then call the ufunc with res = ufunc(..., out=out)? Or is the 1024x1024 2D array not the right shape for pnumpy optimizations? In any case, once we figure out what is going on we should document the targeted use cases for pnumpy.

@jack-pappas
Copy link
Collaborator

@mattip There is indeed -- I've just opened PR #107 with some changes to fix how threading is handled in the benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants