multithreading issues #82

grlee77 · 2020-11-04T18:55:54Z

I only recently downloaded the library and am not sure what the expected result is, but currently the benchmarks involving multiple threads do not show any improvement for me.

possibly related, in the following I get None for fn.cpustring(), so it seems the threading is not enabled? Calling fn.thread_enable() does not seem to enable it either, though.

import fast_numpy_loops as fn
fn.cpustring()

The text was updated successfully, but these errors were encountered:

seberg · 2020-11-09T21:27:03Z

You have to run fn.initialize() first to get a result by fn.cpustring() it appears.

mattip · 2020-11-09T21:35:44Z

https://quansight.github.io/numpy-threading-extensions/stable/use.html

grlee77 · 2020-11-09T22:08:11Z

Thanks, the fn.cpustring does work once initialize has been called. Sorry about missing that in the docs.

Regarding performance, I do not expect all functions to benefit from multithreading, but thought perhaps some cases such as exp would. However if I run asv run -b UFunc_exp using the branch corresponding to #71, I get nearly identical performance regardless of the number of threads:

[  0.00%] ·· Benchmarking virtualenv-py3.8-numpy
[ 16.67%] ··· Running (bench_ufunc.UFunc_exp.time_ufunc_types--)...
[ 66.67%] ··· bench_ufunc.UFunc_exp.time_ufunc_types                                                            ok
[ 66.67%] ··· ========== ============
               nthreads              
              ---------- ------------
                  0       27.6±0.2ms 
                  2       27.4±0.3ms 
                  4       27.5±0.3ms 
              ========== ============

Is this consistent with what others are seeing?
cpustring: **CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz AVX2:1 BMI2:1 0x7ffefbbf 0xbfebfbff 0xd39ffffb 0x00000000

tdimitri · 2020-11-17T13:38:13Z

I have learned a few more things (at least when testing on my computer).

If hyperthreading is on (which we have begun to start detecting), we will only use every other core because when two threads run on the same core, it slow things down by about 10%.
If the array size fits inside the L2 cache size (we can start returning L1/L2/L3 cache sizes in cpuinfo), and the array operation is simple and fast (like add two floats), there may be no speed up because the main thread is pulling L2 cache at top speed
Some common array operations are not vectorized like multiplying int64 values

Below a little over 2x speed up (with 3 extra threads). int64 multiply is not vectorized

In [31]: pn.thread_disable()
In [32]: x=np.arange(100_000, dtype=np.int64)
In [33]: y=x.copy()
In [34]: c=x+y
In [35]: %timeit np.multiply(x,y,out=c)
87.4 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [36]: pn.thread_enable()
In [37]: %timeit np.multiply(x,y,out=c)
35.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Now if I make larger arrays that are larger than L3 cache size.. here 100million * 4bytes per float - 400MB array size.
Below a little over 2x speed up (with 3 extra threads) -- cache is blown

In [38]: x=np.arange(100_000_000, dtype=np.float32)
In [39]: y=x.copy()
In [40]: c=x+y

In [41]: %timeit np.add(x,y,out=c)
40.4 ms ± 82.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: pn.thread_disable()
In [43]: %timeit np.add(x,y,out=c)
85.6 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Below about 4x speed up (with 3 extra threads), inside L3 cache, but outside L2 cache on my computer

In [47]: pn.thread_enable()
In [48]: x=np.arange(1_000_000, dtype=np.float32); y=x.copy(); c=x+y
In [49]: %timeit np.add(x,y,out=c)
109 µs ± 755 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [50]: pn.thread_disable()
In [51]: %timeit np.add(x,y,out=c)
408 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Below runs slower (made small arrays, and reused them)

In [66]: pn.thread_disable()
In [67]: x=np.arange(50_000, dtype=np.float32); y=x.copy(); c=x+y
In [68]: %timeit np.add(x,x,out=c)
14.6 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [70]: pn.thread_enable()
In [71]: %timeit np.add(x,x,out=c)
11.9 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

mattip · 2020-11-17T14:03:24Z

This also demonstrates the problem with simple benchmarks vs. an application. When using many smaller arguments that together do not fit in the L2 cache, breaking an application into smaller blocks may enable all the blocks of arguments to fit in the L2 cache of the various CPUs, providing a speedup. This strategy would be very complicated to execute in pnumpy as a numpy add-on, and would be easier to do in a framework like dask.

mattip · 2020-12-15T21:25:41Z

ping @jack-pappas to see if there is a way to get asv to reflect the performance increase when using multiple threads. Perhaps the benchmark should create an out result and then call the ufunc with res = ufunc(..., out=out)? Or is the 1024x1024 2D array not the right shape for pnumpy optimizations? In any case, once we figure out what is going on we should document the targeted use cases for pnumpy.

jack-pappas · 2020-12-16T14:16:41Z

@mattip There is indeed -- I've just opened PR #107 with some changes to fix how threading is handled in the benchmarks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multithreading issues #82

multithreading issues #82

grlee77 commented Nov 4, 2020

seberg commented Nov 9, 2020

mattip commented Nov 9, 2020

grlee77 commented Nov 9, 2020

tdimitri commented Nov 17, 2020 •

edited

Loading

mattip commented Nov 17, 2020

mattip commented Dec 15, 2020

jack-pappas commented Dec 16, 2020

multithreading issues #82

multithreading issues #82

Comments

grlee77 commented Nov 4, 2020

seberg commented Nov 9, 2020

mattip commented Nov 9, 2020

grlee77 commented Nov 9, 2020

tdimitri commented Nov 17, 2020 • edited Loading

mattip commented Nov 17, 2020

mattip commented Dec 15, 2020

jack-pappas commented Dec 16, 2020

tdimitri commented Nov 17, 2020 •

edited

Loading