AVX2 variation of c++ #74

joelandman · 2022-10-23T00:42:07Z

Use -march=native -O to enable.

Use `-march=native -O` to enable.

Unrolled 4 times. Similar to the AVX2 C++ code concept.

joelandman · 2022-10-23T01:10:29Z

Added an unrolled julia with simplified loops/execution.

niklas-heer · 2022-10-23T10:58:59Z

First thank you @joelandman.
The question for me is what do you intend with the program code? Is it to be added as separate runs alongside the rest or to be execute instead of the current implementations?

joelandman · 2022-10-23T17:46:58Z

I would suggest a separate run. I was curious as to whether I could vectorize this code using AVX2. This is a naive 1st pass.

joelandman · 2022-10-23T17:48:41Z

Note that for the julia code, I wanted to see if simplifying it and unrolling it would have an impact. It looks like it is noticeable, in part due to the impact of reducing the number of loop test conditionals.

This shouldn't replace the base code, rather be along side.

niklas-heer · 2022-10-23T18:18:54Z

That is fair enough. To run your code new entries have to be added to the Earthfile.

If you need help with implementing that let me know. Then I can do that for you. Otherwise if you want to give it a shot here are two relevant examples:

speed-comparison/Earthfile

Lines 109 to 115 in 769babc

    
           cpp: 
        
             FROM +alpine 
        
             RUN apk add --no-cache gcc build-base 
        
             COPY ./src/leibniz.cpp ./ 
        
             RUN --no-cache g++ leibniz.cpp -o leibniz -O3 -s -static -flto -march=native -mtune=native -fomit-frame-pointer -fno-signed-zeros -fno-trapping-math -fassociative-math 
        
             DO +BENCH --name="cpp" --lang="C++ (g++)" --version="g++ --version" --cmd="./leibniz"

speed-comparison/Earthfile

Lines 201 to 209 in 769babc

    
           julia: 
        
             # We have to use a special image since there is no Julia package on alpine 🤷‍♂️ 
        
             FROM julia:1.8.2-alpine3.16 
        
             RUN apk add --no-cache hyperfine 
        
             COPY +build/scmeta ./ 
        
             COPY ./src/rounds.txt ./ 
        
             COPY ./src/leibniz.jl ./ 
        
             DO +BENCH --name="julia" --lang="Julia" --version="julia --version" --cmd="julia leibniz.jl"

joelandman · 2022-10-23T18:40:28Z

Ok, let me play with this, and if I can get it working here, I'll add this to the PR.

joelandman · 2022-10-24T00:01:10Z

Updated Earthfile to include c++-avx2, and julia_ux4

niklas-heer

I've just changed the --lang parameter, as it is the name used to name the y-axis of the plot.
Let's see what the CI says.

niklas-heer · 2022-10-24T10:37:31Z

There doesn't seem that much difference.
The C++ solution seems to be about the same.
The Julia solutions seems to be a bit slower than the standard solution.

niklas-heer · 2022-10-24T13:09:45Z

@joelandman thank you for your contribution! 👍

joelandman · 2022-10-24T13:15:15Z

Interesting. On my machine (Epyc 7551 Zen1 AMD), they are about 2x faster for the AVX, and 33% faster for Julia. I'll try to replicate this on the alpine distro (I use debian 11 on my machine).

Moelf · 2022-10-24T14:59:50Z

The Julia solutions seems to be a bit slower than the standard solution.

iteration too little, start up time dominate

niklas-heer · 2022-10-24T15:14:59Z

The Julia solutions seems to be a bit slower than the standard solution.

iteration too little, start up time dominate

That could be it. I don't know yet when I will get around to start implementing #59.

joelandman · 2022-10-24T15:36:38Z

I could do this easily for Julia. Might be harder for C++ (as startup is minimal). The reason I thought the julia unroll is faster, is this:

Original:

julia> using BenchmarkTools
julia> struct SignVector <: AbstractVector{Float64}
           len::Int
       end
julia> Base.size(s::SignVector) = (s.len,)
julia> Base.getindex(::SignVector, i::Int) = Float64((-1)^iseven(i))
julia> function f(rounds)
           xs = SignVector(rounds + 2)
           pi = 1.0
           @simd for i in 2:(rounds + 2)
               x = xs[i]
               pi += x / (2 * i - 1)
           end
           return pi*4
       end
f (generic function with 1 method)
julia> rounds = parse(Int64, readchomp("rounds.txt"))
100000000
julia> f(rounds)
3.1415926435880532
julia> @benchmark f(rounds)
BenchmarkTools.Trial: 36 samples with 1 evaluation.
 Range (min … max):  142.441 ms … 145.558 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     142.733 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   142.835 ms ± 493.724 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       █▃  ▁                                                     
  ▆▆▆▆▇██▆▁█▇▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  142 ms           Histogram: frequency by time          146 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

Unrolled:

julia> using BenchmarkTools
julia> function f(rounds)
           pi = 1.0
           x  = -1.0
           r2 = rounds + 2
           vend = r2 - r2 % 4
           @simd for i in 2:4:r2
               pi +=   x / (2.0 * i -  1.0) - 
                       x / (2.0 * i +  1.0) + 
                       x / (2.0 * i +  3.0) - 
                       x / (2.0 * i +  5.0) 
           end
           for i in vend+1:r2
                  pi += 1.0 / (2.0 * (i + 0.0) - 1.0)
               x = -x
           end
           return pi*4
       end
f (generic function with 1 method)
julia> rounds = parse(Int64, readchomp("rounds.txt"))
100000000
julia> print(f(rounds))
3.141592703611381
julia> @benchmark f(rounds)
BenchmarkTools.Trial: 57 samples with 1 evaluation.
 Range (min … max):  88.304 ms …  90.359 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     88.341 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.379 ms ± 267.732 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▅     ▂   █ ▅                                         
  ▅▁▁▅▁▅▁▅███▅▅▅█▁▁▁████▅▁▁███▅█▅███▁▅▅▁█▅▁▁▅▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▅ ▁
  88.3 ms         Histogram: frequency by time         88.4 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

The unrolled is nearly 2x the speed of the plain. Similarly for the cpp versions

joe@calculon:~/bench/speed-comparison/src$ gcc -O3 -march=native leibniz.cpp -o l.x
joe@calculon:~/bench/speed-comparison/src$ gcc -O3 -march=native leibniz_avx2.cpp -o l_avx2.x
joe@calculon:~/bench/speed-comparison/src$ /usr/bin/time ./l.x
3.1415926635893259
0.26user 0.00system **0:00.26elapsed** 100%CPU (0avgtext+0avgdata 1388maxresident)k
1inputs+0outputs (4major+66minor)pagefaults 0swaps
joe@calculon:~/bench/speed-comparison/src$ /usr/bin/time ./l_avx2.x
3.1415926635945883
0.11user 0.00system **0:00.11elapsed** 98%CPU (0avgtext+0avgdata 1460maxresident)k
1inputs+0outputs (4major+70minor)pagefaults 0swaps

showing that the avx2 version was a little more than 2x the performance of the regular cpp version.

I'm not sure what the differences are, though my machine is running debian 11 (and therefore glibc), and the test cases for c++ are running on alpine and musl. I've lit up an alpine VM on that machine (using kvm) to see if I can compare, and maybe generate insight. I've heard anecdotally about performance differences, but haven't measured them.

joelandman · 2022-10-24T15:45:43Z

Just moved the code over to my zen2 laptop (Ryzen 7 4800H) and got this for the C++

3.1415926635893259
0.12user 0.00system 0:00.12elapsed 99%CPU (0avgtext+0avgdata 1540maxresident)k
0inputs+0outputs (0major+68minor)pagefaults 0swaps
joe@zap:~/bench/speed-comparison/src$ /usr/bin/time ./l_avx2.x 
3.1415926635945883
0.03user 0.00system 0:00.03elapsed 94%CPU (0avgtext+0avgdata 1576maxresident)k
0inputs+0outputs (0major+66minor)pagefaults 0swaps

So its roughly 4x faster with avx2 (which makes sense).

and for the julia

Original

julia> @benchmark f(rounds)
BenchmarkTools.Trial: 108 samples with 1 evaluation.
 Range (min … max):  45.974 ms …  49.822 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     46.423 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.606 ms ± 542.685 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▂ ▅ █                                     ▃       
  ▃▁▃▃▃▃▃▅▃▆▅██▆█▇█▇█▄▄▃▃▃▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▃▃▁▁▁▁▁▁▁██▁▁▁▃▃ ▃
  46 ms           Histogram: frequency by time         47.6 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

Unroll X4

julia> @benchmark f(rounds)
BenchmarkTools.Trial: 171 samples with 1 evaluation.
 Range (min … max):  29.277 ms … 29.421 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.353 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.353 ms ± 24.620 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▂           ▃ ▄█ ▅  ▂  ▂                   
  ▃▁▁▁▁▁▁▁▃▃▄▃▃▁▄▃▁▃█▄▄▃█▄▇▅▇█▇▆██████▆▅█▇▆█▅▇▄▅▄▇▁▁▁▃▁▄▁▃▄▁▄ ▃
  29.3 ms         Histogram: frequency by time        29.4 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

Which shows the 30% or so better performance I was discussing. Also, the core c++ avx2 code is about the same performance as the julia unroll by 4 code. Which is inline with what many of us observe about julia.

Moelf · 2022-10-24T16:12:58Z

julia> @benchmark f(rounds)

yeah but this is not how this repo does timing, this repo calls

julia lebniz.jl

which includes 200ms or something start uptime and compile first run time. And the CI system is very unstable, all of the top languages should be exactly the same

joelandman added 2 commits October 22, 2022 20:41

AVX2 variation of c++

0727634

Use `-march=native -O` to enable.

Simplified/unrolled julia

95e91dd

Unrolled 4 times. Similar to the AVX2 C++ code concept.

Update Earthfile

3d4e8ac

📝 Add distinctive names

293c52f

niklas-heer approved these changes Oct 24, 2022

View reviewed changes

✏️ Fix typo

18b3166

niklas-heer approved these changes Oct 24, 2022

View reviewed changes

✏️ Move into the right directory

a210067

niklas-heer approved these changes Oct 24, 2022

View reviewed changes

🐛 leibniz_avx2.cpp: Fix build to use the file

3f462ec

niklas-heer approved these changes Oct 24, 2022

View reviewed changes

niklas-heer added the hacktoberfest-accepted label Oct 24, 2022

niklas-heer merged commit 042b84c into niklas-heer:master Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX2 variation of c++ #74

AVX2 variation of c++ #74

joelandman commented Oct 23, 2022

joelandman commented Oct 23, 2022

niklas-heer commented Oct 23, 2022

joelandman commented Oct 23, 2022

joelandman commented Oct 23, 2022 •

edited

Loading

niklas-heer commented Oct 23, 2022

joelandman commented Oct 23, 2022

joelandman commented Oct 24, 2022

niklas-heer left a comment

niklas-heer commented Oct 24, 2022

niklas-heer commented Oct 24, 2022

joelandman commented Oct 24, 2022

Moelf commented Oct 24, 2022

niklas-heer commented Oct 24, 2022

joelandman commented Oct 24, 2022

joelandman commented Oct 24, 2022 •

edited

Loading

Moelf commented Oct 24, 2022

AVX2 variation of c++ #74

AVX2 variation of c++ #74

Conversation

joelandman commented Oct 23, 2022

joelandman commented Oct 23, 2022

niklas-heer commented Oct 23, 2022

joelandman commented Oct 23, 2022

joelandman commented Oct 23, 2022 • edited Loading

niklas-heer commented Oct 23, 2022

joelandman commented Oct 23, 2022

joelandman commented Oct 24, 2022

niklas-heer left a comment

Choose a reason for hiding this comment

niklas-heer commented Oct 24, 2022

niklas-heer commented Oct 24, 2022

joelandman commented Oct 24, 2022

Moelf commented Oct 24, 2022

niklas-heer commented Oct 24, 2022

joelandman commented Oct 24, 2022

joelandman commented Oct 24, 2022 • edited Loading

Moelf commented Oct 24, 2022

joelandman commented Oct 23, 2022 •

edited

Loading

joelandman commented Oct 24, 2022 •

edited

Loading