Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX2 variation of c++ #74

Merged
merged 7 commits into from
Oct 24, 2022
Merged

AVX2 variation of c++ #74

merged 7 commits into from
Oct 24, 2022

Conversation

joelandman
Copy link
Contributor

Use -march=native -O to enable.

Use `-march=native -O` to enable.
Unrolled 4 times.  Similar to the AVX2 C++ code concept.
@joelandman
Copy link
Contributor Author

Added an unrolled julia with simplified loops/execution.

@niklas-heer
Copy link
Owner

First thank you @joelandman.
The question for me is what do you intend with the program code? Is it to be added as separate runs alongside the rest or to be execute instead of the current implementations?

@joelandman
Copy link
Contributor Author

I would suggest a separate run. I was curious as to whether I could vectorize this code using AVX2. This is a naive 1st pass.

@joelandman
Copy link
Contributor Author

joelandman commented Oct 23, 2022

Note that for the julia code, I wanted to see if simplifying it and unrolling it would have an impact. It looks like it is noticeable, in part due to the impact of reducing the number of loop test conditionals.

This shouldn't replace the base code, rather be along side.

@niklas-heer
Copy link
Owner

That is fair enough. To run your code new entries have to be added to the Earthfile.

If you need help with implementing that let me know. Then I can do that for you. Otherwise if you want to give it a shot here are two relevant examples:

speed-comparison/Earthfile

Lines 109 to 115 in 769babc

cpp:
FROM +alpine
RUN apk add --no-cache gcc build-base
COPY ./src/leibniz.cpp ./
RUN --no-cache g++ leibniz.cpp -o leibniz -O3 -s -static -flto -march=native -mtune=native -fomit-frame-pointer -fno-signed-zeros -fno-trapping-math -fassociative-math
DO +BENCH --name="cpp" --lang="C++ (g++)" --version="g++ --version" --cmd="./leibniz"

speed-comparison/Earthfile

Lines 201 to 209 in 769babc

julia:
# We have to use a special image since there is no Julia package on alpine 🤷‍♂️
FROM julia:1.8.2-alpine3.16
RUN apk add --no-cache hyperfine
COPY +build/scmeta ./
COPY ./src/rounds.txt ./
COPY ./src/leibniz.jl ./
DO +BENCH --name="julia" --lang="Julia" --version="julia --version" --cmd="julia leibniz.jl"

@joelandman
Copy link
Contributor Author

Ok, let me play with this, and if I can get it working here, I'll add this to the PR.

@joelandman
Copy link
Contributor Author

Updated Earthfile to include c++-avx2, and julia_ux4

Copy link
Owner

@niklas-heer niklas-heer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just changed the --lang parameter, as it is the name used to name the y-axis of the plot.
Let's see what the CI says.

@niklas-heer
Copy link
Owner

There doesn't seem that much difference.
The C++ solution seems to be about the same.
The Julia solutions seems to be a bit slower than the standard solution.
combined_results

@niklas-heer niklas-heer merged commit 042b84c into niklas-heer:master Oct 24, 2022
@niklas-heer
Copy link
Owner

@joelandman thank you for your contribution! 👍

@joelandman
Copy link
Contributor Author

Interesting. On my machine (Epyc 7551 Zen1 AMD), they are about 2x faster for the AVX, and 33% faster for Julia. I'll try to replicate this on the alpine distro (I use debian 11 on my machine).

@Moelf
Copy link
Contributor

Moelf commented Oct 24, 2022

The Julia solutions seems to be a bit slower than the standard solution.

iteration too little, start up time dominate

@niklas-heer
Copy link
Owner

The Julia solutions seems to be a bit slower than the standard solution.

iteration too little, start up time dominate

That could be it. I don't know yet when I will get around to start implementing #59.

@joelandman
Copy link
Contributor Author

I could do this easily for Julia. Might be harder for C++ (as startup is minimal). The reason I thought the julia unroll is faster, is this:

Original:

julia> using BenchmarkTools
julia> struct SignVector <: AbstractVector{Float64}
           len::Int
       end
julia> Base.size(s::SignVector) = (s.len,)
julia> Base.getindex(::SignVector, i::Int) = Float64((-1)^iseven(i))
julia> function f(rounds)
           xs = SignVector(rounds + 2)
           pi = 1.0
           @simd for i in 2:(rounds + 2)
               x = xs[i]
               pi += x / (2 * i - 1)
           end
           return pi*4
       end
f (generic function with 1 method)
julia> rounds = parse(Int64, readchomp("rounds.txt"))
100000000
julia> f(rounds)
3.1415926435880532
julia> @benchmark f(rounds)
BenchmarkTools.Trial: 36 samples with 1 evaluation.
 Range (min … max):  142.441 ms … 145.558 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     142.733 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   142.835 ms ± 493.724 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       █▃  ▁                                                     
  ▆▆▆▆▇██▆▁█▇▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  142 ms           Histogram: frequency by time          146 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

Unrolled:

julia> using BenchmarkTools
julia> function f(rounds)
           pi = 1.0
           x  = -1.0
           r2 = rounds + 2
           vend = r2 - r2 % 4
           @simd for i in 2:4:r2
               pi +=   x / (2.0 * i -  1.0) - 
                       x / (2.0 * i +  1.0) + 
                       x / (2.0 * i +  3.0) - 
                       x / (2.0 * i +  5.0) 
           end
           for i in vend+1:r2
                  pi += 1.0 / (2.0 * (i + 0.0) - 1.0)
               x = -x
           end
           return pi*4
       end
f (generic function with 1 method)
julia> rounds = parse(Int64, readchomp("rounds.txt"))
100000000
julia> print(f(rounds))
3.141592703611381
julia> @benchmark f(rounds)
BenchmarkTools.Trial: 57 samples with 1 evaluation.
 Range (min … max):  88.304 ms …  90.359 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     88.341 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.379 ms ± 267.732 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▅     ▂   █ ▅                                         
  ▅▁▁▅▁▅▁▅███▅▅▅█▁▁▁████▅▁▁███▅█▅███▁▅▅▁█▅▁▁▅▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▅ ▁
  88.3 ms         Histogram: frequency by time         88.4 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

The unrolled is nearly 2x the speed of the plain. Similarly for the cpp versions

joe@calculon:~/bench/speed-comparison/src$ gcc -O3 -march=native leibniz.cpp -o l.x
joe@calculon:~/bench/speed-comparison/src$ gcc -O3 -march=native leibniz_avx2.cpp -o l_avx2.x
joe@calculon:~/bench/speed-comparison/src$ /usr/bin/time ./l.x
3.1415926635893259
0.26user 0.00system **0:00.26elapsed** 100%CPU (0avgtext+0avgdata 1388maxresident)k
1inputs+0outputs (4major+66minor)pagefaults 0swaps
joe@calculon:~/bench/speed-comparison/src$ /usr/bin/time ./l_avx2.x
3.1415926635945883
0.11user 0.00system **0:00.11elapsed** 98%CPU (0avgtext+0avgdata 1460maxresident)k
1inputs+0outputs (4major+70minor)pagefaults 0swaps

showing that the avx2 version was a little more than 2x the performance of the regular cpp version.

I'm not sure what the differences are, though my machine is running debian 11 (and therefore glibc), and the test cases for c++ are running on alpine and musl. I've lit up an alpine VM on that machine (using kvm) to see if I can compare, and maybe generate insight. I've heard anecdotally about performance differences, but haven't measured them.

@joelandman
Copy link
Contributor Author

joelandman commented Oct 24, 2022

Just moved the code over to my zen2 laptop (Ryzen 7 4800H) and got this for the C++

3.1415926635893259
0.12user 0.00system 0:00.12elapsed 99%CPU (0avgtext+0avgdata 1540maxresident)k
0inputs+0outputs (0major+68minor)pagefaults 0swaps
joe@zap:~/bench/speed-comparison/src$ /usr/bin/time ./l_avx2.x 
3.1415926635945883
0.03user 0.00system 0:00.03elapsed 94%CPU (0avgtext+0avgdata 1576maxresident)k
0inputs+0outputs (0major+66minor)pagefaults 0swaps

So its roughly 4x faster with avx2 (which makes sense).

and for the julia

Original

julia> @benchmark f(rounds)
BenchmarkTools.Trial: 108 samples with 1 evaluation.
 Range (min … max):  45.974 ms …  49.822 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     46.423 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.606 ms ± 542.685 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▂ ▅ █                                     ▃       
  ▃▁▃▃▃▃▃▅▃▆▅██▆█▇█▇█▄▄▃▃▃▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▃▃▁▁▁▁▁▁▁██▁▁▁▃▃ ▃
  46 ms           Histogram: frequency by time         47.6 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

Unroll X4

julia> @benchmark f(rounds)
BenchmarkTools.Trial: 171 samples with 1 evaluation.
 Range (min … max):  29.277 ms … 29.421 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.353 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.353 ms ± 24.620 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▂           ▃ ▄█ ▅  ▂  ▂                   
  ▃▁▁▁▁▁▁▁▃▃▄▃▃▁▄▃▁▃█▄▄▃█▄▇▅▇█▇▆██████▆▅█▇▆█▅▇▄▅▄▇▁▁▁▃▁▄▁▃▄▁▄ ▃
  29.3 ms         Histogram: frequency by time        29.4 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

Which shows the 30% or so better performance I was discussing. Also, the core c++ avx2 code is about the same performance as the julia unroll by 4 code. Which is inline with what many of us observe about julia.

@Moelf
Copy link
Contributor

Moelf commented Oct 24, 2022

julia> @benchmark f(rounds)

yeah but this is not how this repo does timing, this repo calls

julia lebniz.jl

which includes 200ms or something start uptime and compile first run time. And the CI system is very unstable, all of the top languages should be exactly the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants