-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX2 variation of c++ #74
Conversation
Use `-march=native -O` to enable.
Unrolled 4 times. Similar to the AVX2 C++ code concept.
Added an unrolled julia with simplified loops/execution. |
First thank you @joelandman. |
I would suggest a separate run. I was curious as to whether I could vectorize this code using AVX2. This is a naive 1st pass. |
Note that for the julia code, I wanted to see if simplifying it and unrolling it would have an impact. It looks like it is noticeable, in part due to the impact of reducing the number of loop test conditionals. This shouldn't replace the base code, rather be along side. |
That is fair enough. To run your code new entries have to be added to the If you need help with implementing that let me know. Then I can do that for you. Otherwise if you want to give it a shot here are two relevant examples: Lines 109 to 115 in 769babc
Lines 201 to 209 in 769babc
|
Ok, let me play with this, and if I can get it working here, I'll add this to the PR. |
Updated Earthfile to include c++-avx2, and julia_ux4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just changed the --lang parameter, as it is the name used to name the y-axis of the plot.
Let's see what the CI says.
@joelandman thank you for your contribution! 👍 |
Interesting. On my machine (Epyc 7551 Zen1 AMD), they are about 2x faster for the AVX, and 33% faster for Julia. I'll try to replicate this on the alpine distro (I use debian 11 on my machine). |
iteration too little, start up time dominate |
That could be it. I don't know yet when I will get around to start implementing #59. |
I could do this easily for Julia. Might be harder for C++ (as startup is minimal). The reason I thought the julia unroll is faster, is this: Original:
Unrolled:
The unrolled is nearly 2x the speed of the plain. Similarly for the cpp versions
showing that the avx2 version was a little more than 2x the performance of the regular cpp version. I'm not sure what the differences are, though my machine is running debian 11 (and therefore glibc), and the test cases for c++ are running on alpine and musl. I've lit up an alpine VM on that machine (using kvm) to see if I can compare, and maybe generate insight. I've heard anecdotally about performance differences, but haven't measured them. |
Just moved the code over to my zen2 laptop (Ryzen 7 4800H) and got this for the C++
So its roughly 4x faster with avx2 (which makes sense). and for the julia Original
Unroll X4
Which shows the 30% or so better performance I was discussing. Also, the core c++ avx2 code is about the same performance as the julia unroll by 4 code. Which is inline with what many of us observe about julia. |
yeah but this is not how this repo does timing, this repo calls
which includes 200ms or something start uptime and compile first run time. And the CI system is very unstable, all of the top languages should be exactly the same |
Use
-march=native -O
to enable.