-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAPACK test failure with 3.28 on aarch64 #5050
Comments
Is there a way to get the |
On which flavour of aarch64 / which build TARGET do you see this ? (At first glance it looks like a malloc error, does not look familiar) |
This is the make command:
What is doubly strange is that the error has only been reproduced on the Fedora koji builders, not on other aarch64 test machines. |
Is koji running on actual hardware, or virtual machines (that may be resource-limited or not reporting hardware properties like cache sizes correctly) ? The COMPLEX/COMPLEX16 parts of the LAPACK testsuite are a bit more memory hungry than the other tests. |
Also, would it be possible to add OPENBLAS_VERBOSE=2 to the environment, to see which cpu gets autodetected during the test phase of this DYNAMIC_ARCH build ? (Though I think it is at least as likely that the out-of-bounds access is a bug in the test code OpenBLAS imports from Reference-LAPACK, maybe it is the glibc version in your koji setup that catches this. ISTR there were a few test code fixes in Reference-LAPACK that I copied for 0.3.29 or still need to copy before its release) |
Just a note that 0.3.27 seems to be okay. koji is using VMs, should be fairly decent memory - I'll get a number at some point and try with the verbose. |
Hmm. The only remotely relevant change in 0.3.28 that I can identify is the addition of vector registers to the clobber list of the cdot/zdot assembly kernel used primarily on AppleM, ThunderX2 and Graviton2 - if anything this should have improved that kernel, certainly not led to memory overruns. (CSYL01 tests CTRSYL which uses very few external functions, most notably CDOT). |
FWIW I have reproduced the issue on our bare-metal Ampere MtSnow system (80 cpus) doing a rawhide mock build for flexiblas. |
Thanks - that would be NeoverseN1, which (at least in theory) should be rather well tested, though perhaps not with all your additional compiler options. I'll see if I can reproduce this in the GCC Compile Farm |
In valgrind @opoplawski got:
I cannot find the reference to |
Thanks - my valgrind run has not reported anything interesting so far. |
Seems extremely unlikely to me, if anything the older nrm2 kernels that these reverted to have had much more exposure. I also don't think we have NRM2 anywhere on the call graph of CSYL01 testing CTRSYL (the hit in cgemm_beta makes it look as if the fault is coming from the test code rather than the function under testing, but maybe that was a false positive from valgrind) |
I cannot reproduce the error with gcc 14.2 and all your build options except the special spec files (cfarm425 runs Debian, my other option would be "Rocky 9.5" but it looks like I'd need to build my own gcc there first to get anything recent). As an unwanted side effect, the installed valgrind 3.2 trips over something in the binary when I use your build options. |
FYI newer gcc version are available via Developer Toolset (not sure if it's still called this way) for RHEL-based distros |
can I install them as a non-privileged user ? |
they are (usually) in a different repo, but still as rpms, so admin privilege is needed |
I can run more checks if I get some details how as I am not really familiar with openblas (or flexiblas) internals and buildsystems ... |
12 seems to be the magical number here. I've reproduced the crash by setting
|
Got it down to a segfault in kernel/generic/zgemm_beta.c line 105 during bisect (suggesting that the second set of values being processed two-at-a-time in that part of the unrolled loop is already nonexistent) |
Bisect puts it down to #4655 "Expanding the scope of 2D thread distribution to improve multithreaded DGEMM performance. (51ab190). I need more time to understand if that PR is actually at fault here (and maybe some of its performance improvement can be salvaged by limiting it to non-complex cases or certain thread counts), or if it only exposes a flaw in (gcc 14's optimization of) the generic C gemm beta code. |
pragma GCC optimize O0 in zgemm_beta.c does not help, so probably not a gcc14 optimizer bug. Checking the thread redistribution produced by Yamazaki's PR now to see if it does anything interesting at the time of the crash. |
Looks like the while loop in zgemm_beta.c can cause an additional roundtrip... still testing my "fix" though... |
Can you please give #5057 a spin ? |
I'm still seeing the crash with that patch. |
Yes, the zeroing loop in that function must be patched too. I did that here and the crash disappears. |
With the update from 0.3.26 to 0.3.28 in Fedora we're starting to see the following lapack test failure on aarch64 only:
The text was updated successfully, but these errors were encountered: