Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace while loop in generic C/ZGEMM_BETA to avoid going out of bounds #5057

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

martin-frbg
Copy link
Collaborator

fixes #5050

@Enchufa2
Copy link

Enchufa2 commented Jan 9, 2025

Subtle one. I'll give it a spin later.

Did you figure out why this happens to cause problems only in aarch64 with 12 or more threads?

@martin-frbg
Copy link
Collaborator Author

aarch64 is the only widely used architecture for which OpenBLAS currently has no optimized GEMM_BETA kernels at all, so
every single cpu will use this C source. And I cannot confirm the 12 threads being special - it seems to be more a question of memory layout in the individual run, the outcome depending on what happens to be sitting in the overrun area. I have not been able to provoke the bug without the extra protector options, and even with them it was something like a 10 to 20 percent chance of failure on any individual run, something that might easily be missed if you're just building and testing once.
But I have not done an in-depth analysis of this yet, all I can really say is that the code survived several hundred re-runs of the LAPACK test with this change.

@Enchufa2
Copy link

Enchufa2 commented Jan 9, 2025

Could #4917 be a consequence of this?

@Enchufa2
Copy link

Enchufa2 commented Jan 9, 2025

And BTW... the valgrind output I collected reported invalid writes in line zgemm_beta.c:69 too. Shouldn't the loop in line zgemm_beta.c:62 be replaced in the same way?

@martin-frbg
Copy link
Collaborator Author

Not sure about #4917 as that was reproducibly returning sloppy results for existing data (if only for select thread counts), and yes the zeroing loop will need identical treatment if this stage of the PR is correct at all

@Enchufa2
Copy link

Enchufa2 commented Jan 9, 2025

Ok, I'm building in Copr. I've reproduced the crash by setting OMP_NUM_THREADS in FlexiBLAS' check stage. I'm building OpenBLAS there now with this patch, I just started another one with the loop in zgemm_beta.c:62 patched too, and I'll spin new FlexiBLAS builds on top of them when they finish.

@sharkcz
Copy link
Contributor

sharkcz commented Jan 9, 2025

looks good here, thanks

@martin-frbg
Copy link
Collaborator Author

looks good here, thanks

Thanks for testing - this is still quite weird to me as the code must have been like that for at least 15 years, if not 20 - granted aarch64 gave it much more exposure lately

@Enchufa2
Copy link

Enchufa2 commented Jan 9, 2025

@sharkcz You mean that the issue is not reproduced anymore in the Ampere MtSnow system? Because this patch still crashes for me (see 8492896 on top of 8492496 here). I'm building now with the other loop patched too (8492760), and I'll spin another FlexiBLAS build to see what happens.

@sharkcz
Copy link
Contributor

sharkcz commented Jan 9, 2025

@sharkcz You mean that the issue is not reproduced anymore in the Ampere MtSnow system? Because this patch still crashes for me (see 8492896 on top of 8492496 here). I'm building now with the other loop patched too (8492760), and I'll spin another FlexiBLAS build to see what happens.

correct, I have built updated openblas with this fix (in rawhide mock), installed new rpms into a new rawhide buildroot and successfully rebuilt flexiblas there

@Enchufa2
Copy link

Enchufa2 commented Jan 9, 2025

correct, I have built updated openblas with this fix (in rawhide mock), installed new rpms into a new rawhide buildroot and successfully rebuilt flexiblas there

With the most current commit in rawhide? Because yesterday I limited the number of threads to 10 in order to avoid the crash and be able to rebuild FlexiBLAS in rawhide, because it was affected by the retirement of ATLAS.

@Enchufa2
Copy link

Enchufa2 commented Jan 9, 2025

Ok, good news: the new build succeeded where the others failed. So I can confirm that patching the zeroing loop in zgemm_beta.c:62 on top of your changes here does fix the issue.

@sharkcz
Copy link
Contributor

sharkcz commented Jan 9, 2025

correct, I have built updated openblas with this fix (in rawhide mock), installed new rpms into a new rawhide buildroot and successfully rebuilt flexiblas there

With the most current commit in rawhide? Because yesterday I limited the number of threads to 10 in order to avoid the crash and be able to rebuild FlexiBLAS in rawhide, because it was affected by the retirement of ATLAS.

openblas is from rawhide HEAD, but flexiblas from https://src.fedoraproject.org/rpms/flexiblas/c/e10825622fc90f7405e4791062e6b433822a62c8?branch=rawhide (before your workaround)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LAPACK test failure with 3.28 on aarch64
3 participants