Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pacman] getting stuck on clang-aarch64 #4340

Closed
1 task done
Wormnest opened this issue Jan 11, 2024 · 39 comments
Closed
1 task done

[pacman] getting stuck on clang-aarch64 #4340

Wormnest opened this issue Jan 11, 2024 · 39 comments
Labels

Comments

@Wormnest
Copy link

Description / Steps to reproduce the issue

Since about a month GIMP's aarch64 CI runner is getting stuck when running pacman --noconfirm -Suy.
The last job that succeeded was Dec 11 and another from the same day and any later one is failing.

Most of the time (e.g. here) it already seems to stop before the databases are updated:

$ C:\msys64\usr\bin\bash -lc "bash -x ./build/windows/gitlab-ci/1_build-deps-msys2.sh"
+ set -e
+ [[ aarch64 == \a\a\r\c\h\6\4 ]]
+ export ARTIFACTS_SUFFIX=-a64
+ ARTIFACTS_SUFFIX=-a64
+ [[ CI_NATIVE != \C\I\_\N\A\T\I\V\E ]]
+ pacman --noconfirm -Suy
WARNING: Failed to terminate process: 1 error occurred:
	* failed to attach the runner process to the console of its parent process: The handle is invalid.
WARNING: Timed out waiting for the build to finish

Sometimes it gets a little further:

$ C:\msys64\usr\bin\bash -lc "bash -x ./build/windows/gitlab-ci/1_build-deps-msys2.sh"
+ set -e
+ [[ aarch64 == \a\a\r\c\h\6\4 ]]
+ export ARTIFACTS_SUFFIX=-a64
+ ARTIFACTS_SUFFIX=-a64
+ [[ CI_NATIVE != \C\I\_\N\A\T\I\V\E ]]
+ pacman --noconfirm -Suy
:: Synchronizing package databases...
 clangarm64 downloading...
 msys downloading...
:: Starting core system upgrade...
 there is nothing to do
:: Starting full system upgrade...
 there is nothing to do
+ pacman --noconfirm -S --needed base-devel mingw-w64-clang-aarch64-toolchain mingw-w64-clang-aarch64-meson mingw-w64-clang-aarch64-cairo mingw-w64-clang-aarch64-crt-git mingw-w64-clang-aarch64-glib-networking mingw-w64-clang-aarch64-gobject-introspection mingw-w64-clang-aarch64-json-glib mingw-w64-clang-aarch64-lcms2 mingw-w64-clang-aarch64-lensfun mingw-w64-clang-aarch64-libspiro mingw-w64-clang-aarch64-maxflow mingw-w64-clang-aarch64-openexr mingw-w64-clang-aarch64-pango mingw-w64-clang-aarch64-suitesparse mingw-w64-clang-aarch64-vala
WARNING: Failed to terminate process: 1 error occurred:
	* failed to attach the runner process to the console of its parent process: The handle is invalid.

When testing on my Ms Dev kit now it did not get stuck (but I do remember seeing that sometimes in the past). However, when checking with Process Explorer, I do see that after pacman closed the terminal, the pacman and conhost processes are still running.

Expected behavior

Pacman finishes after doing its thing.

Actual behavior

Pacman gets stuck

Verification

Windows Version

MSYS64_NT-10.0-22621

Are you willing to submit a PR?

No response

@Wormnest Wormnest added the bug label Jan 11, 2024
@Biswa96
Copy link
Member

Biswa96 commented Jan 11, 2024

Does the CI script terminate msys2 processes after update? See https://github.com/msys2/setup-msys2/blob/8b0d40b8912601756301a7b3de7752d5dba969cd/main.js#L408

In that main.js file, pacman -Syuu updates base packages > terminate all msys2 related processes > update remaining packages.

@jeremyd2019
Copy link
Member

jeremyd2019 commented Jan 11, 2024

There is also a bug (presumably in msys2-runtime), which I have never been able to debug, that manifests as processes hanging around when they should have exited. This usually seems to happen when pacman uses gpgme to attempt to validate signatures. As a workaround, I disable the validation of database signatures in pacman.conf, because the database signature verification seems to happen every time pacman is run, whereas package verification is only done when a package is being installed. I still see occasional hangups in package verification though.

See also msys2/msys2-autobuild#62 which I think is the closest thing to an existing bug tracking this.

Workarounds I currently apply:

REM https://github.com/msys2/msys2-autobuild/issues/62
CALL C:\msys64\msys2_shell.cmd -defterm -no-start -c "mkdir -p /etc/pacman.d/hooks && touch /etc/pacman.d/hooks/texinfo-{install,remove}.hook"
REM the caret is messing with CMD parsing, try it another way
C:\msys64\usr\bin\sed.exe -i -e 's/^^\(SigLevel\s\+=\s\+Required\)\s*$/\1 DatabaseNever/' /etc/pacman.conf

@Wormnest
Copy link
Author

Does the CI script terminate msys2 processes after update?

We are now doing something similar, which seems to get us further, but it's finding a database lock.

On my own Arm Dev Kit, I just updated pacman, which went without problems. However, it is now stuck in compiling part of GIMP (I think I've seen this before too). So, this getting stuck is probably not specific to pacman. Looking in process explorer, the innermost process is env.exe. Could it be related to reading/setting env vars, which I seem to remember can have problems from multiple threads.

@Biswa96
Copy link
Member

Biswa96 commented Jan 15, 2024

The database lock file /var/lib/pacman/db.lck can be deleted before running pacman command. Though, I am not sure if that will fix the actual issue.

@brunvonlope
Copy link

Hi. The database lock is being investigated externally (this is not a MSYS2 bug). The problem is, when the database is not a concern, we cann't kill pacman easily. See: https://gitlab.gnome.org/GNOME/gimp/-/jobs/3458478#L157

@Biswa96
Copy link
Member

Biswa96 commented Jan 15, 2024

The msys2 related processes should be terminated outside of msys2 environment. For example, taskkill command in a batch file. See setup-msys2 repository as mentioned.

@brunvonlope
Copy link

brunvonlope commented Jan 15, 2024

The msys2 related processes should be terminated outside of msys2 environment. For example, taskkill command in a batch file. See setup-msys2 repository as mentioned.

I tried before with takkill but the exit code makes the job fail.
Inkscape, for example, uses taskkill, but they use the "retry" key in CI .yml.

@Biswa96
Copy link
Member

Biswa96 commented Jan 15, 2024

OK, I am out of ideas then. By the way, @hmartinez82 has done some great work of porting apps to aarch64. He may suggest some ideas.

@jeremyd2019
Copy link
Member

I wish somebody with good knowledge of low-level debugging on WoA (and/or of Cygwin) could debug this, I have tried and had no luck (I always got an error getting the context of the main thread, from every debugger I tried: windbg, gdb, lldb).

@hmartinez82
Copy link
Contributor

hmartinez82 commented Jan 16, 2024

I even see this happening, randomly, in my personal laptop when using pacman.

@Biswa96 GIMP's aarch64 CI runner is actually a Windows DevKit in my living room 😅. I joke you not.

I'm not low level debugger. Actually I'm thinking about installing Tailscale in that VM and letting someone else with more expertise take a look.

@Jehan
Copy link

Jehan commented Feb 1, 2024

Hi all! Now we have new runners contributed by Arm Ltd., additionally to the one by @hmartinez82. And this pacman getting stuck issue is also happening randomly on their runners.

Is there anything we could tell the admins at Arm to look for in order to help you debug this issue?

@hmartinez82
Copy link
Contributor

@Jehan I'm glad they are having it too, so we now know it's not just my runner. I don't know what the issue is.

@jeremyd2019
Copy link
Member

https://cygwin.com/pipermail/cygwin-patches/2024q1/012617.html

gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out the pacman
command within our script and try again, up to 2 more times. Hopefully one of
the calls would succeed.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out the pacman
command within our script and try again, up to 2 more times. Hopefully one of
the calls would succeed.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out the pacman
command within our script and try again, up to 2 more times. Hopefully one of
the calls would succeed.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out the pacman
command within our script and try again, up to 2 more times. Hopefully one of
the calls would succeed.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out (after 3
minutes) the pacman command within our script and try again, up to 2 more times.
Hopefully one of the calls would succeed.

I also send a SIGKILL through the timeout (though I have no idea how signals
translate to Windows processes) and run again taskkill after this, which may
seem overkill. Interestingly I get output for both, which seems to indicate that
the kill succeeds in both cases (because of several processes?).

Anyway clearly it's a bit of random code not completely understood, but the
inability to test this all locally clearly doesn't help so it's good enough for
the time being.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
…64 jobs.

This is the command suggest by MSYS2 developers here:
msys2/MSYS2-packages#4340 (comment)

They also say to run it outside the MSYS2 environment, which is why it's in the
CI rules, not in the shell script.
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
…64 jobs.

This is the command suggest by MSYS2 developers here:
msys2/MSYS2-packages#4340 (comment)

They also say to run it outside the MSYS2 environment, which is why it's in the
CI rules, not in the shell script.

Honestly at this point, it feels like we are just stacking weird workaround to
get it to fail not too often. ;-(
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still
being investigated by upstream projects, though anyway it's bad for us right
now, to the point that there are discussions to remove Aarch64 support from the
Windows installer (whereas it just got added recently!) in #10729.

This is an attempt to a workaround. Instead of getting stuck forever and waiting
until the whole job times out (per Gitlab CI settings), I time-out (after 3
minutes) the pacman command within our script and try again, up to 2 more times.
Hopefully one of the calls would succeed.

I also send a SIGKILL through the timeout (though I have no idea how signals
translate to Windows processes) and run again taskkill after this, which may
seem overkill. Interestingly I get output for both, which seems to indicate that
the kill succeeds in both cases (because of several processes?).

Anyway clearly it's a bit of random code not completely understood, but the
inability to test this all locally clearly doesn't help so it's good enough for
the time being.

See: msys2/MSYS2-packages#4340
gnomesysadmins pushed a commit to GNOME/gimp that referenced this issue Feb 8, 2024
…64 jobs.

This is the command suggest by MSYS2 developers here:
msys2/MSYS2-packages#4340 (comment)

They also say to run it outside the MSYS2 environment, which is why it's in the
CI rules, not in the shell script.

Honestly at this point, it feels like we are just stacking weird workaround to
get it to fail not too often. ;-(
@Alovchin91
Copy link

Just a heads-up: this has been fixed by @jeremyd2019's patch that made it into msys2-runtime 3.5.4-5.

@brunvonlope
Copy link

After the fix pacman is getting stuck on x64 and x86 MSYSTEMs at gpg phase using @lazka runners

@Alovchin91
Copy link

After the fix pacman is getting stuck on x64 and x86 MSYSTEMs at gpg phase using @lazka runners

@jeremyd2019 Have you experienced this? 🤔

@Alovchin91
Copy link

Alovchin91 commented Nov 21, 2024

There's this in CancelSynchronousIo function's documentation:

Most types of operations can be canceled immediately; other operations can continue toward completion before they are actually canceled and the caller is notified. The CancelSynchronousIo function does not wait for all canceled operations to complete.

I have seen some differences with other overlapped I/O functions between x64 and arm64 systems, namely ReadDirectoryChangesEx function has returned an error code synchronously on arm64, while on x64 it has (sometimes?) reported it to the overlapped operation. (I'm not entirely certain if it was x64/arm64 or the difference in storage itself though.)

Could it be that CancelSynchronousIo doesn't really cancel all the I/O on x64? I can't think of anything that could be wrong with your suspend thread solution.

@lazka
Copy link
Member

lazka commented Nov 21, 2024

Got two hangs in CI just now:

gpgme/libgpg-error also got updated and pacman rebuilt recently

image

(should we open a new issue for this?)

@lazka
Copy link
Member

lazka commented Nov 21, 2024

Which reminds me that we had a stuck job in autobuild some days ago (2024-11-17): https://github.com/msys2/msys2-autobuild/actions/runs/11876484140/job/33094866915

in a different place though (?)

@jeremyd2019
Copy link
Member

jeremyd2019 commented Nov 21, 2024

(should we open a new issue for this?)

Probably. I am going to suspect the CancelSynchronousIo call, without any further evidence. Maybe there is some other synchronous io on the thread that is canceled, that does not result in the thread exiting as expected.

It looks like it's hanging in pacman/gpgme so maybe I'll start a loop of pacman -Suu and see if it hangs so I can debug.

@hmartinez82
Copy link
Contributor

hmartinez82 commented Nov 22, 2024

If you think it's worth keeping the CancelSynchronousIo call only for ARM64, with the decision at runtime then IsWow64Process2 is an option.

@jeremyd2019
Copy link
Member

I don't think it's necessary at all, the SuspendThread/GetThreadContext dance fixed the ARM64 issue, the CancelSynchronousIo was an attempt to avoid the necessity of TerminateThread at all. If I can get some insight into what's going wrong, and it is due to CancelSyncronousIo somehow, it'd be better to just revert that part of the patch.

I sort of suspect whoever got that stack trace in the image above attached to the "wrong" process in the tree, though. That stack looks like a process in the waitpid call in the "grandparent". It could be that the wait_thread is messed up and not notifying it that the "parent" is gone, or it could more likely be that the "parent" is screwed somehow (that's what was happening with the longstanding ARM64 issue)

@jeremyd2019
Copy link
Member

jeremyd2019 commented Nov 22, 2024

I was able to reproduce and poke in the debugger. wait_thread is happily waiting in ReadFile, while the main thread is hanging in ForceCloseHandle1 (close_h, rd_proc_pipe);. The pinfo::release call is after the wait thread should have been terminated, so it's apparent the CancelSynchronousIo didn't work for whatever reason (possibly due to canceling some other sychronous IO, letting it come back around in the loop and call ReadFile again).

Using gdb to call CancelSynchronousIo again resulted in the wait thread exiting immediately.

@Alovchin91
Copy link

Probably good to let @dscho know as well

@Alovchin91
Copy link

https://inbox.sourceware.org/cygwin-patches/[email protected]/T/#m98d5eebc0cda7df653275e3294abdcfd7cb9bf85

Probably makes sense to undo the CancelSynchronousIo change at least in msys2-runtime, since it has already been merged and published and is now causing issues, wdyt?

jeremyd2019 added a commit to jeremyd2019/msys2-runtime that referenced this issue Nov 22, 2024
It appears this is causing hangs on native x86_64 in similar scenarios
as the hangs on ARM64.

Addresses: msys2/MSYS2-packages#4340
jeremyd2019 added a commit to jeremyd2019/msys2-runtime that referenced this issue Nov 22, 2024
It appears this is causing hangs on native x86_64 in similar scenarios
as the hangs on ARM64.

Addresses: msys2/MSYS2-packages#4340 (comment)
jeremyd2019 added a commit to jeremyd2019/msys2-runtime that referenced this issue Nov 22, 2024
It appears this is causing hangs on native x86_64 in similar scenarios
as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE`
but not canceling the `ReadFile` call as expected.

Addresses: msys2/MSYS2-packages#4340 (comment)
Fixes: b091b47 ("cygthread: suspend thread before terminating.")
@jeremyd2019
Copy link
Member

I did a quick commit last night (my time) and fired off some tests on x86_64 and ARM64 while I slept.
while pacsift --base xorriso > /dev/null; do true; done (this is how I reproduced the hang to debug).

There were no hangs on either architecture, thought I saw a couple errors like this on ARM64:
111 [main] pacsift 29659 dofork: child -1 - forked process 6796 died unexpectedly, retry 0, exit code 0xC0000005, errno 11

these didn't cause any other hang, error output, or abort the while loop...

@Alovchin91
Copy link

111 [main] pacsift 29659 dofork: child -1 - forked process 6796 died unexpectedly, retry 0, exit code 0xC0000005, errno 11

Some quick irresponsible googling has brought me here and here.

tl;dr: Might have something to do with ASLR.

jeremyd2019 added a commit to jeremyd2019/msys2-runtime that referenced this issue Nov 22, 2024
It appears this is causing hangs on native x86_64 in similar scenarios
as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE`
but not canceling the `ReadFile` call as expected.

Addresses: msys2/MSYS2-packages#4340 (comment)
Fixes: b091b47 ("cygthread: suspend thread before terminating.")
Signed-off-by: Jeremy Drake <[email protected]>
dscho added a commit to dscho/MSYS2-packages that referenced this issue Nov 23, 2024
This change seems to have caused hangs on x86_64, so let's revert it.

Addresses msys2#4340 (comment)
and corresponds to msys2/msys2-runtime#243.

Signed-off-by: Johannes Schindelin <[email protected]>
lazka pushed a commit that referenced this issue Nov 23, 2024
This change seems to have caused hangs on x86_64, so let's revert it.

Addresses #4340 (comment)
and corresponds to msys2/msys2-runtime#243.

Signed-off-by: Johannes Schindelin <[email protected]>
jeremyd2019 added a commit to jeremyd2019/msys2-runtime that referenced this issue Nov 23, 2024
It appears this is causing hangs on native x86_64 in similar scenarios
as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE`
but not canceling the `ReadFile` call as expected.

Addresses: msys2/MSYS2-packages#4340 (comment)
Fixes: b091b47 ("cygthread: suspend thread before terminating.")
Signed-off-by: Jeremy Drake <[email protected]>
lazka pushed a commit to msys2/msys2-runtime that referenced this issue Nov 23, 2024
It appears this is causing hangs on native x86_64 in similar scenarios
as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE`
but not canceling the `ReadFile` call as expected.

Addresses: msys2/MSYS2-packages#4340 (comment)
Fixes: b091b47 ("cygthread: suspend thread before terminating.")
Signed-off-by: Jeremy Drake <[email protected]>
@jeremyd2019
Copy link
Member

The fix for the x86_64 hang is merged now, and it looks like msys2-runtime 3.5.4-7 is already in the repo.

@jeremyd2019
Copy link
Member

jeremyd2019 commented Nov 24, 2024

Looks like a hang on https://github.com/msys2/msys2-autobuild/actions/runs/11990401626/job/33427879361 not sure why though.

UPDATE: maybe not hung, just uploading really really slowly?

@Alovchin91
Copy link

Yes it looks like it goes like 1 package per 20 minutes or something.

dscho pushed a commit to dscho/msys2-runtime that referenced this issue Nov 25, 2024
It appears this is causing hangs on native x86_64 in similar scenarios
as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE`
but not canceling the `ReadFile` call as expected.

Cherry-picked from msys2/msys2-runtime's 2eb6be14ee (Cygwin: revert use
of CancelSyncronousIo on wait_thread., 2024-11-21).

Addresses: msys2/MSYS2-packages#4340 (comment)
Fixes: b091b47b9e56 ("cygthread: suspend thread before terminating.")
Signed-off-by: Jeremy Drake <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
github-cygwin pushed a commit to cygwin/cygwin that referenced this issue Nov 25, 2024
It appears this is causing hangs on native x86_64 in similar scenarios
as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE`
but not canceling the `ReadFile` call as expected.

Addresses: msys2/MSYS2-packages#4340 (comment)
Fixes: b091b47 ("cygthread: suspend thread before terminating.")
Signed-off-by: Jeremy Drake <[email protected]>
stahta01 pushed a commit to stahta01/MSYS2-cygwin-packages that referenced this issue Dec 21, 2024
This change seems to have caused hangs on x86_64, so let's revert it.

Addresses msys2/MSYS2-packages#4340 (comment)
and corresponds to msys2/msys2-runtime#243.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho pushed a commit to msys2/msys2-runtime that referenced this issue Dec 21, 2024
It appears this is causing hangs on native x86_64 in similar scenarios
as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE`
but not canceling the `ReadFile` call as expected.

Addresses: msys2/MSYS2-packages#4340 (comment)
Fixes: b091b47 ("cygthread: suspend thread before terminating.")
Signed-off-by: Jeremy Drake <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants