-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] A thread is stuck waiting to close an SRT socket due to an internal SRT mutex lock. #2944
Comments
From the provided stack traces it is not clear who holds the
But from the state of the GC thread that above described case is not the one happening here.
|
Could this problem be reproduced with having #2893 added? |
Thread sanitizer shows a potential deadlock around those functions:
Full Threat ReportClick to expand
|
There was some PR where m_ConnectionLock was taken out for the call to closeInternal(), but I don't remember the details. The inversion between m_LSLock and m_ConnectionLock is known, but it was later proven that these two things can never happen simultaneously - it's just detected by the thread sanitizer that these two cases may happen in theory. It would be nice to dissolve them, but no sensible way to fix it was found. |
I can try this next week |
I tried the fix using the branch "dev-add-socket-busy-counter". I can reproduce the same problem. let me know if you need more info. |
We need somehow to identify who is holding the mutex (see my comment above). Can you check other threads? There should be some trying to lock another mutex while |
You may try the fix in #2032. I can't find any PR with that fix which does unlocking for the time of closeInternal(). |
Attached full BT of all threads. |
I see 7 threads calling This issue is likely similar to #2252. |
Just took a quick look - |
And the only other thing I can see where any of the SRT threads is standing is |
if (!self->m_bClosing)
{
self->m_pSndUList->waitNonEmpty();
IF_DEBUG_HIGHRATE(self->m_WorkerStats.lCondWait++);
} while this notification is also being missed srt::CSndQueue::~CSndQueue()
{
m_bClosing = true;
if (m_pTimer != NULL)
{
m_pTimer->interrupt();
}
// Unblock CSndQueue worker thread if it is waiting.
m_pSndUList->signalInterrupt(); |
Hi, let me know if I can help from my side or if some info is needed. |
@maxsharabayko Can you confirm srt_sendmsg uses the same mutex that is used for configuration? - because my case is that the neptonManager thread is stuck forever. |
What do you mean by "for configuration"? |
For example, look at this bt: A call to the SRT API of srt_getsockstate is used (I called it configuration - ignore this name):
Versus "srt_sendmsg" calls. |
From your last backtrace the #1 0x00007f38c168f411 in __GI___pthread_mutex_lock (mutex=0x7f38c2b5c480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80
#2 0x00007f38c2a9168e in srt::CUDTUnited::getStatus(int) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
#3 0x00007f38c2a95a0f in srt::CUDT::getsockstate(int) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5 Thread 26 is just about to release it: #0 __lll_unlock_wake () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:339
#1 0x00007f38c169095e in __pthread_mutex_unlock_usercnt (decr=1, mutex=0x7f38c2b5c480 <srt::CUDT::uglobal()::instance+96>) at pthread_mutex_unlock.c:56
#2 __GI___pthread_mutex_unlock (mutex=0x7f38c2b5c480 <srt::CUDT::uglobal()::instance+96>) at pthread_mutex_unlock.c:356
#3 0x00007f38c2b2e9bc in srt::sync::ScopedLock::~ScopedLock() () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
#4 0x00007f38c2a77ff1 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) [clone .cold] () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
#5 0x00007f38c2a9523c in srt::CUDT::sendmsg2(int, char const*, int, SRT_MsgCtrl_&) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5 But there are 7 other If it is the case, then the problem is in |
Understood. I see that the neptonManager thread 39 is stuck waiting to acquire the m_GlobControlLock indefinitely, or at least for the several minutes I observed during debugging. Could this be due to a few threads sending packets (sendmsg) and holding the m_GlobControlLock for too long? By changing the logic, do you mean modifying how the SRT library uses m_GlobControlLock and making the changes described in this issue: (#2393)? |
|
Hi everyone,
Potential issue: A thread is stuck waiting to close an SRT socket due to an internal SRT mutex lock.
The Backtrace of the thread being blocked:
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 #1 0x00007f6ad3751411 in __GI___pthread_mutex_lock (mutex=0x7f6ad4c1e480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f6ad4b53b33 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) () from ../libs/libsrt.so.1.5 #3 0x00007f6ad4b5bc58 in srt::CUDTUnited::close(int) () from ../libs/libsrt.so.1.5 #4 0x00007f6ad4b5bcec in srt::CUDT::close(int) () from ../libs/libsrt.so.1.5 #5 0x00000000004c1e78 in SrtConnectionManager::CheckForConnection (this=0x10ecea0 <SrtConnectionManager::GetInstance()::inst>) at muxer/SrtConnectionManager.cpp:244
I have a few more threads ~6 that are also stuck on this Mutex. e.g.
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 #1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f954b9e2b33 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) () from../libs/libsrt.so.1.5 #3 0x00007f954b9e623c in srt::CUDT::sendmsg2(int, char const*, int, SRT_MsgCtrl_&) () from ../libs/libsrt.so.1.5 #4 0x00007f954b9e629f in srt::CUDT::send(int, char const*, int, int) () from ../libs/libsrt.so.1.5 #5 0x00000000004dad3d in OutSrtSocket::Send (successCnt=0x2f8d484, this=0x7f950194eeb0) at Revioly/OutSrtSocket.cpp:32
I also found that the SRT thread "SRT:GC" holds it too:
41 Thread 0x7f94feffd700 (LWP 32188) "SRT:GC" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 #1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f954b9ed00e in srt::CUDTUnited::checkBrokenSockets() () from ../libs/libsrt.so.1.5 #3 0x00007f954b9ed6a8 in srt::CUDTUnited::garbageCollect(void*) () from ../libs/libsrt.so.1.5 #4 0x00007f954a5ddefc in start_thread (arg=<optimized out>) at pthread_create.c:479 #5 0x00007f9549ec122f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
SRT internal sockets thread example:
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 #1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f954b9ebfb8 in srt::CUDTUnited::newConnection(int, srt::sockaddr_any const&, srt::CPacket const&, srt::CHandShake&, int&, srt::CUDT*&) () from ../libs/libsrt.so.1.5 #3 0x00007f954ba1948d in srt::CUDT::processConnectRequest(srt::sockaddr_any const&, srt::CPacket&) () from ../libs/libsrt.so.1.5 #4 0x00007f954ba60156 in srt::CRcvQueue::worker_ProcessConnectionRequest(srt::CUnit*, srt::sockaddr_any const&) () from ../libs/libsrt.so.1.5 #5 0x00007f954ba60ec4 in srt::CRcvQueue::worker(void*) () from ../libs/libsrt.so.1.5 #6 0x00007f954a5ddefc in start_thread (arg=<optimized out>) at pthread_create.c:479 #7 0x00007f9549ec122f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
To Reproduce
Steps to reproduce the behavior:
No problem in SRT versions 1.5.1 or 1.5.2 (I didn't check 1.5.2) But, I found the issue in version 1.5.3.
My setup has 70 SRT sockets (Senders - listeners), setting the SRT latency to 6000ms.
The trouble starts in the Sender-Listener device right after the downstream device (Receiver-Caller) restarts or loses connection.
Desktop
The text was updated successfully, but these errors were encountered: