Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix missing increment of the instance count on every call to startup() #3100

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mbechard
Copy link

@mbechard mbechard commented Jan 3, 2025

Change how the implicit startup is tracked.
The extra change is needed since the startup() calls in
srt::CUDT::socket()
and
srt::CUDT::createGroup
would endlessly increment the startup counter as sockets were created
otherwise.

Force a clean() when the singleton is getting destructed.
Previously it just did a single decrement of the instance count,
which could result in attempting to exit without actually cleaning up.

fixes #3098

@mbechard mbechard marked this pull request as draft January 3, 2025 22:08
@ethouris
Copy link
Collaborator

ethouris commented Jan 6, 2025

As for me it looks ok, but Max will be back in the second half of January.

@ethouris ethouris added this to the v1.5.5 milestone Jan 6, 2025
@mbechard
Copy link
Author

mbechard commented Jan 6, 2025

Still has bugs with existing code, as per the issue discussion

@mbechard mbechard marked this pull request as ready for review January 6, 2025 21:18
@mbechard
Copy link
Author

mbechard commented Jan 6, 2025

I think this modified solution is reasonable to handle the remaining issues.

srtcore/api.cpp Outdated Show resolved Hide resolved
@ethouris
Copy link
Collaborator

ethouris commented Jan 8, 2025

Wait, there's one thing I don't understand.

Why do you need to prevent the call to cleanup()? The call to cleanup() should be able to be done multiple times, even if at least one of the series has done the real cleanup.

The only rule was that there should be done as many (meaning, the minimum) cleanup calls as there was startup calls, but extra cleanup() calls, which actually do nothing, should be allowed. Have you found a problem with extra cleanups?

@mbechard
Copy link
Author

mbechard commented Jan 8, 2025

This doesn't disallow extra cleanup in general, that allowance remains in place. The change here though is that it avoids an unpaired cleanup from internal code. Doing a cleanup internally when a startup wasn't done internally can hide bugs in the API users code. And since the cleanup done in srt::CUDTUnited::~CUDTUnited() is only just doing a single count decrement, whether or not it actually avoids the app closing without doing the cleanup isn't dependable.
If anything, I would contend that srt::CUDTUnited::~CUDTUnited() should force a cleanup no matter what counter is, as long as it's > 0. That would ensure the cleanup on shutdown.

@ethouris
Copy link
Collaborator

ethouris commented Jan 9, 2025

The cleanup call is prevented if this wasn't started internally.

I don't understand the reason of this blocking. Starting it internally should simply call exactly the same startup(), with exception that it should check if it isn't initialized already, to ensure that an internal startup is really done only once. So, this is called when creating a socket and it should ensure that this can be called only once, and only once increase the number of instances. Just the call from srt_startup() should increase the number of instances every time it is called.

@mbechard
Copy link
Author

mbechard commented Jan 9, 2025

It depends what version of the code you are looking at. What you are stating was correct for the old code, before the original bug was introduced. However multiple changes occurred with that bug commit. The current master fails to increment on any but the first startup call, which results in incorrect cleanups due to the counting being incorrect. If we fix it so every startup() increments the count, then the counting increments too much due to internal startups, which occur every time a socket is created now. This is new behavior because the code used to avoid the internal startup() call with a if (m_bGCStatus) check, which has now been removed.

This PR handles both of those new issues. There are certainly multiple ways to solve this. My commit is one way. Feel free to solve it another if you wish.

@ethouris
Copy link
Collaborator

All I was asking was the reasoning of introducing the new field for marking the implicit initialization and performing the cleanup only under this condition. And how well it handles a situation, when after creating a socket without having called srt_startup() first this function is called again afterwards.

Things that I think should be taken into account are:

  • The user may call srt_startup() in the beginning (before creating the first socket) or not
  • The user may forget to call srt_cleanup() at the end
  • The srt_startup() may be called again without having called srt_cleanup()

This version you provided, for example, prevents the cleanup action from being called in a situation when a user called srt_startup() in the beginning, but didn't call srt_cleanup() (for example, when the program has ended with an exception). At least this is the behavior as I understand it.

@mbechard
Copy link
Author

mbechard commented Jan 10, 2025

I added the new field because making a choice on implicit startup/cleanups based on the state of the GC thread bool seems error prone. If the logic for the GC thread changes in the future, it may introduce bugs. The GC management is a sub-portion of the startup/cleanup procedure.

Agreed the new behavior does avoid a single srt_startup() getting implicitly cleaned up on program exit. However the old behavior only did that for a single call. If the user called srt_startup() twice, the old behavior wouldn't do a cleanup either since it's only doing a single decrement. So I think that old behavior was of dubious usefulness.

It would seem the improvement over both the old and new behavior would to ignore the counter and force a cleanup on program exit if it has still been left in the started state. I'll add those changes.

@mbechard
Copy link
Author

The old code has race condition in it too. The check in srt::CUDT::socket() for the state of m_bGCStatus isn't protected by a lock so it could result in multiple implicit startups calls if multiple threads hit it at the same time. That would result in the count growing more than 1 via implicit startups, so the cleanup() wouldn't end up actually triggering a cleanup anyways.

@mbechard
Copy link
Author

Ok see how you feel about this one instead. I'm not a huge fan of the boolean traps added to startup()/cleanup(). For cleanup() I could make it an internal flag in the class that the destructor sets, if you prefer.
For startup() it's hard to do this cleanly since the internal implementation of UniqueLock doesn't have move constructor/operator.

@mbechard mbechard changed the title fix missing increment of the reference count on every call to startup() fix missing increment of the instance count on every call to startup() Jan 10, 2025
srtcore/api.cpp Outdated
if (m_iInstanceCount == 0)
return 0;

if (--m_iInstanceCount > 0)
return 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be bound.

Suggested change
if (m_iInstanceCount == 0)
return 0;
if (--m_iInstanceCount > 0)
return 0;
if (m_iInstanceCount == 0 || --m_iInstanceCount > 0)
return 0;

srtcore/api.cpp Outdated
@@ -233,15 +228,24 @@ string srt::CUDTUnited::CONID(SRTSOCKET sock)
return os.str();
}

int srt::CUDTUnited::startup()
void srt::CUDTUnited::implicitStartup()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you have a parameter in startup(), this function is no longer necessary. You can call it as startup(true) in socket and group creation functions.

@ethouris
Copy link
Collaborator

Ok, now this looks definitely better. I would keenly have some unit tests for this, though; I'll try to add something myself. For this these functions would have to return some status value, which simply states whether something was executed or not. I'll try to figure out something myself; we'd need also a function that can report the current number of instances, but should be able to use the CUDTClass directly, not the API.

@ethouris
Copy link
Collaborator

Ok, I made a new #3104 as draft so that you can take the changes there. Things that I reported in review comments are applied, and the test is added. I marked places that should be changed for 1.6.0, but for the 1.5.5 they must stay the same to maintain the rules of the API, so this is only marked, and changes would be then provided in a separate PR for 1.6.0. And the test is also there.

Please take a look and you can also get these changes to apply here - just add ".patch" to the URI.

@mbechard
Copy link
Author

Ok thanks, I've re-pushed your patch. Appreciate the feedback and the patch

@ethouris
Copy link
Collaborator

Ok, that comment I reported I added it - forgot about that myself ;).

@ethouris ethouris added Type: Bug Indicates an unexpected problem or unintended behavior [core] Area: Changes in SRT library core labels Jan 14, 2025
@ethouris
Copy link
Collaborator

Ok, something's wrong, the tests on Travis fail. On my draft PR they fail either. Need to see into this.

@ethouris
Copy link
Collaborator

Alright, I found the problem. there are two additional changes in the tests. One of them just performs one testing step in the text that HAPPENS NEXT to check if the PREVIOUS TEST did the job right ;). Unfortunately there's no way to check if the global destructor has run correctly because it happens at the end of the testing application itself. Can you please take the updates from #3104?

@ethouris
Copy link
Collaborator

Ok small update: only RIGHT NOW I was able to make all tests pass also in Travis. Some tests rely on doing some things in timely manner and irregularly working machines can fail there, including in cases when an error is expected.

Change how the implicit startup is tracked.
The extra change is needed since the startup() calls in
srt::CUDT::socket()
and
srt::CUDT::createGroup
would endlessly increment the startup counter as sockets were created
otherwise.

Force a clean() when the singleton is getting destructed.
Previously it just did a single decrement of the instance count,
which could result in attempting to exit without actually cleaning up.

fixes Haivision#3098
@mbechard
Copy link
Author

Ok re-pushed, thanks for your edits!

if (m_bGCStatus)

if (implicit && m_iInstanceCount > 0)
// [reinstate in 1.6.0] return m_iInstanceCount;

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
return 1;

if (m_iInstanceCount++ > 0)
// [reinstate in 1.6.0] return m_iInstanceCount;

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
// [reinstate in 1.6.0] return m_iInstanceCount;
return 1;

if (m_bGCStatus)

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
return 0;
if (!force)
{
if (m_iInstanceCount == 0 || --m_iInstanceCount > 0)

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Regression with commit bc2f48e057644ab9
2 participants