-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ros::Timer & ros::ServiceServer finalization failure #2283
Comments
Making sure currently executing callback finishes before removing thread returns from removeByID allows to avoid race condition in a case when e.g. ros::Timer held in a class and capturing `this` is stopped just before destruction of the class and its data members: https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp
Rationale for the testcase was the following deadlock scenario: https://gist.github.com/iwanders/ede48fb649fd47f9b1f9a52c527b463c Changed testcase presents how the same scenario can be carried out with blocking removeByID (with exception for self-removal). The external mutex from the scenario must be unlocked for the call of ros::Timer::stop, otherwise scenario stays as it is. External thread returns from removeByID once cb call finishes, spinner thread returns immediately.
Yikes, that's definitely not good. Thanks for a minimal example to reproduce this failure.
I'm not sure I agree with this behaviour, nothing in the documentation states that cancelling the timer will only happen after any current callbacks complete their execution. In my opinion cancelling a timer should be non blocking in all situations. But I think this is a matter of opinion, and different people may have different assumptions around this. If it is blocking until any callbacks are completed (like in #2284) we go back to the situation where it's very easy to accidentally make the system deadlock with an external mutex in play. Requiring the user to unlock its mutex to allow any callback to complete before being able to cancel the timer (as proposed there) may not be possible to do without introducing any thread safety issues on the user side. 🤔 I'm not too sure yet how we can solve this in a way that works for all use cases, especially if the historic behaviour has been to block and there's code out there that relies on that. Maybe we revert the fix for the deadlock (#2121) and update the documentation? Stating that all running callbacks must be able to finish while blocked on the timer stop? Basically reverting back to the old behaviour and possibly deadlocking when there's a user mutex involved. fyi @guillaumeautran , @mikepurvis , @jasonimercer for awareness. The timer deadlock fix from #2121 (CORE-18232) introduces memory unsafety in some situations. |
True, external mutex use-case can coexist with the old blocking stop behavior, assuming that:
It would be great to have the old blocking behavior back and defined in documentation. 👍 Still, self-removal might be a worthwhile addition, previously self-removal would deadlock, so couldn't have been relied on. |
Delivering #2121 broke existing
ros::Timer
andros::ServiceServer
usages, introducing possibility of using destroyed objects by last callback call. Affected usages in public projects look like this:https://github.com/ros-planning/moveit/blob/1.1.10/moveit_ros/moveit_servo/include/moveit_servo/collision_check.h
https://github.com/ros-planning/moveit/blob/1.1.10/moveit_ros/moveit_servo/src/collision_check.cpp
One production system I'm aware of was affected as well.
Before #2121 stop/shutdown post-condition ensured callback is not being executed, now it doesn't, allowing destruction of referred-to data and undefined behavior when last callback call continues. Following example illustrates such crash scenario:
https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp
When
ros::Timer
orros::ServiceServer
is created with a callback capturing data by reference, there is no easy way to guarantee that lifetime of callback execution doesn't exceed the lifetime of data it uses. Problem occurs in code bases where objects composing timers or srv-servers (capturingthis
) are created/destroyed repeatedly while spinners are running.Phrasing that documentation uses to describe stop/shutdown/dtor behavior is as such:
It's imprecise wrt waiting for currently executing callback, but since all above classes did block in stop/shutdown before #2121, and
ros::Subscriber
still blocks, I assume proper/expected/consistent behavior is to block.The text was updated successfully, but these errors were encountered: