-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: handle sigterm to allow for graceful shutdown #337
Conversation
@@ -22,8 +17,7 @@ def main(): | |||
return | |||
|
|||
try: | |||
doers = args.handler(args) | |||
directing.runController(doers=doers, expire=0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this call because the handlers themselves to not return any doers
, they are running them themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
keria/src/keria/app/cli/commands/start.py
Line 114 in 4981d84
directing.runController(doers=agency, expire=0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur, this makes sense.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #337 +/- ##
==========================================
+ Coverage 93.06% 93.77% +0.71%
==========================================
Files 36 37 +1
Lines 7121 8288 +1167
==========================================
+ Hits 6627 7772 +1145
- Misses 494 516 +22 ☔ View full report in Codecov by Sentry. |
src/keria/app/cli/commands/start.py
Outdated
@@ -111,8 +112,15 @@ def launch(args): | |||
bootPassword=args.bootPassword, | |||
bootUsername=args.bootUsername) | |||
|
|||
directing.runController(doers=agency, expire=0.0) | |||
tock = 0.03125 | |||
doist = doing.Doist(limit=0.0, tock=tock, real=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directing.runController
did not return the doist
, so there was no way to access it to be able to call exit
. However, it was only two lines of code, so I thought we could just inline it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. This is good enough for now. We could consider making a PR to change KERIpy to have runController
either return the Doist
or have a parameter to pass a shutdown signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Converting to draft because it does not shutdown the agents, only the agency. So it still needs some adjustments. Will pick up again soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/keria/app/cli/commands/start.py
Outdated
@@ -111,8 +112,15 @@ def launch(args): | |||
bootPassword=args.bootPassword, | |||
bootUsername=args.bootUsername) | |||
|
|||
directing.runController(doers=agency, expire=0.0) | |||
tock = 0.03125 | |||
doist = doing.Doist(limit=0.0, tock=tock, real=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. This is good enough for now. We could consider making a PR to change KERIpy to have runController
either return the Doist
or have a parameter to pass a shutdown signal.
@@ -22,8 +17,7 @@ def main(): | |||
return | |||
|
|||
try: | |||
doers = args.handler(args) | |||
directing.runController(doers=doers, expire=0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur, this makes sense.
I just tried this with the vLEI#85 and it worked well. Yes, we need to send the shutdown signal to the agents, though that should be as simple as making a way to call |
I added an initial implementation here that calls |
Looks like the manual class Doist(tyming.Tymist):
def exit(self, deeds=None):
...
if deeds is None:
deeds = self.deeds
while(deeds): # .close each remaining dog in deeds in reverse order
dog, retime, doer = deeds.pop() # pop it off in reverse (right side)
if not dog: # marker deed
continue # skip marker
try:
tock = dog.close() # force GeneratorExit. Maybe log exit tock tyme <---- THIS IS THE CLOSE
except StopIteration:
pass # Hmm? Not supposed to happen!
else: # set done state
try:
doer.done = False # forced close
except AttributeError: # when using bound method for generator function
doer.__func__.done = False # forced close |
And, really, since this is fully cooperative multitasking I believe calling |
For and Doer, any resources that it opens, should be opened inside the Doer's "enter" method and then closed inside the Doer's "exit" method. The Doist, when shutting down i.e exiting will call "exit" on each of its Doers and DoDoers, and DoDoers will call exit on each of their Doers and DoDoers all that way down. The Doist main loop has a try-finally that calls the Doist's exit method. Any break in the main loop will trigger the finally clause, thereby calling exit. This happens on a SigInt (KeyboardInterrupt or Cntl-C, a SystemExit, or any other exception. Some system exits like SigTerm will terminate the process prematurely before the exit method has time to complete. But SigInt will not and should exit cleanly. Therefore In your dev ops process manager you should terminate Hio using SigInt not SigTerm. Then you are reasonably assured that all your resources that are closed in Doer. exit() methods will be closed before the process running the Doist terminates. |
@SmithSamuelM thank you for clarifying the default behavior of the Doist. That makes sense to me now. We need to move some of the startup logic for both the Agency and the individual Agent instances to the enter and close lifecycle functions in order to leverage the lifecycle context management of HIO. Then graceful shutdown will be as simple as calling Doist.exit(). I will go make that change. I do support the addition of the SIGTERM handler since the default behavior of both Docker and Kubernetes is to send a SIGTERM, pause for a period (10s and 30s, respectively), and then send a SIGKILL if the process is still alive. SystemD and other process managers that manage daemon-like processes also use SIGTERM so it seems an appropriate choice to trigger the doist.exit() call like @lenkan has written. @SmithSamuelM would it make sense to add a SIGTERM signal handler directly to the HIO library or would you prefer the HIO library only handle SIGINT and leave it up to application libraries to decide if they want to handle SIGTERM? I imagine the broad adoption of SIGTERM by DevOps tools like Docker and K8s may make handling SIGTERM in HIO attractive. Let me know and I'll open a PR there if you would like. |
Are you not mixing up SIGTERM with, for example SIGKILL? I would say that the SIGTERM signal is specifically meant to ask a process to terminate gracefully. Upon receiving it, the process will decide itself when it is time to exit. https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html
Agree. I also vote for implementing graceful shutdown of keria using SIGTERM. I also believe this is the expected behaviour to the vast majority of users. I am not sure I would expect hio to do it for me though. I would rather explicitly tell hio to stop at the keria level, by calling Doist.exit, like you suggested. |
Extensive testing history indicates that SigInt works because unlike SigTerm, the python interpreter at a low lever catches SigInt and propogates it through the exception handling hierarchy. While there are corner cases where this may be problematic, the Doist loop is avoiding most of those. To explicitly handle SegTerm requires using the signals library which is then being run by the interpreter instead of being in the interpreter. This introduces other corner cases. Often the solution to those corner cases looks like, running two processes or threads, one thread Phil ran into problems with SigTerm when using supervisord which by default was shutting down with sigterm. When he switched to sigint then the problems went away. It may very well be true that the problem of doist.exit() not running to completion is the SigKill not the SigTerm and simply calling registring signals for SigTerm on the Doist main loop may be sufficient without worrying about the corner cases. But it was simply enough for Phil to reconfigure SuperviserD to use SigInt to kill the Doist rather than solve the signals corner cases. |
In the use case where you are terminating KERI is it possible to use SigInt instead of SigTerm? |
This is interesting, I did not know this about the signals library.
In the context of Docker and Kubernetes the Doist main loop did not respond to a SigTerm at all. SigTerm had no effect, which caused the process to remain alive and later be killed with a SigKill.
@SmithSamuelM You mentioned having a Doer in the main loop that checks for the SigTerm on every execution. I added this in my latest commit to @lenkan 's fork's branch for this work. Let me know if I did it right or wrong. I used a This DoDoer does work as intended though. class GracefulShutdownDoer(doing.DoDoer):
def __init__(self, doist, agency, **kwa):
self.doist: Doist = doist
self.agency = agency
self.shutdown_flag = False
# Register signal handler
signal.signal(signal.SIGTERM, self.handle_sigterm)
signal.signal(signal.SIGINT, self.handle_sigterm)
logger.info("Registered signal handlers for SIGTERM and SIGINT")
super().__init__(doers=[self.shutdown], **kwa)
def handle_sigterm(self, signum, frame):
logger.info(f"Received signal {signum}, initiating graceful shutdown.")
self.shutdown_flag = True
def shutdown_agents(self, agents):
logger.info("Stopping %s agents", len(agents))
for caid in agents:
self.agency.shut(self.agency.agents[caid])
@doing.doize()
def shutdown(self, tymth, tock=0.0):
self.wind(tymth)
while not self.shutdown_flag:
yield tock
# Once shutdown_flag is set, exit the Doist loop
self.shutdown_agents(list(self.agency.agents.keys()))
logger.info(f"Shutting down main Doist loop")
self.doist.exit() @lenkan will you accept my PR to this draft PR? I did change a few more things in order to make unit testing easier for both SigTerm and SigInt. The DoDoer I added listens for both signals and then changes one piece of state, I also cleaned up the arguments to I believe we can merge this PR once my PR to Daniel's fork is accepted. |
I got it figured out. My problem was that I was not calling class GracefulShutdownDoer(doing.Doer):
"""
Shuts all Agency agents down before exiting the Doist loop, performing a graceful shutdown.
Sets up signal handlers in the Doer.enter lifecycle method and exits the Doist scheduler loop in Doer.exit
Checks for the signals in the Doer.recur lifecycle method.
"""
def __init__(self, doist, agency, **kwa):
"""
Parameters:
doist (Doist): The Doist running this Doer
agency (Agency): The Agency containing Agent instances to be gracefully shut down
kwa (dict): Additional keyword arguments for Doer initialization
"""
self.doist: Doist = doist
self.agency = agency
self.shutdown_received = False
super().__init__(**kwa)
def handle_sigterm(self, signum, frame):
"""Handler function for SIGTERM"""
logger.info(f"Received SIGTERM, initiating graceful shutdown.")
self.shutdown_received = True
def handle_sigint(self, signum, frame):
"""Handler function for SIGINT"""
logger.info(f"Received SIGINT, initiating graceful shutdown.")
self.shutdown_received = True
def shutdown_agents(self, agents):
"""Helper function to shut down the agents."""
logger.info("Stopping %s agents", len(agents))
for caid in agents:
self.agency.shut(self.agency.agents[caid])
def enter(self):
"""
Sets up signal handlers.
Lifecycle method called once when the Doist running this Doer enters the context for this Doer.
"""
# Register signal handler
signal.signal(signal.SIGTERM, self.handle_sigterm)
signal.signal(signal.SIGINT, self.handle_sigint)
logger.info("Registered signal handlers for SIGTERM and SIGINT")
def recur(self, tock=0.0):
"""Generator coroutine checking once per tock for shutdown flag"""
# Checks once per tock if the shutdown flag has been set and if so initiates the shutdown process
logger.info("Recurring graceful shutdown doer")
while not self.shutdown_received:
yield tock # will iterate forever in here until shutdown flag set
# Once shutdown_flag is set, exit the Doist loop
self.shutdown_agents(list(self.agency.agents.keys()))
return True # Returns a "done" status
# Causes the Doist scheduler to call .exit() lifecycle method below, killing the doist loop
def exit(self):
"""
Exits the Doist loop.
Lifecycle method called once when the Doist running this Doer exits the context for this Doer.
"""
logger.info(f"Shutting down main Doist loop")
self.doist.exit() |
This uses lifecycle methods appropriately to set up the signal handlers, mark the GracefulShutdownDoer as done when the shutdown flag is set, and tear down the Doist in the Doer.exit() method
Since the Doist loop listens for the KeyboardInterrupt exception then it effectively handles the Ctrl-C SigInt interrupt already which means the shutdown handler only has to handle the SigTerm signal.
feat: shut down agents gracefully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ci: add requests to tests
test: add test coverage for graceful shutdown
Better would be to just raise KeyboardInterrupt Or to call sys.exit() rather than calling doist.exit() |
See #175 for rationale.
This PR implements handling of the SIGTERM, which is sent by runtimes such as docker to the process to allow it to shut down gracefully. I do not know
hio
, but theDoist
class exposes anexit
function that seems to do the trick.Before this change, KERIA would exit with a non-zero exit when receiving a
SIGTERM
signal.Closes #175