Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: handle sigterm to allow for graceful shutdown #337

Merged
merged 11 commits into from
Jan 13, 2025

Conversation

lenkan
Copy link
Collaborator

@lenkan lenkan commented Dec 11, 2024

See #175 for rationale.

This PR implements handling of the SIGTERM, which is sent by runtimes such as docker to the process to allow it to shut down gracefully. I do not know hio, but the Doist class exposes an exit function that seems to do the trick.

Before this change, KERIA would exit with a non-zero exit when receiving a SIGTERM signal.

Closes #175

@lenkan lenkan requested a review from kentbull December 11, 2024 13:36
@@ -22,8 +17,7 @@ def main():
return

try:
doers = args.handler(args)
directing.runController(doers=doers, expire=0.0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this call because the handlers themselves to not return any doers, they are running them themselves.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

directing.runController(doers=agency, expire=0.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur, this makes sense.

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 85.11905% with 25 lines in your changes missing coverage. Please review.

Project coverage is 93.77%. Comparing base (18d3ad7) to head (fa3cb0f).
Report is 57 commits behind head on main.

Files with missing lines Patch % Lines
src/keria/app/agenting.py 70.12% 23 Missing ⚠️
src/keria/app/serving.py 96.42% 1 Missing ⚠️
tests/app/test_agenting.py 98.41% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #337      +/-   ##
==========================================
+ Coverage   93.06%   93.77%   +0.71%     
==========================================
  Files          36       37       +1     
  Lines        7121     8288    +1167     
==========================================
+ Hits         6627     7772    +1145     
- Misses        494      516      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -111,8 +112,15 @@ def launch(args):
bootPassword=args.bootPassword,
bootUsername=args.bootUsername)

directing.runController(doers=agency, expire=0.0)
tock = 0.03125
doist = doing.Doist(limit=0.0, tock=tock, real=True)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directing.runController did not return the doist, so there was no way to access it to be able to call exit. However, it was only two lines of code, so I thought we could just inline it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. This is good enough for now. We could consider making a PR to change KERIpy to have runController either return the Doist or have a parameter to pass a shutdown signal.

Copy link
Collaborator

@2byrds 2byrds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@lenkan lenkan marked this pull request as draft December 11, 2024 14:13
@lenkan
Copy link
Collaborator Author

lenkan commented Dec 11, 2024

Converting to draft because it does not shutdown the agents, only the agency. So it still needs some adjustments. Will pick up again soon.

Copy link
Contributor

@kentbull kentbull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -111,8 +112,15 @@ def launch(args):
bootPassword=args.bootPassword,
bootUsername=args.bootUsername)

directing.runController(doers=agency, expire=0.0)
tock = 0.03125
doist = doing.Doist(limit=0.0, tock=tock, real=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. This is good enough for now. We could consider making a PR to change KERIpy to have runController either return the Doist or have a parameter to pass a shutdown signal.

@@ -22,8 +17,7 @@ def main():
return

try:
doers = args.handler(args)
directing.runController(doers=doers, expire=0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur, this makes sense.

@kentbull
Copy link
Contributor

I just tried this with the vLEI#85 and it worked well.

Yes, we need to send the shutdown signal to the agents, though that should be as simple as making a way to call .exit() on their Doer or DoDoer instances.

@kentbull
Copy link
Contributor

I added an initial implementation here that calls Agency.shut on each agent prior to shutting down the main Doist loop. I added a doer.close() to instruct each of the Doers and DoDoers to end their generator loop. If we were to add anything to the shutdown logic to round this out it would be some sort of check on the list of doers that their "done" attribute is "True" which would mean looping in the Agency.shut function while checking that the doers have completed.

@kentbull
Copy link
Contributor

kentbull commented Dec 22, 2024

Looks like the manual doer.close() is not necessary because the Doist does this in the Doist.exit() function in reverse-nested order, so I removed that code. For reference, here is the code from the Doist class

class Doist(tyming.Tymist):
    def exit(self, deeds=None):
        ...
        if deeds is None:
            deeds = self.deeds

        while(deeds):  # .close each remaining dog in deeds in reverse order
            dog, retime, doer = deeds.pop()  # pop it off in reverse (right side)
            if not dog:  # marker deed
                continue  # skip marker
            try:
                tock = dog.close()  # force GeneratorExit. Maybe log exit tock tyme  <---- THIS IS THE CLOSE
            except StopIteration:
                pass  # Hmm? Not supposed to happen!
            else:  # set done state
                try:
                    doer.done = False  # forced close
                except AttributeError:  # when using bound method for generator function
                    doer.__func__.done = False  # forced close

@kentbull
Copy link
Contributor

And, really, since this is fully cooperative multitasking I believe calling Doist.exit() is sufficient to close all of the agents as well. So calling agent.close() per agent should not be necessary because the Doist scheduler should do all of that on its own. I will confirm this with Sam and if true will remove that code from my PR to Daniel L's PR.

@SmithSamuelM
Copy link
Contributor

SmithSamuelM commented Dec 22, 2024

For and Doer, any resources that it opens, should be opened inside the Doer's "enter" method and then closed inside the Doer's "exit" method. The Doist, when shutting down i.e exiting will call "exit" on each of its Doers and DoDoers, and DoDoers will call exit on each of their Doers and DoDoers all that way down.

The Doist main loop has a try-finally that calls the Doist's exit method. Any break in the main loop will trigger the finally clause, thereby calling exit. This happens on a SigInt (KeyboardInterrupt or Cntl-C, a SystemExit, or any other exception. Some system exits like SigTerm will terminate the process prematurely before the exit method has time to complete. But SigInt will not and should exit cleanly. Therefore In your dev ops process manager you should terminate Hio using SigInt not SigTerm. Then you are reasonably assured that all your resources that are closed in Doer. exit() methods will be closed before the process running the Doist terminates.

@kentbull
Copy link
Contributor

kentbull commented Dec 23, 2024

@SmithSamuelM thank you for clarifying the default behavior of the Doist. That makes sense to me now. We need to move some of the startup logic for both the Agency and the individual Agent instances to the enter and close lifecycle functions in order to leverage the lifecycle context management of HIO. Then graceful shutdown will be as simple as calling Doist.exit().

I will go make that change.

I do support the addition of the SIGTERM handler since the default behavior of both Docker and Kubernetes is to send a SIGTERM, pause for a period (10s and 30s, respectively), and then send a SIGKILL if the process is still alive. SystemD and other process managers that manage daemon-like processes also use SIGTERM so it seems an appropriate choice to trigger the doist.exit() call like @lenkan has written.

@SmithSamuelM would it make sense to add a SIGTERM signal handler directly to the HIO library or would you prefer the HIO library only handle SIGINT and leave it up to application libraries to decide if they want to handle SIGTERM?

I imagine the broad adoption of SIGTERM by DevOps tools like Docker and K8s may make handling SIGTERM in HIO attractive. Let me know and I'll open a PR there if you would like.

@lenkan
Copy link
Collaborator Author

lenkan commented Dec 23, 2024

Some system exits like SigTerm will terminate the process prematurely before the exit method has time to complete.

Are you not mixing up SIGTERM with, for example SIGKILL? I would say that the SIGTERM signal is specifically meant to ask a process to terminate gracefully. Upon receiving it, the process will decide itself when it is time to exit. https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html

I do support the addition of the SIGTERM handler

Agree. I also vote for implementing graceful shutdown of keria using SIGTERM. I also believe this is the expected behaviour to the vast majority of users. I am not sure I would expect hio to do it for me though. I would rather explicitly tell hio to stop at the keria level, by calling Doist.exit, like you suggested.

@SmithSamuelM
Copy link
Contributor

Extensive testing history indicates that SigInt works because unlike SigTerm, the python interpreter at a low lever catches SigInt and propogates it through the exception handling hierarchy. While there are corner cases where this may be problematic, the Doist loop is avoiding most of those. To explicitly handle SegTerm requires using the signals library which is then being run by the interpreter instead of being in the interpreter. This introduces other corner cases. Often the solution to those corner cases looks like, running two processes or threads, one thread
/process that handles the signals and the other thread/process that runs the code.

Phil ran into problems with SigTerm when using supervisord which by default was shutting down with sigterm. When he switched to sigint then the problems went away.

It may very well be true that the problem of doist.exit() not running to completion is the SigKill not the SigTerm and simply calling registring signals for SigTerm on the Doist main loop may be sufficient without worrying about the corner cases. But it was simply enough for Phil to reconfigure SuperviserD to use SigInt to kill the Doist rather than solve the signals corner cases.

@SmithSamuelM
Copy link
Contributor

In the use case where you are terminating KERI is it possible to use SigInt instead of SigTerm?

@kentbull
Copy link
Contributor

signals library which is then being run by the interpreter instead of being in the interpreter.

This is interesting, I did not know this about the signals library.

It may very well be true that the problem of doist.exit() not running to completion is the SigKill not the SigTerm

In the context of Docker and Kubernetes the Doist main loop did not respond to a SigTerm at all. SigTerm had no effect, which caused the process to remain alive and later be killed with a SigKill.

registring signals for SigTerm on the Doist main loop may be sufficient without worrying about the corner cases

@SmithSamuelM You mentioned having a Doer in the main loop that checks for the SigTerm on every execution. I added this in my latest commit to @lenkan 's fork's branch for this work. Let me know if I did it right or wrong.

I used a DoDoer as follows. I tried to use a Doer yet when I tried to call Doist.exit() from within Doer.recur() then it never seemed to have any effect, which means I was using it wrong. I imagine this DoDoer could be simplified to a Doer, I just don't know how to do that.

This DoDoer does work as intended though.

class GracefulShutdownDoer(doing.DoDoer):
    def __init__(self, doist, agency, **kwa):
        self.doist: Doist = doist
        self.agency = agency
        self.shutdown_flag = False

        # Register signal handler
        signal.signal(signal.SIGTERM, self.handle_sigterm)
        signal.signal(signal.SIGINT, self.handle_sigterm)
        logger.info("Registered signal handlers for SIGTERM and SIGINT")

        super().__init__(doers=[self.shutdown], **kwa)

    def handle_sigterm(self, signum, frame):
        logger.info(f"Received signal {signum}, initiating graceful shutdown.")
        self.shutdown_flag = True

    def shutdown_agents(self, agents):
        logger.info("Stopping %s agents", len(agents))
        for caid in agents:
            self.agency.shut(self.agency.agents[caid])

    @doing.doize()
    def shutdown(self, tymth, tock=0.0):
        self.wind(tymth)
        while not self.shutdown_flag:
            yield tock

        # Once shutdown_flag is set, exit the Doist loop
        self.shutdown_agents(list(self.agency.agents.keys()))
        logger.info(f"Shutting down main Doist loop")
        self.doist.exit()

@lenkan will you accept my PR to this draft PR? I did change a few more things in order to make unit testing easier for both SigTerm and SigInt. The DoDoer I added listens for both signals and then changes one piece of state, self.shutdown_flag, that the Doer checks on each loop iteration.

I also cleaned up the arguments to agenting.setup which I renamed to agenting.setupDoers since it was getting unwieldy. I added a documented @dataclass for those arguments and moved all of the agency setup and runtime logic to the agenting module.

I believe we can merge this PR once my PR to Daniel's fork is accepted.

@kentbull
Copy link
Contributor

I got it figured out. My problem was that I was not calling Doist.exit() from the Doer.exit() function as shown below. Now I finally understand how to use the Doer context functions. In the updated code below, also submitted to Daniel's PR, .enter() sets up the signal handlers, .recur() is a generator waiting for the shutdown flag to be set, and .exit() is called when .recur() returns True which shuts down the Doist scheduler with Doist.exit().

class GracefulShutdownDoer(doing.Doer):
    """
    Shuts all Agency agents down before exiting the Doist loop, performing a graceful shutdown.
    Sets up signal handlers in the Doer.enter lifecycle method and exits the Doist scheduler loop in Doer.exit
    Checks for the signals in the Doer.recur lifecycle method.
    """
    def __init__(self, doist, agency, **kwa):
        """
        Parameters:
            doist (Doist): The Doist running this Doer
            agency (Agency): The Agency containing Agent instances to be gracefully shut down
            kwa (dict): Additional keyword arguments for Doer initialization
        """
        self.doist: Doist = doist
        self.agency = agency
        self.shutdown_received = False

        super().__init__(**kwa)

    def handle_sigterm(self, signum, frame):
        """Handler function for SIGTERM"""
        logger.info(f"Received SIGTERM, initiating graceful shutdown.")
        self.shutdown_received = True

    def handle_sigint(self, signum, frame):
        """Handler function for SIGINT"""
        logger.info(f"Received SIGINT, initiating graceful shutdown.")
        self.shutdown_received = True

    def shutdown_agents(self, agents):
        """Helper function to shut down the agents."""
        logger.info("Stopping %s agents", len(agents))
        for caid in agents:
            self.agency.shut(self.agency.agents[caid])

    def enter(self):
        """
        Sets up signal handlers.
        Lifecycle method called once when the Doist running this Doer enters the context for this Doer.
        """
        # Register signal handler
        signal.signal(signal.SIGTERM, self.handle_sigterm)
        signal.signal(signal.SIGINT, self.handle_sigint)
        logger.info("Registered signal handlers for SIGTERM and SIGINT")

    def recur(self, tock=0.0):
        """Generator coroutine checking once per tock for shutdown flag"""
        # Checks once per tock if the shutdown flag has been set and if so initiates the shutdown process 
        logger.info("Recurring graceful shutdown doer")
        while not self.shutdown_received:
            yield tock # will iterate forever in here until shutdown flag set

        # Once shutdown_flag is set, exit the Doist loop
        self.shutdown_agents(list(self.agency.agents.keys()))
        
        return True # Returns a "done" status
        # Causes the Doist scheduler to call .exit() lifecycle method below, killing the doist loop

    def exit(self):
        """
        Exits the Doist loop.
        Lifecycle method called once when the Doist running this Doer exits the context for this Doer.
        """
        logger.info(f"Shutting down main Doist loop")
        self.doist.exit()

This uses lifecycle methods appropriately to set up the signal handlers, mark the GracefulShutdownDoer as done when the shutdown flag is set, and tear down the Doist in the Doer.exit() method
@kentbull kentbull marked this pull request as ready for review January 11, 2025 21:13
@kentbull kentbull marked this pull request as draft January 11, 2025 21:13
kentbull and others added 2 commits January 12, 2025 21:03
Since the Doist loop listens for the KeyboardInterrupt exception then it effectively handles the Ctrl-C SigInt interrupt already which means the shutdown handler only has to handle the SigTerm signal.
feat: shut down agents gracefully
Copy link
Contributor

@kentbull kentbull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kentbull kentbull marked this pull request as ready for review January 13, 2025 15:23
@kentbull kentbull merged commit ca5cc7a into WebOfTrust:main Jan 13, 2025
5 checks passed
@SmithSamuelM
Copy link
Contributor

Better would be to just

raise KeyboardInterrupt

Or to call

sys.exit()

rather than calling doist.exit()
The latter is more brittle as it fixes the dependency on a what the Doist does in response to exiting conditions rather than simply triggering the exiting conditions as the former two examples do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to do graceful shutdown of KERIA
4 participants