Skip to content
This repository has been archived by the owner on Feb 21, 2024. It is now read-only.

Completed Job pods cause an error #33

Open
dlaidlaw opened this issue Aug 28, 2020 · 5 comments
Open

Completed Job pods cause an error #33

dlaidlaw opened this issue Aug 28, 2020 · 5 comments

Comments

@dlaidlaw
Copy link

If the node being drained has any pods that are not ready, such as a pod created by a Job that has completed, then there will be an error.

The completed job will not be removed from the evictable jobs list, and therefore the code will loop forever (or until the lambda times out) waiting for the job to be evicted.

The pod_is_evictable method should ignore any pods that are not in a ready state, as well as DaemonSet pods. An alternative would be to ignore the pod if its owner_reference is a Job.

@svozza
Copy link
Contributor

svozza commented Sep 1, 2020

Hi there, apologies for the delay in responding. It sounds like you have an idea on how to fix this, would you like to open a pull request that makes the changes you think are required?

@dlaidlaw
Copy link
Author

dlaidlaw commented Sep 3, 2020

I would love to. Unfortunately I am unable to do so in a reasonable time due to company policies.

@svozza
Copy link
Contributor

svozza commented Sep 3, 2020

No problem, is it really just as simple as looking for the value in that owner_reference field? If I get time next week I can give that a go.

@dlaidlaw
Copy link
Author

dlaidlaw commented Sep 3, 2020

What I settled on was:

            if ref.kind == CONTROLLER_KIND_DAEMON_SET:
                logger.info("Skipping DaemonSet {}/{}".format(pod.metadata.namespace, pod.metadata.name))
                return False
            elif ref.kind == CONTROLLER_KIND_JOB:
                if pod.status and pod.status.phase:
                    if pod.status.phase == "Failed":
                        logger.info("Skipping failed Job pod {}/{}".format(pod.metadata.namespace, pod.metadata.name))
                        return False
                    elif pod.status.phase == "Succeeded":
                        logger.info("Skipping succeeded Job pod {}/{}".format(pod.metadata.namespace, pod.metadata.name))
                        return False

CONTROLLER_KIND_JOB was set to "Job".

The thought being that if the job failed or succeeded it could be ignored, otherwise it is running and could be evicted. I am not sure if everyone would want to evict running jobs, however.

@svozza
Copy link
Contributor

svozza commented Sep 3, 2020

Yeah, I see what you mean about people not wanting to evict running jobs, leave it with me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants