Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If client disappears then jobs get stuck in running state #13

Open
vrih opened this issue May 15, 2021 · 6 comments
Open

If client disappears then jobs get stuck in running state #13

vrih opened this issue May 15, 2021 · 6 comments

Comments

@vrih
Copy link

vrih commented May 15, 2021

If a client disappears without erroring, e.g. forced eviction from kubernetes, the jobs that the client owned become permanently stuck in running state.

Ideally we would record a heartbeat from each worker in the database and if multiple heartbeats are missed all in progress jobs have owner and started at reset to null.

@dim
Copy link
Member

dim commented May 16, 2021

  1. Is the worker shutdown via TERM or KILL?
  2. Strange, I thought we do have a TTL after which a job becomes available for pickup again?

@vrih
Copy link
Author

vrih commented May 16, 2021 via email

@mxmCherry
Copy link
Contributor

mxmCherry commented Sep 30, 2021

@dim pls advise on this:

First, neat way:

Have smth like @vrih suggested - heartbeat.

We can add smth like jobs.lock_expires_at and update it in parallel with job being performed (maybe configurable globally as HEARTBEAT_INTERVAL or so).

And treat jobs with lock_expires_at < Time.zone.now (with some threshold based on HEARTBEAT_INTERVAL ofc) as non-started (available for processing).

That will require another migration + manually doing that for lib users.

Second, brutal one:

Have smth like global worker option MAX_JOB_EXEC_TIME (configurable, doing nothing unless set explicitly) and consider jobs as available for processing if (started_at + MAX_JOB_EXEC_TIME) < Time.zone.now.

And have it smth around few-hours or even 1d or so.

Quick to implement, but dirty (not a solution, but more of a workaround).

@dim
Copy link
Member

dim commented Sep 30, 2021

So yes, we could add a new column but "update it in parallel with job being performed" isn't really THAT easy as you need a background thread which can error too and you will end up with many headaches.

One other option is to use the existing expires_at. If a job has started_at < NOW AND expires_at IS NOT NULL AND expires_at < NOW AND finished_at IS NULL we may want to reschedule if job.reschedule?

@mxmCherry
Copy link
Contributor

mxmCherry commented Sep 30, 2021

Just reminding: this will work only if job implements self.ttl. For those without the ttl implemented, expires_at is always NULL.

This is not a big issue, and it's fine to have such disclaimer, as it's quite an abnormal thing (when jobs stuck in running).

I'm not really following what should job.reschedule? check.

Just in case, I'm aware of def reschedule(owner, now: Time.zone.now), but that's not a checker.

I think of this, pseudocode:

class Job # effectively a Concern, but doesn't matter
  ...
  scope :lock_expired { where("started_at < NOW AND expires_at IS NOT NULL AND expires_at < NOW AND finished_at IS NULL") } # done with arel; naming is hard, but lock_expired indicates intent just fine
  scope :not_started { where(...).or(lock_expired) }
end 

And no job.reschedule? checker.

Then we don't even have to declare more scopes for backend "interface", this lock_expired will be hidden in AR implementation.

This works @dim ?

@dim
Copy link
Member

dim commented Sep 30, 2021

I think so, let's try that, it's teh simplest solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants