If client disappears then jobs get stuck in running state #13

vrih · 2021-05-15T10:24:08Z

If a client disappears without erroring, e.g. forced eviction from kubernetes, the jobs that the client owned become permanently stuck in running state.

Ideally we would record a heartbeat from each worker in the database and if multiple heartbeats are missed all in progress jobs have owner and started at reset to null.

dim · 2021-05-16T16:21:05Z

Is the worker shutdown via TERM or KILL?
Strange, I thought we do have a TTL after which a job becomes available for pickup again?

vrih · 2021-05-16T16:38:34Z

1. The deployment rotated fast enough that I assume it was a term and not a kill on time out. 2. There's only expiration, beyond which a job won't be retrieds 16 May 2021 17:21:22 Dimitrij Denissenko ***@***.***>: 1. > Is the worker shutdown via TERM or KILL? 2. > Strange, I thought we do have a TTL after which a job becomes available for pickup again?

…

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub[#13 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAU4EDGYVP5I4ACQBSPG55LTN7WIBANCNFSM445X7JEA]. [data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADkAAAA5CAYAAACMGIOFAAAAAXNSR0IArs4c6QAAAARzQklUCAgICHwIZIgAAAAkSURBVGiB7cExAQAAAMKg9U9tDQ+gAAAAAAAAAAAAAAAA4NQAMv0AAa3l6BAAAAAASUVORK5CYII=###24x24:true###][Tracking image][https://github.com/notifications/beacon/AAU4EDBBTT5AGPE23N4LNFTTN7WIBA5CNFSM445X7JEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOGIWXIEA.gif]

mxmCherry · 2021-09-30T10:24:01Z

@dim pls advise on this:

First, neat way:

Have smth like @vrih suggested - heartbeat.

We can add smth like jobs.lock_expires_at and update it in parallel with job being performed (maybe configurable globally as HEARTBEAT_INTERVAL or so).

And treat jobs with lock_expires_at < Time.zone.now (with some threshold based on HEARTBEAT_INTERVAL ofc) as non-started (available for processing).

That will require another migration + manually doing that for lib users.

Second, brutal one:

Have smth like global worker option MAX_JOB_EXEC_TIME (configurable, doing nothing unless set explicitly) and consider jobs as available for processing if (started_at + MAX_JOB_EXEC_TIME) < Time.zone.now.

And have it smth around few-hours or even 1d or so.

Quick to implement, but dirty (not a solution, but more of a workaround).

dim · 2021-09-30T10:49:22Z

So yes, we could add a new column but "update it in parallel with job being performed" isn't really THAT easy as you need a background thread which can error too and you will end up with many headaches.

One other option is to use the existing expires_at. If a job has started_at < NOW AND expires_at IS NOT NULL AND expires_at < NOW AND finished_at IS NULL we may want to reschedule if job.reschedule?

mxmCherry · 2021-09-30T11:05:43Z

Just reminding: this will work only if job implements self.ttl. For those without the ttl implemented, expires_at is always NULL.

This is not a big issue, and it's fine to have such disclaimer, as it's quite an abnormal thing (when jobs stuck in running).

I'm not really following what should job.reschedule? check.

Just in case, I'm aware of def reschedule(owner, now: Time.zone.now), but that's not a checker.

I think of this, pseudocode:

class Job # effectively a Concern, but doesn't matter
  ...
  scope :lock_expired { where("started_at < NOW AND expires_at IS NOT NULL AND expires_at < NOW AND finished_at IS NULL") } # done with arel; naming is hard, but lock_expired indicates intent just fine
  scope :not_started { where(...).or(lock_expired) }
end

And no job.reschedule? checker.

Then we don't even have to declare more scopes for backend "interface", this lock_expired will be hidden in AR implementation.

This works @dim ?

dim · 2021-09-30T11:12:55Z

I think so, let's try that, it's teh simplest solution

mxmCherry mentioned this issue Sep 30, 2021

workaround: lock ttl/expiration #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If client disappears then jobs get stuck in running state #13

If client disappears then jobs get stuck in running state #13

vrih commented May 15, 2021

dim commented May 16, 2021

vrih commented May 16, 2021 via email

mxmCherry commented Sep 30, 2021 •

edited

Loading

dim commented Sep 30, 2021

mxmCherry commented Sep 30, 2021 •

edited

Loading

dim commented Sep 30, 2021

If client disappears then jobs get stuck in running state #13

If client disappears then jobs get stuck in running state #13

Comments

vrih commented May 15, 2021

dim commented May 16, 2021

vrih commented May 16, 2021 via email

mxmCherry commented Sep 30, 2021 • edited Loading

dim commented Sep 30, 2021

mxmCherry commented Sep 30, 2021 • edited Loading

dim commented Sep 30, 2021

mxmCherry commented Sep 30, 2021 •

edited

Loading

mxmCherry commented Sep 30, 2021 •

edited

Loading