-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If client disappears then jobs get stuck in running state #13
Comments
|
1. The deployment rotated fast enough that I assume it was a term and not
a kill on time out.
2. There's only expiration, beyond which a job won't be retrieds
16 May 2021 17:21:22 Dimitrij Denissenko ***@***.***>:
1. > Is the worker shutdown via TERM or KILL?
2. > Strange, I thought we do have a TTL after which a job becomes
available for pickup again?
…
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on
GitHub[#13 (comment)],
or
unsubscribe[https://github.com/notifications/unsubscribe-auth/AAU4EDGYVP5I4ACQBSPG55LTN7WIBANCNFSM445X7JEA].
[###24x24:true###][Tracking
image][https://github.com/notifications/beacon/AAU4EDBBTT5AGPE23N4LNFTTN7WIBA5CNFSM445X7JEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOGIWXIEA.gif]
|
@dim pls advise on this: First, neat way: Have smth like @vrih suggested - heartbeat. We can add smth like And treat jobs with That will require another migration + manually doing that for lib users. Second, brutal one: Have smth like global worker option And have it smth around few-hours or even 1d or so. Quick to implement, but dirty (not a solution, but more of a workaround). |
So yes, we could add a new column but "update it in parallel with job being performed" isn't really THAT easy as you need a background thread which can error too and you will end up with many headaches. One other option is to use the existing |
Just reminding: this will work only if job implements This is not a big issue, and it's fine to have such disclaimer, as it's quite an abnormal thing (when jobs stuck in running). I'm not really following what should Just in case, I'm aware of I think of this, pseudocode: class Job # effectively a Concern, but doesn't matter
...
scope :lock_expired { where("started_at < NOW AND expires_at IS NOT NULL AND expires_at < NOW AND finished_at IS NULL") } # done with arel; naming is hard, but lock_expired indicates intent just fine
scope :not_started { where(...).or(lock_expired) }
end And no Then we don't even have to declare more scopes for backend "interface", this This works @dim ? |
I think so, let's try that, it's teh simplest solution |
If a client disappears without erroring, e.g. forced eviction from kubernetes, the jobs that the client owned become permanently stuck in running state.
Ideally we would record a heartbeat from each worker in the database and if multiple heartbeats are missed all in progress jobs have owner and started at reset to null.
The text was updated successfully, but these errors were encountered: