-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
max_client_disconnect setting causes allocation states to hang in a non-terminal state when a client disconnects/reconnects #19729
Comments
@louievandyke you've provided a lot of extraneous detail here, like garbage collection behavior (which seems correct to me given the allocation states). But it looks like the core issue you're reporting is simply that I know that @Juanadelacuesta fixed a bug in this behavior in #18701 (which appears to be missing a changelog entry). It would be helpful to know if we're seeing the evaluations we expect. |
@tgross: Garbage collection behavior (which seems correct to me given the allocation states).Issue is while running the command "nomad job status example",Allocation desired,status were showing run,failed and desired,status is not moving to stop, complete. Can you please tell us what should we do to mark the allocation desired into stop state. nomad system gc will not clean the allocations because they are not in a terminal state and do not transition to terminal. |
@vaiaro the bug is that this isn't happening automatically as we'd expect. As a workaround, you could |
@vaiaro I've been looking at this and I want to make sure Im following the correct process:
Am I missing anything? |
@tgross I’ve done some more testing and have reproduced this now in a specific way. The issue is not that the client node encounters HB failure, but instead the I’ve been able to reproduce this by killing the docker process on the node where the allocation for the below jobspec is running.
screenshots of allocation state behavior on 1.14.12+ent
Nomad starts to restart the task
I kill docker again...
A replacement Allocation ID is now spinning up and the old Allocation ID is in desired.stop and status.failed - ie. terminal
. I ran through the same process on Nomad v1.5.10+ent, and the problem is the old allocation will not go terminal until the max_client_disconnect setting duration has completed… odd because it has nothing to do with a client RPC failure. screenshots of reproduction on 1.5.10+ent
Nomad starts to restart the task
I kill it again
A replacement Allocation ID is spun up and the old Allocation ID is in desired.run and status.failed - ie. NOT terminal and will stay that way for the duration of
|
… client disconnect (#20181) Only ignore allocs on terminal states that are updated --------- Co-authored-by: Tim Gross <[email protected]>
… client disconnect (#20181) Only ignore allocs on terminal states that are updated --------- Co-authored-by: Tim Gross <[email protected]>
… client disconnect (#20181) Only ignore allocs on terminal states that are updated --------- Co-authored-by: Tim Gross <[email protected]>
… client disconnect (#20181) Only ignore allocs on terminal states that are updated --------- Co-authored-by: Tim Gross <[email protected]>
… client disconnect (#20181) (#20253) Only ignore allocs on terminal states that are updated --------- Co-authored-by: Tim Gross <[email protected]>
…nts with max client disconnect into release/1.6.x (#20249) * [gh-19729] Fix logic for updating terminal allocs on clients with max client disconnect (#20181) Only ignore allocs on terminal states that are updated --------- Co-authored-by: Tim Gross <[email protected]> * fix: update test --------- Co-authored-by: Juana De La Cuesta <[email protected]> Co-authored-by: Tim Gross <[email protected]> Co-authored-by: Juanadelacuesta <[email protected]>
Fix merged and back ported to 1.7, 1.6 and 1.5 |
…nts with max client disconnect into release/1.7.x (#20248) * [gh-19729] Fix logic for updating terminal allocs on clients with max client disconnect (#20181) Only ignore allocs on terminal states that are updated --------- Co-authored-by: Tim Gross <[email protected]> * [gh-19729] Fix logic for updating terminal allocs on clients with max client disconnect (#20181) * fix: update test --------- Co-authored-by: Juana De La Cuesta <[email protected]> Co-authored-by: Tim Gross <[email protected]> Co-authored-by: Juanadelacuesta <[email protected]>
… client disconnect (#20181) Only ignore allocs on terminal states that are updated --------- Co-authored-by: Tim Gross <[email protected]>
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Output from
nomad version
see below for tests on multiple versions
Operating system and Environment details
ubuntu linux
Issue
We are running a job in which allocation Desired is
run
and Status isfailed
. nomad system gc will not clean the allocations because they are not in a terminal state and do not transition to terminal.Reproduction steps
Create one nomad server and one client.
I have tested the functionality with an example job
With 1.4.14, I try with restart attempts to 3 or 0, and after some time nomad allocation Desired becomes stop.
With 1.4.14 I used restart attempts to 3 and max_client_disconnect = "1m", stop the client agent, After 1 minutes I can see alloc Desired,Status become (stop, lost) on server. If I start the client agent again then allocation is trying to run on the same client node.
With 1.5.10 I used restart attempts to 3 and after some time nomad allocation Desired,Status is showing in run, pending state. Even if tried with max_client_disconnect = "1m" , I can see only run/pending status. Nomad system gc is not able to clean failed allocation as it is not moving their state to terminal.
With 1.7.2 I used restart attempts to 3 and after some time nomad allocation Desired,Status is not changing to a stop state. Even if tried with max_client_disconnect = "1m"
Expected Result
After some time nomad allocation Desired becomes
stop
Actual Result
The status for allocations does not change even when the count is fulfilled.
Job file (if appropriate)
see above
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: