-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: handling non reschedule disconnecting and reconnecting allocs #18701
Conversation
4748885
to
d54830b
Compare
1ab800f
to
a069806
Compare
9d2591c
to
31fd610
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code changes in this PR seem to make sense @Juanadelacuesta, but unfortunately it looks like it fixes this bug but introduces a new one (or maybe I'm discovering a related bug here?).
I built Nomad from this branch for testing. I deployed a 2 node cluster and a job with the following relevant bits of the jobspec:
job "httpd" {
reschedule {
delay = "2m"
delay_function = "exponential"
max_delay = "1h"
unlimited = true
}
group "web" {
max_client_disconnect = "5m"
# ... etc
}
}
full job spec here
job "httpd" {
reschedule {
delay = "2m"
delay_function = "exponential"
max_delay = "1h"
unlimited = true
}
group "web" {
max_client_disconnect = "5m"
network {
mode = "bridge"
port "www" {
to = 8001
}
}
task "http" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-vv", "-f", "-p", "8001", "-h", "/local"]
ports = ["www"]
}
resources {
cpu = 100
memory = 100
}
}
}
}
First, I waited for the deployment to finish. After stopping the node the alloc was on and waiting for the missed heartbeat, the alloc is marked unknown
as expected but it gets immediately rescheduled. It doesn't wait for the delay
window:
$ nomad eval list
ID Priority Triggered By Job ID Namespace Node ID Status Placement Failures
90de133f 50 max-disconnect-timeout httpd default <none> pending false
f0b4cd16 50 node-update httpd default d55ab857 complete false
768e0dbe 50 deployment-watcher httpd default <none> complete false
ef752be3 50 job-register httpd default <none> complete false
$ nomad job status httpd
...
Allocations
ID Node ID Task Group Version Desired Status Created Modified
7baca89e 8910b70c web 0 run running 33s ago 21s ago
7abbdab0 d55ab857 web 0 run unknown 1m48s ago 33s ago
If I start over and I disable reschedules entirely, then the bug fix works as intended.
job "httpd" {
reschedule {
delay = "2m"
delay_function = "exponential"
max_delay = "1h"
unlimited = false
}
group "web" {
max_client_disconnect = "5m"
# ... etc
}
}
Then it's not immediately rescheduled and I end up seeing see the following evals in my list:
$ nomad eval list
ID Priority Triggered By Job ID Namespace Node ID Status Placement Failures
38554c98 50 max-disconnect-timeout httpd default <none> pending false
e8976a61 50 node-update httpd default d55ab857 complete false
89feb842 50 deployment-watcher httpd default <none> complete false
5a25579f 50 job-register httpd default <none> complete false
243f0d1
to
3749533
Compare
…ding them to the updates
…ve been previously replaced can be replaced again
…y count untainted and reconnected using the adecuate alloc replacememnt logic
… to be rescheduled and set as untanted the ones that dont
…locs, to avoid replacing them by empty follow up eval
351b891
to
67d9033
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on this @Juanadelacuesta!
Code LGTM, and I've pulled it down for testing manually and everything checks out.
…18701) This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
…18701) This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
…18701) (#18899) This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
…18701) (#18898) This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
This PR fixes Max_client_disconnect ignores no-reschedule policy
The bug had 2 root causes:
nomad/scheduler/reconcile.go
Line 461 in cecd9b0
Any disconnecting allocation that should not be rescheduled now, was ignored for the rest of the reconcile process and would always end up on a new placement, because it was not taken into consideration for the deployment final count.
never set to be explicitly replaced, because according to the
shouldFilter
logic, they areuntainted
and are ignored on line 396, and never make it to theelegibleNow
selection.L396
Also on the
updateByReschedulable
, disonnecting allocs where never set to reschedule now:This allocations would once again end up being replaced because the deployment count would be one short, not because it should be replaced and most definitely not taking into account its reschedule policy.
Once the previous problem was fixed and the correct reschedule policy was being taken into account, a new bug was visible: The
disconnecting
allocations don't have a fail time, so to calculate the next reschedule time, the alloc reconcilernow
was used. The minimun delay for batch jobs is 5 sec and 30 sec for services, following the logic in the code, the next reschedule time ends up always being at least 5 sec from now, and the disconnecting allocations were always set torescheduleLater
, but for disconnecting allocations, therescheduleLater
group was ignored, as shown previously on line 461 of the rescheduler.The function
filterByRescheduleable
was modified to correctly process disconnecting allocation and the disconnecting untainted and reconnect later allocations were added to the result to avoid unnecessary placement and correctly replace thedisconnecting
allocs using the reschedule policy.Once the previous change was done, new evaluations were created with the new reschedule time to replace the
disconnecting
allocs, but by the time the evaluation was executed, the allocs don't qualify as disconnecting anymore so a new restriction needed to be added in theupdateByReschedule
function for alloc that were not disconnecting and had not failed yet, theunknown
allocs.In order to avoid adding replacements for this allocs every time the reconciler run, a validation using the
followupEvalID
was added.When taking into account the reschedule policy, every
disconnecting
alloc generates 2 new evaluations: one for the next reschedule time, in order to place a replacement and a second one to set the alloc as lost once themax_client_disconnect
times out.The way the code is written, the disconnecting alloc would get updated with the evaluation ID of the second generated evaluation, making the
nextEvalID
useless to verify if theunknown
alloc being rescheduled was in fact a replacement for an old one. A change to overwrite thenextEvalID
with the ID of the evaluation corresponding to the replacement was set in place. This change also made finding if an alloc was a replacement of an older one or not on reconnecting a lot easier.