Backport of fix: handling non reschedule disconnecting and reconnecting allocs into release/1.5.x #18885
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport
This PR is auto-generated from #18701 to be assessed for backporting due to the inclusion of the label backport/1.5.x.
🚨
The person who merged in the original PR is:
@Juanadelacuesta
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.
The below text is copied from the body of the original PR.
This PR fixes Max_client_disconnect ignores no-reschedule policy
The bug had 2 root causes:
nomad/scheduler/reconcile.go
Line 461 in cecd9b0
Any disconnecting allocation that should not be rescheduled now, was ignored for the rest of the reconcile process and would always end up on a new placement, because it was not taken into consideration for the deployment final count.
never set to be explicitly replaced, because according to the
shouldFilter
logic, they areuntainted
and are ignored on line 396, and never make it to theelegibleNow
selection.L396
Also on the
updateByReschedulable
, disonnecting allocs where never set to reschedule now:This allocations would once again end up being replaced because the deployment count would be one short, not because it should be replaced and most definitely not taking into account its reschedule policy.
Once the previous problem was fixed and the correct reschedule policy was being taken into account, a new bug was visible: The
disconnecting
allocations don't have a fail time, so to calculate the next reschedule time, the alloc reconcilernow
was used. The minimun delay for batch jobs is 5 sec and 30 sec for services, following the logic in the code, the next reschedule time ends up always being at least 5 sec from now, and the disconnecting allocations were always set torescheduleLater
, but for disconnecting allocations, therescheduleLater
group was ignored, as shown previously on line 461 of the rescheduler.The function
filterByRescheduleable
was modified to correctly process disconnecting allocation and the disconnecting untainted and reconnect later allocations were added to the result to avoid unnecessary placement and correctly replace thedisconnecting
allocs using the reschedule policy.Once the previous change was done, new evaluations were created with the new reschedule time to replace the
disconnecting
allocs, but by the time the evaluation was executed, the allocs don't qualify as disconnecting anymore so a new restriction needed to be added in theupdateByReschedule
function for alloc that were not disconnecting and had not failed yet, theunknown
allocs.In order to avoid adding replacements for this allocs every time the reconciler run, a validation using the
followupEvalID
was added.When taking into account the reschedule policy, every
disconnecting
alloc generates 2 new evaluations: one for the next reschedule time, in order to place a replacement and a second one to set the alloc as lost once themax_client_disconnect
times out.The way the code is written, the disconnecting alloc would get updated with the evaluation ID of the second generated evaluation, making the
nextEvalID
useless to verify if theunknown
alloc being rescheduled was in fact a replacement for an old one. A change to overwrite thenextEvalID
with the ID of the evaluation corresponding to the replacement was set in place. This change also made finding if an alloc was a replacement of an older one or not on reconnecting a lot easier.Overview of commits