fix: handling non reschedule disconnecting and reconnecting allocs #18701

Juanadelacuesta · 2023-10-09T11:18:40Z

This PR fixes Max_client_disconnect ignores no-reschedule policy

The bug had 2 root causes:

The code was written under the assumption that disconnecting allocs should always be rescheduled:

nomad/scheduler/reconcile.go

Line 461 in cecd9b0

_, rescheduleDisconnecting, _ := disconnecting.filterByRescheduleable(a.batch, true, a.now, a.evalID, a.deployment)

Any disconnecting allocation that should not be rescheduled now, was ignored for the rest of the reconcile process and would always end up on a new placement, because it was not taken into consideration for the deployment final count.

The logic to define if an allocation should be reschedule didn't take into account disconnecting allocations, they were
never set to be explicitly replaced, because according to the shouldFilter logic, they are untainted and are ignored on line 396, and never make it to the elegibleNow selection.

          isUntainted, ignore := shouldFilter(alloc, isBatch)
   	if isUntainted && !isDisconnecting {
   		untainted[alloc.ID] = alloc
   	}
   	if isUntainted || ignore {
   		continue
   	}

   	// Only failed allocs with desired state run get to this point
   	// If the failed alloc is not eligible for rescheduling now we
   	// add it to the untainted set. Disconnecting delay evals are
   	// handled by allocReconciler.createTimeoutLaterEvals
   	eligibleNow, eligibleLater, rescheduleTime = updateByReschedulable(alloc, now, evalID, deployment, isDisconnecting)
   	if !isDisconnecting && !eligibleNow {

L396
Also on the updateByReschedulable, disonnecting allocs where never set to reschedule now:

	if eligible && (alloc.FollowupEvalID == evalID || rescheduleTime.Sub(now) <= rescheduleWindowSize) {
		rescheduleNow = true
		return
	}

This allocations would once again end up being replaced because the deployment count would be one short, not because it should be replaced and most definitely not taking into account its reschedule policy.

Once the previous problem was fixed and the correct reschedule policy was being taken into account, a new bug was visible: The disconnecting allocations don't have a fail time, so to calculate the next reschedule time, the alloc reconciler now was used. The minimun delay for batch jobs is 5 sec and 30 sec for services, following the logic in the code, the next reschedule time ends up always being at least 5 sec from now, and the disconnecting allocations were always set to rescheduleLater, but for disconnecting allocations, the rescheduleLater group was ignored, as shown previously on line 461 of the rescheduler.

The function filterByRescheduleable was modified to correctly process disconnecting allocation and the disconnecting untainted and reconnect later allocations were added to the result to avoid unnecessary placement and correctly replace the disconnecting allocs using the reschedule policy.

Once the previous change was done, new evaluations were created with the new reschedule time to replace the disconnecting allocs, but by the time the evaluation was executed, the allocs don't qualify as disconnecting anymore so a new restriction needed to be added in the updateByReschedule function for alloc that were not disconnecting and had not failed yet, the unknown allocs.
In order to avoid adding replacements for this allocs every time the reconciler run, a validation using the followupEvalID was added.
When taking into account the reschedule policy, every disconnecting alloc generates 2 new evaluations: one for the next reschedule time, in order to place a replacement and a second one to set the alloc as lost once the max_client_disconnect times out.
The way the code is written, the disconnecting alloc would get updated with the evaluation ID of the second generated evaluation, making the nextEvalID useless to verify if the unknown alloc being rescheduled was in fact a replacement for an old one. A change to overwrite the nextEvalID with the ID of the evaluation corresponding to the replacement was set in place. This change also made finding if an alloc was a replacement of an older one or not on reconnecting a lot easier.

tgross

This code changes in this PR seem to make sense @Juanadelacuesta, but unfortunately it looks like it fixes this bug but introduces a new one (or maybe I'm discovering a related bug here?).

I built Nomad from this branch for testing. I deployed a 2 node cluster and a job with the following relevant bits of the jobspec:

job "httpd" {
  reschedule {
    delay          = "2m"
    delay_function = "exponential"
    max_delay      = "1h"
    unlimited      = true
  }

  group "web" {
    max_client_disconnect = "5m"

    # ... etc
  }
}

full job spec here

job "httpd" {

  reschedule {
    delay          = "2m"
    delay_function = "exponential"
    max_delay      = "1h"
    unlimited      = true
  }

  group "web" {

    max_client_disconnect = "5m"

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

First, I waited for the deployment to finish. After stopping the node the alloc was on and waiting for the missed heartbeat, the alloc is marked unknown as expected but it gets immediately rescheduled. It doesn't wait for the delay window:

$ nomad eval list
ID        Priority  Triggered By            Job ID  Namespace  Node ID   Status    Placement Failures
90de133f  50        max-disconnect-timeout  httpd   default    <none>    pending   false
f0b4cd16  50        node-update             httpd   default    d55ab857  complete  false
768e0dbe  50        deployment-watcher      httpd   default    <none>    complete  false
ef752be3  50        job-register            httpd   default    <none>    complete  false

$ nomad job status httpd
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
7baca89e  8910b70c  web         0        run      running  33s ago    21s ago
7abbdab0  d55ab857  web         0        run      unknown  1m48s ago  33s ago

If I start over and I disable reschedules entirely, then the bug fix works as intended.

job "httpd" {
  reschedule {
    delay          = "2m"
    delay_function = "exponential"
    max_delay      = "1h"
    unlimited      = false
  }

  group "web" {
    max_client_disconnect = "5m"

    # ... etc
  }
}

Then it's not immediately rescheduled and I end up seeing see the following evals in my list:

$ nomad eval list
ID        Priority  Triggered By            Job ID  Namespace  Node ID   Status    Placement Failures
38554c98  50        max-disconnect-timeout  httpd   default    <none>    pending   false
e8976a61  50        node-update             httpd   default    d55ab857  complete  false
89feb842  50        deployment-watcher      httpd   default    <none>    complete  false
5a25579f  50        job-register            httpd   default    <none>    complete  false

scheduler/reconcile_util.go

…ding them to the updates

…ve been previously replaced can be replaced again

…y count untainted and reconnected using the adecuate alloc replacememnt logic

… to be rescheduled and set as untanted the ones that dont

…locs, to avoid replacing them by empty follow up eval

tgross

Nice work on this @Juanadelacuesta!

Code LGTM, and I've pulled it down for testing manually and everything checks out.

…18701) This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.

…18701) (#18899) This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.

…18701) (#18898) This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.

vercel bot deployed to Preview – nomad-storybook-and-ui October 9, 2023 11:20 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 9, 2023 20:44 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 10, 2023 08:48 View deployment

Juanadelacuesta force-pushed the b-gh-16515 branch from 4748885 to d54830b Compare October 10, 2023 13:38

vercel bot deployed to Preview – nomad-storybook-and-ui October 10, 2023 13:41 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 10, 2023 14:40 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 10, 2023 15:16 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 10, 2023 19:50 View deployment

Juanadelacuesta force-pushed the b-gh-16515 branch 2 times, most recently from 1ab800f to a069806 Compare October 10, 2023 19:52

vercel bot deployed to Preview – nomad-storybook-and-ui October 10, 2023 19:57 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 11, 2023 11:25 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 11, 2023 18:08 View deployment

Juanadelacuesta force-pushed the b-gh-16515 branch from 9d2591c to 31fd610 Compare October 11, 2023 18:26

vercel bot deployed to Preview – nomad-storybook-and-ui October 11, 2023 18:29 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 12, 2023 12:26 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 12, 2023 12:48 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 12, 2023 14:01 View deployment

Juanadelacuesta marked this pull request as ready for review October 12, 2023 14:36

Juanadelacuesta requested review from tgross, jrasell, lgfa29 and schmichael October 12, 2023 14:37

tgross requested changes Oct 12, 2023

View reviewed changes

scheduler/reconcile_util.go Show resolved Hide resolved

tgross mentioned this pull request Oct 12, 2023

evals for max-disconnect-timeout escape GC #18747

Closed

vercel bot deployed to Preview – nomad-storybook-and-ui October 13, 2023 11:37 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 16, 2023 16:52 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 16, 2023 17:20 View deployment

Juanadelacuesta force-pushed the b-gh-16515 branch from 243f0d1 to 3749533 Compare October 16, 2023 19:26

vercel bot deployed to Preview – nomad-storybook-and-ui October 17, 2023 09:26 View deployment

Juanadelacuesta added 3 commits October 25, 2023 15:52

func: avoid restoping stopped allocs, by checking the state before ad…

832a087

…ding them to the updates

func: add extra condition to elegible later allocs, so allocs that ha…

bb34579

…ve been previously replaced can be replaced again

fix: update tests to not count stopped allocs as updates and correctl…

d684b7d

…y count untainted and reconnected using the adecuate alloc replacememnt logic

vercel bot deployed to Preview – nomad-storybook-and-ui October 25, 2023 13:59 View deployment

fix: update the filter by tainted function to ignore allocs that need…

3b83461

… to be rescheduled and set as untanted the ones that dont

vercel bot deployed to Preview – nomad-storybook-and-ui October 25, 2023 16:59 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui October 26, 2023 12:46 View deployment

fix: add correct alloc states and previous eval ID to disconnected al…

67d9033

…locs, to avoid replacing them by empty follow up eval

Juanadelacuesta force-pushed the b-gh-16515 branch from 351b891 to 67d9033 Compare October 26, 2023 12:51

vercel bot deployed to Preview – nomad-storybook-and-ui October 26, 2023 12:55 View deployment

Juanadelacuesta requested a review from tgross October 26, 2023 13:29

tgross approved these changes Oct 26, 2023

View reviewed changes

Juanadelacuesta merged commit e8efe2d into main Oct 27, 2023
22 checks passed

Juanadelacuesta deleted the b-gh-16515 branch October 27, 2023 11:14

Juanadelacuesta added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Oct 27, 2023

tgross mentioned this pull request Jan 17, 2024

max_client_disconnect setting causes allocation states to hang in a non-terminal state when a client disconnects/reconnects #19729

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handling non reschedule disconnecting and reconnecting allocs #18701

fix: handling non reschedule disconnecting and reconnecting allocs #18701

Juanadelacuesta commented Oct 9, 2023 •

edited

Loading

tgross left a comment •

edited

Loading

tgross left a comment

fix: handling non reschedule disconnecting and reconnecting allocs #18701

fix: handling non reschedule disconnecting and reconnecting allocs #18701

Conversation

Juanadelacuesta commented Oct 9, 2023 • edited Loading

tgross left a comment • edited Loading

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

Juanadelacuesta commented Oct 9, 2023 •

edited

Loading

tgross left a comment •

edited

Loading