Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handling non reschedule disconnecting and reconnecting allocs #18701

Merged
merged 27 commits into from
Oct 27, 2023

Conversation

Juanadelacuesta
Copy link
Member

@Juanadelacuesta Juanadelacuesta commented Oct 9, 2023

This PR fixes Max_client_disconnect ignores no-reschedule policy

The bug had 2 root causes:

  • The code was written under the assumption that disconnecting allocs should always be rescheduled:
    _, rescheduleDisconnecting, _ := disconnecting.filterByRescheduleable(a.batch, true, a.now, a.evalID, a.deployment)

Any disconnecting allocation that should not be rescheduled now, was ignored for the rest of the reconcile process and would always end up on a new placement, because it was not taken into consideration for the deployment final count.

  • The logic to define if an allocation should be reschedule didn't take into account disconnecting allocations, they were
    never set to be explicitly replaced, because according to the shouldFilter logic, they are untainted and are ignored on line 396, and never make it to the elegibleNow selection.
          isUntainted, ignore := shouldFilter(alloc, isBatch)
   	if isUntainted && !isDisconnecting {
   		untainted[alloc.ID] = alloc
   	}
   	if isUntainted || ignore {
   		continue
   	}

   	// Only failed allocs with desired state run get to this point
   	// If the failed alloc is not eligible for rescheduling now we
   	// add it to the untainted set. Disconnecting delay evals are
   	// handled by allocReconciler.createTimeoutLaterEvals
   	eligibleNow, eligibleLater, rescheduleTime = updateByReschedulable(alloc, now, evalID, deployment, isDisconnecting)
   	if !isDisconnecting && !eligibleNow {

L396
Also on the updateByReschedulable, disonnecting allocs where never set to reschedule now:

	if eligible && (alloc.FollowupEvalID == evalID || rescheduleTime.Sub(now) <= rescheduleWindowSize) {
		rescheduleNow = true
		return
	}

This allocations would once again end up being replaced because the deployment count would be one short, not because it should be replaced and most definitely not taking into account its reschedule policy.

Once the previous problem was fixed and the correct reschedule policy was being taken into account, a new bug was visible: The disconnecting allocations don't have a fail time, so to calculate the next reschedule time, the alloc reconciler now was used. The minimun delay for batch jobs is 5 sec and 30 sec for services, following the logic in the code, the next reschedule time ends up always being at least 5 sec from now, and the disconnecting allocations were always set to rescheduleLater, but for disconnecting allocations, the rescheduleLater group was ignored, as shown previously on line 461 of the rescheduler.

The function filterByRescheduleable was modified to correctly process disconnecting allocation and the disconnecting untainted and reconnect later allocations were added to the result to avoid unnecessary placement and correctly replace the disconnecting allocs using the reschedule policy.

Once the previous change was done, new evaluations were created with the new reschedule time to replace the disconnecting allocs, but by the time the evaluation was executed, the allocs don't qualify as disconnecting anymore so a new restriction needed to be added in the updateByReschedule function for alloc that were not disconnecting and had not failed yet, the unknown allocs.
In order to avoid adding replacements for this allocs every time the reconciler run, a validation using the followupEvalID was added.
When taking into account the reschedule policy, every disconnecting alloc generates 2 new evaluations: one for the next reschedule time, in order to place a replacement and a second one to set the alloc as lost once the max_client_disconnect times out.
The way the code is written, the disconnecting alloc would get updated with the evaluation ID of the second generated evaluation, making the nextEvalID useless to verify if the unknown alloc being rescheduled was in fact a replacement for an old one. A change to overwrite the nextEvalID with the ID of the evaluation corresponding to the replacement was set in place. This change also made finding if an alloc was a replacement of an older one or not on reconnecting a lot easier.

@Juanadelacuesta Juanadelacuesta force-pushed the b-gh-16515 branch 2 times, most recently from 1ab800f to a069806 Compare October 10, 2023 19:52
@Juanadelacuesta Juanadelacuesta marked this pull request as ready for review October 12, 2023 14:36
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code changes in this PR seem to make sense @Juanadelacuesta, but unfortunately it looks like it fixes this bug but introduces a new one (or maybe I'm discovering a related bug here?).

I built Nomad from this branch for testing. I deployed a 2 node cluster and a job with the following relevant bits of the jobspec:

job "httpd" {
  reschedule {
    delay          = "2m"
    delay_function = "exponential"
    max_delay      = "1h"
    unlimited      = true
  }

  group "web" {
    max_client_disconnect = "5m"

    # ... etc
  }
}
full job spec here
job "httpd" {

  reschedule {
    delay          = "2m"
    delay_function = "exponential"
    max_delay      = "1h"
    unlimited      = true
  }

  group "web" {

    max_client_disconnect = "5m"

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

First, I waited for the deployment to finish. After stopping the node the alloc was on and waiting for the missed heartbeat, the alloc is marked unknown as expected but it gets immediately rescheduled. It doesn't wait for the delay window:

$ nomad eval list
ID        Priority  Triggered By            Job ID  Namespace  Node ID   Status    Placement Failures
90de133f  50        max-disconnect-timeout  httpd   default    <none>    pending   false
f0b4cd16  50        node-update             httpd   default    d55ab857  complete  false
768e0dbe  50        deployment-watcher      httpd   default    <none>    complete  false
ef752be3  50        job-register            httpd   default    <none>    complete  false
$ nomad job status httpd
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
7baca89e  8910b70c  web         0        run      running  33s ago    21s ago
7abbdab0  d55ab857  web         0        run      unknown  1m48s ago  33s ago

If I start over and I disable reschedules entirely, then the bug fix works as intended.

job "httpd" {
  reschedule {
    delay          = "2m"
    delay_function = "exponential"
    max_delay      = "1h"
    unlimited      = false
  }

  group "web" {
    max_client_disconnect = "5m"

    # ... etc
  }
}

Then it's not immediately rescheduled and I end up seeing see the following evals in my list:

$ nomad eval list
ID        Priority  Triggered By            Job ID  Namespace  Node ID   Status    Placement Failures
38554c98  50        max-disconnect-timeout  httpd   default    <none>    pending   false
e8976a61  50        node-update             httpd   default    d55ab857  complete  false
89feb842  50        deployment-watcher      httpd   default    <none>    complete  false
5a25579f  50        job-register            httpd   default    <none>    complete  false

scheduler/reconcile_util.go Show resolved Hide resolved
…ve been previously replaced can be replaced again
…y count untainted and reconnected using the adecuate alloc replacememnt logic
… to be rescheduled and set as untanted the ones that dont
…locs, to avoid replacing them by empty follow up eval
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this @Juanadelacuesta!

Code LGTM, and I've pulled it down for testing manually and everything checks out. :shipit:

@Juanadelacuesta Juanadelacuesta merged commit e8efe2d into main Oct 27, 2023
22 checks passed
@Juanadelacuesta Juanadelacuesta deleted the b-gh-16515 branch October 27, 2023 11:14
@Juanadelacuesta Juanadelacuesta added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Oct 27, 2023
Juanadelacuesta added a commit that referenced this pull request Oct 27, 2023
…18701)

This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
Juanadelacuesta added a commit that referenced this pull request Oct 27, 2023
…18701)

This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
Juanadelacuesta added a commit that referenced this pull request Oct 27, 2023
…18701) (#18899)

This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
Juanadelacuesta added a commit that referenced this pull request Oct 27, 2023
…18701) (#18898)

This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Max_client_disconnect ignores no-reschedule policy.
2 participants