Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_client_disconnect setting causes allocation states to hang in a non-terminal state when a client disconnects/reconnects #19729

Closed
louievandyke opened this issue Jan 12, 2024 · 7 comments

Comments

@louievandyke
Copy link
Contributor

Nomad version

Output from nomad version

see below for tests on multiple versions

Operating system and Environment details

ubuntu linux

Issue

We are running a job in which allocation Desired is run and Status is failed. nomad system gc will not clean the allocations because they are not in a terminal state and do not transition to terminal.

Reproduction steps

Create one nomad server and one client.

I have tested the functionality with an example job

With 1.4.14, I try with restart attempts to 3 or 0, and after some time nomad allocation Desired becomes stop.

job "holiday" {
  datacenters = ["dc1"]

  type = "service"

  constraint {
      distinct_hosts = true
  }

  update {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "15m"
    progress_deadline = "1h"
    auto_revert = false
    canary = 0
  }

  group "holiday_group" {
    count = 1
    max_client_disconnect = "1m"

    restart {
      attempts = 3
      interval = "1m"
      delay = "15s"
      mode = "fail"
    }

    reschedule {
      attempts = 0
      interval = "1m"
      delay = "15s"
      delay_function = "fibonacci"
      max_delay = "360s"
      unlimited = "true"
    }

    network {
      port "db" {
        to = 6379
      }
    }


    task "holiday_maintenance_pretask" {
      driver = "docker"

      lifecycle {
        hook = "prestart"
        sidecar = false
      }

      resources {
        memory = 50
      }


      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

    }

  }

}
root@nc1:/etc/nomad.d/job# nomad job status holiday
ID            = holiday
Name          = holiday
Submit Date   = 2023-12-28T18:16:03+05:30
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
holiday_group  0       0         0        4       0         0     0

Future Rescheduling Attempts
Task Group     Eval ID   Eval Time
holiday_group  0c26dcd2  16s from now

Latest Deployment
ID          = d37bc0c3
Status      = running
Description = Deployment is running

Deployed
Task Group     Desired  Placed  Healthy  Unhealthy  Progress Deadline
holiday_group  1        4       0        4          2023-12-28T19:16:03+05:30

Allocations
ID        Node ID   Task Group     Version  Desired  Status  Created    Modified
33a689fe  611b1a54  holiday_group  0        run      failed  33s ago    24s ago
83ea399d  611b1a54  holiday_group  0        stop     failed  1m7s ago   33s ago
a6843b3c  611b1a54  holiday_group  0        stop     failed  1m26s ago  1m7s ago
a6b7bc5c  611b1a54  holiday_group  0        stop     failed  1m45s ago  1m26s ago
root@nc1:/etc/nomad.d/job#

With 1.4.14 I used restart attempts to 3 and max_client_disconnect = "1m", stop the client agent, After 1 minutes I can see alloc Desired,Status become (stop, lost) on server. If I start the client agent again then allocation is trying to run on the same client node.

root@ns1:/tmp# nomad job status holiday
ID            = holiday
Name          = holiday
Submit Date   = 2023-12-28T18:23:13+05:30
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
holiday_group  10      0         0        0       0         1     0

Placement Failure
Task Group "holiday_group":
  * No nodes were eligible for evaluation
  * No nodes are available in datacenter "dc1"

Latest Deployment
ID          = b8f23763
Status      = running
Description = Deployment is running

Deployed
Task Group     Desired  Placed  Healthy  Unhealthy  Progress Deadline
holiday_group  10       1       0        0          2023-12-28T19:23:13+05:30

Allocations
ID        Node ID   Task Group     Version  Desired  Status  Created    Modified
2dd0ab28  611b1a54  holiday_group  0        stop     lost    3m54s ago  9s ago
root@ns1:/tmp# 

With 1.5.10 I used restart attempts to 3 and after some time nomad allocation Desired,Status is showing in run, pending state. Even if tried with max_client_disconnect = "1m" , I can see only run/pending status. Nomad system gc is not able to clean failed allocation as it is not moving their state to terminal.

root@nc1:/etc/nomad.d/job# nomad job status holiday
ID            = holiday
Name          = holiday
Submit Date   = 2023-12-28T18:39:27+05:30
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
holiday_group  0       1         0        0       0         0     0

Latest Deployment
ID          = 5baab18b
Status      = running
Description = Deployment is running

Deployed
Task Group     Desired  Placed  Healthy  Unhealthy  Progress Deadline
holiday_group  1        1       0        0          2023-12-28T19:39:27+05:30

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created    Modified
67f2bd6c  a066558f  holiday_group  0        run      pending  2m59s ago  13s ago
root@nc1:/etc/nomad.d/job#

With 1.7.2 I used restart attempts to 3 and after some time nomad allocation Desired,Status is not changing to a stop state. Even if tried with max_client_disconnect = "1m"

root@nc1:/etc/nomad.d/job# nomad job status holiday
ID            = holiday
Name          = holiday
Submit Date   = 2023-12-28T18:55:00+05:30
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
holiday_group  0       1         0        0       0         0     0

Latest Deployment
ID          = eb869f65
Status      = running
Description = Deployment is running

Deployed
Task Group     Desired  Placed  Healthy  Unhealthy  Progress Deadline
holiday_group  1        1       0        0          2023-12-28T19:55:00+05:30

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created   Modified
e2999ca8  57e214fd  holiday_group  0        run      pending  3m3s ago  14s ago
root@nc1:/etc/nomad.d/job#

Expected Result

After some time nomad allocation Desired becomes stop

Actual Result

The status for allocations does not change even when the count is fulfilled.

Job file (if appropriate)

see above

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@louievandyke louievandyke added type/bug hcc/cst Admin - internal labels Jan 12, 2024
@tgross
Copy link
Member

tgross commented Jan 17, 2024

@louievandyke you've provided a lot of extraneous detail here, like garbage collection behavior (which seems correct to me given the allocation states). But it looks like the core issue you're reporting is simply that max_client_disconnect field isn't being correctly respected from 1.5.10 onwards, inasmuch as an allocation with reschedule.attempts = 0 should be marked lost once the max_client_disconnect window expires.

I know that @Juanadelacuesta fixed a bug in this behavior in #18701 (which appears to be missing a changelog entry). It would be helpful to know if we're seeing the evaluations we expect.

@vaiaro
Copy link

vaiaro commented Jan 19, 2024

@tgross: Garbage collection behavior (which seems correct to me given the allocation states).Issue is while running the command "nomad job status example",Allocation desired,status were showing run,failed and desired,status is not moving to stop, complete.

Can you please tell us what should we do to mark the allocation desired into stop state. nomad system gc will not clean the allocations because they are not in a terminal state and do not transition to terminal.

@tgross
Copy link
Member

tgross commented Jan 19, 2024

@vaiaro the bug is that this isn't happening automatically as we'd expect. As a workaround, you could nomad alloc stop those allocations, so long as the node the allocation is on is connected.

@Juanadelacuesta
Copy link
Member

@vaiaro I've been looking at this and I want to make sure Im following the correct process:

  1. Set up a cluster with one server and one client.
  2. Spin up a job using the specs provided above and wait until it runs.
  3. Kill the nomad client and wait until the alloc Desired and Status are Stop and Lost respectively.
  4. Restart the client and see the Desired and Status be stuck on Run and Pending

Am I missing anything?

@louievandyke
Copy link
Contributor Author

@tgross I’ve done some more testing and have reproduced this now in a specific way. The issue is not that the client node encounters HB failure, but instead the max_client_disconnect setting seems to interfere with the desired.state ever going terminal.

I’ve been able to reproduce this by killing the docker process on the node where the allocation for the below jobspec is running.

job "holiday" {
  datacenters = ["dc1"]
  type = "service"
  update {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "15m"
    progress_deadline = "1h"
    auto_revert = false
    canary = 0
  }
  group "holiday_group" {
    count = 1
    max_client_disconnect = "24h"
    restart {
      attempts = 1
      interval = "1m"
      delay = "15s"
      mode = "fail"
    }
    reschedule {
      interval = "20s"
      delay = "5s"
      delay_function = "fibonacci"
      max_delay = "360s"
      unlimited = "true"
    }
#    reschedule {
#      attempts  = 0
#      unlimited = false
#    }
    network {
      port "db" {
        to = 6379
      }
    }
    task "holiday_maintenance_pretask" {
      driver = "docker"
      resources {
        memory = 50
      }
      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }
    }
  }
}

screenshots of allocation state behavior on 1.14.12+ent

ubuntu@ip-172-31-30-149:~$ nomad --version
Nomad v1.4.12+ent (130da7d1c43269fd3a044228eb9bcde42b4cc9fc)
ubuntu@ip-172-31-30-149:~$ nomad job status holiday
ID            = holiday
Name          = holiday
Submit Date   = 2024-01-25T20:55:11Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
holiday_group  0       0         1        1       0         0     0

Latest Deployment
ID          = 2f0f868d
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group     Desired  Placed  Healthy  Unhealthy  Progress Deadline
holiday_group  1        1       1        0          2024-01-25T21:55:21Z

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created     Modified
af292ede  a56d51c0  holiday_group  1        run      running  24m22s ago  3m42s ago
ubuntu@ip-172-31-30-149:~$ sudo su
root@ip-172-31-30-149:/home/ubuntu# ps -ef | grep docker
root        5567       1  0 21:01 ?        00:00:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root        5726    5567  0 21:01 ?        00:00:00 /usr/bin/docker-proxy -proto tcp -host-ip 172.31.30.149 -host-port 25280 -container-ip 172.17.0.2 -container-port 6379
root        5739    5567  0 21:01 ?        00:00:00 /usr/bin/docker-proxy -proto udp -host-ip 172.31.30.149 -host-port 25280 -container-ip 172.17.0.2 -container-port 6379
root        5808    1084  0 21:01 ?        00:00:00 /usr/local/bin/nomad docker_logger
root        6105    6097  0 21:06 pts/0    00:00:00 grep --color=auto docker
root@ip-172-31-30-149:/home/ubuntu# kill -9 5567

Nomad starts to restart the task

root@ip-172-31-30-149:/home/ubuntu# nomad job status hol
ID            = holiday
Name          = holiday
Submit Date   = 2024-01-25T20:55:11Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
holiday_group  0       1         0        1       0         0     0

Latest Deployment
ID          = 2f0f868d
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group     Desired  Placed  Healthy  Unhealthy  Progress Deadline
holiday_group  1        1       1        0          2024-01-25T21:55:21Z

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created     Modified
af292ede  a56d51c0  holiday_group  1        run      pending  26m49s ago  2s ago

I kill docker again...

root@ip-172-31-30-149:/home/ubuntu# ps -ef | grep docker
root        6185       1  1 21:07 ?        00:00:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root        6449    6185  0 21:08 ?        00:00:00 /usr/bin/docker-proxy -proto tcp -host-ip 172.31.30.149 -host-port 25280 -container-ip 172.17.0.2 -container-port 6379
root        6462    6185  0 21:08 ?        00:00:00 /usr/bin/docker-proxy -proto udp -host-ip 172.31.30.149 -host-port 25280 -container-ip 172.17.0.2 -container-port 6379
root        6529    1084  0 21:08 ?        00:00:00 /usr/local/bin/nomad docker_logger
root        6549    6097  0 21:08 pts/0    00:00:00 grep --color=auto docker
root@ip-172-31-30-149:/home/ubuntu# kill -9 6185

A replacement Allocation ID is now spinning up and the old Allocation ID is in desired.stop and status.failed - ie. terminal

root@ip-172-31-30-149:/home/ubuntu# nomad job status hol
ID            = holiday
Name          = holiday
Submit Date   = 2024-01-25T20:55:11Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
holiday_group  0       1         0        2       0         0     0

Latest Deployment
ID          = 2f0f868d
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group     Desired  Placed  Healthy  Unhealthy  Progress Deadline
holiday_group  1        1       1        0          2024-01-25T21:55:21Z

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created     Modified
dd471b19  2b79acf1  holiday_group  1        run      pending  0s ago      0s ago
af292ede  a56d51c0  holiday_group  1        stop     failed   27m28s ago  0s ago

.
.
...

I ran through the same process on Nomad v1.5.10+ent, and the problem is the old allocation will not go terminal until the max_client_disconnect setting duration has completed… odd because it has nothing to do with a client RPC failure.

screenshots of reproduction on 1.5.10+ent

ubuntu@ip-172-31-27-5:~$ nomad --version
Nomad v1.5.10+ent
BuildDate 2023-10-30T14:57:35Z
Revision 7217ca6788edf2ec7245b064a0e5fc4c40356c4c
ubuntu@ip-172-31-27-5:~$ nomad job status holiday
ID            = holiday
Name          = holiday
Submit Date   = 2024-01-25T20:55:03Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
holiday_group  0       0         1        0       12        1     0

Latest Deployment
ID          = 11d1f2ec
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group     Desired  Placed  Healthy  Unhealthy  Progress Deadline
holiday_group  1        1       1        0          2024-01-25T21:55:19Z

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created     Modified
d4ad2a41  e062b420  holiday_group  9        run      running  10m30s ago  2m19s ago
ubuntu@ip-172-31-27-5:~$ sudo su
root@ip-172-31-27-5:/home/ubuntu# ps -ef | grep docker
root     1240058       1  0 21:03 ?        00:00:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root     1240199 1240058  0 21:03 ?        00:00:00 /usr/bin/docker-proxy -proto tcp -host-ip 172.31.27.5 -host-port 28104 -container-ip 172.17.0.2 -container-port 6379
root     1240213 1240058  0 21:03 ?        00:00:00 /usr/bin/docker-proxy -proto udp -host-ip 172.31.27.5 -host-port 28104 -container-ip 172.17.0.2 -container-port 6379
root     1240290  213703  0 21:03 ?        00:00:00 /usr/local/bin/nomad docker_logger
root     1240498 1240490  0 21:06 pts/0    00:00:00 grep --color=auto docker
root@ip-172-31-27-5:/home/ubuntu# kill -9 1240058

Nomad starts to restart the task

root@ip-172-31-27-5:/home/ubuntu# nomad job status holiday
ID            = holiday
Name          = holiday
Submit Date   = 2024-01-25T20:55:03Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
holiday_group  0       1         0        0       12        1     0

Latest Deployment
ID          = 11d1f2ec
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group     Desired  Placed  Healthy  Unhealthy  Progress Deadline
holiday_group  1        1       1        0          2024-01-25T21:55:19Z

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created     Modified
d4ad2a41  e062b420  holiday_group  9        run      pending  14m13s ago  2s ago

I kill it again

root@ip-172-31-27-5:/home/ubuntu# ps -ef | grep docker
root     1240647       1  1 21:09 ?        00:00:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root     1240875 1240647  0 21:09 ?        00:00:00 /usr/bin/docker-proxy -proto tcp -host-ip 172.31.27.5 -host-port 28104 -container-ip 172.17.0.2 -container-port 6379
root     1240888 1240647  0 21:09 ?        00:00:00 /usr/bin/docker-proxy -proto udp -host-ip 172.31.27.5 -host-port 28104 -container-ip 172.17.0.2 -container-port 6379
root     1240962  213703  0 21:09 ?        00:00:00 /usr/local/bin/nomad docker_logger
root     1240983 1240490  0 21:09 pts/0    00:00:00 grep --color=auto docker
root@ip-172-31-27-5:/home/ubuntu# kill -9 1240647

A replacement Allocation ID is spun up and the old Allocation ID is in desired.run and status.failed - ie. NOT terminal and will stay that way for the duration of max_client_disconnect

root@ip-172-31-27-5:/home/ubuntu# nomad job status holiday
ID            = holiday
Name          = holiday
Submit Date   = 2024-01-25T20:55:03Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
holiday_group  0       0         1        1       12        1     0

Latest Deployment
ID          = 11d1f2ec
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group     Desired  Placed  Healthy  Unhealthy  Progress Deadline
holiday_group  1        1       1        0          2024-01-25T21:55:19Z

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created    Modified
e1f27d68  a6135028  holiday_group  9        run      running  23s ago    7s ago
d4ad2a41  e062b420  holiday_group  9        run      failed   15m9s ago  23s ago

@Juanadelacuesta Juanadelacuesta self-assigned this Mar 19, 2024
Juanadelacuesta added a commit that referenced this issue Mar 26, 2024
… client disconnect (#20181)

Only ignore allocs on terminal states that are updated
---------

Co-authored-by: Tim Gross <[email protected]>
Juanadelacuesta added a commit that referenced this issue Mar 28, 2024
… client disconnect (#20181)

Only ignore allocs on terminal states that are updated
---------

Co-authored-by: Tim Gross <[email protected]>
Juanadelacuesta added a commit that referenced this issue Mar 28, 2024
… client disconnect (#20181)

Only ignore allocs on terminal states that are updated
---------

Co-authored-by: Tim Gross <[email protected]>
Juanadelacuesta added a commit that referenced this issue Mar 28, 2024
… client disconnect (#20181)

Only ignore allocs on terminal states that are updated
---------

Co-authored-by: Tim Gross <[email protected]>
Juanadelacuesta added a commit that referenced this issue Mar 28, 2024
… client disconnect (#20181) (#20253)

Only ignore allocs on terminal states that are updated
---------

Co-authored-by: Tim Gross <[email protected]>
Juanadelacuesta added a commit that referenced this issue Apr 2, 2024
…nts with max client disconnect into release/1.6.x (#20249)

* [gh-19729] Fix logic for updating terminal allocs on clients with max client disconnect (#20181)

Only ignore allocs on terminal states that are updated
---------

Co-authored-by: Tim Gross <[email protected]>

* fix: update test

---------

Co-authored-by: Juana De La Cuesta <[email protected]>
Co-authored-by: Tim Gross <[email protected]>
Co-authored-by: Juanadelacuesta <[email protected]>
@Juanadelacuesta
Copy link
Member

Juanadelacuesta commented Apr 2, 2024

Fix merged and back ported to 1.7, 1.6 and 1.5

Juanadelacuesta added a commit that referenced this issue Apr 2, 2024
…nts with max client disconnect into release/1.7.x (#20248)

* [gh-19729] Fix logic for updating terminal allocs on clients with max client disconnect (#20181)

Only ignore allocs on terminal states that are updated
---------

Co-authored-by: Tim Gross <[email protected]>

* [gh-19729] Fix logic for updating terminal allocs on clients with max client disconnect (#20181)

* fix: update test

---------

Co-authored-by: Juana De La Cuesta <[email protected]>
Co-authored-by: Tim Gross <[email protected]>
Co-authored-by: Juanadelacuesta <[email protected]>
philrenaud pushed a commit that referenced this issue Apr 18, 2024
… client disconnect (#20181)

Only ignore allocs on terminal states that are updated
---------

Co-authored-by: Tim Gross <[email protected]>
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants