Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reject reconnecting agents with different resource pool configuration #9815

Merged
merged 18 commits into from
Aug 29, 2024

Conversation

ShreyaLnuHpe
Copy link
Contributor

@ShreyaLnuHpe ShreyaLnuHpe commented Aug 13, 2024

Ticket

CM-186

Description

Resource pool change does not apply if agent restarts too quickly. Resource Pool change on reconnect, discard the corresponding agent state and tell the agent to shutdown (and restart) as a way to not overly complicate the restore process and keep sanity.

Error thrown by master:

[ERRO] change in agent resource pool was detected    component=agent error="resource pool has changed: pool1 -> pool2" id=agent1 resource-pool=pool1
[ERRO] agent crashed                                 address=127.0.0.1 component=agent error="resource pool has changed: pool1 -> pool2" id=agent1 resource-pool=pool1 started=true

Error thrown by agent before it dies:
[FATA] agent is past reconnect period, it must restart

Previously, the steps were as follows:

  1. The master starts and creates resource pools.
  2. The agent starts and connects to the master.
  3. The master assigns the agent to the resource pool specified in the configuration.
  4. The configuration is updated.
  5. The agent is stopped, and during this time, the master attempts to reconnect/restore the agent for a few seconds.
  6. If the agent restarts while the master is still trying to reconnect, the master does not reload the configuration and continues using the old resource pool.
  7. If the agent does not restart within those few seconds, the master drains/removes the agent. If the agent starts afterward, the master reads the updated configuration and attempts to connect the agent to the new resource pool.

After my code changes, the steps are now:

  1. The master starts and creates resource pools.
  2. The agent starts and connects to the master.
  3. The master assigns the agent to the resource pool specified in the configuration.
  4. The configuration is updated.
  5. The agent is stopped, and during this time, the master attempts to reconnect/restore the agent for a few seconds.
  6. If the agent tries to restart while the master is still attempting to reconnect, the master will detect the change in the resource pool, throw an error, and remove the agent, prompting a restart of the agent.
  7. If the agent does not restart within those few seconds, the master drains/removes the agent. If the agent starts afterward, the master reads the updated configuration and attempts to connect the agent to the new resource pool.

Test Plan

  • Unit test added
  • Automated E2E test

To test manually:

  1. Update resource pool config and have agent reconnect back to master
  2. Agent state will be wiped and it will be told it needs to restart
  3. After another reconnection agent and master should work ok

Commands to make/run/test manually:
Build only master and agent to get their binaries:
make -C master build
make -C agent build

Run the docker to create a db: <you can get this command by ‘Copy the docker run’ from already running docker container>

% docker run --hostname=d07083bced2f --mac-address=02:42:ac:11:00:02 --env=POSTGRES_DB=determined --env=POSTGRES_PASSWORD=postgres --env=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/postgresql/10/bin --env=GOSU_VERSION=1.12 --env=LANG=en_US.utf8 --env=PG_MAJOR=10 --env=PG_VERSION=10.14-1.pgdg90+1 --env=PGDATA=/var/lib/postgresql/data --volume=/Users/shreya/.postgres:/var/lib/postgresql/data --volume=/var/lib/postgresql/data -p 5432:5432 --restart=no --runtime=runc -d postgres:10.14

Run the master.yaml:

% master/build/determined-master --config-file determined/.circleci/devcluster/master.yaml
Run the agent.yaml:
% agent/build/determined-agent --config-file determined/.circleci/devcluster/agent.yaml
Run E2E test:
% pytest e2e_tests/tests/cluster/test_master_restart.py -k "test_agent_resource_pool_change" --log-cli-level=info --user-password=<INITIAL_USER_PASSWORD>

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@cla-bot cla-bot bot added the cla-signed label Aug 13, 2024
Copy link

netlify bot commented Aug 13, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 2cbf48f
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66cf8429294b5e00081ab3a3

Copy link

codecov bot commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 33.33333% with 10 lines in your changes missing coverage. Please review.

Project coverage is 54.75%. Comparing base (ee269c8) to head (2cbf48f).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
master/internal/rm/agentrm/agent.go 0.00% 10 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9815   +/-   ##
=======================================
  Coverage   54.75%   54.75%           
=======================================
  Files        1261     1261           
  Lines      156333   156348   +15     
  Branches     3600     3598    -2     
=======================================
+ Hits        85604    85615   +11     
- Misses      70598    70602    +4     
  Partials      131      131           
Flag Coverage Δ
backend 45.22% <33.33%> (+<0.01%) ⬆️
harness 72.62% <ø> (ø)
web 54.53% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
agent/internal/agent.go 0.00% <ø> (ø)
agent/internal/options/options.go 30.35% <100.00%> (+1.26%) ⬆️
master/internal/rm/agentrm/agent_state.go 84.02% <100.00%> (+0.17%) ⬆️
master/pkg/aproto/master_message.go 0.00% <ø> (ø)
master/internal/rm/agentrm/agent.go 25.72% <0.00%> (+0.29%) ⬆️

... and 3 files with indirect coverage changes

@ShreyaLnuHpe ShreyaLnuHpe marked this pull request as ready for review August 14, 2024 22:02
@ShreyaLnuHpe ShreyaLnuHpe requested review from a team as code owners August 14, 2024 22:02
// If the agent's resource pool is empty in the configuration and the master has it set to default,
// the agent should not be restarted. However, if the agent's resource pool differs from the master's record,
// the agent should be restarted.
if !(agentStarted.ResourcePoolName == "" && a.resourcePoolName == defaultResourcePoolName) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it makes sense to replace/rewrite "" -> defaultResourcePoolName earlier when agent tries to connect? there may be a lot of places which look at the ResourcePoolName field, having to do the same check in all of them may be redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it in DefaultOptions to eliminate the need for checking it wherever an Agent is created. Let me know if you have any suggestions for improvement.

@ShreyaLnuHpe ShreyaLnuHpe requested review from ioga and salonig23 August 14, 2024 23:54
@ShreyaLnuHpe ShreyaLnuHpe marked this pull request as draft August 20, 2024 18:05
@ShreyaLnuHpe ShreyaLnuHpe marked this pull request as ready for review August 22, 2024 05:38
@ShreyaLnuHpe ShreyaLnuHpe requested review from a team as code owners August 22, 2024 05:38
@@ -92,4 +92,4 @@ stages:
agent_reconnect_attempts: 24
agent_reconnect_backoff: 5
container_auto_remove_disabled: true
artificial_slots: 4
artificial_slots: 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to configure your editor so it does not remove newlines at the end of files. We can afford the extra byte of data. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted! I will configure my editor better. But surprisingly this .yaml file didn't have a trailing new line in the end of it, nor did I add/remove any. Will see what is wrong and correct it. Thanks!

Copy link
Contributor

@maxrussell maxrussell Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use VS Code, there's this setting you can toggle:
image

For more context, this is a POSIX standard thing: all lines should be terminated with a newline. Generally, this means that files' last line should have a \n at the end as well. Git and Github show us when the file does not comply with the standard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

Copy link
Contributor

@dannysauer dannysauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the parts infra owns, though "see note re: trailing newlines in files" ;)

@@ -4512,6 +4512,20 @@ workflows:
extra-pytest-flags: "--no-compare-stats"
collect-det-job-logs: false

- test-e2e:
name: test-e2e-managed-devcluster-resource-pools
Copy link
Contributor Author

@ShreyaLnuHpe ShreyaLnuHpe Aug 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 1 (DONE): Create a separate E2E test for the .circleci/devcluster/multi-resource-pools.devcluster.yaml config file to allow for broader test coverage in the future.
Option 2: Add a parallel run within test-e2e-managed-devcluster for the new config file.
https://circleci.com/docs/parallelism-faster-jobs/

In both options, the overall execution time of the CircleCI test-e2e runs remains unaffected since they run in parallel with other E2E tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think option 1 is good for now, when we need broader coverage we can decide how to split things up!

@@ -4512,6 +4512,20 @@ workflows:
extra-pytest-flags: "--no-compare-stats"
collect-det-job-logs: false

- test-e2e:
name: test-e2e-managed-devcluster-resource-pools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think option 1 is good for now, when we need broader coverage we can decide how to split things up!

master/internal/rm/agentrm/agent_state_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@salonig23 salonig23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work!

Copy link
Contributor

@maxrussell maxrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Very minor suggestion about the logged message that's totally optional.

master/internal/rm/agentrm/agent.go Outdated Show resolved Hide resolved
master/internal/rm/agentrm/agent_state_test.go Outdated Show resolved Hide resolved
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Aug 28, 2024
@determined-ci determined-ci requested a review from a team August 28, 2024 19:38
@ShreyaLnuHpe ShreyaLnuHpe requested a review from tara-hpe August 28, 2024 19:40
@@ -391,6 +391,11 @@ If you are using static resource pools and launching agents by hand, you will ne
:ref:`agent configuration <agent-config-reference>` to specify which resource pool the agent should
join.

Note that to change an agent's assigned resource_pool after it has already joined one, you need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that to change an agent's assigned resource_pool after it has already joined one, you need to
To change the resource pool an agent is assigned to after it has already joined one, you need to

@@ -391,6 +391,11 @@ If you are using static resource pools and launching agents by hand, you will ne
:ref:`agent configuration <agent-config-reference>` to specify which resource pool the agent should
join.

Note that to change an agent's assigned resource_pool after it has already joined one, you need to
update the :ref:`agent configuration <agent-config-reference>`. Make sure to drain the agents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
update the :ref:`agent configuration <agent-config-reference>`. Make sure to drain the agents
update the :ref:`agent's configuration <agent-config-reference>`. Before making this change, ensure the agents are properly drained. Once the configuration is updated, restart the agent to connect it to the new resource pool.

@@ -391,6 +391,11 @@ If you are using static resource pools and launching agents by hand, you will ne
:ref:`agent configuration <agent-config-reference>` to specify which resource pool the agent should
join.

Note that to change an agent's assigned resource_pool after it has already joined one, you need to
update the :ref:`agent configuration <agent-config-reference>`. Make sure to drain the agents
properly before modifying the resource_pool. After making the changes, restart the agent to connect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
properly before modifying the resource_pool. After making the changes, restart the agent to connect

Note that to change an agent's assigned resource_pool after it has already joined one, you need to
update the :ref:`agent configuration <agent-config-reference>`. Make sure to drain the agents
properly before modifying the resource_pool. After making the changes, restart the agent to connect
it to the new resource pool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it to the new resource pool.

Copy link
Contributor

@tara-hpe tara-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested edits

@determined-ci determined-ci requested a review from a team August 28, 2024 20:10
@ShreyaLnuHpe ShreyaLnuHpe requested a review from tara-hpe August 28, 2024 20:10
Copy link
Contributor

@tara-hpe tara-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShreyaLnuHpe ShreyaLnuHpe merged commit a605f00 into main Aug 29, 2024
84 of 98 checks passed
@ShreyaLnuHpe ShreyaLnuHpe deleted the shreya/agentRestart branch August 29, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants