-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: reject reconnecting agents with different resource pool configuration #9815
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9815 +/- ##
=======================================
Coverage 54.75% 54.75%
=======================================
Files 1261 1261
Lines 156333 156348 +15
Branches 3600 3598 -2
=======================================
+ Hits 85604 85615 +11
- Misses 70598 70602 +4
Partials 131 131
Flags with carried forward coverage won't be shown. Click here to find out more.
|
// If the agent's resource pool is empty in the configuration and the master has it set to default, | ||
// the agent should not be restarted. However, if the agent's resource pool differs from the master's record, | ||
// the agent should be restarted. | ||
if !(agentStarted.ResourcePoolName == "" && a.resourcePoolName == defaultResourcePoolName) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it makes sense to replace/rewrite ""
-> defaultResourcePoolName earlier when agent tries to connect? there may be a lot of places which look at the ResourcePoolName
field, having to do the same check in all of them may be redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it in DefaultOptions
to eliminate the need for checking it wherever an Agent
is created. Let me know if you have any suggestions for improvement.
@@ -92,4 +92,4 @@ stages: | |||
agent_reconnect_attempts: 24 | |||
agent_reconnect_backoff: 5 | |||
container_auto_remove_disabled: true | |||
artificial_slots: 4 | |||
artificial_slots: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to configure your editor so it does not remove newlines at the end of files. We can afford the extra byte of data. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted! I will configure my editor better. But surprisingly this .yaml file didn't have a trailing new line in the end of it, nor did I add/remove any. Will see what is wrong and correct it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use VS Code, there's this setting you can toggle:
For more context, this is a POSIX standard thing: all lines should be terminated with a newline. Generally, this means that files' last line should have a \n
at the end as well. Git and Github show us when the file does not comply with the standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with the parts infra owns, though "see note re: trailing newlines in files" ;)
@@ -4512,6 +4512,20 @@ workflows: | |||
extra-pytest-flags: "--no-compare-stats" | |||
collect-det-job-logs: false | |||
|
|||
- test-e2e: | |||
name: test-e2e-managed-devcluster-resource-pools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 1 (DONE): Create a separate E2E test for the .circleci/devcluster/multi-resource-pools.devcluster.yaml config file to allow for broader test coverage in the future.
Option 2: Add a parallel run within test-e2e-managed-devcluster for the new config file.
https://circleci.com/docs/parallelism-faster-jobs/
In both options, the overall execution time of the CircleCI test-e2e runs remains unaffected since they run in parallel with other E2E tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think option 1 is good for now, when we need broader coverage we can decide how to split things up!
@@ -4512,6 +4512,20 @@ workflows: | |||
extra-pytest-flags: "--no-compare-stats" | |||
collect-det-job-logs: false | |||
|
|||
- test-e2e: | |||
name: test-e2e-managed-devcluster-resource-pools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think option 1 is good for now, when we need broader coverage we can decide how to split things up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Very minor suggestion about the logged message that's totally optional.
@@ -391,6 +391,11 @@ If you are using static resource pools and launching agents by hand, you will ne | |||
:ref:`agent configuration <agent-config-reference>` to specify which resource pool the agent should | |||
join. | |||
|
|||
Note that to change an agent's assigned resource_pool after it has already joined one, you need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that to change an agent's assigned resource_pool after it has already joined one, you need to | |
To change the resource pool an agent is assigned to after it has already joined one, you need to |
@@ -391,6 +391,11 @@ If you are using static resource pools and launching agents by hand, you will ne | |||
:ref:`agent configuration <agent-config-reference>` to specify which resource pool the agent should | |||
join. | |||
|
|||
Note that to change an agent's assigned resource_pool after it has already joined one, you need to | |||
update the :ref:`agent configuration <agent-config-reference>`. Make sure to drain the agents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the :ref:`agent configuration <agent-config-reference>`. Make sure to drain the agents | |
update the :ref:`agent's configuration <agent-config-reference>`. Before making this change, ensure the agents are properly drained. Once the configuration is updated, restart the agent to connect it to the new resource pool. |
@@ -391,6 +391,11 @@ If you are using static resource pools and launching agents by hand, you will ne | |||
:ref:`agent configuration <agent-config-reference>` to specify which resource pool the agent should | |||
join. | |||
|
|||
Note that to change an agent's assigned resource_pool after it has already joined one, you need to | |||
update the :ref:`agent configuration <agent-config-reference>`. Make sure to drain the agents | |||
properly before modifying the resource_pool. After making the changes, restart the agent to connect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
properly before modifying the resource_pool. After making the changes, restart the agent to connect |
Note that to change an agent's assigned resource_pool after it has already joined one, you need to | ||
update the :ref:`agent configuration <agent-config-reference>`. Make sure to drain the agents | ||
properly before modifying the resource_pool. After making the changes, restart the agent to connect | ||
it to the new resource pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it to the new resource pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested edits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ticket
CM-186
Description
Resource pool change does not apply if agent restarts too quickly. Resource Pool change on reconnect, discard the corresponding agent state and tell the agent to shutdown (and restart) as a way to not overly complicate the restore process and keep sanity.
Error thrown by master:
Error thrown by agent before it dies:
[FATA] agent is past reconnect period, it must restart
Previously, the steps were as follows:
After my code changes, the steps are now:
Test Plan
To test manually:
Commands to make/run/test manually:
Build only master and agent to get their binaries:
make -C master build
make -C agent build
Run the docker to create a db: <you can get this command by ‘Copy the docker run’ from already running docker container>
% docker run --hostname=d07083bced2f --mac-address=02:42:ac:11:00:02 --env=POSTGRES_DB=determined --env=POSTGRES_PASSWORD=postgres --env=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/postgresql/10/bin --env=GOSU_VERSION=1.12 --env=LANG=en_US.utf8 --env=PG_MAJOR=10 --env=PG_VERSION=10.14-1.pgdg90+1 --env=PGDATA=/var/lib/postgresql/data --volume=/Users/shreya/.postgres:/var/lib/postgresql/data --volume=/var/lib/postgresql/data -p 5432:5432 --restart=no --runtime=runc -d postgres:10.14
Run the master.yaml:
% master/build/determined-master --config-file determined/.circleci/devcluster/master.yaml
Run the agent.yaml:
% agent/build/determined-agent --config-file determined/.circleci/devcluster/agent.yaml
Run E2E test:
% pytest e2e_tests/tests/cluster/test_master_restart.py -k "test_agent_resource_pool_change" --log-cli-level=info --user-password=<INITIAL_USER_PASSWORD>
Checklist
docs/release-notes/
See Release Note for details.