-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: reject reconnecting agents with different resource pool configuration #9815
Changes from 17 commits
d3caacd
b4a24f7
61384cf
2a4d4f2
8076a1b
cbf6884
47bea5b
d93bb75
556220b
b3eb2d0
69920f5
27825d2
178a096
4877be6
5342281
6af1bd5
f329db4
2cbf48f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
stages: | ||
- db: | ||
name: db | ||
port: 5434 | ||
|
||
- master: | ||
pre: | ||
- sh: make -C tools prep-root | ||
config_file: | ||
security: | ||
initial_user_password: $INITIAL_USER_PASSWORD | ||
db: | ||
host: localhost | ||
port: 5434 | ||
password: postgres | ||
user: postgres | ||
name: determined | ||
__internal: | ||
preemption_timeout: 60s | ||
checkpoint_storage: | ||
type: shared_fs | ||
host_path: /tmp | ||
storage_path: determined-cp | ||
log: | ||
level: debug | ||
root: tools/build | ||
cache: | ||
cache_dir: /tmp/determined-cache | ||
launch_error: false | ||
telemetry: | ||
enabled: false | ||
resource_manager: | ||
default_aux_resource_pool: default | ||
default_compute_resource_pool: default | ||
scheduler: | ||
fitting_policy: best | ||
type: fair_share | ||
type: agent | ||
resource_pools: | ||
- agent_reattach_enabled: true | ||
agent_reconnect_wait: 25s | ||
description: '' | ||
max_aux_containers_per_agent: 100 | ||
pool_name: default | ||
provider: null | ||
task_container_defaults: null | ||
- pool_name: pool1 | ||
max_slots: 8 | ||
scheduler: | ||
type: priority | ||
scim: | ||
enabled: true | ||
auth: | ||
type: basic | ||
username: determined | ||
password: password | ||
|
||
- custom: | ||
name: proxy | ||
cmd: ["socat", "-d", "-d", "TCP-LISTEN:8081,reuseaddr,fork", "TCP:localhost:8080"] | ||
post: | ||
- conncheck: | ||
port: 8081 | ||
|
||
- agent: | ||
name: agent1 | ||
config_file: | ||
master_host: 127.0.0.1 | ||
master_port: 8081 | ||
agent_id: agent1 | ||
container_master_host: $DOCKER_LOCALHOST | ||
agent_reconnect_attempts: 24 | ||
agent_reconnect_backoff: 5 | ||
container_auto_remove_disabled: true | ||
hooks: | ||
on_connection_lost: ["touch", "/tmp/agent1-connection-lost"] | ||
|
||
|
||
- agent: | ||
name: agent2 | ||
config_file: | ||
master_host: 127.0.0.1 | ||
master_port: 8081 | ||
agent_id: agent2 | ||
container_master_host: $DOCKER_LOCALHOST | ||
agent_reconnect_attempts: 24 | ||
agent_reconnect_backoff: 5 | ||
container_auto_remove_disabled: true | ||
|
||
- agent: | ||
name: agent10 # Copy of agent1, but with different resource pool. | ||
config_file: | ||
master_host: 127.0.0.1 | ||
master_port: 8081 | ||
agent_id: agent1 | ||
container_master_host: $DOCKER_LOCALHOST | ||
agent_reconnect_attempts: 24 | ||
agent_reconnect_backoff: 5 | ||
container_auto_remove_disabled: true | ||
resource_pool: pool1 | ||
|
||
- agent: | ||
name: agent20 # Copy of agent1, but with empty(default) resource pool. | ||
config_file: | ||
master_host: 127.0.0.1 | ||
master_port: 8081 | ||
agent_id: agent1 | ||
container_master_host: $DOCKER_LOCALHOST | ||
agent_reconnect_attempts: 24 | ||
agent_reconnect_backoff: 5 | ||
container_auto_remove_disabled: true |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -58,6 +58,7 @@ log: | |
log: | ||
level: trace | ||
color: true | ||
resource_pool: default | ||
slot_type: auto | ||
security: | ||
tls: | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -391,6 +391,11 @@ If you are using static resource pools and launching agents by hand, you will ne | |||||
:ref:`agent configuration <agent-config-reference>` to specify which resource pool the agent should | ||||||
join. | ||||||
|
||||||
Note that to change an agent's assigned resource_pool after it has already joined one, you need to | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
update the :ref:`agent configuration <agent-config-reference>`. Make sure to drain the agents | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
properly before modifying the resource_pool. After making the changes, restart the agent to connect | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
it to the new resource pool. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
Migrate to Resource Pools | ||||||
------------------------- | ||||||
|
||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 1 (DONE): Create a separate E2E test for the .circleci/devcluster/multi-resource-pools.devcluster.yaml config file to allow for broader test coverage in the future.
Option 2: Add a parallel run within test-e2e-managed-devcluster for the new config file.
https://circleci.com/docs/parallelism-faster-jobs/
In both options, the overall execution time of the CircleCI test-e2e runs remains unaffected since they run in parallel with other E2E tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think option 1 is good for now, when we need broader coverage we can decide how to split things up!