Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster][Feature] skip suspending worker groups if the in-tree autoscaler is enabled #2748

Merged
merged 5 commits into from
Jan 17, 2025

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jan 14, 2025

Why are these changes needed?

Skip suspending worker groups if the in-tree autoscaler is enabled to prevent ray cluster from malfunctioning.

The old autoscaler can't know if a worker group has been suspended on the KubeRay side, therefore, it will keep making wrong scaling decisions if some of the worker groups have been suspended. Such as:

  1. Always trying to scale up a suspended worker group to keep the min_workers requirement.
  2. Always trying to scale up a suspended worker group for the current queuing Ray tasks.

To prevent a ray cluster from malfunctioning like that, we must forbid the usage of work group suspension on all old Ray versions that don't recognize the suspend field in the CR. However, we are unable to forbid usage by the controller without the validation webhook. In the case without the validation webhook, we can only fire a warning event InvalidRayClusterSpec to let users know that this is forbidden.

The validation webhook implementation will be in another PR.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@rueian rueian force-pushed the autoscaler-workergroup-suspend branch from a68ca2e to dfd71b8 Compare January 14, 2025 20:12
@rueian rueian marked this pull request as ready for review January 14, 2025 22:06
@rueian rueian force-pushed the autoscaler-workergroup-suspend branch from dfd71b8 to 3e99db1 Compare January 15, 2025 02:30
@kevin85421
Copy link
Member

cc @andrewsykim, would you mind reviewing this PR after #2643 is merged? Thanks

@kevin85421 kevin85421 self-assigned this Jan 17, 2025
@kevin85421
Copy link
Member

discussed offline: validate RayCluster spec and RayJob spec

@andrewsykim
Copy link
Collaborator

@rueian can you let me know once the validation logic is addeed?

…toscaler is enabled to prevent ray cluster from malfunctioning

Signed-off-by: Rueian <[email protected]>
…toscaler is enabled to prevent ray cluster from malfunctioning

Signed-off-by: Rueian <[email protected]>
@rueian rueian force-pushed the autoscaler-workergroup-suspend branch from 3e99db1 to cb8a866 Compare January 17, 2025 19:47
…toscaler is enabled to prevent ray cluster from malfunctioning

Signed-off-by: Rueian <[email protected]>
…toscaler is enabled to prevent ray cluster from malfunctioning

Signed-off-by: Rueian <[email protected]>
@rueian
Copy link
Contributor Author

rueian commented Jan 17, 2025

Hi @kevin85421 and @andrewsykim,

This PR is ready and only focuses on validation on the RayCluster side. I will open another PR for validationon the RayJob side as well.

@rueian
Copy link
Contributor Author

rueian commented Jan 17, 2025

Using the feature gate for this suspend field will also be another PR.

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
@@ -861,6 +861,47 @@ var _ = Context("Inside the default namespace", func() {
})
})

Describe("Suspend RayCluster worker group with Autoscaler enabled", Ordered, func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A unit test seems sufficient if the validation happens at the very beginning of the reconciliation, so no other logic is involved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test case could still be useful when we later add the support to suspend worker groups in an autoscaler-enabled cluster. Do you think we should delete it for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

ray-operator/controllers/ray/raycluster_controller_test.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/raycluster_controller_test.go Outdated Show resolved Hide resolved
…toscaler is enabled to prevent ray cluster from malfunctioning

Signed-off-by: Rueian <[email protected]>
@rueian rueian force-pushed the autoscaler-workergroup-suspend branch from 68eb153 to 3155e13 Compare January 17, 2025 21:24
@kevin85421 kevin85421 merged commit d86ea62 into ray-project:master Jan 17, 2025
23 of 24 checks passed
@kevin85421
Copy link
Member

@andrewsykim I merged this PR for now to move forward. Feel free to open a follow up PR if you have any comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants