-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayCluster][Feature] skip suspending worker groups if the in-tree autoscaler is enabled #2748
[RayCluster][Feature] skip suspending worker groups if the in-tree autoscaler is enabled #2748
Conversation
a68ca2e
to
dfd71b8
Compare
dfd71b8
to
3e99db1
Compare
cc @andrewsykim, would you mind reviewing this PR after #2643 is merged? Thanks |
discussed offline: validate RayCluster spec and RayJob spec |
@rueian can you let me know once the validation logic is addeed? |
…toscaler is enabled to prevent ray cluster from malfunctioning Signed-off-by: Rueian <[email protected]>
…toscaler is enabled to prevent ray cluster from malfunctioning Signed-off-by: Rueian <[email protected]>
3e99db1
to
cb8a866
Compare
…toscaler is enabled to prevent ray cluster from malfunctioning Signed-off-by: Rueian <[email protected]>
…toscaler is enabled to prevent ray cluster from malfunctioning Signed-off-by: Rueian <[email protected]>
Hi @kevin85421 and @andrewsykim, This PR is ready and only focuses on validation on the RayCluster side. I will open another PR for validationon the RayJob side as well. |
Using the feature gate for this suspend field will also be another PR. |
ray-operator/controllers/ray/raycluster_controller_unit_test.go
Outdated
Show resolved
Hide resolved
@@ -861,6 +861,47 @@ var _ = Context("Inside the default namespace", func() { | |||
}) | |||
}) | |||
|
|||
Describe("Suspend RayCluster worker group with Autoscaler enabled", Ordered, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A unit test seems sufficient if the validation happens at the very beginning of the reconciliation, so no other logic is involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test case could still be useful when we later add the support to suspend worker groups in an autoscaler-enabled cluster. Do you think we should delete it for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
…toscaler is enabled to prevent ray cluster from malfunctioning Signed-off-by: Rueian <[email protected]>
68eb153
to
3155e13
Compare
@andrewsykim I merged this PR for now to move forward. Feel free to open a follow up PR if you have any comments. |
Why are these changes needed?
Skip suspending worker groups if the in-tree autoscaler is enabled to prevent ray cluster from malfunctioning.
The old autoscaler can't know if a worker group has been suspended on the KubeRay side, therefore, it will keep making wrong scaling decisions if some of the worker groups have been suspended. Such as:
To prevent a ray cluster from malfunctioning like that, we must forbid the usage of work group suspension on all old Ray versions that don't recognize the
suspend
field in the CR. However, we are unable to forbid usage by the controller without the validation webhook. In the case without the validation webhook, we can only fire a warning eventInvalidRayClusterSpec
to let users know that this is forbidden.The validation webhook implementation will be in another PR.
Related issue number
Checks