Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add threshold for scale down #1038

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add threshold for scale down #1038

wants to merge 4 commits into from

Conversation

roypaulin
Copy link
Collaborator

This adds a new field scaleDownThreshold to the metric's definition. It will be used to control scale down i.e scale down will happen only when the metric's value is lower than this new field's value.
As the hpa does not natively support 2 thresholds, this works through a hack: we know that scaling cannot happen if the currentReplicas is equal to the minReplicas so the operator sets the hpa minReplicas to the VerticaAutoscaler's targetSize as long as the metric's value is greater than scaleDownThreshold thus preventing scale down. The moment the value is lower than scaleDownThreshold, the operator update the hpa minReplicas to its correct value (from vas) which will trigger scale down.
This will not work as expected with the scale-down stabilization window as the new threshold is not part of the hpa. So, it will be recommended to either set a stabilization window or set scaleDownThreshold for a stable scaling down experience

NB: Using the scale-down stabilization window is also a very good way to have a stable scaling down process. A stabilization window long enough will assure the metric has been lower long enough to safely scale down.

- script: make undeploy-prometheus || true
- script: if [ "$NEED_PROMETHEUS" = "true" ]; then make deploy-prometheus; fi
- script: make undeploy-prometheus-adapter || true
- script: if [ "$NEED_PROMETHEUS" = "true" ]; then make deploy-prometheus-adapter; fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we run "make undeploy-prometheus" and "make undeploy-prometheus-adaptor" at the end of leg-12? If not, are all the generated prometheus objects are in one namespace and how we can make sure the test objects are cleaned up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prometheus and prometheus-adapter are at the same level as the operator i.e a new instance is created for each (leg-12) run for the entire run(only one prometheus/adapter is used by all the test cases that need it). We are not explicitly cleaning them up because they will be removed with the kind cluster just like the operator.

kind: TestStep
commands:
- script: |
echo "sleep 120 seconds before starting to verify metrics"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to sleep 120 seconds for getting the metrics?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it may take some time for the metrics to show up in the adapter. I will make it 60s and see.

set -o errexit
set -o xtrace

kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a for-loop will be easily to find how many connections we created.

kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" &
kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" &
kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" &
sleep 60
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this sleep for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, the pod will immediately complete and those sessions won't stay active. I don't really.

name: vertica_sessions_running_total
target:
type: AverageValue
averageValue: 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't specify "scaleDownThreshold", will we not scale down any more? If so, we probably need to make "scaleDownThreshold" required or use a default value for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will scale down based on the threshold also used for scale up

targetSize: 3
status:
scalingCount: 1
currentSize: 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we verify minReplicas and maxReplicas here as well?

@@ -173,13 +179,22 @@ type VerticaAutoscalerConditionType string
const (
// TargetSizeInitialized indicates whether the operator has initialized targetSize in the spec
TargetSizeInitialized VerticaAutoscalerConditionType = "TargetSizeInitialized"
// ScalingActive indicates that the horizontal pod autoscaler can fetch the metric
// and is ready for whenever scaling is needed.
ScalingActive VerticaAutoscalerConditionType = "ScalingActive"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this condition work? I cannot find where we assign the value to this condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In verifyhpa_reconciler.go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is set to true when the hpa is ready. The operator checks the hpa status to know that.

@cchen-vertica
Copy link
Collaborator

This adds a new field scaleDownThreshold to the metric's definition. It will be used to control scale down i.e scale down will happen only when the metric's value is lower than this new field's value. As the hpa does not natively support 2 thresholds, this works through a hack: we know that scaling cannot happen if the currentReplicas is equal to the minReplicas so the operator sets the hpa minReplicas to the VerticaAutoscaler's targetSize as long as the metric's value is greater than scaleDownThreshold thus preventing scale down. The moment the value is lower than scaleDownThreshold, the operator update the hpa minReplicas to its correct value (from vas) which will trigger scale down. This will not work as expected with the scale-down stabilization window as the new threshold is not part of the hpa. So, it will be recommended to either set a stabilization window or set scaleDownThreshold for a stable scaling down experience

NB: Using the scale-down stabilization window is also a very good way to have a stable scaling down process. A stabilization window long enough will assure the metric has been lower long enough to safely scale down.

For the users, they usually will view hpa to see current metrics, right? Then if they see minReplicas is changed, will it be strange? Could we add a new field in hpa to indicate the original minReplicas?

@roypaulin
Copy link
Collaborator Author

@LiboYu2, @qindotguan comments, thoughts?

@roypaulin
Copy link
Collaborator Author

For the users, they usually will view hpa to see current metrics, right? Then if they see minReplicas is changed, will it be strange? Could we add a new field in hpa to indicate the original minReplicas?

We cannot add a new field to the hpa as it is 3rd party struct.
In general, users are going to refer to the CR, the underlying hpa is just a dependent object. Maybe I can add that field in the CR, it will show the currentMinreplicas and the user-specified minReplicas. I will see if it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants