-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add threshold for scale down #1038
base: main
Are you sure you want to change the base?
Conversation
- script: make undeploy-prometheus || true | ||
- script: if [ "$NEED_PROMETHEUS" = "true" ]; then make deploy-prometheus; fi | ||
- script: make undeploy-prometheus-adapter || true | ||
- script: if [ "$NEED_PROMETHEUS" = "true" ]; then make deploy-prometheus-adapter; fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we run "make undeploy-prometheus" and "make undeploy-prometheus-adaptor" at the end of leg-12? If not, are all the generated prometheus objects are in one namespace and how we can make sure the test objects are cleaned up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prometheus and prometheus-adapter are at the same level as the operator i.e a new instance is created for each (leg-12) run for the entire run(only one prometheus/adapter is used by all the test cases that need it). We are not explicitly cleaning them up because they will be removed with the kind cluster just like the operator.
kind: TestStep | ||
commands: | ||
- script: | | ||
echo "sleep 120 seconds before starting to verify metrics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to sleep 120 seconds for getting the metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it may take some time for the metrics to show up in the adapter. I will make it 60s and see.
set -o errexit | ||
set -o xtrace | ||
|
||
kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a for-loop will be easily to find how many connections we created.
kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" & | ||
kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" & | ||
kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" & | ||
sleep 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this sleep for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, the pod will immediately complete and those sessions won't stay active. I don't really.
name: vertica_sessions_running_total | ||
target: | ||
type: AverageValue | ||
averageValue: 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't specify "scaleDownThreshold", will we not scale down any more? If so, we probably need to make "scaleDownThreshold" required or use a default value for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will scale down based on the threshold also used for scale up
targetSize: 3 | ||
status: | ||
scalingCount: 1 | ||
currentSize: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we verify minReplicas and maxReplicas here as well?
@@ -173,13 +179,22 @@ type VerticaAutoscalerConditionType string | |||
const ( | |||
// TargetSizeInitialized indicates whether the operator has initialized targetSize in the spec | |||
TargetSizeInitialized VerticaAutoscalerConditionType = "TargetSizeInitialized" | |||
// ScalingActive indicates that the horizontal pod autoscaler can fetch the metric | |||
// and is ready for whenever scaling is needed. | |||
ScalingActive VerticaAutoscalerConditionType = "ScalingActive" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this condition work? I cannot find where we assign the value to this condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In verifyhpa_reconciler.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is set to true when the hpa is ready. The operator checks the hpa status to know that.
For the users, they usually will view hpa to see current metrics, right? Then if they see minReplicas is changed, will it be strange? Could we add a new field in hpa to indicate the original minReplicas? |
@LiboYu2, @qindotguan comments, thoughts? |
We cannot add a new field to the hpa as it is 3rd party struct. |
This adds a new field
scaleDownThreshold
to the metric's definition. It will be used to control scale down i.e scale down will happen only when the metric's value is lower than this new field's value.As the hpa does not natively support 2 thresholds, this works through a hack: we know that scaling cannot happen if the currentReplicas is equal to the minReplicas so the operator sets the hpa minReplicas to the VerticaAutoscaler's targetSize as long as the metric's value is greater than scaleDownThreshold thus preventing scale down. The moment the value is lower than scaleDownThreshold, the operator update the hpa minReplicas to its correct value (from vas) which will trigger scale down.
This will not work as expected with the scale-down stabilization window as the new threshold is not part of the hpa. So, it will be recommended to either set a stabilization window or set scaleDownThreshold for a stable scaling down experience
NB: Using the scale-down stabilization window is also a very good way to have a stable scaling down process. A stabilization window long enough will assure the metric has been lower long enough to safely scale down.