Fix scale back to 0 with 300 requests scenario #76

samos123 · 2024-02-07T16:30:16Z

Fixes #73 and includes pr #70 and others from Alex

Remove compareScales function
Refactor SetDesiredScale that is simpler to understand by storing lastSuccessfulScale time
Improve test coverage for scaler.go

Co-author: @alpe

pkg/deployments/scaler.go

The recursive compareScales call seems to only set the current replicas so we can simply do that directly instead.

alpe

Good refactoring and simpler to read. I added some comments as I think the logic has changed for scale down. This is not necessarily bad but should be revisited from a product view.

alpe · 2024-02-08T10:34:54Z

pkg/autoscaler/autoscaler.go

@@ -93,6 +93,11 @@ func (a *Autoscaler) Start() {
 		}

 		for deploymentName, waitCount := range stats.ActiveRequests {
+			// TODO remove this check and ensure only stats for deployments with models are returned
+			if !a.Deployments.HasModel(deploymentName) {
+				log.Printf("Deployment: %v has no model annotations, skipping", deploymentName)


Good to exclude them. I was wondering about the log already

personal preference: do we need this log output? This can create a lot of noise?

nit: here and others: prefer %q or %s in strings

I think we can remove this log once we remove this if statement. The logs are confusing without a log saying that the deployment is ignored. The reason being that earlier in the log it shows aggregating stats of e.g. kubernetes which might make people think it's active on the kubernetes service.

alpe · 2024-02-08T11:03:36Z

pkg/deployments/scaler.go

-
-		if s.scaleDownTimer == nil {
-			s.scaleDownTimer = time.AfterFunc(s.scaleDownDelay, func() {
+		if time.Since(s.lastScaleDown) >= s.scaleDownDelay {


🤔 the lastScaleDown value is set in the constructor. This ensures the scale down delay is taken into account on a node startup and right after the last successful scale down.
There is no delay if a running node becomes leader or if the last scale down is some time ago. A scale down call would execute immediately as the condition is always true.
With the old logic, the first call to scale down triggered a timer. I order to achieve the same, we can have an unset lastScaleDown time and set it on first call to scale down. You can check with s.lastScaleDown.IsZero() for unset.
It should be cleared (unset) on

desiredScale >= currentScale

successful scale down

loosing leader role

Probably with AtLeastOne()

Here is my code I was working on in parallel:

if s.currentScale < s.desiredScale { // Scale up immediately. go s.scaleFunc(s.desiredScale, false) s.desiredScaleDownStart = time.Time{} } else if s.currentScale == s.desiredScale { s.desiredScaleDownStart = time.Time{} } else { if s.desiredScaleDownStart.IsZero() { s.desiredScaleDownStart = time.Now() } else if time.Since(s.desiredScaleDownStart) >= s.scaleDownDelay { go s.scaleFunc(s.desiredScale, false) } }

Thanks for taking the time to do a deeper analysis on how it differs. I agree that setting it to zero value is cleaner and have fixed it.

@nstogner regarding your implementation, it's important that we always check and log the error of s.scaleFunc because otherwise it will be hard to debug if there is a scaling issue. Was there any reason we weren't originally doing error checks? and any reason why your implementation does not? Just making sure I'm not doing something useless with the error checking.

pkg/deployments/manager.go

nstogner · 2024-02-08T15:18:13Z

pkg/autoscaler/autoscaler.go

@@ -93,6 +93,11 @@ func (a *Autoscaler) Start() {
 		}

 		for deploymentName, waitCount := range stats.ActiveRequests {
+			// TODO remove this check and ensure only stats for deployments with models are returned
+			if !a.Deployments.HasModel(deploymentName) {


What scenario leads lingo to fall into this state?

Lingo right now has autoscaler active on all deployments because ActiceRequests returns the active requests for all deployments. This if statement can be removed once we fix #59 and use Pod endpoints and only add endpoints of Deployments that have a lingo model annotation.

You can verify this by running Lingo and checking the logs. Note I added a TODO item that this should eventually get removed but I don't want to do this as part of this PR.

To answer your question, this happens in all scenarios.

pkg/deployments/manager.go

pkg/deployments/scaler.go

nstogner · 2024-02-08T15:23:16Z

pkg/deployments/scaler.go

-
-func (s *scaler) compareScales(current, desired int32) {
+	log.Printf("SetDesiredScale(%v), current: %v, min: %v, max: %v", n, s.currentScale, s.minScale, s.maxScale)
+	nMinMax := s.applyMinMax(n)


No need for local var, clearer to do s.desiredScale = s.applyMinMax(n)

but s.applyMinMax will call Lock() so that's why I did this. Is that not needed even when it also calls Lock()?

I'm pretty sure this would result in a deadlock unless I use the local var, see code example here which simulates what would happen if I were to call s.desiredCale = s.applyMinMax(n) within the Lock()

func main() { mtx := sync.Mutex{} mtx.Lock() fmt.Println("Locked") mtx.Lock() fmt.Println("Double locked") }

You can call it before the Lock()

but I shouldn't modify s.desiredScale before calling Lock()?

after quick chat, I fixed this by removing the lock in applyMinMax and moving the call to applyMinMax into the lock itself.

nstogner · 2024-02-08T15:24:21Z

pkg/deployments/scaler.go

-
-		if s.scaleDownTimer == nil {
-			s.scaleDownTimer = time.AfterFunc(s.scaleDownDelay, func() {
+		if time.Since(s.lastScaleDown) >= s.scaleDownDelay {


Here is my code I was working on in parallel:

if s.currentScale < s.desiredScale { // Scale up immediately. go s.scaleFunc(s.desiredScale, false) s.desiredScaleDownStart = time.Time{} } else if s.currentScale == s.desiredScale { s.desiredScaleDownStart = time.Time{} } else { if s.desiredScaleDownStart.IsZero() { s.desiredScaleDownStart = time.Now() } else if time.Since(s.desiredScaleDownStart) >= s.scaleDownDelay { go s.scaleFunc(s.desiredScale, false) } }

pkg/deployments/scaler_test.go

pkg/deployments/scaler.go

remove lastScaleDown init to time.Now()

* Previous implementation would scale down more aggresively * prevent test cases from hanging forever by using simple sleeps

nstogner

LGTM - added one small suggestion in tests

nstogner · 2024-02-09T22:07:22Z

pkg/deployments/scaler_test.go

+	s.SetDesiredScale(3)
+	time.Sleep(1 * time.Second)
+	mockScaleMtx.Lock() // Ensure consistency of the checked state
+	if scaleFuncCalled != false {


You could also use require.True(t, scaleFuncCalled) - https://pkg.go.dev/github.com/stretchr/testify/require#True

samos123 changed the title ~~Do not merge: Test alpe fixes on top of my PR~~ Do not merge: Test alpe fixes #75 on top of #72 Feb 7, 2024

alpe and others added 4 commits February 7, 2024 09:20

Stop scaledown timer

80794fc

Review feedback

f887216

Another approach to clear desired scale state

4a020dd

bring back bigger scale e2e tests

9d55549

samos123 force-pushed the test-alpe-fixes branch from cc876cc to 9d55549 Compare February 7, 2024 17:20

samos123 changed the title ~~Do not merge: Test alpe fixes #75 on top of #72~~ Fix scale back from 0 when there are more requests Feb 7, 2024

samos123 changed the title ~~Fix scale back from 0 when there are more requests~~ Fix scale back from 0 with 300 requests scenario Feb 7, 2024

samos123 changed the title ~~Fix scale back from 0 with 300 requests scenario~~ Fix scale back to 0 with 300 requests scenario Feb 7, 2024

samos123 requested a review from nstogner February 7, 2024 17:28

samos123 commented Feb 7, 2024

View reviewed changes

pkg/deployments/scaler.go Outdated Show resolved Hide resolved

remove recursive compareScales call

6d25bc7

The recursive compareScales call seems to only set the current replicas so we can simply do that directly instead.

This was referenced Feb 7, 2024

Stop scale down timer #70

Closed

[Do not merge] alternative approach to clear state when not leader anymore #75

Closed

samos123 added 5 commits February 7, 2024 19:25

refactor SetDesiredScale

abf7997

Revert leader election code

c33e98a

add log for models associated with deployment

b66ed4c

only scale deployments with model annotation

7871bb7

remove unused sleep in manager_test

7b22339

samos123 requested a review from alpe February 8, 2024 06:33

add more test cases

ea96fdc

alpe reviewed Feb 8, 2024

View reviewed changes

nstogner requested changes Feb 8, 2024

View reviewed changes

nstogner reviewed Feb 8, 2024

View reviewed changes

pkg/deployments/scaler.go Outdated Show resolved Hide resolved

samos123 added 4 commits February 8, 2024 08:36

allow scale down directly

3be45bf

remove lastScaleDown init to time.Now()

remove deployment name during removeDeployment

74a6c23

idiomatic golang error handling?

cdf3772

move conditional to wg.Add(1) in tests

320613a

samos123 requested review from nstogner and alpe February 8, 2024 16:59

samos123 added 11 commits February 8, 2024 12:32

use single lock for applyMinMax

5eae0c0

add comment stating temporary nature of HasModel

7c3a534

add more explanation of why the check is needed

2dab61a

move check to only scale managed deployments to deployment manager

0e0be5b

remove need for tracking deployments in a map

9292bce

make HasModel private into hasModel

704e811

add scale down delay after scale up

c630645

Implement delayed scale down correctly

c4fbf23

* Previous implementation would scale down more aggresively * prevent test cases from hanging forever by using simple sleeps

add scaler tests comments

6050e1c

remove unused var in scale_test.go

68a4519

remove empty line in autoscaler.go

5b4b00e

nstogner approved these changes Feb 9, 2024

View reviewed changes

samos123 added 2 commits February 9, 2024 14:44

clean up test to use require instead of if asserts

3f7319b

scale down can take slightly longer

449fa46

samos123 merged commit 5227052 into main Feb 9, 2024
6 checks passed

samos123 deleted the test-alpe-fixes branch February 9, 2024 23:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scale back to 0 with 300 requests scenario #76

Fix scale back to 0 with 300 requests scenario #76

samos123 commented Feb 7, 2024 •

edited

Loading

alpe left a comment

alpe Feb 8, 2024

samos123 Feb 8, 2024

alpe Feb 8, 2024

nstogner Feb 8, 2024

samos123 Feb 8, 2024

nstogner Feb 8, 2024

samos123 Feb 8, 2024

samos123 Feb 8, 2024

samos123 Feb 8, 2024

nstogner Feb 8, 2024

samos123 Feb 8, 2024

samos123 Feb 8, 2024

nstogner Feb 8, 2024

samos123 Feb 8, 2024

samos123 Feb 8, 2024

nstogner Feb 8, 2024

nstogner left a comment

nstogner Feb 9, 2024

Fix scale back to 0 with 300 requests scenario #76

Fix scale back to 0 with 300 requests scenario #76

Conversation

samos123 commented Feb 7, 2024 • edited Loading

alpe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nstogner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samos123 commented Feb 7, 2024 •

edited

Loading