Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scale back to 0 with 300 requests scenario #76

Merged
merged 28 commits into from
Feb 9, 2024
Merged

Conversation

samos123
Copy link
Contributor

@samos123 samos123 commented Feb 7, 2024

Fixes #73 and includes pr #70 and others from Alex

  • Remove compareScales function
  • Refactor SetDesiredScale that is simpler to understand by storing lastSuccessfulScale time
  • Improve test coverage for scaler.go

Co-author: @alpe

@samos123 samos123 changed the title Do not merge: Test alpe fixes on top of my PR Do not merge: Test alpe fixes #75 on top of #72 Feb 7, 2024
@samos123 samos123 changed the title Do not merge: Test alpe fixes #75 on top of #72 Fix scale back from 0 when there are more requests Feb 7, 2024
@samos123 samos123 changed the title Fix scale back from 0 when there are more requests Fix scale back from 0 with 300 requests scenario Feb 7, 2024
@samos123 samos123 changed the title Fix scale back from 0 with 300 requests scenario Fix scale back to 0 with 300 requests scenario Feb 7, 2024
@samos123 samos123 requested a review from nstogner February 7, 2024 17:28
pkg/deployments/scaler.go Outdated Show resolved Hide resolved
The recursive compareScales call seems to only set the current replicas
so we can simply do that directly instead.
@samos123 samos123 requested a review from alpe February 8, 2024 06:33
Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good refactoring and simpler to read. I added some comments as I think the logic has changed for scale down. This is not necessarily bad but should be revisited from a product view.

@@ -93,6 +93,11 @@ func (a *Autoscaler) Start() {
}

for deploymentName, waitCount := range stats.ActiveRequests {
// TODO remove this check and ensure only stats for deployments with models are returned
if !a.Deployments.HasModel(deploymentName) {
log.Printf("Deployment: %v has no model annotations, skipping", deploymentName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to exclude them. I was wondering about the log already

personal preference: do we need this log output? This can create a lot of noise?

nit: here and others: prefer %q or %s in strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this log once we remove this if statement. The logs are confusing without a log saying that the deployment is ignored. The reason being that earlier in the log it shows aggregating stats of e.g. kubernetes which might make people think it's active on the kubernetes service.


if s.scaleDownTimer == nil {
s.scaleDownTimer = time.AfterFunc(s.scaleDownDelay, func() {
if time.Since(s.lastScaleDown) >= s.scaleDownDelay {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 the lastScaleDown value is set in the constructor. This ensures the scale down delay is taken into account on a node startup and right after the last successful scale down.
There is no delay if a running node becomes leader or if the last scale down is some time ago. A scale down call would execute immediately as the condition is always true.
With the old logic, the first call to scale down triggered a timer. I order to achieve the same, we can have an unset lastScaleDown time and set it on first call to scale down. You can check with s.lastScaleDown.IsZero() for unset.
It should be cleared (unset) on

  • desiredScale >= currentScale
  • successful scale down
  • loosing leader role
  • Probably with AtLeastOne()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my code I was working on in parallel:

	if s.currentScale < s.desiredScale {
		// Scale up immediately.
		go s.scaleFunc(s.desiredScale, false)
		s.desiredScaleDownStart = time.Time{}
	} else if s.currentScale == s.desiredScale {
		s.desiredScaleDownStart = time.Time{}
	} else {
		if s.desiredScaleDownStart.IsZero() {
			s.desiredScaleDownStart = time.Now()
		} else if time.Since(s.desiredScaleDownStart) >= s.scaleDownDelay {
			go s.scaleFunc(s.desiredScale, false)
		}
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to do a deeper analysis on how it differs. I agree that setting it to zero value is cleaner and have fixed it.

@nstogner regarding your implementation, it's important that we always check and log the error of s.scaleFunc because otherwise it will be hard to debug if there is a scaling issue. Was there any reason we weren't originally doing error checks? and any reason why your implementation does not? Just making sure I'm not doing something useless with the error checking.

pkg/deployments/manager.go Show resolved Hide resolved
@@ -93,6 +93,11 @@ func (a *Autoscaler) Start() {
}

for deploymentName, waitCount := range stats.ActiveRequests {
// TODO remove this check and ensure only stats for deployments with models are returned
if !a.Deployments.HasModel(deploymentName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What scenario leads lingo to fall into this state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lingo right now has autoscaler active on all deployments because ActiceRequests returns the active requests for all deployments. This if statement can be removed once we fix #59 and use Pod endpoints and only add endpoints of Deployments that have a lingo model annotation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can verify this by running Lingo and checking the logs. Note I added a TODO item that this should eventually get removed but I don't want to do this as part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your question, this happens in all scenarios.

pkg/deployments/manager.go Outdated Show resolved Hide resolved
pkg/deployments/scaler.go Outdated Show resolved Hide resolved
pkg/deployments/scaler.go Show resolved Hide resolved

func (s *scaler) compareScales(current, desired int32) {
log.Printf("SetDesiredScale(%v), current: %v, min: %v, max: %v", n, s.currentScale, s.minScale, s.maxScale)
nMinMax := s.applyMinMax(n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for local var, clearer to do s.desiredScale = s.applyMinMax(n)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but s.applyMinMax will call Lock() so that's why I did this. Is that not needed even when it also calls Lock()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this would result in a deadlock unless I use the local var, see code example here which simulates what would happen if I were to call s.desiredCale = s.applyMinMax(n) within the Lock()


func main() {
	mtx := sync.Mutex{}
	mtx.Lock()
	fmt.Println("Locked")
	mtx.Lock()
	fmt.Println("Double locked")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can call it before the Lock()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I shouldn't modify s.desiredScale before calling Lock()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after quick chat, I fixed this by removing the lock in applyMinMax and moving the call to applyMinMax into the lock itself.


if s.scaleDownTimer == nil {
s.scaleDownTimer = time.AfterFunc(s.scaleDownDelay, func() {
if time.Since(s.lastScaleDown) >= s.scaleDownDelay {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my code I was working on in parallel:

	if s.currentScale < s.desiredScale {
		// Scale up immediately.
		go s.scaleFunc(s.desiredScale, false)
		s.desiredScaleDownStart = time.Time{}
	} else if s.currentScale == s.desiredScale {
		s.desiredScaleDownStart = time.Time{}
	} else {
		if s.desiredScaleDownStart.IsZero() {
			s.desiredScaleDownStart = time.Now()
		} else if time.Since(s.desiredScaleDownStart) >= s.scaleDownDelay {
			go s.scaleFunc(s.desiredScale, false)
		}
	}

pkg/deployments/scaler_test.go Outdated Show resolved Hide resolved
pkg/deployments/scaler.go Outdated Show resolved Hide resolved
@samos123 samos123 requested review from nstogner and alpe February 8, 2024 16:59
Copy link
Contributor

@nstogner nstogner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - added one small suggestion in tests

s.SetDesiredScale(3)
time.Sleep(1 * time.Second)
mockScaleMtx.Lock() // Ensure consistency of the checked state
if scaleFuncCalled != false {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also use require.True(t, scaleFuncCalled) - https://pkg.go.dev/github.com/stretchr/testify/require#True

@samos123 samos123 merged commit 5227052 into main Feb 9, 2024
6 checks passed
@samos123 samos123 deleted the test-alpe-fixes branch February 9, 2024 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scale to 0 not working with replicas 3
3 participants