Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: custom autoscaling parameters #227

Open
Varun2101 opened this issue Jan 9, 2024 · 2 comments
Open

Feature request: custom autoscaling parameters #227

Varun2101 opened this issue Jan 9, 2024 · 2 comments

Comments

@Varun2101
Copy link

Hi, I'm working on deploying a private language model to production through Replicate. I have requests coming in sporadically so provisioning always-on servers is not feasible for me, but I would like requests to be handled at my max concurrency for increased speed. Currently I face 2-2.5 minutes of cold start for each instance and they terminate after 1 minute each, which can lead to some frustrating delays that are longer than necessary.
Would it be possible to add either of these functionalities?

  1. API to force boot n instances together: reduces the spread of boot time, more control to start the boot process early before requests actually need to be processed
  2. Custom idle time limits: this needs to be at least as long as the boot time. I wouldn't mind having to pay for some extra uptime if it meant I don't have stop-start behaviour in the middle of a chunk being processed.

Currently I'm attempting a workaround for no.1 by burst-pinging the model early with the default input n times, but the short idle time means that there's still a good chance that the instances get terminated before I send any actual requests. Let me know if you have a better solution.
Thanks!

@mattt
Copy link
Contributor

mattt commented Jan 30, 2024

Hi, @Varun2101. Thanks for sharing this feedback.

To your second point, you can get more control over the behavior of a model on Replicate by creating a deployment. I don't believe we provide a way to configure the timing for autoscaling a deployment, but that's something we've discussed.

@nathan-eagle
Copy link

I've built a workaround that simply pings the model (to minimize tokens, I ask it to respond with a single character). This takes .1 seconds of runtime, and I set the ping frequency to 1 minute. After 15 minutes of inactivity the pings stop. While it doesn't solve replicate's initial cold boot issue, this keeps the model warm while a user is active with negligible cost - far cheaper than created a dedicated deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants