You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm working on deploying a private language model to production through Replicate. I have requests coming in sporadically so provisioning always-on servers is not feasible for me, but I would like requests to be handled at my max concurrency for increased speed. Currently I face 2-2.5 minutes of cold start for each instance and they terminate after 1 minute each, which can lead to some frustrating delays that are longer than necessary.
Would it be possible to add either of these functionalities?
API to force boot n instances together: reduces the spread of boot time, more control to start the boot process early before requests actually need to be processed
Custom idle time limits: this needs to be at least as long as the boot time. I wouldn't mind having to pay for some extra uptime if it meant I don't have stop-start behaviour in the middle of a chunk being processed.
Currently I'm attempting a workaround for no.1 by burst-pinging the model early with the default input n times, but the short idle time means that there's still a good chance that the instances get terminated before I send any actual requests. Let me know if you have a better solution.
Thanks!
The text was updated successfully, but these errors were encountered:
To your second point, you can get more control over the behavior of a model on Replicate by creating a deployment. I don't believe we provide a way to configure the timing for autoscaling a deployment, but that's something we've discussed.
I've built a workaround that simply pings the model (to minimize tokens, I ask it to respond with a single character). This takes .1 seconds of runtime, and I set the ping frequency to 1 minute. After 15 minutes of inactivity the pings stop. While it doesn't solve replicate's initial cold boot issue, this keeps the model warm while a user is active with negligible cost - far cheaper than created a dedicated deployment.
Hi, I'm working on deploying a private language model to production through Replicate. I have requests coming in sporadically so provisioning always-on servers is not feasible for me, but I would like requests to be handled at my max concurrency for increased speed. Currently I face 2-2.5 minutes of cold start for each instance and they terminate after 1 minute each, which can lead to some frustrating delays that are longer than necessary.
Would it be possible to add either of these functionalities?
Currently I'm attempting a workaround for no.1 by burst-pinging the model early with the default input n times, but the short idle time means that there's still a good chance that the instances get terminated before I send any actual requests. Let me know if you have a better solution.
Thanks!
The text was updated successfully, but these errors were encountered: