Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] huggingface-pytorch-inference image - increase Job Queue Size #4478

Open
anshgandhi opened this issue Jan 4, 2025 · 2 comments
Open

Comments

@anshgandhi
Copy link

Hello Team 👋🏼

Concise Description:
Getting a 503 'ServiceUnavailableException' on SageMaker Realtime Inference Endpoint when calling the endpoint with multiple requests. Trying to find a way to increase the Job Queue Size as mentioned in the error.

DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.1.0-transformers4.37.0-gpu-py310-cu118-ubuntu20.04

Current behavior:
Sagemaker 503 Error when sending multiple requests to the model.
No worker is available to serve request for model: model. Consider increasing job queue size.

Additional context:
Is there a document I could refer that mentions the way to increase the Job Queue Size?

@bencrabtree
Copy link

Thanks for your comment. Here's the documentation that shows how to increase job_queue_size under the FAQ "Q: What are the common tunable environment variables for SageMaker AI containers?": https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-troubleshooting.html.

Multi Model Server (MMS)
Environment Variable: description

job_queue_size: This parameter is useful to tune when you have a scenario where the type of the inference request payload is large, and due to the size of payload being larger, you may have higher heap memory consumption of the JVM in which this queue is being maintained. Ideally you might want to keep the heap memory requirements of JVM lower and allow Python workers to allot more memory for actual model serving. JVM is only for receiving the HTTP requests, queuing them, and dispatching them to the Python-based workers for inference. If you increase the job_queue_size, you might end up increasing the heap memory consumption of the JVM and ultimately taking away memory from the host that could have been used by Python workers. Therefore, exercise caution when tuning this parameter as well.

@anshgandhi
Copy link
Author

thanks @bencrabtree - I wasn't sure if these variables would work for Realtime Inference as well. Will give these a try :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants