-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to access the a MIG Device ID programmatically #6
Comments
Hi @HamidShojanazeri you would most likely need to use NVML (which has python bindings) to query the state of the device and get the available MIG devices. An example of a golang-based implementation can be found here. Calling @klueska should be able to comment as to whether anything else is required other than extracting the relevant UUIDs. |
Hello @HamidShojanazeri, JFYI, here is the way I used to discover the MIG instances (or full) GPU:
|
MIG GPUs don't support peer to peer communications. The MLPerf/Pytorch code I was using was failing to deploy on multiple MIG instances, I struggled+hacked my way for the launcher to correctly take into account the MIG instances ... only to find out that the code would crash, because of the p2p communication could not be enabled. So double check if you're not in the same situation |
Thanks @elezar for the pointer, yes I think that should give me the access from a python script, I wonder if there any documentation/ tutorial for Python bindings of the NVML. |
Thanks @kpouget for sharing, that would be great for trainings I think, in this case, Torchserve is serving solution with Python backend, need to access the MIG devices from Python handlers to assign them dynamically to the workers. |
@kpouget Thanks for highlighting it, for this use-case we will not use p2p communication for running inferences, but this may come up in some other workflows for us in future. |
I'm pretty sure all of the python bindings for NVML are named the same as their C counterparts from this API documentation: https://docs.nvidia.com/deploy/nvml-api/index.html |
Actually, this page explains where / how they differ (which is minimal): |
Thanks @klueska for the prompt response, I will give it a shot and update you. |
Hi @klueska , I am looking into an issue of assigning a gpu which has been partitioned by MIG inside a python script where want to run a Pytorch model.
We typically do it this way in Torchserve and now if a A100 gpu is partitioned into 2 gpus such as "MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b/0/0" and "MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b/1/0", what would be good way to handle it, is there any tool available that provides this info?
This MIG GPU-id is not available through CUDA utilities in Pytorch.
I appreciate your thoughts.
The text was updated successfully, but these errors were encountered: