Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to access the a MIG Device ID programmatically #6

Open
HamidShojanazeri opened this issue Sep 22, 2021 · 9 comments
Open

How to access the a MIG Device ID programmatically #6

HamidShojanazeri opened this issue Sep 22, 2021 · 9 comments

Comments

@HamidShojanazeri
Copy link

Hi @klueska , I am looking into an issue of assigning a gpu which has been partitioned by MIG inside a python script where want to run a Pytorch model.

We typically do it this way in Torchserve and now if a A100 gpu is partitioned into 2 gpus such as "MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b/0/0" and "MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b/1/0", what would be good way to handle it, is there any tool available that provides this info?

This MIG GPU-id is not available through CUDA utilities in Pytorch.

I appreciate your thoughts.

@elezar
Copy link
Member

elezar commented Sep 22, 2021

Hi @HamidShojanazeri you would most likely need to use NVML (which has python bindings) to query the state of the device and get the available MIG devices.

An example of a golang-based implementation can be found here. Calling GetUUID() on each of the handles should return the relevant device IDs.

@klueska should be able to comment as to whether anything else is required other than extracting the relevant UUIDs.

@kpouget
Copy link

kpouget commented Sep 22, 2021

Hello @HamidShojanazeri, JFYI, here is the way I used to discover the MIG instances (or full) GPU:

NB_GPUS=$(nvidia-smi -L | grep "UUID: MIG-GPU" | wc -l)
if [[ "$NB_GPUS" == 0 ]]; then
    ALL_GPUS=$(nvidia-smi -L | grep "UUID: GPU" | cut -d" " -f5 | cut -d')' -f1)

    echo "No MIG GPU available, using the full GPUs ($ALL_GPUS)."
else
    ALL_GPUS=$(nvidia-smi -L | grep "UUID: MIG-GPU" | cut -d" " -f8 | cut -d')' -f1)
    echo "Found $NB_GPU MIG instances: $ALL_GPUS"
fi
...
CMD=('python' ...)
ARGS=(train.py ...)

declare -a pids

trap "date; echo failed :(; exit 1" ERR

for gpu in $(echo "$ALL_GPUS"); do
    export CUDA_VISIBLE_DEVICES=$gpu  # <--- assignement is done here

    dest=/tmp/ssd_$(echo $gpu | sed 's|/|_|g').log

    "${CMD[@]}" "${ARGS[@]}" > "$dest" &
    pids+=($!)
done

echo "$(date): starting waiting for $NB_GPU executions: ${pids[@]}"

wait

@kpouget
Copy link

kpouget commented Sep 22, 2021

MIG GPUs don't support peer to peer communications. The MLPerf/Pytorch code I was using was failing to deploy on multiple MIG instances, I struggled+hacked my way for the launcher to correctly take into account the MIG instances ... only to find out that the code would crash, because of the p2p communication could not be enabled.

So double check if you're not in the same situation

@HamidShojanazeri
Copy link
Author

Hi @HamidShojanazeri you would most likely need to use NVML (which has python bindings) to query the state of the device and get the available MIG devices.

An example of a golang-based implementation can be found here. Calling GetUUID() on each of the handles should return the relevant device IDs.

@klueska should be able to comment as to whether anything else is required other than extracting the relevant UUIDs.

Thanks @elezar for the pointer, yes I think that should give me the access from a python script, I wonder if there any documentation/ tutorial for Python bindings of the NVML.

@HamidShojanazeri
Copy link
Author

Hello @HamidShojanazeri, JFYI, here is the way I used to discover the MIG instances (or full) GPU:

NB_GPUS=$(nvidia-smi -L | grep "UUID: MIG-GPU" | wc -l)
if [[ "$NB_GPUS" == 0 ]]; then
    ALL_GPUS=$(nvidia-smi -L | grep "UUID: GPU" | cut -d" " -f5 | cut -d')' -f1)

    echo "No MIG GPU available, using the full GPUs ($ALL_GPUS)."
else
    ALL_GPUS=$(nvidia-smi -L | grep "UUID: MIG-GPU" | cut -d" " -f8 | cut -d')' -f1)
    echo "Found $NB_GPU MIG instances: $ALL_GPUS"
fi
...
CMD=('python' ...)
ARGS=(train.py ...)

declare -a pids

trap "date; echo failed :(; exit 1" ERR

for gpu in $(echo "$ALL_GPUS"); do
    export CUDA_VISIBLE_DEVICES=$gpu  # <--- assignement is done here

    dest=/tmp/ssd_$(echo $gpu | sed 's|/|_|g').log

    "${CMD[@]}" "${ARGS[@]}" > "$dest" &
    pids+=($!)
done

echo "$(date): starting waiting for $NB_GPU executions: ${pids[@]}"

wait

Thanks @kpouget for sharing, that would be great for trainings I think, in this case, Torchserve is serving solution with Python backend, need to access the MIG devices from Python handlers to assign them dynamically to the workers.

@HamidShojanazeri
Copy link
Author

MIG GPUs don't support peer to peer communications. The MLPerf/Pytorch code I was using was failing to deploy on multiple MIG instances, I struggled+hacked my way for the launcher to correctly take into account the MIG instances ... only to find out that the code would crash, because of the p2p communication could not be enabled.

So double check if you're not in the same situation

@kpouget Thanks for highlighting it, for this use-case we will not use p2p communication for running inferences, but this may come up in some other workflows for us in future.

@klueska
Copy link
Contributor

klueska commented Sep 22, 2021

I'm pretty sure all of the python bindings for NVML are named the same as their C counterparts from this API documentation: https://docs.nvidia.com/deploy/nvml-api/index.html

@klueska
Copy link
Contributor

klueska commented Sep 22, 2021

Actually, this page explains where / how they differ (which is minimal):
https://pythonhosted.org/nvidia-ml-py/

@HamidShojanazeri
Copy link
Author

Thanks @klueska for the prompt response, I will give it a shot and update you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants