How to access the a MIG Device ID programmatically #6

HamidShojanazeri · 2021-09-22T06:13:18Z

Hi @klueska , I am looking into an issue of assigning a gpu which has been partitioned by MIG inside a python script where want to run a Pytorch model.

We typically do it this way in Torchserve and now if a A100 gpu is partitioned into 2 gpus such as "MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b/0/0" and "MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b/1/0", what would be good way to handle it, is there any tool available that provides this info?

This MIG GPU-id is not available through CUDA utilities in Pytorch.

I appreciate your thoughts.

elezar · 2021-09-22T08:22:27Z

Hi @HamidShojanazeri you would most likely need to use NVML (which has python bindings) to query the state of the device and get the available MIG devices.

An example of a golang-based implementation can be found here. Calling GetUUID() on each of the handles should return the relevant device IDs.

@klueska should be able to comment as to whether anything else is required other than extracting the relevant UUIDs.

kpouget · 2021-09-22T12:18:00Z

Hello @HamidShojanazeri, JFYI, here is the way I used to discover the MIG instances (or full) GPU:

NB_GPUS=$(nvidia-smi -L | grep "UUID: MIG-GPU" | wc -l)
if [[ "$NB_GPUS" == 0 ]]; then
    ALL_GPUS=$(nvidia-smi -L | grep "UUID: GPU" | cut -d" " -f5 | cut -d')' -f1)

    echo "No MIG GPU available, using the full GPUs ($ALL_GPUS)."
else
    ALL_GPUS=$(nvidia-smi -L | grep "UUID: MIG-GPU" | cut -d" " -f8 | cut -d')' -f1)
    echo "Found $NB_GPU MIG instances: $ALL_GPUS"
fi
...
CMD=('python' ...)
ARGS=(train.py ...)

declare -a pids

trap "date; echo failed :(; exit 1" ERR

for gpu in $(echo "$ALL_GPUS"); do
    export CUDA_VISIBLE_DEVICES=$gpu  # <--- assignement is done here

    dest=/tmp/ssd_$(echo $gpu | sed 's|/|_|g').log

    "${CMD[@]}" "${ARGS[@]}" > "$dest" &
    pids+=($!)
done

echo "$(date): starting waiting for $NB_GPU executions: ${pids[@]}"

wait

kpouget · 2021-09-22T12:22:07Z

MIG GPUs don't support peer to peer communications. The MLPerf/Pytorch code I was using was failing to deploy on multiple MIG instances, I struggled+hacked my way for the launcher to correctly take into account the MIG instances ... only to find out that the code would crash, because of the p2p communication could not be enabled.

So double check if you're not in the same situation

HamidShojanazeri · 2021-09-22T18:36:59Z

Hi @HamidShojanazeri you would most likely need to use NVML (which has python bindings) to query the state of the device and get the available MIG devices.

An example of a golang-based implementation can be found here. Calling GetUUID() on each of the handles should return the relevant device IDs.

@klueska should be able to comment as to whether anything else is required other than extracting the relevant UUIDs.

Thanks @elezar for the pointer, yes I think that should give me the access from a python script, I wonder if there any documentation/ tutorial for Python bindings of the NVML.

HamidShojanazeri · 2021-09-22T18:39:41Z

Hello @HamidShojanazeri, JFYI, here is the way I used to discover the MIG instances (or full) GPU:

NB_GPUS=$(nvidia-smi -L | grep "UUID: MIG-GPU" | wc -l)
if [[ "$NB_GPUS" == 0 ]]; then
    ALL_GPUS=$(nvidia-smi -L | grep "UUID: GPU" | cut -d" " -f5 | cut -d')' -f1)

    echo "No MIG GPU available, using the full GPUs ($ALL_GPUS)."
else
    ALL_GPUS=$(nvidia-smi -L | grep "UUID: MIG-GPU" | cut -d" " -f8 | cut -d')' -f1)
    echo "Found $NB_GPU MIG instances: $ALL_GPUS"
fi
...
CMD=('python' ...)
ARGS=(train.py ...)

declare -a pids

trap "date; echo failed :(; exit 1" ERR

for gpu in $(echo "$ALL_GPUS"); do
    export CUDA_VISIBLE_DEVICES=$gpu  # <--- assignement is done here

    dest=/tmp/ssd_$(echo $gpu | sed 's|/|_|g').log

    "${CMD[@]}" "${ARGS[@]}" > "$dest" &
    pids+=($!)
done

echo "$(date): starting waiting for $NB_GPU executions: ${pids[@]}"

wait

Thanks @kpouget for sharing, that would be great for trainings I think, in this case, Torchserve is serving solution with Python backend, need to access the MIG devices from Python handlers to assign them dynamically to the workers.

HamidShojanazeri · 2021-09-22T18:42:18Z

MIG GPUs don't support peer to peer communications. The MLPerf/Pytorch code I was using was failing to deploy on multiple MIG instances, I struggled+hacked my way for the launcher to correctly take into account the MIG instances ... only to find out that the code would crash, because of the p2p communication could not be enabled.

So double check if you're not in the same situation

@kpouget Thanks for highlighting it, for this use-case we will not use p2p communication for running inferences, but this may come up in some other workflows for us in future.

klueska · 2021-09-22T19:23:51Z

I'm pretty sure all of the python bindings for NVML are named the same as their C counterparts from this API documentation: https://docs.nvidia.com/deploy/nvml-api/index.html

klueska · 2021-09-22T19:25:51Z

Actually, this page explains where / how they differ (which is minimal):
https://pythonhosted.org/nvidia-ml-py/

HamidShojanazeri · 2021-09-22T21:16:33Z

Thanks @klueska for the prompt response, I will give it a shot and update you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to access the a MIG Device ID programmatically #6

How to access the a MIG Device ID programmatically #6

HamidShojanazeri commented Sep 22, 2021

elezar commented Sep 22, 2021

kpouget commented Sep 22, 2021

kpouget commented Sep 22, 2021

HamidShojanazeri commented Sep 22, 2021

HamidShojanazeri commented Sep 22, 2021

HamidShojanazeri commented Sep 22, 2021

klueska commented Sep 22, 2021

klueska commented Sep 22, 2021

HamidShojanazeri commented Sep 22, 2021

How to access the a MIG Device ID programmatically #6

How to access the a MIG Device ID programmatically #6

Comments

HamidShojanazeri commented Sep 22, 2021

elezar commented Sep 22, 2021

kpouget commented Sep 22, 2021

kpouget commented Sep 22, 2021

HamidShojanazeri commented Sep 22, 2021

HamidShojanazeri commented Sep 22, 2021

HamidShojanazeri commented Sep 22, 2021

klueska commented Sep 22, 2021

klueska commented Sep 22, 2021

HamidShojanazeri commented Sep 22, 2021