docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

henryli001 · 2025-01-11T00:49:33Z

Running standalone docker container in an Azure Linux 2.0 VM with nvidia container toolkit installed will lose access to the GPU and throw the error: "Failed to initialize NVML: Unknown Error" after the container is running for a while. The symptom is similar to that described in this known issue: #48 and can be reproed by running systemctl daemon-reload.

The issue would not show up if I explicitly set --device= for each NVIDIA device node in my system in the docker command. However, this is not a sustainable solution as the number of NVIDIA device nodes in the system may change based on the configuration and thus I'm wondering if there's a better way to let the container automatically access all the NVIDIA devices without explicitly setting --device= for each NVIDIA device node?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

henryli001 commented Jan 11, 2025

docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

Comments

henryli001 commented Jan 11, 2025