Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

Open
henryli001 opened this issue Jan 11, 2025 · 0 comments

Comments

@henryli001
Copy link

Running standalone docker container in an Azure Linux 2.0 VM with nvidia container toolkit installed will lose access to the GPU and throw the error: "Failed to initialize NVML: Unknown Error" after the container is running for a while. The symptom is similar to that described in this known issue: #48 and can be reproed by running systemctl daemon-reload.

The issue would not show up if I explicitly set --device= for each NVIDIA device node in my system in the docker command. However, this is not a sustainable solution as the number of NVIDIA device nodes in the system may change based on the configuration and thus I'm wondering if there's a better way to let the container automatically access all the NVIDIA devices without explicitly setting --device= for each NVIDIA device node?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant