-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Startup order #11
Comments
Typically kernel modules are loaded by the |
Nvidia gpu drivers (kernel module) are only loaded when used, and unloaded if not used. To solve that issue, persistent daemon is used. Details can be found https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persistenced.html We need to make sure that daemon is started first, in a server, so that the gpu kernel module is loaded even though not being used. |
The nvidia kernel module is most definitely not loaded and unloaded across each use. If it is not loaded at system boot then running one of the nvidia utilities (e.g. nvidia-smi or nvidia-persistenced) will load the kernel module before it runs. It will not unload it once it is done though. It will remain loaded until the system shuts down or a user explicitly unloads it (via rmmod for example). What the page you linked refers to is about keeping the GPU in persistence mode or not. This has nothing to do with whether the module is loaded or not, but rather whether the (always loaded) module keeps GPU state alive or not across operations. Without the persistenced service (or the old persistence mode being enabled) the driver will tear down GPU state across each operation, making it very slow to respond. With persistenced this state is kept alive and the driver is mich more responsive. In any case, it seems that your system is not loading the module during sysinit and instead relying on the persistenced service to do it for you. I would recommend adding the nvidia module to the set of preloaded modules as is commonly done on other systems (e.g. the Nvidia DGX systems that this nvidia-mig-manager.service was built for and is tested on). |
You are right that the modules are indeed autoloaded, but it seems to have a bit delay, and when nvidia-mig-manager.service runs, it sometimes misses it since nvidia-mig-manager.service is a oneshot service. By the time if I login to check , I do see the modules loaded, and if I manually run nvidia-mig-manager.service again, it works fine. Maybe we can have nvidia-mig-manager.service to wait and retry. |
Would it cause any potential issues if we were to move nvidia-persistenced.service from Before to "After"? Unlike nvidia-mig-manager.service (oneshot), persistenced is a daemon so it can catch up even if it starts before mig-manager. |
In general, the
The reason for this is because these services become clients of the GPU, prohibiting the Unfortunately, the only dependencies these services have in the systemd dependency graph are on the Likewise, the The right way to do this would be to have all of the |
Based on https://github.com/NVIDIA/mig-parted/blob/main/deployments/systemd/nvidia-mig-manager.service#L19 nvidia-mig-manager.service starts before nvidia-persistenced.service. This causes a problem because nvidia-persistenced.service is responsible to load the nvidia kernel modules on a server, so it needs to start first, otherwise nvidia-mig-manager.service won't be able to create the mig without nvidia drivers being loaded.
The text was updated successfully, but these errors were encountered: