-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test the ecosystem integration against GPU containers #38
Comments
I see and it would be quite a useful feature to allow customer images... I think it could be very feasible for the GPU testing which is already running customer docker image, but how much do you think is needed also for base CPU testing? One more point, and it is rather thinking than a complaining... we are talking about two kinds of users (a) heavily rely on containers/docker [probably some corporate user] and (b) casual users using mostly PyPI/Conda registry... so I may say the at would like to serve both, so for example of Nemo I would include both testing a,b 🐰 |
CPU testing should be done only on cpu instances when possible, to avoid incurring gpu runtime costs. The containers are mostly tailored towards GPUs since your primary use case for containers are for multi gpu or multi node deployments for training. For Nemo, the vast majority of users can get by with conda env and no docker, so I do agree with this that it's a niche problem for a subset of users. That subset of users does include the entire Nemo research team plus a few external research teams, who build containers and run multi node jobs on the clusters. So a breakage of support usually means we wait out upgrading our containers for periods of 1-2 months. Also, the CI tests running in Nemo are in the container, but we simulate the install environment of the user - ie we use a bare bones pytorch base container and then follow regular pip and conda install steps. Now ofc the torch environment is based on a container which is not what normal users will face, but still it's close to real world install scenarios. |
🚀 Feature
Nemo tests it's own CI on a Pytorch container from NGC (versioned as YY.MM) and these are generally available on other cloud providers too. Note that - usually once pytorch has a public release, it takes at least one month for the next container to actually have the public pytorch release. By actually have the released pytorch, I mean that the current container will have an alpha release of pytorch with some cherry-picked changes vs the actual full new release in public.
This can cause cases where improper version checking (using distutils instead of packaging.version.Version) can fail these alpha version comparison tests and cause PTL inside of the container to pick incorrect code paths. So the ecosystem CI will work fine ... but when you run it on a pytorch container released from Nvidia (ie on most cloud providers) it may fail (and not just Nemo, anything that uses PTL and hits that code path).
So maybe on a separate test prior to release, test the ecosystem CI on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.
Motivation
For a current example of exactly how we have to patch for such an issue right now (wrt Pytorch 1.10, NGC Container 21.01 and Pytorch Lightning 1.5.9), https://github.com/NVIDIA/NeMo/blob/8e15ba43ba0a17b456d3bfa09444574ef1faa301/Jenkinsfile#L70-L76 due to an issue regarding torchtext.
For an extreme case of exactly how bad things become - we had to adaptively install torch, PTL and nemo dependencies based on whether the install occurred inside a container or not.. https://github.com/NVIDIA/NeMo/blob/r1.0.0rc1/setup.py#L107-L146
Pitch
Maybe test the ecosystem CI (or just even PTL alone) on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.
Alternatives
Apart from manual patching of PTL source at install time, we haven't found any better solution than to wait it out for a month or two before the container actually contains the latest code from the latest torch release.
The text was updated successfully, but these errors were encountered: