Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test the ecosystem integration against GPU containers #38

Open
titu1994 opened this issue Feb 11, 2022 · 2 comments
Open

Test the ecosystem integration against GPU containers #38

titu1994 opened this issue Feb 11, 2022 · 2 comments
Labels
enhancement New feature or request RFC Ready for Comments

Comments

@titu1994
Copy link

🚀 Feature

Nemo tests it's own CI on a Pytorch container from NGC (versioned as YY.MM) and these are generally available on other cloud providers too. Note that - usually once pytorch has a public release, it takes at least one month for the next container to actually have the public pytorch release. By actually have the released pytorch, I mean that the current container will have an alpha release of pytorch with some cherry-picked changes vs the actual full new release in public.

This can cause cases where improper version checking (using distutils instead of packaging.version.Version) can fail these alpha version comparison tests and cause PTL inside of the container to pick incorrect code paths. So the ecosystem CI will work fine ... but when you run it on a pytorch container released from Nvidia (ie on most cloud providers) it may fail (and not just Nemo, anything that uses PTL and hits that code path).

So maybe on a separate test prior to release, test the ecosystem CI on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.

Motivation

For a current example of exactly how we have to patch for such an issue right now (wrt Pytorch 1.10, NGC Container 21.01 and Pytorch Lightning 1.5.9), https://github.com/NVIDIA/NeMo/blob/8e15ba43ba0a17b456d3bfa09444574ef1faa301/Jenkinsfile#L70-L76 due to an issue regarding torchtext.

For an extreme case of exactly how bad things become - we had to adaptively install torch, PTL and nemo dependencies based on whether the install occurred inside a container or not.. https://github.com/NVIDIA/NeMo/blob/r1.0.0rc1/setup.py#L107-L146

Pitch

Maybe test the ecosystem CI (or just even PTL alone) on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.

Alternatives

Apart from manual patching of PTL source at install time, we haven't found any better solution than to wait it out for a month or two before the container actually contains the latest code from the latest torch release.

@Borda
Copy link
Member

Borda commented Feb 13, 2022

Maybe test the ecosystem CI (or just even PTL alone) on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.

I see and it would be quite a useful feature to allow customer images... I think it could be very feasible for the GPU testing which is already running customer docker image, but how much do you think is needed also for base CPU testing?

One more point, and it is rather thinking than a complaining... we are talking about two kinds of users (a) heavily rely on containers/docker [probably some corporate user] and (b) casual users using mostly PyPI/Conda registry... so I may say the at would like to serve both, so for example of Nemo I would include both testing a,b 🐰

@Borda Borda added the RFC Ready for Comments label Feb 13, 2022
@titu1994
Copy link
Author

CPU testing should be done only on cpu instances when possible, to avoid incurring gpu runtime costs. The containers are mostly tailored towards GPUs since your primary use case for containers are for multi gpu or multi node deployments for training.

For Nemo, the vast majority of users can get by with conda env and no docker, so I do agree with this that it's a niche problem for a subset of users.

That subset of users does include the entire Nemo research team plus a few external research teams, who build containers and run multi node jobs on the clusters. So a breakage of support usually means we wait out upgrading our containers for periods of 1-2 months.

Also, the CI tests running in Nemo are in the container, but we simulate the install environment of the user - ie we use a bare bones pytorch base container and then follow regular pip and conda install steps. Now ofc the torch environment is based on a container which is not what normal users will face, but still it's close to real world install scenarios.

@stale stale bot added the wontfix This will not be worked on label Apr 16, 2022
@Lightning-AI Lightning-AI deleted a comment from stale bot Apr 16, 2022
@stale stale bot removed the wontfix This will not be worked on label Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RFC Ready for Comments
Projects
None yet
Development

No branches or pull requests

2 participants