Test the ecosystem integration against GPU containers #38

titu1994 · 2022-02-11T19:11:03Z

🚀 Feature

Nemo tests it's own CI on a Pytorch container from NGC (versioned as YY.MM) and these are generally available on other cloud providers too. Note that - usually once pytorch has a public release, it takes at least one month for the next container to actually have the public pytorch release. By actually have the released pytorch, I mean that the current container will have an alpha release of pytorch with some cherry-picked changes vs the actual full new release in public.

This can cause cases where improper version checking (using distutils instead of packaging.version.Version) can fail these alpha version comparison tests and cause PTL inside of the container to pick incorrect code paths. So the ecosystem CI will work fine ... but when you run it on a pytorch container released from Nvidia (ie on most cloud providers) it may fail (and not just Nemo, anything that uses PTL and hits that code path).

So maybe on a separate test prior to release, test the ecosystem CI on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.

Motivation

For a current example of exactly how we have to patch for such an issue right now (wrt Pytorch 1.10, NGC Container 21.01 and Pytorch Lightning 1.5.9), https://github.com/NVIDIA/NeMo/blob/8e15ba43ba0a17b456d3bfa09444574ef1faa301/Jenkinsfile#L70-L76 due to an issue regarding torchtext.

For an extreme case of exactly how bad things become - we had to adaptively install torch, PTL and nemo dependencies based on whether the install occurred inside a container or not.. https://github.com/NVIDIA/NeMo/blob/r1.0.0rc1/setup.py#L107-L146

Pitch

Maybe test the ecosystem CI (or just even PTL alone) on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.

Alternatives

Apart from manual patching of PTL source at install time, we haven't found any better solution than to wait it out for a month or two before the container actually contains the latest code from the latest torch release.

Borda · 2022-02-13T09:29:09Z

Maybe test the ecosystem CI (or just even PTL alone) on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.

I see and it would be quite a useful feature to allow customer images... I think it could be very feasible for the GPU testing which is already running customer docker image, but how much do you think is needed also for base CPU testing?

One more point, and it is rather thinking than a complaining... we are talking about two kinds of users (a) heavily rely on containers/docker [probably some corporate user] and (b) casual users using mostly PyPI/Conda registry... so I may say the at would like to serve both, so for example of Nemo I would include both testing a,b 🐰

titu1994 · 2022-02-14T01:37:36Z

CPU testing should be done only on cpu instances when possible, to avoid incurring gpu runtime costs. The containers are mostly tailored towards GPUs since your primary use case for containers are for multi gpu or multi node deployments for training.

For Nemo, the vast majority of users can get by with conda env and no docker, so I do agree with this that it's a niche problem for a subset of users.

That subset of users does include the entire Nemo research team plus a few external research teams, who build containers and run multi node jobs on the clusters. So a breakage of support usually means we wait out upgrading our containers for periods of 1-2 months.

Also, the CI tests running in Nemo are in the container, but we simulate the install environment of the user - ie we use a bare bones pytorch base container and then follow regular pip and conda install steps. Now ofc the torch environment is based on a container which is not what normal users will face, but still it's close to real world install scenarios.

titu1994 added the enhancement New feature or request label Feb 11, 2022

titu1994 mentioned this issue Feb 11, 2022

integrate with Lightning ecosystem CI NVIDIA/NeMo#3502

Closed

Borda added the RFC Ready for Comments label Feb 13, 2022

stale bot added the wontfix This will not be worked on label Apr 16, 2022

Lightning-AI deleted a comment from stale bot Apr 16, 2022

stale bot removed the wontfix This will not be worked on label Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test the ecosystem integration against GPU containers #38

Test the ecosystem integration against GPU containers #38

titu1994 commented Feb 11, 2022

Borda commented Feb 13, 2022

titu1994 commented Feb 14, 2022