re-think modules organization and structure #36

fmalmeida · 2021-11-11T12:27:46Z

It would be nice that the pipeline was also capable of running with singularity and conda, instead of only docker, which would be more similar to what is found in nf-core modules. To do that, the following should be addressed:

The pipeline must be reconfigured so it does not depends on a docker containing the databases
- Ideally, the pipeline would have a command that creates the required databases in a directory chosen by the user
- Then this database is loaded in the pipeline with a parameter, e.g. --databases
- The pipeline would check the available databases and skip processes if its database is not available
Finally, instead of creating a huge docker image from scratch, it would be ideal that each module uses a conda package (if available) or a docker image containing only its tool if not available in conda.
- The custom docker (if necessary) must not create files with root premissions to allow its use with singularity, if the user desires, trying to give more flexibility then only docker.

These changes, are related and would also affect the following issues making them more possible or to be easier implemented:

The text was updated successfully, but these errors were encountered:

abhi18av · 2021-11-16T17:15:23Z

Adding this here, since the question above would decide whether the following information is relevant or not

bacannot/main.nf

Line 111 in 0cd2c21

    
           include { unicycler } from './modules/assembly/unicycler.nf' params(outdir: params.outdir,

If we continue to use custom modules, then the allocation of module-level parameters could be simplified by using the addParams (params.MODULE_NAME) in the module import statement . As shown here

https://github.com/mtb-bioinformatics/mtbseq-nf/blob/a459f5d95432f088a3e4d94cab415f88d22916a1/workflows/batch_analysis.nf#L4

And then collect all the pipeline parameters within the config file itself, as shown here.

https://github.com/mtb-bioinformatics/mtbseq-nf/blob/a459f5d95432f088a3e4d94cab415f88d22916a1/conf/global_params.config#L137

fmalmeida · 2021-11-19T01:11:59Z

Hi @abhi18av,

Thank you for pointing that out, however, I believe this one is not relevant, because I actually don't use module-level parameters. I only use the parameters that are set globally to the workflow.

This was a mistake, I thought that I had to declare the parameters I wanted to use, by I figure I hadn't. These have been removed in the new 3.x version of the pipeline.

fmalmeida · 2021-11-19T01:12:39Z

Note on the issue:

I will try to address this issue in branch https://github.com/fmalmeida/bacannot/tree/remodeling.

Steps:

add a parameter and a workflow to download required databases in a path selected by the user
add a parameter to load the downloaded databases into the pipeline
reconfigure modules to access the database provided as input
reconfigure modules to have conda, docker and singularity profiles
go to issue Add bakta #16

fmalmeida · 2021-11-23T15:08:49Z

Found a small hindrance while updating the pipeline in the https://github.com/fmalmeida/bacannot/tree/remodeling branch that needs to be addressed before continuing:

The best for each module would be to use its specific conda packages and quay.io images do allow easy conversion between docker, conda and singularity profiles. For example, this is possible for PROKKA module.
However, I few modules depends on more than one conda package at once, therefore, a quay.io image does not exist for this module. So, which would be the best?
- Try to split the module into smaller ones to avoid requiring multiple packages? Would that be possible to all this "problematic" modules?
- Or, I could create a small docker image, called bacannot:misc_tools that will have all this conda packages that are useful for this other modules that could not use the quay.io image. Then, all these "problematic" modules, would use the same "misc_tools" image and conda "env.yml" file

Needs to think about this.

fmalmeida · 2021-12-17T17:35:20Z

Decided to always use biocontainer images whenever possible. And create a miscellaneous image that hosts the tools for the modules which have more than one conda dependencie and, therefore, does not have a biocontainer.

abhi18av · 2021-12-17T17:47:23Z

Definitely a great step forward! 👍

Might I suggest to cross-publish it on docker and quay.io registeries so that there's a fallback in case of running across docker rate limits.

fmalmeida · 2021-12-17T17:52:10Z

What do you mean by cross-publish it? I can upload images to quay.io?

abhi18av · 2021-12-17T17:58:06Z

Oh, nothing fancy - just that you'll need to tag the same image twice. First for pushing to dockerhub and then again for pushing it to quay.io

https://docs.docker.com/engine/reference/commandline/push/

https://docs.quay.io/solution/getting-started.html

fmalmeida · 2022-01-19T08:57:28Z

Hindrance mentioned in this comment and thought to be solved in this other comment is actually yet to be solved.

While implementing changes, we saw that the changes for biocontainers would not grasp the entirety of modules and we would still need two or three different images to allocate other tools and programming languages that would be required.

After some thought, we saw that having "half" from biocontainers and "half" from custom docker images would not be the most standardized option ...

We then are yet to decide in the dilemma again:

Either focus on custom docker images, let's say, one for each "main dependency" like, and image for tools that require R, other for python3.6, other python > 3.6, etc ... and with that have only 3 or 4 docker images do be downloaded ...
- Then these docker images could be used for singularity since they would only have the conda packages installed inside them
- We would also provide a profile for conda which would tell the dependencies for each module, still allowing users to have docker, singularity or conda.
Or try to completely remodel the pipeline so the modules are smaller and with less dependencies, allowing that at least 90% of it could be ran with biocontainer
- The problem with this option is the efforts it would require and the time it would take depending on the time we could spend with it.

... 🤔 ...

abhi18av · 2022-01-20T10:51:57Z

Hi @fmalmeida ,

Unless I'm mistaken, these are all of the docker images used in the pipeline right?

https://github.com/fmalmeida/bacannot/blob/develop/conf/docker.config

While implementing changes, we saw the the changes for biocontainers would not grasp the entirety of modules and we would still need two or three different images to allocate other tools and programming languages that would be required.

Also, if possible could we discuss this further using some example modules, a bit too abstract for me atm 😅

fmalmeida · 2022-01-20T11:10:12Z

HI @abhi18av,

It is nothing too fancy. It is just because some modules, such as the one that renders the rmarkdown reports or the one for the genome browser for instance, have a few dependencies such as scripts or multiple libraries that would not be available inside the biocontainers images which are designed to have only the tool (or conda package) itself. Or even some modules such as digIS which is not available in conda.

So, if going forward with changing the ones that could be changed to biocontainers, some modules would still require some non-biocontainer images like the bacannot:server; bacannot:renv; bacannot:jbrowse. And we would end up with a mixture of biocontainers and custom images.

The "problem", is not actually a problem for real, is just a concept and that I am not too much of a fan of doing such mixtures, I like things more standardized 😅

Just to be clear, I am still going to refactor the modules to read the databases from outside the containers as suggested when opening the issue. But, instead of changing everything to biocontainers, I am thinking in:

Having the pipeline to read the databases from outside docker images. Having a module that would prepare such directory with required databases.
Then creating smaller custom docker images based on the main dependency (or conflict) of conda packages, like: bacannot:renv; bacannot:py36_env; bacannot:perl_env; etc. With its related tools inside them.
- This would create something between 4/5 images, with all the dependencies without conflicts nor database files, wich would allow them to be used with singularity
- 4/5 os already the amount of images required, that is why I am more favourable to this one. But they would be smaller because the databases will be outside.
Them, try to create a config file for conda, since when using conda we can set more than 1 package per module, this would be possible in theory.

And yes, these are the docker images used.

The point is just that actually, instead of pointing 60%/70% to biocontainers, I am leaning towards creating these custom smaller images, what for me, would be easier to maintain.

😁

abhi18av · 2022-01-21T08:26:16Z

Hi Felipe,

Thanks for sharing the full context!

For this, allow me to respond after taking a deeper look at this today and by tomorrow, I'll share my thoughts (if you need 😅 ). Perhaps we can find an optimal strategy for this one.

fmalmeida · 2022-01-21T16:45:13Z

Hi Abhinav,

Thanks for the interest.

For this issue specifically, we have discussed here in the lab and decided to go on as described for now, which is a way that is already established for our local pipelines, thus would follow the standards of the lab and would require less "changes" for the moment being.

But surely, I'd like to hear your thoughts, maybe I could use them to find some optimal strategy as you said for future releases or future pipelines.

😄

abhi18av · 2022-01-24T07:56:32Z

For this issue specifically, we have discussed here in the lab and decided to go on as described for now, which is a way that is already established for our local pipelines, thus would follow the standards of the lab and would require less "changes" for the moment being.

Ah, okay. Yeah, maintenance burden is an important factor 👍

Then, I'll simply just summarize my suggestions here

An overall preference for small docker images should be given (like you plan), I have seen that on some clouds (like Azure) larger docker images are problematic unless, you move them to the native container registry (like ACR).
You can combine the docker.registry and process.ext options, to allow people to change the base name of the registry dynamically

The second point, could be used in combination with the process selectors to add this dynamic behavior.

Hope this helps (in some way 😉 )!

fmalmeida · 2022-01-24T08:18:18Z

Many thanks for the suggestions!

About the first one, we were already facing some problems with the image sizes, which also helped triggering this issue 😅

And about the second one, I just loved the idea. I didn't know about these options and how useful they could be.

Thanks again for these nice resources 😄

fmalmeida added enhancement New feature or request good first issue Good for newcomers labels Nov 11, 2021

fmalmeida changed the title ~~rethink modules organization and structure~~ re-think modules organization and structure Nov 11, 2021

fmalmeida mentioned this issue Nov 23, 2021

Change case of modules/workflows to distinguish them from parameters #38

Closed

fmalmeida added the stalled Implementation is slow or difficult label Dec 6, 2021

fmalmeida mentioned this issue Jan 20, 2022

provide a small dataset for quick code tests #46

Closed

2 tasks

fmalmeida mentioned this issue Feb 14, 2022

annotation using list of gene IDs #31

Closed

8 tasks

fmalmeida linked a pull request Feb 14, 2022 that will close this issue

Pipeline remodelling. Issues 24, 31 and 36 #44

Merged

fmalmeida removed a link to a pull request Mar 28, 2022

Pipeline remodelling. Issues 24, 31 and 36 #44

Merged

fmalmeida linked a pull request Mar 28, 2022 that will close this issue

Pipeline remodelling. Issues 24, 31 and 36 (#44) #51

Merged

fmalmeida closed this as completed in #51 Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-think modules organization and structure #36

re-think modules organization and structure #36

fmalmeida commented Nov 11, 2021 •

edited

Loading

abhi18av commented Nov 16, 2021

fmalmeida commented Nov 19, 2021

fmalmeida commented Nov 19, 2021 •

edited

Loading

fmalmeida commented Nov 23, 2021

fmalmeida commented Dec 17, 2021 •

edited

Loading

abhi18av commented Dec 17, 2021

fmalmeida commented Dec 17, 2021 •

edited

Loading

abhi18av commented Dec 17, 2021 •

edited

Loading

fmalmeida commented Jan 19, 2022 •

edited

Loading

abhi18av commented Jan 20, 2022

fmalmeida commented Jan 20, 2022 •

edited

Loading

abhi18av commented Jan 21, 2022

fmalmeida commented Jan 21, 2022

abhi18av commented Jan 24, 2022

fmalmeida commented Jan 24, 2022

re-think modules organization and structure #36

re-think modules organization and structure #36

Comments

fmalmeida commented Nov 11, 2021 • edited Loading

abhi18av commented Nov 16, 2021

fmalmeida commented Nov 19, 2021

fmalmeida commented Nov 19, 2021 • edited Loading

fmalmeida commented Nov 23, 2021

fmalmeida commented Dec 17, 2021 • edited Loading

abhi18av commented Dec 17, 2021

fmalmeida commented Dec 17, 2021 • edited Loading

abhi18av commented Dec 17, 2021 • edited Loading

fmalmeida commented Jan 19, 2022 • edited Loading

abhi18av commented Jan 20, 2022

fmalmeida commented Jan 20, 2022 • edited Loading

abhi18av commented Jan 21, 2022

fmalmeida commented Jan 21, 2022

abhi18av commented Jan 24, 2022

fmalmeida commented Jan 24, 2022

fmalmeida commented Nov 11, 2021 •

edited

Loading

fmalmeida commented Nov 19, 2021 •

edited

Loading

fmalmeida commented Dec 17, 2021 •

edited

Loading

fmalmeida commented Dec 17, 2021 •

edited

Loading

abhi18av commented Dec 17, 2021 •

edited

Loading

fmalmeida commented Jan 19, 2022 •

edited

Loading

fmalmeida commented Jan 20, 2022 •

edited

Loading