Merge pull request #280 from vib-singlecell-nf/develop

Develop
vib-singlecell-nf · Dec 14, 2020 · 91e5724 · 91e5724
2 parents 6beddf1 + f8b01a9
commit 91e5724
Show file tree

Hide file tree

Showing 8 changed files with 97 additions and 152 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,31 @@
+*checkpoint.ipynb
+*checkpoint*
+*checkpoint.py
+*.test.ipynb
+*.csv
+*.loom
+*.pickle
+*.pyc
+*.html
+*egg*
+.vscode
+.nextflow
+.nextflow*
+data
+refdata
+work
+out/notebooks
+src/scenic/out
+src/scenic/notebooks
+src/scenic/data
+refdata
+data/10x/tiny
+work/
+out/
+tests/
+debug/
+*.swp
+*.swo
+docs/_build/
+src/*/.git
+
diff --git a/README.rst b/README.rst
@@ -8,7 +8,8 @@ A repository of pipelines for single-cell data analysis in Nextflow DSL2.
 
 **Full documentation** is available on `Read the Docs <https://vsn-pipelines.readthedocs.io/en/latest/>`_, or take a look at the `Quick Start <https://vsn-pipelines.readthedocs.io/en/latest/getting-started.html#quick-start>`_ guide.
 
-This main repo contains multiple workflows for analyzing single cell transcriptomics data, and depends on a number of tools, which are organized into submodules within the VIB-Singlecell-NF_ organization.
+This main repo contains multiple workflows for analyzing single cell transcriptomics data, and depends on a number of tools, which are organized into subfolders within the ``src/`` directory.
+The VIB-Singlecell-NF_ organization contains this main repo along with a collection of example runs (`VSN-Pipelines-examples <https://vsn-pipelines-examples.readthedocs.io/en/latest/>`_).
 Currently available workflows are listed below.
 
 If VSN-Pipelines is useful for your research, consider citing:
@@ -109,6 +110,7 @@ Sample Aggregation Workflows
 
 
 ---
+
 In addition, the pySCENIC_ implementation of the SCENIC_ workflow is integrated here and can be run in conjunction with any of the above workflows.
 The output of each of the main workflows is a loom_-format file, which is ready for import into the interactive single-cell web visualization tool SCope_.
 In addition, data is also output in h5ad format, and reports are generated for the major pipeline steps.

diff --git a/docs/development.rst b/docs/development.rst
@@ -4,6 +4,8 @@ Development
 Create module
 -------------
 
+Tool-based modules are located in ``src/<tool-name>``, and each module has a specific structure for scripts and Nextflow processes (see `Repository structure`_ below).
+
 Case study: Add `Harmony`
 *************************
 
@@ -19,40 +21,42 @@ Links:
 
 Steps:
 
-#. Ask the `VIB-SingleCell-NF` administrators to create a new repository (in this case: ``harmony``) or create one on your GitHub account that could be brought into the `VIB-SingleCell-NF` organization.
-
-    When using your own repo, you MUST start from the `template repository`_ in the vib-singlecell-nf organisation. Click the green "Use this template" button and provide a name for your new repo. Make sure the "Include all branches" checkbox is checked.
-
-    .. _`template repository`: https://github.com/vib-singlecell-nf/template
-
 #. Create a new issue on ``vsn-pipelines`` GitHub repository explaining which module you are going to add (e.g.: `Add Harmony batch correction method`).
 
-
-#. `Fork the`_ ``vsn-pipelines`` repository to your own GitHub account.
+#. `Fork the`_ ``vsn-pipelines`` repository to your own GitHub account (if you are an external collaborator).
 
     .. _`Fork the`: https://help.github.com/en/github/getting-started-with-github/fork-a-repo
 
-#. From your ``vsn-pipelines`` GitHub repository, create a new branch called ``feature/[github-issue-id]-[description]``.
+#. From your local copy of ``vsn-pipelines`` GitHub repository, create a new branch called ``feature/[github-issue-id]-[description]``.
 
     In this case,
 
     - ``[github-issue-id] = 115``
     - ``[description] = add_harmony_batch_correction_method``
 
+   It is highly recommended to start from the ``develop`` branch:
+
     .. code:: bash
 
+        git checkout develop
+        git fetch
+        git pull
         git checkout -b feature/115-add_harmony_batch_correction_method
 
-#. From within the ``src`` directory of the ``vsn-pipelines`` repo, run the ``add_new_submodule.sh`` script.
+#. Use the `template repository`_ in the vib-singlecell-nf organisation to create the framework for the new module in ``src/<tool-name>``:
 
     .. code:: bash
 
-        ./add_new_submodule.sh [git-repo-url] -d
+        git clone --depth=1 https://github.com/vib-singlecell-nf/template.git src/harmony
 
-    ``[git-repo-url]`` = https://github.com/vib-singlecell-nf/harmony.git (Git Repository URL from `VSN-SingleCell-NF` or from your GitHub account)
-    ``-d`` tracks the develop branch of the new repository, which is where you should work until the module is working.
+    .. _`template repository`: https://github.com/vib-singlecell-nf/template
 
-    If you are using VSCode and you don't see the new submodule appearing in ``SOURCE CONTROL PROVIDERS``, open any file from ``src/harmony`` (e.g.: LICENSE)
+#. Now, you can start to edit file in the tool module that is now located in ``src/<tool-name>``.
+   Optionally, you can delete the ``.git`` directory in the new module to avoid confusion in future local development:
+
+    .. code:: bash
+
+        rm -rf src/harmony/.git
 
 
 #. Create the Dockerfile recipe
@@ -81,11 +85,11 @@ Steps:
             apt-get clean
 
 
-#. Update the ``nextflow.config`` file to create the ``harmony.config`` configuration file.
+#. Rename the ``nextflow.config`` file to create the ``harmony.config`` configuration file.
 
-    * Each process's options should be in their own level. With a single proccess, you do not need one extra level.
+    * Each process's options should be in their own level. With a single process, you do not need one extra level.
 
-    .. code:: dockerfile
+    .. code:: groovy
 
         params {
             sc {
@@ -225,7 +229,7 @@ Steps:
 
 
 
-#. Create the Nextflow process that will run the Harmony R script defined in 7.
+#. Create the Nextflow process that will run the Harmony R script defined in the previous step.
 
     .. code:: groovy
 
@@ -260,9 +264,9 @@ Steps:
         }
 
 
-#. Create a Nextflow module that will call the Nextflow process defined in 8. and perform some other tasks (dimensionality reduction, cluster identification, marker genes identification and report generation)
+#. Create a Nextflow "subworkflow" that will call the Nextflow process defined in the previous step and perform some other tasks (dimensionality reduction, cluster identification, marker genes identification and report generation)
 
-    This step is not required. However it this step is skipped, the code would still need to added into the main ``harmony`` workflow (`workflows/harmony.nf`, see step 10)
+    This step is not required. However it this step is skipped, the code would still need to added into the main ``harmony`` workflow (`workflows/harmony.nf`, see the next step)
 
     .. code:: groovy
 
@@ -408,7 +412,7 @@ Steps:
 
         }
 
-#. In the ``vsn-pipelines``, create a new main workflow called ``harmony.nf`` under ``workflows``
+#. In the ``vsn-pipelines``, create a new main workflow called ``harmony.nf`` under ``workflows/``:
 
     .. code:: groovy
 
@@ -599,7 +603,20 @@ Steps:
 
 
 
-#. Add a new Nextflow profile in ``nextflow.config`` of the ``vsn-pipelines`` repository
+#. Add a new Nextflow profile in the ``profiles`` section of the main ``nextflow.config`` of the ``vsn-pipelines`` repository:
+
+    .. code:: groovy
+
+        profiles {
+
+            harmony {
+                includeConfig 'src/scanpy/scanpy.config'
+                includeConfig 'src/harmony/harmony.config'
+            }
+            ...
+        }
+
+#. Finally add a new entry in ``main.nf`` of the ``vsn-pipelines`` repository
 
     .. code:: groovy
 
@@ -624,29 +641,14 @@ Steps:
 
         }
 
-#. Finally add a new entry in main.nf of the ``vsn-pipelines`` repository
+    You should now be able to configure (``nextflow config ...``) and run the ``harmony`` pipeline (``nextflow run ...``).
 
-    .. code:: groovy
-
-        harmony {
-            includeConfig 'src/scanpy/scanpy.config'
-            includeConfig 'src/harmony/harmony.config'
-        }
-
-    You should now be able to configure (``nextflow config``) and run the ``harmony`` pipeline (``nextflow run``).
-
-#. After confirming that your module is functional, you should merge your changes in the tool repo into the ``master`` branch.
+#. After confirming that your module is functional, you should create a pull request to merge your changes into the ``develop`` branch.
 
     - Make sure you have removed all references to ``TEMPLATE`` in your repository
     - Include some basic documentation for your module so people know what it does and how to use it.
 
-#. Once merged into ``master`` you should update the submodule in the ``vsn-pipelines`` repo to point to the correct branch
-
-    .. code:: bash
-
-        git submodule set-branch --default src/harmony
-
-#. Finally, add your new and updated files alongside the updated ``.gitmodules`` file and ``src/harmony`` files to a new commit and submit a pull request on the ``vsn-pipelines`` repo to have your new module integrated.
+   The pull request will be reviewed and accepted once it is confirmed to be working. Once the ``develop`` branch is merged into ``master``, the new tool will be part of the new release of VSN Pipelines!
 
 Repository structure
 --------------------

diff --git a/docs/features.rst b/docs/features.rst
@@ -55,14 +55,14 @@ Finally run the pipeline,
 
 Set the seed
 ------------
-Some steps in the pipelines are nondeterministic. In order to have reproducible results, a seed is set by default to:
+Some steps in the pipelines are non-deterministic. In order to have reproducible results, a seed is set by default to:
 
 .. code:: groovy
 
     workflow.manifest.version.replaceAll("\\.","").toInteger()
 
-The seed is a number derived from the the version of the pipeline used at the time of the analysis run.
-To override the seed (integer) you have edit the nextflow.config file with:
+The seed is a number derived from the version of the pipeline used at the time of the analysis run.
+To override the seed (integer) you have edit the ``nextflow.config`` file with:
 
 .. code:: groovy
 
@@ -154,19 +154,19 @@ Two methods (``params.sc.cell_annotate.method``) are available:
 
 If you have a single file containing the metadata information of all your samples, use ``aio`` method otherwise use ``obo``.
 
-For both methods, here are the mandatory params to set:
+For both methods, here are the mandatory parameters to set:
 
 - ``off`` should be set to ``h5ad``
 - ``method`` choose either ``obo`` or ``aio``
 - ``annotationColumnNames`` is an array of columns names from ``cellMetaDataFilePath`` containing different annotation metadata to add.
 
-If ``aio`` used, the following additional params are required:
+If ``aio`` used, the following additional parameters are required:
 
 - ``cellMetaDataFilePath`` is a file path pointing to a single .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
 - ``indexColumnName`` is the column name from ``cellMetaDataFilePath`` containing the cell IDs information. This column **can** have unique values; if it's not the case, it's important that the combination of the values from the ``indexColumnName`` and the ``sampleColumnName`` are unique. 
-- ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sur that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section.
+- ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sure that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section.
 
-If ``obo`` is used, the following params are required:
+If ``obo`` is used, the following parameters are required:
 
 - ``cellMetaDataFilePath``
 
@@ -267,7 +267,7 @@ Two methods (``params.sc.cell_filter.method``) are available:
 
 If you have a single file containing the metadata information of all your samples, use ``external`` method otherwise use ``internal``.
 
-For both methods, here are the mandatory params to set:
+For both methods, here are the mandatory parameters to set:
 
 - ``off`` should be set to ``h5ad``
 - ``method`` choose either ``internal`` or ``external``
@@ -276,20 +276,20 @@ For both methods, here are the mandatory params to set:
   - ``id`` is a short identifier for the filter
   - ``valuesToKeepFromFilterColumn`` is array of values from the ``filterColumnName`` that should be kept (other values will be filtered out).
 
-If ``internal`` used, the following additional params are required:
+If ``internal`` used, the following additional parameters are required:
 
 - ``filters`` is a List of Maps where each Map is required to have the following parameters:
 
   - ``sampleColumnName`` is the column name containing the sample ID/name information. It should exist in the ``obs`` column attribute of the h5ad.
   - ``filterColumnName`` is the column name that will be used to filter out cells.  It should exist in the ``obs`` column attribute of the h5ad.
 
-If ``external`` used, the following additional params are required:
+If ``external`` used, the following additional parameters are required:
 
 - ``filters`` is a List of Maps where each Map is required to have the following parameters:
 
   - ``cellMetaDataFilePath`` is a file path pointing to a single .tsv file (with header) with at least 3 columns: a column containing all the cell IDs, another containing the sample ID/name information, and a column to use for the filtering.
   - ``indexColumnName`` is the column name from ``cellMetaDataFilePath`` containing the cell IDs information. This column **must** have unique values. 
-  - `optional` ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sur that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section.
+  - `optional` ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sure that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section.
   - `optional` ``filterColumnName`` is the column name from ``cellMetaDataFilePath`` which be used to filter out cells.
 
 
@@ -348,8 +348,8 @@ If you want to apply custom parameters for some specific samples and have a "gen
         }
     }
 
-Using this config, the param ``params.sc.scanpy.cellFilterMinNGenes`` will be applied with a threshold value of ``600`` to ``1k_pbmc_v2_chemistry``.  The rest of the samples will use the value ``800`` to filter the cells having less than that number of genes.
-This strategy can be applied to any other paramameter of the config.
+Using this config, the parameter ``params.sc.scanpy.cellFilterMinNGenes`` will be applied with a threshold value of ``600`` to ``1k_pbmc_v2_chemistry``.  The rest of the samples will use the value ``800`` to filter the cells having less than that number of genes.
+This strategy can be applied to any other parameter of the config.
 
 
 Parameter exploration
@@ -437,4 +437,4 @@ The following command, will create a Nextflow config which the pipeline will und
        -profile min,[data-profile],scanpy_data_transformation,scanpy_normalization,[...],singularity > nextflow.config
 
 - ``[data-profile]``: Can be one of the different possible data profiles e.g.: ``h5ad``
-- ``[...]``: Can be other profiles like ``bbknn``, ``harmony``, ``pcacv``, ...
+- ``[...]``: Can be other profiles like ``bbknn``, ``harmony``, ``pcacv``, ...
diff --git a/docs/getting-started.rst b/docs/getting-started.rst
@@ -126,6 +126,6 @@ The pipelines will generate 3 types of results in the output directory (`params.
 
     - See the example output report from the 1k PBMC data `here <http://htmlpreview.github.io/?https://github.com/vib-singlecell-nf/vsn-pipelines/blob/master/notebooks/10x_PBMC.merged_report.html>`_
 
-- ``pipeline_reports``: nextflow dag, execution, timeline, and trace reports
+- ``pipeline_reports``: Nextflow dag, execution, timeline, and trace reports
 
 If you would like to use the pipelines on a custom dataset, please see the `pipelines <./pipelines.html>`_ section below.
diff --git a/docs/pipelines.rst b/docs/pipelines.rst
@@ -8,7 +8,7 @@ This pipeline can be configured and run on custom data with a few steps.
 The recommended method is to first run ``nextflow config ...`` to generate a complete config file (with the default parameters) in your working directory.
 The tool-specific parameters, as well as Docker/Singularity profiles, are included when specifying the appropriate profiles to ``nextflow config``.
 
-1. First, update to the latest pipeline version (this will update the nextflow cache of the repository, typically located in ``~/.nextflow/assets/vib-singlecell-nf/``)::
+1. First, update to the latest pipeline version (this will update the Nextflow cache of the repository, typically located in ``~/.nextflow/assets/vib-singlecell-nf/``)::
 
     nextflow pull vib-singlecell-nf/vsn-pipelines
 
@@ -502,14 +502,14 @@ The output is a loom file with the results embedded.
 Utility Pipelines
 *****************
 
-Contrary to the aformentioned pipelines, these are not end-to-end. They are used to perfom small incremental processing steps.
+Contrary to the aformentioned pipelines, these are not end-to-end. They are used to perform small incremental processing steps.
 
 **cell_annotate**
 -----------------
 
 Runs the ``cell_annotate`` workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files.
 We show a use case here below with 10x Genomics data were it will annotate different samples using the ``obo`` method. For more information
-about this cell-based annotation feautre please visit `Cell-based metadata annotation`_ section.
+about this cell-based annotation feature please visit `Cell-based metadata annotation`_ section.
 
 .. _`Cell-based metadata annotation`: https://vsn-pipelines.readthedocs.io/en/latest/features.html#cell-based-metadata-annotation
 
@@ -561,7 +561,7 @@ Now we can run it with the following command:
 
 Runs the ``cell_annotate_filter`` workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files following by a cell-based filtering.
 We show a use case here below with 10x Genomics data were it will annotate different samples using the ``obo`` method. For more information
-about this cell-based annotation feautre please visit `Cell-based metadata annotation`_ section and `Cell-based metadata filtering`_ section.
+about this cell-based annotation feature please visit `Cell-based metadata annotation`_ section and `Cell-based metadata filtering`_ section.
 
 .. _`Cell-based metadata filtering`: https://vsn-pipelines.readthedocs.io/en/latest/features.html#cell-based-metadata-filtering
 
@@ -752,7 +752,7 @@ In the generated .config file, make sure the ``file_paths`` parameter is set wit
 
 - The ``suffix`` parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).
 
-In case there are multiple .h5ad files that need to be processed with different suffixes, the multi-labelled strategy should be used to define the h5ad param::
+In case there are multiple .h5ad files that need to be processed with different suffixes, the multi-labelled strategy should be used to define the h5ad parameter::
 
     [...]
     data {

diff --git a/nextflow.config b/nextflow.config
@@ -3,7 +3,7 @@ manifest {
     name = 'vib-singlecell-nf/vsn-pipelines'
     description = 'A repository of pipelines for single-cell data in Nextflow DSL2'
     homePage = 'https://github.com/vib-singlecell-nf/vsn-pipelines'
-    version = '0.23.0'
+    version = '0.24.0'
     mainScript = 'main.nf'
     defaultBranch = 'master'
     nextflowVersion = '!20.04.1' // with ! prefix, stop execution if current version does not match required version.