From fa5c192c652ede170a124023ea0ae0178d047c3e Mon Sep 17 00:00:00 2001 From: Chris Flerin Date: Fri, 11 Dec 2020 12:03:04 +0100 Subject: [PATCH 1/5] Doc updates - typo fixes, edit for clarity --- README.rst | 4 +++- docs/development.rst | 4 ++-- docs/features.rst | 28 ++++++++++++++-------------- docs/getting-started.rst | 2 +- docs/pipelines.rst | 10 +++++----- 5 files changed, 25 insertions(+), 23 deletions(-) diff --git a/README.rst b/README.rst index 52ab9363..4be64d21 100644 --- a/README.rst +++ b/README.rst @@ -8,7 +8,8 @@ A repository of pipelines for single-cell data analysis in Nextflow DSL2. **Full documentation** is available on `Read the Docs `_, or take a look at the `Quick Start `_ guide. -This main repo contains multiple workflows for analyzing single cell transcriptomics data, and depends on a number of tools, which are organized into submodules within the VIB-Singlecell-NF_ organization. +This main repo contains multiple workflows for analyzing single cell transcriptomics data, and depends on a number of tools, which are organized into subfolders within the ``src/`` directory. +The VIB-Singlecell-NF_ organization contains this main repo along with a collection of example runs (`VSN-Pipelines-examples `_). Currently available workflows are listed below. If VSN-Pipelines is useful for your research, consider citing: @@ -109,6 +110,7 @@ Sample Aggregation Workflows --- + In addition, the pySCENIC_ implementation of the SCENIC_ workflow is integrated here and can be run in conjunction with any of the above workflows. The output of each of the main workflows is a loom_-format file, which is ready for import into the interactive single-cell web visualization tool SCope_. In addition, data is also output in h5ad format, and reports are generated for the major pipeline steps. diff --git a/docs/development.rst b/docs/development.rst index 2cda7cbb..67d0e939 100644 --- a/docs/development.rst +++ b/docs/development.rst @@ -83,7 +83,7 @@ Steps: #. Update the ``nextflow.config`` file to create the ``harmony.config`` configuration file. - * Each process's options should be in their own level. With a single proccess, you do not need one extra level. + * Each process's options should be in their own level. With a single process, you do not need one extra level. .. code:: dockerfile @@ -624,7 +624,7 @@ Steps: } -#. Finally add a new entry in main.nf of the ``vsn-pipelines`` repository +#. Finally add a new entry in ``main.nf`` of the ``vsn-pipelines`` repository .. code:: groovy diff --git a/docs/features.rst b/docs/features.rst index 37a900bd..e6fa68cc 100644 --- a/docs/features.rst +++ b/docs/features.rst @@ -55,14 +55,14 @@ Finally run the pipeline, Set the seed ------------ -Some steps in the pipelines are nondeterministic. In order to have reproducible results, a seed is set by default to: +Some steps in the pipelines are non-deterministic. In order to have reproducible results, a seed is set by default to: .. code:: groovy workflow.manifest.version.replaceAll("\\.","").toInteger() -The seed is a number derived from the the version of the pipeline used at the time of the analysis run. -To override the seed (integer) you have edit the nextflow.config file with: +The seed is a number derived from the version of the pipeline used at the time of the analysis run. +To override the seed (integer) you have edit the ``nextflow.config`` file with: .. code:: groovy @@ -154,19 +154,19 @@ Two methods (``params.sc.cell_annotate.method``) are available: If you have a single file containing the metadata information of all your samples, use ``aio`` method otherwise use ``obo``. -For both methods, here are the mandatory params to set: +For both methods, here are the mandatory parameters to set: - ``off`` should be set to ``h5ad`` - ``method`` choose either ``obo`` or ``aio`` - ``annotationColumnNames`` is an array of columns names from ``cellMetaDataFilePath`` containing different annotation metadata to add. -If ``aio`` used, the following additional params are required: +If ``aio`` used, the following additional parameters are required: - ``cellMetaDataFilePath`` is a file path pointing to a single .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column. - ``indexColumnName`` is the column name from ``cellMetaDataFilePath`` containing the cell IDs information. This column **can** have unique values; if it's not the case, it's important that the combination of the values from the ``indexColumnName`` and the ``sampleColumnName`` are unique. -- ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sur that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section. +- ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sure that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section. -If ``obo`` is used, the following params are required: +If ``obo`` is used, the following parameters are required: - ``cellMetaDataFilePath`` @@ -267,7 +267,7 @@ Two methods (``params.sc.cell_filter.method``) are available: If you have a single file containing the metadata information of all your samples, use ``external`` method otherwise use ``internal``. -For both methods, here are the mandatory params to set: +For both methods, here are the mandatory parameters to set: - ``off`` should be set to ``h5ad`` - ``method`` choose either ``internal`` or ``external`` @@ -276,20 +276,20 @@ For both methods, here are the mandatory params to set: - ``id`` is a short identifier for the filter - ``valuesToKeepFromFilterColumn`` is array of values from the ``filterColumnName`` that should be kept (other values will be filtered out). -If ``internal`` used, the following additional params are required: +If ``internal`` used, the following additional parameters are required: - ``filters`` is a List of Maps where each Map is required to have the following parameters: - ``sampleColumnName`` is the column name containing the sample ID/name information. It should exist in the ``obs`` column attribute of the h5ad. - ``filterColumnName`` is the column name that will be used to filter out cells. It should exist in the ``obs`` column attribute of the h5ad. -If ``external`` used, the following additional params are required: +If ``external`` used, the following additional parameters are required: - ``filters`` is a List of Maps where each Map is required to have the following parameters: - ``cellMetaDataFilePath`` is a file path pointing to a single .tsv file (with header) with at least 3 columns: a column containing all the cell IDs, another containing the sample ID/name information, and a column to use for the filtering. - ``indexColumnName`` is the column name from ``cellMetaDataFilePath`` containing the cell IDs information. This column **must** have unique values. - - `optional` ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sur that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section. + - `optional` ``sampleColumnName`` is the column name from ``cellMetaDataFilePath`` containing the sample ID/name information. Make sure that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the `Input Data Formats`_ section. - `optional` ``filterColumnName`` is the column name from ``cellMetaDataFilePath`` which be used to filter out cells. @@ -348,8 +348,8 @@ If you want to apply custom parameters for some specific samples and have a "gen } } -Using this config, the param ``params.sc.scanpy.cellFilterMinNGenes`` will be applied with a threshold value of ``600`` to ``1k_pbmc_v2_chemistry``. The rest of the samples will use the value ``800`` to filter the cells having less than that number of genes. -This strategy can be applied to any other paramameter of the config. +Using this config, the parameter ``params.sc.scanpy.cellFilterMinNGenes`` will be applied with a threshold value of ``600`` to ``1k_pbmc_v2_chemistry``. The rest of the samples will use the value ``800`` to filter the cells having less than that number of genes. +This strategy can be applied to any other parameter of the config. Parameter exploration @@ -437,4 +437,4 @@ The following command, will create a Nextflow config which the pipeline will und -profile min,[data-profile],scanpy_data_transformation,scanpy_normalization,[...],singularity > nextflow.config - ``[data-profile]``: Can be one of the different possible data profiles e.g.: ``h5ad`` -- ``[...]``: Can be other profiles like ``bbknn``, ``harmony``, ``pcacv``, ... \ No newline at end of file +- ``[...]``: Can be other profiles like ``bbknn``, ``harmony``, ``pcacv``, ... diff --git a/docs/getting-started.rst b/docs/getting-started.rst index c91e2eae..6c8cd24b 100644 --- a/docs/getting-started.rst +++ b/docs/getting-started.rst @@ -126,6 +126,6 @@ The pipelines will generate 3 types of results in the output directory (`params. - See the example output report from the 1k PBMC data `here `_ -- ``pipeline_reports``: nextflow dag, execution, timeline, and trace reports +- ``pipeline_reports``: Nextflow dag, execution, timeline, and trace reports If you would like to use the pipelines on a custom dataset, please see the `pipelines <./pipelines.html>`_ section below. diff --git a/docs/pipelines.rst b/docs/pipelines.rst index 8e69cc6e..9178c23f 100644 --- a/docs/pipelines.rst +++ b/docs/pipelines.rst @@ -8,7 +8,7 @@ This pipeline can be configured and run on custom data with a few steps. The recommended method is to first run ``nextflow config ...`` to generate a complete config file (with the default parameters) in your working directory. The tool-specific parameters, as well as Docker/Singularity profiles, are included when specifying the appropriate profiles to ``nextflow config``. -1. First, update to the latest pipeline version (this will update the nextflow cache of the repository, typically located in ``~/.nextflow/assets/vib-singlecell-nf/``):: +1. First, update to the latest pipeline version (this will update the Nextflow cache of the repository, typically located in ``~/.nextflow/assets/vib-singlecell-nf/``):: nextflow pull vib-singlecell-nf/vsn-pipelines @@ -502,14 +502,14 @@ The output is a loom file with the results embedded. Utility Pipelines ***************** -Contrary to the aformentioned pipelines, these are not end-to-end. They are used to perfom small incremental processing steps. +Contrary to the aformentioned pipelines, these are not end-to-end. They are used to perform small incremental processing steps. **cell_annotate** ----------------- Runs the ``cell_annotate`` workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files. We show a use case here below with 10x Genomics data were it will annotate different samples using the ``obo`` method. For more information -about this cell-based annotation feautre please visit `Cell-based metadata annotation`_ section. +about this cell-based annotation feature please visit `Cell-based metadata annotation`_ section. .. _`Cell-based metadata annotation`: https://vsn-pipelines.readthedocs.io/en/latest/features.html#cell-based-metadata-annotation @@ -561,7 +561,7 @@ Now we can run it with the following command: Runs the ``cell_annotate_filter`` workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files following by a cell-based filtering. We show a use case here below with 10x Genomics data were it will annotate different samples using the ``obo`` method. For more information -about this cell-based annotation feautre please visit `Cell-based metadata annotation`_ section and `Cell-based metadata filtering`_ section. +about this cell-based annotation feature please visit `Cell-based metadata annotation`_ section and `Cell-based metadata filtering`_ section. .. _`Cell-based metadata filtering`: https://vsn-pipelines.readthedocs.io/en/latest/features.html#cell-based-metadata-filtering @@ -752,7 +752,7 @@ In the generated .config file, make sure the ``file_paths`` parameter is set wit - The ``suffix`` parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name). -In case there are multiple .h5ad files that need to be processed with different suffixes, the multi-labelled strategy should be used to define the h5ad param:: +In case there are multiple .h5ad files that need to be processed with different suffixes, the multi-labelled strategy should be used to define the h5ad parameter:: [...] data { From b9bd605e7eb05436d44decd259be6d213098865f Mon Sep 17 00:00:00 2001 From: Chris Flerin Date: Fri, 11 Dec 2020 13:04:25 +0100 Subject: [PATCH 2/5] Updated dev docs to remove submodule references --- docs/development.rst | 80 +++++++++++++++++++++++--------------------- 1 file changed, 41 insertions(+), 39 deletions(-) diff --git a/docs/development.rst b/docs/development.rst index 67d0e939..9018ff90 100644 --- a/docs/development.rst +++ b/docs/development.rst @@ -4,6 +4,8 @@ Development Create module ------------- +Tool-based modules are located in ``src/``, and each module has a specific structure for scripts and Nextflow processes (see `Repository structure`_ below). + Case study: Add `Harmony` ************************* @@ -19,40 +21,42 @@ Links: Steps: -#. Ask the `VIB-SingleCell-NF` administrators to create a new repository (in this case: ``harmony``) or create one on your GitHub account that could be brought into the `VIB-SingleCell-NF` organization. - - When using your own repo, you MUST start from the `template repository`_ in the vib-singlecell-nf organisation. Click the green "Use this template" button and provide a name for your new repo. Make sure the "Include all branches" checkbox is checked. - - .. _`template repository`: https://github.com/vib-singlecell-nf/template - #. Create a new issue on ``vsn-pipelines`` GitHub repository explaining which module you are going to add (e.g.: `Add Harmony batch correction method`). - -#. `Fork the`_ ``vsn-pipelines`` repository to your own GitHub account. +#. `Fork the`_ ``vsn-pipelines`` repository to your own GitHub account (if you are an external collaborator). .. _`Fork the`: https://help.github.com/en/github/getting-started-with-github/fork-a-repo -#. From your ``vsn-pipelines`` GitHub repository, create a new branch called ``feature/[github-issue-id]-[description]``. +#. From your local copy of ``vsn-pipelines`` GitHub repository, create a new branch called ``feature/[github-issue-id]-[description]``. In this case, - ``[github-issue-id] = 115`` - ``[description] = add_harmony_batch_correction_method`` + It is highly recommended to start from the ``develop`` branch: + .. code:: bash + git checkout develop + git fetch + git pull git checkout -b feature/115-add_harmony_batch_correction_method -#. From within the ``src`` directory of the ``vsn-pipelines`` repo, run the ``add_new_submodule.sh`` script. +#. Use the `template repository`_ in the vib-singlecell-nf organisation to create the framework for the new module in ``src/``: .. code:: bash - ./add_new_submodule.sh [git-repo-url] -d + git clone --depth=1 https://github.com/vib-singlecell-nf/template.git src/harmony + + .. _`template repository`: https://github.com/vib-singlecell-nf/template + +#. Now, you can start to edit file in the tool module that is now located in ``src/``. + Optionally, you can delete the ``.git`` directory in the new module to avoid confusion in future local development: - ``[git-repo-url]`` = https://github.com/vib-singlecell-nf/harmony.git (Git Repository URL from `VSN-SingleCell-NF` or from your GitHub account) - ``-d`` tracks the develop branch of the new repository, which is where you should work until the module is working. + .. code:: bash - If you are using VSCode and you don't see the new submodule appearing in ``SOURCE CONTROL PROVIDERS``, open any file from ``src/harmony`` (e.g.: LICENSE) + rm -rf src/harmony/.git #. Create the Dockerfile recipe @@ -81,11 +85,11 @@ Steps: apt-get clean -#. Update the ``nextflow.config`` file to create the ``harmony.config`` configuration file. +#. Rename the ``nextflow.config`` file to create the ``harmony.config`` configuration file. * Each process's options should be in their own level. With a single process, you do not need one extra level. - .. code:: dockerfile + .. code:: groovy params { sc { @@ -225,7 +229,7 @@ Steps: -#. Create the Nextflow process that will run the Harmony R script defined in 7. +#. Create the Nextflow process that will run the Harmony R script defined in the previous step. .. code:: groovy @@ -260,9 +264,9 @@ Steps: } -#. Create a Nextflow module that will call the Nextflow process defined in 8. and perform some other tasks (dimensionality reduction, cluster identification, marker genes identification and report generation) +#. Create a Nextflow "subworkflow" that will call the Nextflow process defined in the previous step and perform some other tasks (dimensionality reduction, cluster identification, marker genes identification and report generation) - This step is not required. However it this step is skipped, the code would still need to added into the main ``harmony`` workflow (`workflows/harmony.nf`, see step 10) + This step is not required. However it this step is skipped, the code would still need to added into the main ``harmony`` workflow (`workflows/harmony.nf`, see the next step) .. code:: groovy @@ -408,7 +412,7 @@ Steps: } -#. In the ``vsn-pipelines``, create a new main workflow called ``harmony.nf`` under ``workflows`` +#. In the ``vsn-pipelines``, create a new main workflow called ``harmony.nf`` under ``workflows/``: .. code:: groovy @@ -599,7 +603,20 @@ Steps: -#. Add a new Nextflow profile in ``nextflow.config`` of the ``vsn-pipelines`` repository +#. Add a new Nextflow profile in the ``profiles`` section of the main ``nextflow.config`` of the ``vsn-pipelines`` repository: + + .. code:: groovy + + profiles { + + harmony { + includeConfig 'src/scanpy/scanpy.config' + includeConfig 'src/harmony/harmony.config' + } + ... + } + +#. Finally add a new entry in ``main.nf`` of the ``vsn-pipelines`` repository .. code:: groovy @@ -624,29 +641,14 @@ Steps: } -#. Finally add a new entry in ``main.nf`` of the ``vsn-pipelines`` repository - - .. code:: groovy - - harmony { - includeConfig 'src/scanpy/scanpy.config' - includeConfig 'src/harmony/harmony.config' - } - - You should now be able to configure (``nextflow config``) and run the ``harmony`` pipeline (``nextflow run``). + You should now be able to configure (``nextflow config ...``) and run the ``harmony`` pipeline (``nextflow run ...``). -#. After confirming that your module is functional, you should merge your changes in the tool repo into the ``master`` branch. +#. After confirming that your module is functional, you should create a pull request to merge your changes into the ``develop`` branch. - Make sure you have removed all references to ``TEMPLATE`` in your repository - Include some basic documentation for your module so people know what it does and how to use it. -#. Once merged into ``master`` you should update the submodule in the ``vsn-pipelines`` repo to point to the correct branch - - .. code:: bash - - git submodule set-branch --default src/harmony - -#. Finally, add your new and updated files alongside the updated ``.gitmodules`` file and ``src/harmony`` files to a new commit and submit a pull request on the ``vsn-pipelines`` repo to have your new module integrated. + The pull request will be reviewed and accepted once it is confirmed to be working. Once the ``develop`` branch is merged into ``master``, the new tool will be part of the new release of VSN Pipelines! Repository structure -------------------- From db0f5e918b45e289ad76fd2e82b91d96536822f9 Mon Sep 17 00:00:00 2001 From: Chris Flerin Date: Fri, 11 Dec 2020 13:06:57 +0100 Subject: [PATCH 3/5] Add .gitignore --- .gitignore | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 00000000..2d12e372 --- /dev/null +++ b/.gitignore @@ -0,0 +1,31 @@ +*checkpoint.ipynb +*checkpoint* +*checkpoint.py +*.test.ipynb +*.csv +*.loom +*.pickle +*.pyc +*.html +*egg* +.vscode +.nextflow +.nextflow* +data +refdata +work +out/notebooks +src/scenic/out +src/scenic/notebooks +src/scenic/data +refdata +data/10x/tiny +work/ +out/ +tests/ +debug/ +*.swp +*.swo +docs/_build/ +src/*/.git + From dc6975a19f1f8c89f271ed27f3ddea090060ca93 Mon Sep 17 00:00:00 2001 From: Chris Flerin Date: Fri, 11 Dec 2020 13:07:17 +0100 Subject: [PATCH 4/5] Remove script to add submodule --- src/add_new_submodule.sh | 90 ---------------------------------------- 1 file changed, 90 deletions(-) delete mode 100755 src/add_new_submodule.sh diff --git a/src/add_new_submodule.sh b/src/add_new_submodule.sh deleted file mode 100755 index a8b5c5ab..00000000 --- a/src/add_new_submodule.sh +++ /dev/null @@ -1,90 +0,0 @@ -#!/bin/bash - -function usage { - echo "usage: $0 repository-url [-cdh] [-b branch/tag]" - echo " -b branch/tag link to branch/tag" - echo " -c add module and generate commit" - echo " -d link to develop branch" - echo " -h display help" - exit 1 -} - -if [[ $# -eq 0 ]]; then - usage -fi - -if [ `basename $PWD` != 'src' ]; then - echo "ERROR: You must be in the vsn-pipelines src directory to run this script" - exit 1 -fi - -unset BRANCH -unset COMMIT -declare -a ARGS - -while [ $# -gt 0 ] -do - unset OPTIND - unset OPTARG - while getopts ':b:cdh' c - do - case $c in - b) if [[ ! -z "$BRANCH" ]]; then - echo "ERROR: Cannot specify -b and -d" - exit 1 - fi - BRANCH=$OPTARG - ;; - c) COMMIT=true - ;; - d) if [[ ! -z "$BRANCH" ]]; then - echo "ERROR: Cannot specify -b and -d" - exit 1 - fi - BRANCH=develop - ;; - h) usage - ;; - :) echo "$0: -$OPTARG needs a value" >&2; - exit 1 - ;; - \?) echo "$0: unknown option -$OPTARG" >&2; - exit 1 - ;; - esac - done - shift $((OPTIND-1)) - ARGS+=($1) - shift -done - -URL=${ARGS[0]} -REPO_NAME=`echo ${URL:0:-4} | sed 's!.*/!!'` - -if [[ $URL =~ http.*:\/\/github.com\/vib-singlecell-nf\/.*\.git ]]; then - echo "WARNING: ${URL} is a http(s) github repository! Using the the following SSH URL..." - URL=`echo ${URL} | sed 's!http.*github.com/!git@github.com:!'` - echo " ${URL}" -fi - -if [[ ! $URL =~ git@github.com:vib-singlecell-nf\/.*\.git ]]; then - echo "ERROR: ${URL} is not a valid SSH address for a vib-singlecell-nf github repository! " - exit 1 -fi - -if [[ ! -z "$BRANCH" ]]; then - echo "Adding requested submodule on ${BRANCH} branch..." - git submodule add --branch $BRANCH $URL -else - echo "Adding requested submodule..." - git submodule add $URL -fi - -echo "Updating submodules..." -git submodule update --init --recursive - -if [[ $COMMIT == 'true' ]]; then - echo '-c passed. Adding module and commiting.' - git add ./${REPO_NAME} - git commit -m "Add ${REPO_NAME} submodule" -fi \ No newline at end of file From f8b01a925dd81ecf66bbfce3d0ca12eeef148e0d Mon Sep 17 00:00:00 2001 From: Chris Flerin Date: Fri, 11 Dec 2020 13:28:08 +0100 Subject: [PATCH 5/5] Update version to 0.24.0 --- nextflow.config | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nextflow.config b/nextflow.config index 67b1571f..5580b47e 100644 --- a/nextflow.config +++ b/nextflow.config @@ -3,7 +3,7 @@ manifest { name = 'vib-singlecell-nf/vsn-pipelines' description = 'A repository of pipelines for single-cell data in Nextflow DSL2' homePage = 'https://github.com/vib-singlecell-nf/vsn-pipelines' - version = '0.23.0' + version = '0.24.0' mainScript = 'main.nf' defaultBranch = 'master' nextflowVersion = '!20.04.1' // with ! prefix, stop execution if current version does not match required version.