Merge pull request #2 from JonathonMifsud/update-v1.0.4

Update v1.0.4
JonathonMifsud · Feb 14, 2024 · 71aa4e7 · 71aa4e7
2 parents e14c68a + d46bdce
commit 71aa4e7
Show file tree

Hide file tree

Showing 39 changed files with 1,269 additions and 509 deletions.
diff --git a/README.md b/README.md
@@ -23,6 +23,7 @@ NOTE: This pipeline relies on several databases, modules and taxonomy files that
     - [Storage](#storage)
   - [Troubleshooting](#troubleshooting)
   - [Installing Anaconda](#installing-anaconda)
+  - [Acknowledgments](#acknowledgments)
   - [How to cite this repo?](#how-to-cite-this-repo)
 
 --------------------
@@ -42,7 +43,7 @@ https://www.ibm.com/aspera/connect/
 
 Each general task you want to run is associated with a .sh (shell) and .pbs script. The .sh script works as a wrapper, passing parameters and variables to the .pbs script. After setting up, you usually don't need to edit the .pbs script.
 
-If you are unsure about what variables/files need to be passed to a script, refer to the .sh script.
+If you are unsure about what variables/files to use just call the help flag `-h` e.g., `_pipeline_download_sra.sh -h` (from verison 1.0.4 onwards)
 
 The scripts are designed to process batches, so they require a list of filenames to run.
 
@@ -58,14 +59,9 @@ The standard pipeline follows these steps:
 3. Check the that the raw reads have downloaded by looking in `/scratch/^your_root_project^/^your_project^/raw_reads` . You can use the `check_sra_downloads.sh` script to do this! Re-download any that are missing (make a new file with the accessions) 
 4. Run read trimming, assembly and calculate contig abundance e.g, `JCOM_pipeline_trim_assembly_abundance.sh`. Trimming is currently setup for TruSeq3 PE and SE illumania libs and will also trim nextera. Check that all contigs are non-zero in size in `/project/^your_root_project^/^your_project^/contigs/final_contigs/`
 5. Run blastxRdRp and blastxRVDB (these can be run simultaneously)
-6. Concatenate all the RVDB and RdRp contigs across all libraries using cat, etc.
-The reason I do this is that it is expensive to run NT / NR blasts for each contig file because the giant databases have to be loaded in each time. Instead you concatentate all of the blast contigs together and run it once. 
-E.g, `cat *_blastcontigs.fasta > combined.contigs.fa`
-7. Move `combined.contigs.fa` to the `/project/^your_root_project^/^your_project^/contigs/final_contigs/` i.e. the input location for blasts. 
-8. Create an input accession file containing a single line with the word `combined` 
-9. Run blastnr and blastnt using this input file. As you are running this on the combined contigs there should only be one subjob in the array! `JCOM_pipeline_blastnr.sh` `JCOM_pipeline_blastnt.sh`
-10. Run the readcount script `JCOM_pipeline_readcount.sh`
-11. Generate a summary table (Anaconda is needed - see below). The summary table script will create several files inside `/project/^your_root_project^/^your_project^/blast_results/summary_table_creation`. The csv files are the summary tables - if another format or summary would suit you best let me know and we can sit down and develop it. You can specify accessions if you only want to run the summary table on a subset of runs -f as normal. IMPORTANT check both the logs files generated in the logs folder `summary_table_creation_TODAY_stderr.txt` and `summary_table_creation_TODAY_stout.txt` as this will let you know if any of the inputs were missing etc. 
+6. Run blastnr and blastnt (these can be run simultaneously). Given an accession file this command will combine the blastcontigs from RdRp and RVDB and use it as input for nr and nt. As such you will notice there is only a single job ran for each instead of an array `JCOM_pipeline_blastnr.sh` `JCOM_pipeline_blastnt.sh`. Output is named after the -f file.
+7.  Run the readcount script `JCOM_pipeline_readcount.sh`
+8.  Generate a summary table (Anaconda is needed - see below). The summary table script will create several files inside `/project/^your_root_project^/^your_project^/blast_results/summary_table_creation`. The csv files are the summary tables - if another format or summary would suit you best let me know and we can sit down and develop it. You can specify accessions if you only want to run the summary table on a subset of runs -f as normal. IMPORTANT check both the logs files generated in the logs folder `summary_table_creation_TODAY_stderr.txt` and `summary_table_creation_TODAY_stout.txt` as this will let you know if any of the inputs were missing etc. 
 
 The large files e.g., raw and trimmed reads and abundance files are stored in `/scratch/` while the smaller files tend to be in /project/
 
@@ -99,7 +95,9 @@ Then to load it: `source ~/.bashrc`
 
 ### Common Flags
 
-Note: Flags can vary between scripts, so always check the individual `.sh` scripts. However, the common flags are as follows:
+Note: Flags can vary between scripts, If you are unsure about what variables/files to use just call the help flag `-h` e.g., `_pipeline_download_sra.sh -h` (from verison 1.0.4 onwards)
+
+However, the common flags are as follows:
 
 `-f` used to specify a file that contains the SRA run accessions to be processed. This option is followed by a string containing the complete path to a file containing accessions one per line. I typically store these files in `/project/your_root/your_project/accession_lists/`. If this option is not provided, most of the scripts in the pipeline will fail or excetue other behaviours (e.g., see the -f in `trim_assembly_abundance.sh`), as such I always recommmend setting the -f where able so you can better keep track of the libraries you are running. NOTE: The max number of SRAs I would put in a accession file is 1000. If you have more than this create two files and run the download_script twice. The limit is enforced by Artemis/PBS. 
 
@@ -173,12 +171,17 @@ Close and re-open your terminal. You should now have access to the conda command
 Conda will sometimes require more memory than the head node can provide causing memory issues when running `conda install` or `conda env create`. To get around this we can create a interactive environment using the following:
 `qsub -I -l select=1:ncpus=4:mem=20GB -l walltime=4:00:00 -M ^your_email^@uni.sydney.edu.au -P ^your_root_project^ -q defaultQ  -j oe`
 
-To create the environments run the following. Note the .yml can be found here or in the environments folder in your main dir (same level as the scripts / blast_results)
+To create the environments run the following:
 `conda env create -f /project/^your_root_project^/^your_project^/environments/ccmetagen_env.yml`
 `conda env create -f /project/^your_root_project^/^your_project^/environments/project_pipeline.yml`
 `conda env create -f /project/^your_root_project^/^your_project^/environments/r_env.yml`
 
 --------------------
 
+## Acknowledgments
+I'd like to acknowledge past and present members of the Holmes Lab for their contributions to the development of this pipeline. Extra special thanks goes to Susana Ortiz-Baez, Vince Costa, Erin Harvey, Lauren Lim, Mary Petrone and Sabrina Sadiq for suggestions and/or bug reports!
+
+--------------------
+
 ## How to cite this repo?
 If this repo was somehow useful a citation would be greatly appeciated! Please cite as Mifsud, J.C.O. (2023) BatchArtemisSRAMiner v1.X.X. Zenodo. https://doi.org/10.5281/zenodo.10020164. Available at: https://github.com/JonathonMifsud/BatchArtemisSRAMiner/ .You can also get a reference file if you click on the doi badge at the top of the repo or visit this link https://zenodo.org/record/8417951
diff --git a/docs/images/Screen Shot 2023-10-18 at 7.29.54 am.png b/docs/images/Screen Shot 2023-10-18 at 7.29.54 am.png
diff --git a/docs/images/Screen Shot 2023-10-18 at 7.30.30 am.png b/docs/images/Screen Shot 2023-10-18 at 7.30.30 am.png
diff --git a/docs/images/Screen Shot 2023-10-18 at 7.32.22 am.png b/docs/images/Screen Shot 2023-10-18 at 7.32.22 am.png
diff --git a/scripts/JCOM_pipeline_blastn_custom.pbs b/scripts/JCOM_pipeline_blastn_custom.pbs
@@ -8,8 +8,8 @@
 # script to blastn and extract positive virus contigs
 
 #PBS -N blastn_custom_array
-#PBS -l select=1:ncpus=6:mem=20GB
-#PBS -l walltime=24:00:00
+#PBS -l select=1:ncpus=12:mem=120GB
+#PBS -l walltime=48:00:00
 #PBS -M [email protected]
 #PBS -m abe
 
@@ -33,9 +33,8 @@ function blastToFasta {
     rm "$outpath"/"$library_id""_temp_contig_names.txt"
 }
 
-
+# Setting variables - you shouldn't really need to change these other than resource allocations
 library_id="$(basename -- "$input")"
-
 outpath="$wd"
 CPU=6
 

diff --git a/scripts/JCOM_pipeline_blastn_custom.sh b/scripts/JCOM_pipeline_blastn_custom.sh
@@ -16,7 +16,21 @@ queue="defaultQ"
 project="JCOM_pipeline_virome"
 root_project="jcomvirome"
 
-while getopts "i:d:p:r:" 'OPTKEY'; do
+show_help() {
+    echo "Usage: $0 [-i input] [-d db] [-h]"
+    echo "  -i input: Input fasta file to blast, provide the full path. (Required)"
+    echo "  -d db: Blast+ database for blastn. (Required)"
+    echo "  -h: Display this help message."
+    echo ""
+    echo "  Example:"
+    echo "  $0 -i /project/$root_project/$project/contigs/final_contigs/mylib.contigs.fa -d /scratch/VELAB/Databases/Blast/nt.^MONTH^-^YEAR^/nt"
+    echo ""
+    echo " Check the Github page for more information:"
+    echo " https://github.com/JonathonMifsud/BatchArtemisSRAMiner "
+    exit 1
+}
+
+while getopts "i:d:p:r:h" 'OPTKEY'; do
     case "$OPTKEY" in
     'i')
         #
@@ -34,47 +48,56 @@ while getopts "i:d:p:r:" 'OPTKEY'; do
         #
         root_project="$OPTARG"
         ;;
+    'h')
+        #
+        show_help
+        ;;    
     '?')
         echo "INVALID OPTION -- ${OPTARG}" >&2
-        exit 1
+        echo ""
+        show_help
         ;;
     ':')
         echo "MISSING ARGUMENT for option -- ${OPTARG}" >&2
-        exit 1
+        echo ""
+        show_help
         ;;
     *)
         # Handle invalid flags here
         echo "Invalid option: -$OPTARG" >&2
-        exit 1
+        echo ""
+        show_help
         ;;    
     esac
 done
 shift $((OPTIND - 1))
 
 if [ "$input" = "" ]; then
-    echo "No input string entered."
-    exit 1
+    echo "ERROR: No input string entered."
+    echo ""
+    show_help
 fi
 
 if [ "$db" = "" ]; then
-    echo "No database specified. Use -d option to specify the database. e.g., -d /scratch/VELAB/Databases/Blast/nt.Jul-2023/nt"
-    exit 1
+    echo "ERROR: No database specified. Use -d option to specify the database. e.g., -d /scratch/VELAB/Databases/Blast/nt.Jul-2023/nt"
+    echo ""
+    show_help
 fi
 
 if [ "$project" = "" ]; then
-    echo "No project string entered. Use e.g, -p JCOM_pipeline_virome"
+    echo "ERROR: No project string entered. Use e.g, -p JCOM_pipeline_virome"
     exit 1
 fi
 
 if [ "$root_project" = "" ]; then
-    echo "No root project string entered. Use e.g., -r VELAB or -r jcomvirome"
+    echo "ERROR: No root project string entered. Use e.g., -r VELAB or -r jcomvirome"
     exit 1
 fi
 
 input_basename=$(basename "$input")
 
-qsub -o "/project/$root_project/$project/logs/blastn_'$input_basename'_$(date '+%Y%m%d')_stout.txt" \
-    -e "/project/$root_project/$project/logs/blastn_'$input_basename'_$(date '+%Y%m%d')_stderr.txt" \
+qsub -o /project/"$root_project"/"$project"/logs/blastn_"$input_basename"_"$(date '+%Y%m%d')"_stout.txt \
+    -e /project/"$root_project"/"$project"/logs/blastn_"$input_basename"_"$(date '+%Y%m%d')"_stderr.txt \
     -v "input=$input,db=$db,wd=$wd" \
     -q "defaultQ" \
     -l "walltime=48:00:00" \

diff --git a/scripts/JCOM_pipeline_blastnr.pbs b/scripts/JCOM_pipeline_blastnr.pbs
@@ -18,34 +18,75 @@
 module load blast+
 module load diamond/2.1.6
 
+
+# concatenate all of the RdRp and RVDB blastcontigs into a file to blastx against the nr database
+function concatenateBlastContigs {
+    if [[ -z "$file_of_accessions" ]]; then
+        rdrp_blastcontigs_file=("$wd"/*_rdrp_blastcontigs.fasta)
+        rvdb_blastcontigs_file=("$wd"/*_RVDB_blastcontigs.fasta)
+    else
+        accession_ids=$(cat "$file_of_accessions")
+        rdrp_blastcontigs_file=()
+        rvdb_blastcontigs_file=()
+        missing_files=()
+
+        # Iterate over each accession ID and find corresponding files
+        for id in $accession_ids; do
+            rdrp_blastcontigs="$wd"/"${id}"_rdrp_blastcontigs.fasta
+            rvdb_blastcontigs="$wd"/"${id}"_RVDB_blastcontigs.fasta
+
+            # Check if the RdRp blastcontigs file exists
+            if [ -f "$rdrp_blastcontigs" ]; then
+                rdrp_blastcontigs_file+=("$rdrp_blastcontigs")
+            else
+                missing_files+=("RdRp blastcontigs file missing for accession ID: $id")
+            fi
+
+            # Check if the RVDB blastcontigs file exists
+            if [ -f "$rvdb_blastcontigs" ]; then
+                rvdb_blastcontigs_file+=("$rvdb_blastcontigs")
+            else
+                missing_files+=("RVDB blastcontigs file missing for accession ID: $id")
+            fi
+        done
+
+        # Print missing files if any
+        if [ ${#missing_files[@]} -gt 0 ]; then
+            echo "${missing_files[@]}"
+        fi
+
+        cat "${rdrp_blastcontigs_file[@]}" >"$wd"/"$file_of_accessions_name"_combined_rdrp_blastcontigs_forNR.fasta # Concatenate all RdRp blast contig files into one file
+        cat "${rvdb_blastcontigs_file[@]}" >"$wd"/"$file_of_accessions_name"_combined_RVDB_blastcontigs_forNR.fasta # Concatenate all RVDB blast contig files into one file
+        cat "$wd"/"$file_of_accessions_name"_combined_rdrp_blastcontigs_forNR.fasta "$wd"/"$file_of_accessions_name"_combined_RVDB_blastcontigs_forNR.fasta >"$inpath"/"$file_of_accessions_name"_blastcontigs_forNR.fasta 
+        rm "$wd"/"$file_of_accessions_name"_combined_rdrp_blastcontigs_forNR.fasta "$wd"/"$file_of_accessions_name"_combined_RVDB_blastcontigs_forNR.fasta
+    fi
+}
+
 # blastx
 function BlastxNR {
-    diamond blastx -q "$inpath"/"$library_id".contigs.fa -d "$db" -o "$outpath"/"$library_id"_nr_blastx_results.txt "$diamond_para" -f 6 qseqid qlen sseqid stitle staxids pident length evalue
+    diamond blastx -q "$inpath"/"$file_of_accessions_name"_blastcontigs_forNR.fasta -d "$db" -o "$outpath"/"$file_of_accessions_name"_nr_blastx_results.txt "$diamond_para" -f 6 qseqid qlen sseqid stitle staxids pident length evalue
 }
 
 #tool to extract contigs from assembly Blast to fasta
 function blastToFasta {
-    grep -i ".*" "$outpath"/"$library_id"_nr_blastx_results.txt | cut -f1 | uniq > "$outpath"/"$library_id""_temp_contig_names.txt" #by defult this will grab the contig name from every blast result line as I commonly use a custom protein database containing only viruses
-	grep -A1 -I -Ff "$outpath"/"$library_id""_temp_contig_names.txt" "$inpath"/"$library_id".contigs.fa > "$outpath"/"$library_id"_nr_blastcontigs.fasta
-    sed -i 's/--//' "$outpath"/"$library_id"_nr_blastcontigs.fasta # remove -- from the contigs
-    sed -i '/^[[:space:]]*$/d' "$outpath"/"$library_id"_nr_blastcontigs.fasta # remove the white space
-    sed --posix -i "/^\>/ s/$/"_$library_id"/" "$outpath"/"$library_id"_nr_blastcontigs.fasta # annotate the contigs
-    rm "$outpath"/"$library_id""_temp_contig_names.txt"
+    grep -i ".*" "$outpath"/"$file_of_accessions_name"_nr_blastx_results.txt | cut -f1 | uniq > "$outpath"/"$file_of_accessions_name""_temp_nr_contig_names.txt" #by defult this will grab the contig name from every blast result line as I commonly use a custom protein database containing only viruses
+	grep -A1 -I -Ff "$outpath"/"$file_of_accessions_name""_temp_nr_contig_names.txt" "$inpath"/"$file_of_accessions_name"_blastcontigs_forNR.fasta > "$outpath"/"$file_of_accessions_name"_nr_blastcontigs.fasta
+    sed -i 's/--//' "$outpath"/"$file_of_accessions_name"_nr_blastcontigs.fasta # remove -- from the contigs
+    sed -i '/^[[:space:]]*$/d' "$outpath"/"$file_of_accessions_name"_nr_blastcontigs.fasta # remove the white space
+    sed --posix -i "/^\>/ s/$/"_$file_of_accessions_name"/" "$outpath"/"$file_of_accessions_name"_nr_blastcontigs.fasta # annotate the contigs
+    rm "$outpath"/"$file_of_accessions_name""_temp_nr_contig_names.txt"
+    rm "$inpath"/"$file_of_accessions_name"_blastcontigs_forNR.fasta # remove the blastcontigs file
 }
 
-# read in list of file names or accessions for example could be several fastq.gz files (paired or single) or just the accession id's
-readarray -t myarray < "$file_of_accessions"
-export library_run=${myarray["$PBS_ARRAY_INDEX"]}
-library_run_without_path="$(basename -- $library_run)"
-library_id=$(echo $library_run_without_path | sed 's/\.contigs.fa//g')
-
-# paths 
+# Setting variables - you shouldn't really need to change these
+file_of_accessions_name=$(basename -- "$file_of_accessions")
 wd=/project/"$root_project"/"$project"/blast_results
 inpath=/project/"$root_project"/"$project"/contigs/final_contigs   # location of reads and filenames
 outpath=/project/"$root_project"/"$project"/blast_results        # location of megahit output
 
 # cd working dir
 cd "$wd" || exit
 
+concatenateBlastContigs
 BlastxNR
 blastToFasta