Skip to content

Commit

Permalink
Merge pull request #2 from JonathonMifsud/update-v1.0.4
Browse files Browse the repository at this point in the history
Update v1.0.4
  • Loading branch information
JonathonMifsud authored Feb 14, 2024
2 parents e14c68a + d46bdce commit 71aa4e7
Show file tree
Hide file tree
Showing 39 changed files with 1,269 additions and 509 deletions.
25 changes: 14 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ NOTE: This pipeline relies on several databases, modules and taxonomy files that
- [Storage](#storage)
- [Troubleshooting](#troubleshooting)
- [Installing Anaconda](#installing-anaconda)
- [Acknowledgments](#acknowledgments)
- [How to cite this repo?](#how-to-cite-this-repo)

--------------------
Expand All @@ -42,7 +43,7 @@ https://www.ibm.com/aspera/connect/

Each general task you want to run is associated with a .sh (shell) and .pbs script. The .sh script works as a wrapper, passing parameters and variables to the .pbs script. After setting up, you usually don't need to edit the .pbs script.

If you are unsure about what variables/files need to be passed to a script, refer to the .sh script.
If you are unsure about what variables/files to use just call the help flag `-h` e.g., `_pipeline_download_sra.sh -h` (from verison 1.0.4 onwards)

The scripts are designed to process batches, so they require a list of filenames to run.

Expand All @@ -58,14 +59,9 @@ The standard pipeline follows these steps:
3. Check the that the raw reads have downloaded by looking in `/scratch/^your_root_project^/^your_project^/raw_reads` . You can use the `check_sra_downloads.sh` script to do this! Re-download any that are missing (make a new file with the accessions)
4. Run read trimming, assembly and calculate contig abundance e.g, `JCOM_pipeline_trim_assembly_abundance.sh`. Trimming is currently setup for TruSeq3 PE and SE illumania libs and will also trim nextera. Check that all contigs are non-zero in size in `/project/^your_root_project^/^your_project^/contigs/final_contigs/`
5. Run blastxRdRp and blastxRVDB (these can be run simultaneously)
6. Concatenate all the RVDB and RdRp contigs across all libraries using cat, etc.
The reason I do this is that it is expensive to run NT / NR blasts for each contig file because the giant databases have to be loaded in each time. Instead you concatentate all of the blast contigs together and run it once.
E.g, `cat *_blastcontigs.fasta > combined.contigs.fa`
7. Move `combined.contigs.fa` to the `/project/^your_root_project^/^your_project^/contigs/final_contigs/` i.e. the input location for blasts.
8. Create an input accession file containing a single line with the word `combined`
9. Run blastnr and blastnt using this input file. As you are running this on the combined contigs there should only be one subjob in the array! `JCOM_pipeline_blastnr.sh` `JCOM_pipeline_blastnt.sh`
10. Run the readcount script `JCOM_pipeline_readcount.sh`
11. Generate a summary table (Anaconda is needed - see below). The summary table script will create several files inside `/project/^your_root_project^/^your_project^/blast_results/summary_table_creation`. The csv files are the summary tables - if another format or summary would suit you best let me know and we can sit down and develop it. You can specify accessions if you only want to run the summary table on a subset of runs -f as normal. IMPORTANT check both the logs files generated in the logs folder `summary_table_creation_TODAY_stderr.txt` and `summary_table_creation_TODAY_stout.txt` as this will let you know if any of the inputs were missing etc.
6. Run blastnr and blastnt (these can be run simultaneously). Given an accession file this command will combine the blastcontigs from RdRp and RVDB and use it as input for nr and nt. As such you will notice there is only a single job ran for each instead of an array `JCOM_pipeline_blastnr.sh` `JCOM_pipeline_blastnt.sh`. Output is named after the -f file.
7. Run the readcount script `JCOM_pipeline_readcount.sh`
8. Generate a summary table (Anaconda is needed - see below). The summary table script will create several files inside `/project/^your_root_project^/^your_project^/blast_results/summary_table_creation`. The csv files are the summary tables - if another format or summary would suit you best let me know and we can sit down and develop it. You can specify accessions if you only want to run the summary table on a subset of runs -f as normal. IMPORTANT check both the logs files generated in the logs folder `summary_table_creation_TODAY_stderr.txt` and `summary_table_creation_TODAY_stout.txt` as this will let you know if any of the inputs were missing etc.

The large files e.g., raw and trimmed reads and abundance files are stored in `/scratch/` while the smaller files tend to be in /project/

Expand Down Expand Up @@ -99,7 +95,9 @@ Then to load it: `source ~/.bashrc`

### Common Flags

Note: Flags can vary between scripts, so always check the individual `.sh` scripts. However, the common flags are as follows:
Note: Flags can vary between scripts, If you are unsure about what variables/files to use just call the help flag `-h` e.g., `_pipeline_download_sra.sh -h` (from verison 1.0.4 onwards)

However, the common flags are as follows:

`-f` used to specify a file that contains the SRA run accessions to be processed. This option is followed by a string containing the complete path to a file containing accessions one per line. I typically store these files in `/project/your_root/your_project/accession_lists/`. If this option is not provided, most of the scripts in the pipeline will fail or excetue other behaviours (e.g., see the -f in `trim_assembly_abundance.sh`), as such I always recommmend setting the -f where able so you can better keep track of the libraries you are running. NOTE: The max number of SRAs I would put in a accession file is 1000. If you have more than this create two files and run the download_script twice. The limit is enforced by Artemis/PBS.

Expand Down Expand Up @@ -173,12 +171,17 @@ Close and re-open your terminal. You should now have access to the conda command
Conda will sometimes require more memory than the head node can provide causing memory issues when running `conda install` or `conda env create`. To get around this we can create a interactive environment using the following:
`qsub -I -l select=1:ncpus=4:mem=20GB -l walltime=4:00:00 -M ^your_email^@uni.sydney.edu.au -P ^your_root_project^ -q defaultQ -j oe`

To create the environments run the following. Note the .yml can be found here or in the environments folder in your main dir (same level as the scripts / blast_results)
To create the environments run the following:
`conda env create -f /project/^your_root_project^/^your_project^/environments/ccmetagen_env.yml`
`conda env create -f /project/^your_root_project^/^your_project^/environments/project_pipeline.yml`
`conda env create -f /project/^your_root_project^/^your_project^/environments/r_env.yml`

--------------------

## Acknowledgments
I'd like to acknowledge past and present members of the Holmes Lab for their contributions to the development of this pipeline. Extra special thanks goes to Susana Ortiz-Baez, Vince Costa, Erin Harvey, Lauren Lim, Mary Petrone and Sabrina Sadiq for suggestions and/or bug reports!

--------------------

## How to cite this repo?
If this repo was somehow useful a citation would be greatly appeciated! Please cite as Mifsud, J.C.O. (2023) BatchArtemisSRAMiner v1.X.X. Zenodo. https://doi.org/10.5281/zenodo.10020164. Available at: https://github.com/JonathonMifsud/BatchArtemisSRAMiner/ .You can also get a reference file if you click on the doi badge at the top of the repo or visit this link https://zenodo.org/record/8417951
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 3 additions & 4 deletions scripts/JCOM_pipeline_blastn_custom.pbs
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
# script to blastn and extract positive virus contigs

#PBS -N blastn_custom_array
#PBS -l select=1:ncpus=6:mem=20GB
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=12:mem=120GB
#PBS -l walltime=48:00:00
#PBS -M [email protected]
#PBS -m abe

Expand All @@ -33,9 +33,8 @@ function blastToFasta {
rm "$outpath"/"$library_id""_temp_contig_names.txt"
}


# Setting variables - you shouldn't really need to change these other than resource allocations
library_id="$(basename -- "$input")"

outpath="$wd"
CPU=6

Expand Down
47 changes: 35 additions & 12 deletions scripts/JCOM_pipeline_blastn_custom.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,21 @@ queue="defaultQ"
project="JCOM_pipeline_virome"
root_project="jcomvirome"

while getopts "i:d:p:r:" 'OPTKEY'; do
show_help() {
echo "Usage: $0 [-i input] [-d db] [-h]"
echo " -i input: Input fasta file to blast, provide the full path. (Required)"
echo " -d db: Blast+ database for blastn. (Required)"
echo " -h: Display this help message."
echo ""
echo " Example:"
echo " $0 -i /project/$root_project/$project/contigs/final_contigs/mylib.contigs.fa -d /scratch/VELAB/Databases/Blast/nt.^MONTH^-^YEAR^/nt"
echo ""
echo " Check the Github page for more information:"
echo " https://github.com/JonathonMifsud/BatchArtemisSRAMiner "
exit 1
}

while getopts "i:d:p:r:h" 'OPTKEY'; do
case "$OPTKEY" in
'i')
#
Expand All @@ -34,47 +48,56 @@ while getopts "i:d:p:r:" 'OPTKEY'; do
#
root_project="$OPTARG"
;;
'h')
#
show_help
;;
'?')
echo "INVALID OPTION -- ${OPTARG}" >&2
exit 1
echo ""
show_help
;;
':')
echo "MISSING ARGUMENT for option -- ${OPTARG}" >&2
exit 1
echo ""
show_help
;;
*)
# Handle invalid flags here
echo "Invalid option: -$OPTARG" >&2
exit 1
echo ""
show_help
;;
esac
done
shift $((OPTIND - 1))

if [ "$input" = "" ]; then
echo "No input string entered."
exit 1
echo "ERROR: No input string entered."
echo ""
show_help
fi

if [ "$db" = "" ]; then
echo "No database specified. Use -d option to specify the database. e.g., -d /scratch/VELAB/Databases/Blast/nt.Jul-2023/nt"
exit 1
echo "ERROR: No database specified. Use -d option to specify the database. e.g., -d /scratch/VELAB/Databases/Blast/nt.Jul-2023/nt"
echo ""
show_help
fi

if [ "$project" = "" ]; then
echo "No project string entered. Use e.g, -p JCOM_pipeline_virome"
echo "ERROR: No project string entered. Use e.g, -p JCOM_pipeline_virome"
exit 1
fi

if [ "$root_project" = "" ]; then
echo "No root project string entered. Use e.g., -r VELAB or -r jcomvirome"
echo "ERROR: No root project string entered. Use e.g., -r VELAB or -r jcomvirome"
exit 1
fi

input_basename=$(basename "$input")

qsub -o "/project/$root_project/$project/logs/blastn_'$input_basename'_$(date '+%Y%m%d')_stout.txt" \
-e "/project/$root_project/$project/logs/blastn_'$input_basename'_$(date '+%Y%m%d')_stderr.txt" \
qsub -o /project/"$root_project"/"$project"/logs/blastn_"$input_basename"_"$(date '+%Y%m%d')"_stout.txt \
-e /project/"$root_project"/"$project"/logs/blastn_"$input_basename"_"$(date '+%Y%m%d')"_stderr.txt \
-v "input=$input,db=$db,wd=$wd" \
-q "defaultQ" \
-l "walltime=48:00:00" \
Expand Down
69 changes: 55 additions & 14 deletions scripts/JCOM_pipeline_blastnr.pbs
Original file line number Diff line number Diff line change
Expand Up @@ -18,34 +18,75 @@
module load blast+
module load diamond/2.1.6


# concatenate all of the RdRp and RVDB blastcontigs into a file to blastx against the nr database
function concatenateBlastContigs {
if [[ -z "$file_of_accessions" ]]; then
rdrp_blastcontigs_file=("$wd"/*_rdrp_blastcontigs.fasta)
rvdb_blastcontigs_file=("$wd"/*_RVDB_blastcontigs.fasta)
else
accession_ids=$(cat "$file_of_accessions")
rdrp_blastcontigs_file=()
rvdb_blastcontigs_file=()
missing_files=()

# Iterate over each accession ID and find corresponding files
for id in $accession_ids; do
rdrp_blastcontigs="$wd"/"${id}"_rdrp_blastcontigs.fasta
rvdb_blastcontigs="$wd"/"${id}"_RVDB_blastcontigs.fasta

# Check if the RdRp blastcontigs file exists
if [ -f "$rdrp_blastcontigs" ]; then
rdrp_blastcontigs_file+=("$rdrp_blastcontigs")
else
missing_files+=("RdRp blastcontigs file missing for accession ID: $id")
fi

# Check if the RVDB blastcontigs file exists
if [ -f "$rvdb_blastcontigs" ]; then
rvdb_blastcontigs_file+=("$rvdb_blastcontigs")
else
missing_files+=("RVDB blastcontigs file missing for accession ID: $id")
fi
done

# Print missing files if any
if [ ${#missing_files[@]} -gt 0 ]; then
echo "${missing_files[@]}"
fi

cat "${rdrp_blastcontigs_file[@]}" >"$wd"/"$file_of_accessions_name"_combined_rdrp_blastcontigs_forNR.fasta # Concatenate all RdRp blast contig files into one file
cat "${rvdb_blastcontigs_file[@]}" >"$wd"/"$file_of_accessions_name"_combined_RVDB_blastcontigs_forNR.fasta # Concatenate all RVDB blast contig files into one file
cat "$wd"/"$file_of_accessions_name"_combined_rdrp_blastcontigs_forNR.fasta "$wd"/"$file_of_accessions_name"_combined_RVDB_blastcontigs_forNR.fasta >"$inpath"/"$file_of_accessions_name"_blastcontigs_forNR.fasta
rm "$wd"/"$file_of_accessions_name"_combined_rdrp_blastcontigs_forNR.fasta "$wd"/"$file_of_accessions_name"_combined_RVDB_blastcontigs_forNR.fasta
fi
}

# blastx
function BlastxNR {
diamond blastx -q "$inpath"/"$library_id".contigs.fa -d "$db" -o "$outpath"/"$library_id"_nr_blastx_results.txt "$diamond_para" -f 6 qseqid qlen sseqid stitle staxids pident length evalue
diamond blastx -q "$inpath"/"$file_of_accessions_name"_blastcontigs_forNR.fasta -d "$db" -o "$outpath"/"$file_of_accessions_name"_nr_blastx_results.txt "$diamond_para" -f 6 qseqid qlen sseqid stitle staxids pident length evalue
}

#tool to extract contigs from assembly Blast to fasta
function blastToFasta {
grep -i ".*" "$outpath"/"$library_id"_nr_blastx_results.txt | cut -f1 | uniq > "$outpath"/"$library_id""_temp_contig_names.txt" #by defult this will grab the contig name from every blast result line as I commonly use a custom protein database containing only viruses
grep -A1 -I -Ff "$outpath"/"$library_id""_temp_contig_names.txt" "$inpath"/"$library_id".contigs.fa > "$outpath"/"$library_id"_nr_blastcontigs.fasta
sed -i 's/--//' "$outpath"/"$library_id"_nr_blastcontigs.fasta # remove -- from the contigs
sed -i '/^[[:space:]]*$/d' "$outpath"/"$library_id"_nr_blastcontigs.fasta # remove the white space
sed --posix -i "/^\>/ s/$/"_$library_id"/" "$outpath"/"$library_id"_nr_blastcontigs.fasta # annotate the contigs
rm "$outpath"/"$library_id""_temp_contig_names.txt"
grep -i ".*" "$outpath"/"$file_of_accessions_name"_nr_blastx_results.txt | cut -f1 | uniq > "$outpath"/"$file_of_accessions_name""_temp_nr_contig_names.txt" #by defult this will grab the contig name from every blast result line as I commonly use a custom protein database containing only viruses
grep -A1 -I -Ff "$outpath"/"$file_of_accessions_name""_temp_nr_contig_names.txt" "$inpath"/"$file_of_accessions_name"_blastcontigs_forNR.fasta > "$outpath"/"$file_of_accessions_name"_nr_blastcontigs.fasta
sed -i 's/--//' "$outpath"/"$file_of_accessions_name"_nr_blastcontigs.fasta # remove -- from the contigs
sed -i '/^[[:space:]]*$/d' "$outpath"/"$file_of_accessions_name"_nr_blastcontigs.fasta # remove the white space
sed --posix -i "/^\>/ s/$/"_$file_of_accessions_name"/" "$outpath"/"$file_of_accessions_name"_nr_blastcontigs.fasta # annotate the contigs
rm "$outpath"/"$file_of_accessions_name""_temp_nr_contig_names.txt"
rm "$inpath"/"$file_of_accessions_name"_blastcontigs_forNR.fasta # remove the blastcontigs file
}

# read in list of file names or accessions for example could be several fastq.gz files (paired or single) or just the accession id's
readarray -t myarray < "$file_of_accessions"
export library_run=${myarray["$PBS_ARRAY_INDEX"]}
library_run_without_path="$(basename -- $library_run)"
library_id=$(echo $library_run_without_path | sed 's/\.contigs.fa//g')

# paths
# Setting variables - you shouldn't really need to change these
file_of_accessions_name=$(basename -- "$file_of_accessions")
wd=/project/"$root_project"/"$project"/blast_results
inpath=/project/"$root_project"/"$project"/contigs/final_contigs # location of reads and filenames
outpath=/project/"$root_project"/"$project"/blast_results # location of megahit output

# cd working dir
cd "$wd" || exit

concatenateBlastContigs
BlastxNR
blastToFasta
Loading

0 comments on commit 71aa4e7

Please sign in to comment.