Skip to content

emo-bon/metaGOflow-data-products-RO-crate-example

Repository files navigation

MetaGOflow-Data-Products-RO-Crate

This code builds the metagoflow-data-products-ro-crate's which contain the MetaGOflow workflow results for the analysis of one EMO BON sampling event.

This code uses Data Version Control (dvc) to store files in a LFS S3 store and uses the .dvc stub files as the payload. The (prototype) ro-crate holding repository metaGOflow-rocrates-dvc is a git submodule of this repository and contains the dvc root.

This ro-crate is specifically for the FAIR-EASE Biodiversity Use Case VRE data products - it does not include all of the MetaGOflow output.

How to create RO-Crate:

$ create-ro-crate.py target_directory yaml_configuration

where:

  • target_directory is the toplevel output directory of MetaGOflow. Note that the name of the directory cannot have a "." in it!
  • yaml_configuration is a YAML file of metadata specific to this ro-crate. Template
  • -d Provide debugging output (Default: False)

NB. this script will automatically concatenate InterProScan files if they are chunked. It will also include all the sequence-categorisation/*.gz bins by dynamically writing the ro-crate-metadata.json file. (Note: 16/10/2024 newer versions of MGF concatenate chunked IPS files during exectution.)

Test example of Data Products RO-Crate

Workflow execution:
CWLtool
all steps
HWLTKDRXY.UDI210 data
HWLTKDRXY.UDI210-results.tar.bz2 results tarball

conda activate metaGOflow
export DATA_FORWARD="DBB_AAADOSDA_1_1_HWLTKDRXY.UDI210_clean.fastq.gz"
export DATA_REVERSE="DBB_AAADOSDA_1_2_HWLTKDRXY.UDI210_clean.fastq.gz"
./run_wf.sh -n run -d  HWLTKDRXY.UDI210 \
-f input_data/${DATA_FORWARD} \
-r input_data/${DATA_REVERSE}

cwltool --debug ${SINGULARITY} --outdir ${OUT_DIR_FINAL} ${CWL} ${RENAMED_YML}
$ tree
.
├── DBB_AAADOSDA_1_1_HWLTKDRXY.UDI210_clean 
│   ├── GC-distribution.out.sub-set > discard
│   ├── GC-distribution.out.sub-set_bin > discard
│   ├── GC-distribution.out.sub-set_pcbin > discard
│   ├── nucleotide-distribution.out.sub-set > discard
│   ├── seq-length.out.sub-set > discard
│   ├── seq-length.out.sub-set_bin > discard
│   ├── seq-length.out.sub-set_pcbin > discard
│   └── summary.out > include
├── DBB_AAADOSDA_1_1_HWLTKDRXY.UDI210_clean.fastq.gz.sha1 > discard
├── DBB_AAADOSDA_1_1_HWLTKDRXY.UDI210_clean.fastq.trimmed.fasta > discard (?)
├── DBB_AAADOSDA_1_2_HWLTKDRXY.UDI210_clean 
│   ├── GC-distribution.out.sub-set > discard
│   ├── GC-distribution.out.sub-set_bin > discard
│   ├── GC-distribution.out.sub-set_pcbin > discard
│   ├── nucleotide-distribution.out.sub-set > discard
│   ├── seq-length.out.sub-set > discard
│   ├── seq-length.out.sub-set_bin > disacard
│   ├── seq-length.out.sub-set_pcbin > discard
│   └── summary.out > include
├── DBB_AAADOSDA_1_2_HWLTKDRXY.UDI210_clean.fastq.gz.sha1 > discard
├── DBB_AAADOSDA_1_2_HWLTKDRXY.UDI210_clean.fastq.trimmed.fasta > discard (?)
├── DBB.merged_CDS.faa > upload to MGnify
├── DBB.merged_CDS.ffn > upload to MGnify
├── DBB.merged.cmsearch.all.tblout.deoverlapped > discard (?)
├── DBB.merged.fasta > upload to MGnify
├── DBB.merged.motus.tsv > discard
├── DBB.merged.unfiltered_fasta > discard
├── fastp.html > include
├── final.contigs.fa > include + upload to MGnify
├── functional-annotation 
│   ├── DBB.merged_CDS.I5.tsv.chunks > discard
│   ├── DBB.merged_CDS.I5.tsv.gz > include + upload to MGnify
│   ├── DBB.merged.hmm.tsv.chunks > discard
│   ├── DBB.merged.hmm.tsv.gz > include + upload to MGnify
│   ├── DBB.merged.summary.go > include + upload to MGnify
│   ├── DBB.merged.summary.go_slim > include + upload to MGnify
│   ├── DBB.merged.summary.ips > include + upload to MGnify
│   ├── DBB.merged.summary.ko > include + upload to MGnify
│   ├── DBB.merged.summary.pfam > include + upload to MGnify
│   ├── stats
│   │   ├── go.stats > include
│   │   ├── interproscan.stats > include
│   │   ├── ko.stats > include
│   │   ├── orf.stats > include
│   │   └── pfam.stats > include
│   └── temp > discard
├── merged_qc
│   ├── GC-distribution.out.sub-set > discard
│   ├── GC-distribution.out.sub-set_bin > discard
│   ├── GC-distribution.out.sub-set_pcbin > discard
│   ├── nucleotide-distribution.out.sub-set > discard
│   ├── seq-length.out.sub-set > discard
│   ├── seq-length.out.sub-set_bin > discard
│   ├── seq-length.out.sub-set_pcbin > discard
│   └── summary.out > discard
├── qc_summary > discard
├── qc_summary_2 > discard
├── RNA-counts > include
├── sequence-categorisation
│   ├── 5_8S.fa.gz > include
│   ├── alpha_tmRNA.RF01849.fasta.gz > include
│   ├── Bacteria_large_SRP.RF01854.fasta.gz > include
│   ├── Bacteria_small_SRP.RF00169.fasta.gz > include
│   ├── cyano_tmRNA.RF01851.fasta.gz > include
│   ├── LSU.fasta.chunks > discard
│   ├── LSU.fasta.gz > include
│   ├── LSU_rRNA_archaea.RF02540.fa.gz > include
│   ├── LSU_rRNA_bacteria.RF02541.fa.gz > include
│   ├── LSU_rRNA_eukarya.RF02543.fa.gz > include
│   ├── Metazoa_SRP.RF00017.fasta.gz > include
│   ├── Protozoa_SRP.RF01856.fasta.gz > include
│   ├── RNase_MRP.RF00030.fasta.gz > include
│   ├── RNaseP_bact_a.RF00010.fasta.gz > include
│   ├── RNaseP_nuc.RF00009.fasta.gz > include
│   ├── SSU.fasta.chunks > discard
│   ├── SSU.fasta.gz > include
│   ├── SSU_rRNA_archaea.RF01959.fa.gz > include
│   ├── SSU_rRNA_bacteria.RF00177.fa.gz > include
│   ├── SSU_rRNA_eukarya.RF01960.fa.gz > include
│   ├── tmRNA.RF00023.fasta.gz > include
│   ├── tRNA.RF00005.fasta.gz > include
│   └── tRNA-Sec.RF01852.fasta.gz > include
└── taxonomy-summary
    ├── LSU
    │   ├── DBB.merged_LSU.fasta.mseq.gz > include + upload to MGnify
    │   ├── DBB.merged_LSU.fasta.mseq_hdf5.biom > include + upload to MGnify
    │   ├── DBB.merged_LSU.fasta.mseq_json.biom > include + upload to MGnify
    │   ├── DBB.merged_LSU.fasta.mseq.tsv > include + upload to MGnify
    │   ├── DBB.merged_LSU.fasta.mseq.txt > include + upload to MGnify
    │   └── krona.html > include + upload to MGnify
    └── SSU
        ├── DBB.merged_SSU.fasta.mseq.gz > include + upload to MGnify
        ├── DBB.merged_SSU.fasta.mseq_hdf5.biom > include + upload to MGnify
        ├── DBB.merged_SSU.fasta.mseq_json.biom > include + upload to MGnify
        ├── DBB.merged_SSU.fasta.mseq.tsv > include + upload to MGnify
        ├── DBB.merged_SSU.fasta.mseq.txt > include + upload to MGnify
        └── krona.html > include + upload to MGnify

10 directories, 88 files

$ du -hs ./*
14G	./HWLTKDRXY.UDI210-results.tar.bz2
4.0K	./prov
54G	./results
12K	./run.yml
1.8T	./tmp

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

About

Test repo for MetaGOflow outputs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •