Skip to content

Latest commit

 

History

History
82 lines (63 loc) · 2.07 KB

EE282_HW3.md

File metadata and controls

82 lines (63 loc) · 2.07 KB

EE282 HW#3

To download the all chromosomes fasta file and md5sum fileonto a new folder:

mkdir melanogaster_data
cd melanogaster_data
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/fasta/dmel-all-chromosome-r6.24.fasta.gz
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/fasta/md5sum.txt

File integrity:

md5sum *.gz > md5check_genome.txt
diff md5{sum,check_genome}.txt
mv md5sum.txt md5sum_genome.txt

Make sure the chromosomes file does not appear in the output, but the files you did not download will because they are in the md5sum file.

To calculate number of nucleotides, number of N's, and number of sequences:

module load jje/jjeutils/0.1a
module load jje/kent/2014.02.19
faSize dmel-all-chromosome-r6.24.fasta.gz

To download the gtf annotation file and md5sum:

wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/gtf/dmel-all-r6.24.gtf.gz
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/gtf/md5sum.txt

Annotation file integrity:

md5sum *.gz > md5check_annotation.txt
diff md5{sum,check_annotation}.txt
mv md5sum.txt md5sum_annotation.txt

Print a summary report with the following information:

  1. Total number of features of each type, sorted from the most common to the least common
zcat dmel-all-r6.24.gtf.gz \
| awk '{print $3}' \
| sort \
| uniq -c \
| sort -rn > summary_features.txt
  1. Total number of genes per chromosome arm (X, Y, 2L, 2R, 3L, 3R, 4)
zcat dmel-all-r6.24.gtf.gz \
| awk '{print $1 "\t" $3}' \
| grep 'gene' \
| awk '{print $1}' \
| sort \
| uniq -c \
| sort -rn > summary_genes.txt

Comments

Good job.

md5sum -c is very useful. Try the following (provided you maintained the filenames):

$ wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/gtf/md5sum.txt
$ wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/gtf/dmel-all-r6.24.gtf.gz
$ md5sum dmel-all-r6.24.gtf.gz 
5cd5dcfbfff952ea7ce89e26cba89bbd  dmel-all-r6.24.gtf.gz

$ md5sum -c md5sum.txt
dmel-all-r6.24.gtf.gz: OK