To download the all chromosomes fasta file and md5sum fileonto a new folder:
mkdir melanogaster_data
cd melanogaster_data
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/fasta/dmel-all-chromosome-r6.24.fasta.gz
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/fasta/md5sum.txt
File integrity:
md5sum *.gz > md5check_genome.txt
diff md5{sum,check_genome}.txt
mv md5sum.txt md5sum_genome.txt
Make sure the chromosomes file does not appear in the output, but the files you did not download will because they are in the md5sum file.
To calculate number of nucleotides, number of N's, and number of sequences:
module load jje/jjeutils/0.1a
module load jje/kent/2014.02.19
faSize dmel-all-chromosome-r6.24.fasta.gz
To download the gtf annotation file and md5sum:
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/gtf/dmel-all-r6.24.gtf.gz
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/gtf/md5sum.txt
Annotation file integrity:
md5sum *.gz > md5check_annotation.txt
diff md5{sum,check_annotation}.txt
mv md5sum.txt md5sum_annotation.txt
Print a summary report with the following information:
- Total number of features of each type, sorted from the most common to the least common
zcat dmel-all-r6.24.gtf.gz \
| awk '{print $3}' \
| sort \
| uniq -c \
| sort -rn > summary_features.txt
- Total number of genes per chromosome arm (X, Y, 2L, 2R, 3L, 3R, 4)
zcat dmel-all-r6.24.gtf.gz \
| awk '{print $1 "\t" $3}' \
| grep 'gene' \
| awk '{print $1}' \
| sort \
| uniq -c \
| sort -rn > summary_genes.txt
Good job.
md5sum -c
is very useful. Try the following (provided you maintained the filenames):
$ wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/gtf/md5sum.txt
$ wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/gtf/dmel-all-r6.24.gtf.gz
$ md5sum dmel-all-r6.24.gtf.gz
5cd5dcfbfff952ea7ce89e26cba89bbd dmel-all-r6.24.gtf.gz
$ md5sum -c md5sum.txt
dmel-all-r6.24.gtf.gz: OK