Spaces instead of underscores in FASTA download #14

peterjc · 2022-07-06T10:05:59Z

http://oomycetedb.cgrb.oregonstate.edu/search.html says:

There are multiple releases of the database. Usually, you should use the latest release, unless you are trying to reproduce a previous analysis. You can download the entire database here or just a subset.

The FASTA header format

The database is a FASTA file with headers in the following format:

name=Aphanomyces_invadans|strain=NJM9701|ncbi_acc=KX405005|ncbi_taxid=157072|oodb_id=13|taxonomy=cellular_organisms;Eukaryota;Stramenopiles;Oomycetes;Saprolegniales;Saprolegniaceae;Aphanomyces_invadans

The following fields are present:

“name”: The binomial species name of the organism with spaces replaced by underscores.
“strain”: The name of the strain/isolate if available. If the strain is not available, it is left empty.
“ncbi_acc: The NCBI accession number for the sequence submitted to genbank. Note that the version number (the number at the end, after the period) is not included.
“ncbi_taxid”: The NCBI taxonomy id. This can be looked up using the NCBI accession number.
“oodb_id”: This is the unique numeric ID specific to OomyceteDB.
“taxonomy”: The taxonomic classification separated by semicolons. This classification is curated by us and is not the taxonomic classification from NCBI associated with the NCBI taxid.

However, the downloads I made in the last few days still have spaces in the species name (as shown in both the name= and taxonomy= fields, counter to the example). e.g.

$ grep NJM9701 oomycetedb_whole_2022_06_28_08_47_34.fa 
>name=Aphanomyces invadans|strain=NJM9701|ncbi_acc=KX405005|ncbi_taxid=157072|oodb_id=13|taxonomy=cellular_organisms;Eukaryota;Stramenopiles;Oomycetes;Saprolegniales;Saprolegniaceae;Aphanomyces invadans

$ grep NJM9701 oomycetedb_whole_2022_07_06_03_05_06.fa 
>name=Aphanomyces invadans|strain=NJM9701|ncbi_acc=KX405005|ncbi_taxid=157072|oodb_id=13|taxonomy=cellular_organisms;Eukaryota;Stramenopiles;Oomycetes;Saprolegniales;Saprolegniaceae;Aphanomyces invadans

This is important as most FASTA parsers will take the first word as the identifier, i.e. breaking at the first space.

The text was updated successfully, but these errors were encountered:

peterjc mentioned this issue Jul 6, 2022

Missing spaces after periods in species names #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spaces instead of underscores in FASTA download #14

Spaces instead of underscores in FASTA download #14

peterjc commented Jul 6, 2022

The FASTA header format

Spaces instead of underscores in FASTA download #14

Spaces instead of underscores in FASTA download #14

Comments

peterjc commented Jul 6, 2022

The FASTA header format