Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces instead of underscores in FASTA download #14

Open
peterjc opened this issue Jul 6, 2022 · 0 comments
Open

Spaces instead of underscores in FASTA download #14

peterjc opened this issue Jul 6, 2022 · 0 comments

Comments

@peterjc
Copy link

peterjc commented Jul 6, 2022

http://oomycetedb.cgrb.oregonstate.edu/search.html says:

There are multiple releases of the database. Usually, you should use the latest release, unless you are trying to reproduce a previous analysis. You can download the entire database here or just a subset.

The FASTA header format

The database is a FASTA file with headers in the following format:

name=Aphanomyces_invadans|strain=NJM9701|ncbi_acc=KX405005|ncbi_taxid=157072|oodb_id=13|taxonomy=cellular_organisms;Eukaryota;Stramenopiles;Oomycetes;Saprolegniales;Saprolegniaceae;Aphanomyces_invadans

The following fields are present:

“name”: The binomial species name of the organism with spaces replaced by underscores.
“strain”: The name of the strain/isolate if available. If the strain is not available, it is left empty.
“ncbi_acc: The NCBI accession number for the sequence submitted to genbank. Note that the version number (the number at the end, after the period) is not included.
“ncbi_taxid”: The NCBI taxonomy id. This can be looked up using the NCBI accession number.
“oodb_id”: This is the unique numeric ID specific to OomyceteDB.
“taxonomy”: The taxonomic classification separated by semicolons. This classification is curated by us and is not the taxonomic classification from NCBI associated with the NCBI taxid.

However, the downloads I made in the last few days still have spaces in the species name (as shown in both the name= and taxonomy= fields, counter to the example). e.g.

$ grep NJM9701 oomycetedb_whole_2022_06_28_08_47_34.fa 
>name=Aphanomyces invadans|strain=NJM9701|ncbi_acc=KX405005|ncbi_taxid=157072|oodb_id=13|taxonomy=cellular_organisms;Eukaryota;Stramenopiles;Oomycetes;Saprolegniales;Saprolegniaceae;Aphanomyces invadans
$ grep NJM9701 oomycetedb_whole_2022_07_06_03_05_06.fa 
>name=Aphanomyces invadans|strain=NJM9701|ncbi_acc=KX405005|ncbi_taxid=157072|oodb_id=13|taxonomy=cellular_organisms;Eukaryota;Stramenopiles;Oomycetes;Saprolegniales;Saprolegniaceae;Aphanomyces invadans

This is important as most FASTA parsers will take the first word as the identifier, i.e. breaking at the first space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant