Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BLASTDBv5 datatype (for blast >=2.8.1) #9939

Merged
merged 2 commits into from
Sep 8, 2020

Conversation

abretaud
Copy link
Contributor

Add support for the new version of blast databases, introduced un 2.8.1, and now the default in 2.10.0 (https://www.ncbi.nlm.nih.gov/books/NBK131777/)

Should fix peterjc/galaxy_blast#124 (ping @peterjc)

@galaxybot galaxybot added this to the 20.09 milestone Jun 30, 2020
@peterjc
Copy link
Contributor

peterjc commented Jun 30, 2020

Have you any thoughts on if we need to update this and the datatypes on the tool shed, or just Galaxy itself, or both? If we do it on the Tool Shed, we could extend the existing data package or add a second one.

I don't think that will make much difference, but using the tool shed will be required for rolling this out to older versions of Galaxy. So I think we should do toolshed first, then add to them Galaxy itself once stable and proven. Update Or both at once.

@abretaud
Copy link
Contributor Author

I'd prefer focusing on maintaining the datatypes only within the Galaxy code, just as you proposed in peterjc/galaxy_blast#124 (comment). That's also what is written in iuc standards

Not sure what to do for older Galaxy versions... I guess Usegalaxy.* instances, and other big instances are often up-to-date with latest releases, so probably not a problem in most cases? This kind of low-impact PR can probably be backported to 20.05, and maybe previous releases too if needed

@peterjc
Copy link
Contributor

peterjc commented Jun 30, 2020

If the Galaxy team are willing to accept back ports to the older release, that'd help minimise the need to do this via the ToolShed.

@bgruening
Copy link
Member

@peterjc will this work for you? #9632

@bernt-matthias
Copy link
Contributor

Do the newer BLAST versions still work with old data bases? Otherwise the data table / data manager might need an update https://github.com/peterjc/galaxy_blast/tree/master/data_managers/ncbi_blastdb

@abretaud
Copy link
Contributor Author

Yes old databases should be usable by newer versions

@bernt-matthias
Copy link
Contributor

Coincidentally I'm struggling with the blastdb data types (#9885). Somehow it seems impossible to create a test for basic composite data types.

By a quick look in the NCBI tool wrappers it seemed to me that this is untested. Any plans to add such tests? Is this working in practice (for my work I do not use BLAST)?

@peterjc
Copy link
Contributor

peterjc commented Jun 30, 2020

Commented on #9885, as far as I know, upload of BLAST DB files never worked - with the knock on effect of preventing their direct use in tests with planemo. One workaround is testing a workflow instead. Another workaround which the BLAST+ wrapper tool suite uses is using test databases via a *.loc file.

@abretaud
Copy link
Contributor Author

Resuming work on peterjc/galaxy_blast#123
So what's up here? I think the failing test is unrelated
Is there anything I should do about tests for basic composite data types? If I understand correctly it's not specific to these new datatypes?

@peterjc
Copy link
Contributor

peterjc commented Aug 21, 2020

I'm catching up on some BLAST+ wrapper work (prompted by @abretaud etc), and on peterjc/galaxy_blast#129 will declare the old BLAST datatype definitions on the Tool Shed obsolete and stop using them.

Having the NCBI BLAST Database v5 format directly in Galaxy is preferable (i.e. this pull request or one like it).

Likewise adding the NCBI BLAST XML v2 format - peterjc/galaxy_blast#65 - but that can be done separately.

@FredericBGA
Copy link
Contributor

Hi,
I've found a big difference in the way V5 indexes work regarding to V4. I've send an email to nlm-support.

blastdbcmd -version
blastdbcmd: 2.10.0+
Package: blast 2.10.0, build Apr 10 2020 10:18:15

makeblastdb -version
makeblastdb: 2.10.0+
Package: blast 2.10.0, build Apr 10 2020 10:18:15

Sample fasta file (with the Uniprot header : >db|UniqueIdentifier|EntryName)
https://www.uniprot.org/uniprot/Q8I6R7.fasta

Blastdb (version 4):

makeblastdb -max_file_sz 4GB -parse_seqids -hash_index -dbtype prot -out Q8I6R7 -in Q8I6R7.fasta -title Q8I6R7 -blastdb_version 4

Search with UniqueIdentifier:

blastdbcmd -entry Q8I6R7 -db Q8I6R7
Q8I6R7 Acanthoscurrin-2 (Fragment) OS=Acanthoscurria gomesiana OX=115339 GN=acantho2 PE=1 SV=1
DVYKGGGGGRYGGGRYGGGGGYGGGLGGGGLGGGGLGGGKGLGGGGLGGGGLGGGGLGGGGLGGGKGLGGGGLGGGGLGG
GGLGGGGLGGGKGLGGGGLGGGGLGGGRGGYGGGGYGGGYGGGYGGGKYKG

Search with EntryName:

blastdbcmd -entry ACN2_ACAGO -db Q8I6R7
Q8I6R7 Acanthoscurrin-2 (Fragment) OS=Acanthoscurria gomesiana OX=115339 GN=acantho2 PE=1 SV=1
DVYKGGGGGRYGGGRYGGGGGYGGGLGGGGLGGGGLGGGKGLGGGGLGGGGLGGGGLGGGGLGGGKGLGGGGLGGGGLGG
GGLGGGGLGGGKGLGGGGLGGGGLGGGRGGYGGGGYGGGYGGGYGGGKYKG

Both Q8I6R7 or ACN2_ACAGO works

Blastdb (version 5):

makeblastdb -max_file_sz 4GB -parse_seqids -hash_index -dbtype prot -out Q8I6R7_V5 -in Q8I6R7.fasta -title Q8I6R7_V5 -blastdb_version 5

Search with UniqueIdentifier:

blastdbcmd -entry Q8I6R7 -db Q8I6R7_V5
Q8I6R7 Acanthoscurrin-2 (Fragment) OS=Acanthoscurria gomesiana OX=115339 GN=acantho2 PE=1 SV=1
DVYKGGGGGRYGGGRYGGGGGYGGGLGGGGLGGGGLGGGKGLGGGGLGGGGLGGGGLGGGGLGGGKGLGGGGLGGGGLGG
GGLGGGGLGGGKGLGGGGLGGGGLGGGRGGYGGGGYGGGYGGGYGGGKYKG

Search with EntryName:

blastdbcmd -entry ACN2_ACAGO -db Q8I6R7_V5
Error: [blastdbcmd] Entry not found: ACN2_ACAGO
Error: [blastdbcmd] Entry or entries not found in BLAST database

Using the V5 indexes we are now unable to use EntryName as argument for -entry.

You need to be aware of this, it was very convient to use both EntryName and UniqueIdentifier.

@peterjc
Copy link
Contributor

peterjc commented Sep 7, 2020

Thanks @FredericBGA - that does look like an NCBI BLAST+ bug, hopefully something that can be fixed and not a design limitation of the V5 DB format.

@abretaud
Copy link
Contributor Author

abretaud commented Sep 7, 2020

Yep I don't think it's blocking this PR, let's hope they'll fix it in a future version.
Any chance to get this merged so we can move on with peterjc/galaxy_blast#123?
There's a selenium test failing but I think it's unrelated

@FredericBGA
Copy link
Contributor

If you use >db|UniqueIdentifier|EntryName as fasta headers and if you can set -blastdb_version 4 this is not yet an issue.
It will be an issue when version 4 of indexes will be deprecated.

@abretaud
Copy link
Contributor Author

abretaud commented Sep 7, 2020

People will still be able to use older tool versions if they want, and the v4 datatypes will still be there, so this should not break anything for most people

@peterjc
Copy link
Contributor

peterjc commented Sep 7, 2020

It has crossed my mind that we could tweak the class definition and use the same datatype for both V4 and V5 databases, on the assumption that most tools will eventually transition. There would be pain during the transition though...

@FredericBGA
Copy link
Contributor

I've got an answer from NCBI: expected behavior for dbV5

Hi,

Thanks for following up.

As far as I can tell, this is expected behavior for dbV5 since the defline parsing is limited to the first string. Because of this, it is not aware of the pipe separated locus_id field since that field is not indexed and cannot retrieve records given the locus_tag input.

The following is my test on custom generated test db and well as our swissport production database.

$ efetch -db protein -id p12345 -format fasta

sp|P12345.2|AATM_RABIT RecName: Full=Aspartate aminotransferase, mitochondrial; Short=mAspAT; AltName: Full=Fatty acid-binding protein; Short=FABP-1; AltName: Full=Glutamate oxaloacetate transaminase 2; AltName: Full=Kynurenine aminotransferase 4; AltName: Full=Kynurenine aminotransferase IV; AltName: Full=Kynurenine--oxoglutarate transaminase 4; AltName: Full=Kynurenine--oxoglutarate transaminase IV; AltName: Full=Plasma membrane-associated fatty acid-binding protein; Short=FABPpm; AltName: Full=Transaminase A; Flags: Precursor
MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKMNLGVGAYRDD
NGKPYVLPSVRKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEVVKSGRFVTVQTISGTGALRI
GASFLQRFFKFSRDVFLPKPSWGNHTPIFRDAGMQLQSYRYYDPKTCGFDFTGALEDISKIPEQSVLLLH
ACAHNPTGVDPRPEQWKEIATVVKKRNLFAFFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKN
MGLYGERVGAFTVICKDADEAKRVESQLKILIRPMYSNPPIHGARIASTILTSPDLRKQWLQEVKGMADR
IIGMRTQLVSNLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSIYMTKDGRISVAGVTSGNVGY
LAHAIHQVTK

$ efetch -db protein -id p12345 -format fasta | makeblastdb -dbtype prot -parse_seqids -out x2 -title "p12345 stdin" -in -

$ blastdbcmd -db x2 -entry AATM_RABIT
Error: [blastdbcmd] Entry not found: AATM_RABIT
Error: [blastdbcmd] Entry or entries not found in BLAST database

$ blastdbcmd -db swissprot -entry AATM_RABIT
Error: [blastdbcmd] Entry not found: AATM_RABIT
Error: [blastdbcmd] Entry or entries not found in BLAST database

$ blastdbcmd -db swissprot -entry p12345

P12345.2 RecName: Full=Aspartate aminotransferase, mitochondrial; Short=mAspAT; AltName: Full=Fatty acid-binding protein; Short=FABP-1; AltName: Full=Glutamate oxaloacetate transaminase 2; AltName: Full=Kynurenine aminotransferase 4; AltName: Full=Kynurenine aminotransferase IV; AltName: Full=Kynurenine--oxoglutarate transaminase 4; AltName: Full=Kynurenine--oxoglutarate transaminase IV; AltName: Full=Plasma membrane-associated fatty acid-binding protein; Short=FABPpm; AltName: Full=Transaminase A; Flags: Precursor [Oryctolagus cuniculus]
MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKMNLGVGAYRDDNGKPYVLPSV
RKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEVVKSGRFVTVQTISGTGALRIGASFLQRFFKFSRDVFLPKP
SWGNHTPIFRDAGMQLQSYRYYDPKTCGFDFTGALEDISKIPEQSVLLLHACAHNPTGVDPRPEQWKEIATVVKKRNLFA
FFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKNMGLYGERVGAFTVICKDADEAKRVESQLKILIRPMYSNPP
IHGARIASTILTSPDLRKQWLQEVKGMADRIIGMRTQLVSNLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSI
YMTKDGRISVAGVTSGNVGYLAHAIHQVTK

$ blastdbcmd -db x2 -entry p12345

P12345.2 RecName: Full=Aspartate aminotransferase, mitochondrial; Short=mAspAT; AltName: Full=Fatty acid-binding protein; Short=FABP-1; AltName: Full=Glutamate oxaloacetate transaminase 2; AltName: Full=Kynurenine aminotransferase 4; AltName: Full=Kynurenine aminotransferase IV; AltName: Full=Kynurenine--oxoglutarate transaminase 4; AltName: Full=Kynurenine--oxoglutarate transaminase IV; AltName: Full=Plasma membrane-associated fatty acid-binding protein; Short=FABPpm; AltName: Full=Transaminase A; Flags: Precursor
MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKMNLGVGAYRDDNGKPYVLPSV
RKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEVVKSGRFVTVQTISGTGALRIGASFLQRFFKFSRDVFLPKP
SWGNHTPIFRDAGMQLQSYRYYDPKTCGFDFTGALEDISKIPEQSVLLLHACAHNPTGVDPRPEQWKEIATVVKKRNLFA
FFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKNMGLYGERVGAFTVICKDADEAKRVESQLKILIRPMYSNPP
IHGARIASTILTSPDLRKQWLQEVKGMADRIIGMRTQLVSNLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSI
YMTKDGRISVAGVTSGNVGYLAHAIHQVTK

Regards,
NCBI User Services

@mvdbeek mvdbeek modified the milestones: 20.09, 21.01 Sep 8, 2020
@jmchilton jmchilton merged commit 511c3e6 into galaxyproject:dev Sep 8, 2020
@jmchilton
Copy link
Member

Awesome - thanks for the discussion all. This should be good regardless of the blast issues being resolved right?

@peterjc
Copy link
Contributor

peterjc commented Sep 8, 2020

Thanks all. Still trying to catch up on the BLAST+ wrapper backlog, but having this in the next Galaxy release will help a lot later.

peterjc added a commit to peterjc/galaxy_blast that referenced this pull request Sep 9, 2020
Need to wait for new v5 BLAST DB datatypes to be
in a released version of Galaxy before using them.
See galaxyproject/galaxy#9939
peterjc added a commit to peterjc/galaxy_blast that referenced this pull request Sep 9, 2020
Need to wait for new v5 BLAST DB datatypes to be
in a released version of Galaxy before using them.
See galaxyproject/galaxy#9939
peterjc added a commit to peterjc/galaxy_blast that referenced this pull request Sep 10, 2020
In preparation for wrapping NCBI BLAST+ 2.10, which
adds support for setting the preferred DB version.

Need to wait for new v5 BLAST DB datatypes to be
in a released version of Galaxy before using them.
See galaxyproject/galaxy#9939
@abretaud
Copy link
Contributor Author

Thanks for the merge @jmchilton !
Would it be possible to backport this to 20.05?

peterjc added a commit to peterjc/galaxy_blast that referenced this pull request Sep 10, 2020
In preparation for wrapping NCBI BLAST+ 2.10, which
adds support for setting the preferred DB version.

Need to wait for new v5 BLAST DB datatypes to be
in a released version of Galaxy before using them.
See galaxyproject/galaxy#9939
@mvdbeek mvdbeek modified the milestones: 21.01, 20.09 Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for BLAST DB v5
9 participants