Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

makeblastdb -max_file_sz #63

Open
bgruening opened this issue Jun 9, 2015 · 30 comments
Open

makeblastdb -max_file_sz #63

bgruening opened this issue Jun 9, 2015 · 30 comments

Comments

@bgruening
Copy link
Contributor

makeblastdb has a -max_file_sz, set by default to 1GB. Can we increase this limit by default to 10GB or should we offer this as a parameter?

@bgruening bgruening changed the title makeblastdb --maxfilesize makeblastdb -max_file_sz Jun 9, 2015
@peterjc
Copy link
Owner

peterjc commented Jun 10, 2015

If you make databases that size, will it start to split them? If it so, we may need to update the BLAST datatype to look for the alternative filenames for each chunk and the alias file? (*.nal or *.pal)

@bgruening
Copy link
Contributor Author

I need to figure this out.
This is still failing with BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID|18349221.

@dpryan79
Copy link

dpryan79 commented Jan 10, 2017

One of our users is also now running into this. @bgruening do you recall if increasing -max_file_sz fixed this? I'm trying 10GB now, but this isn't exactly a quick process.

Update: -max_file_sz '10GB' causes the same error to occur later on. Oddly, none of the files are bigger than ~2GB. I'll try 100GB and see what happens, but I suspect this is just a blast bug.

@bgruening
Copy link
Contributor Author

I think this fixed it for me, yes.

@dpryan79
Copy link

dpryan79 commented Jan 10, 2017

It looks like setting -max_file_sz >2GB is either ignored or otherwise capped at 2GB. Either way, going up to 100GB still produces this error on the dataset in question here. I'm trying 2.6.0+ to see if the issue is resolved there (in that version, -max_file_sz produces an error message if you input something greater than 2GB, which is an improvement over the older behavior).

Anyway, unless this is already fixed in 2.6.0+ then I guess this is just a blast issue and not related to the wrapper.

@dpryan79
Copy link

Final update from me: this still happens in 2.6.0+. It looks like whenever the .nhr file is huge and causes multiple files to be written and the need for a .nal file that this will happen. I'll try to track down where to report blast bugs and report this there.

@anilthanki
Copy link

I am facing same issue with 9.6 GB of fasta on .. Did anyone managed to fix it?

I am using protein sequences on BLAST Galaxy Version 0.3.0

My error is "BLAST Database creation error: Error: Duplicate seq_ids are found:
GNL|BL_ORD_ID:3299542"

@peterjc
Copy link
Owner

peterjc commented Nov 29, 2018

@anilthanki your error is different- the wrapper checks for duplicate sequence IDs and aborts if it finds any. BLAST+ itself copes fine but with many of the output formats including the tabular default we use duplicates become very difficult to distinguish and will most likely break your analysis.

@peterjc
Copy link
Owner

peterjc commented Nov 29, 2018

The way the BLAST databases are defined in Galaxy as composite data types assumes a single file (no .nal or .pal alias pointing at chunks).

This indirectly limits the DB size as for large databases chunks are used.

Fixing this would be hard (and complicated to deploy now that the data types live in Galaxy itself - not sure what would happen if the tool shed defined data type was different).

Workaround is to define the DB outside Galaxy and add it to the *.lic file instead.

@anilthanki
Copy link

@peterjc I checked again and there is no duplicates. I think its something to do with the size of input..

@peterjc
Copy link
Owner

peterjc commented Nov 29, 2018

Strange, but possible. I didn't check the wording matched my script's error message.

Can you reproduce the error calling makeblastdb at the command line outside of Galaxy?

@anilthanki
Copy link

I can not reproduce the error with command line on local machine, I used it with and without -max_file_sz parameter.

Any tip on creating database on command line without indexing so it creates only one file that i can upload to Galaxy for rest of the analysis

@peterjc
Copy link
Owner

peterjc commented Nov 29, 2018

The BLAST databases datatype in Galaxy does not support upload into Galaxy - the expectation is you upload the FASTA file and run makeblastdb within Galaxy. Or, that the Galaxy admin adds the database to the *.loc file.

Reproducing this outside Galaxy would be really instructive - does the failing command line string Galaxy used (read this via a fail makeblastdb history entry) fail in the same way outside Galaxy?

@anilthanki
Copy link

anilthanki commented Nov 30, 2018

Yes I tried running same command as Galaxy on local machine and it was failing because of "-hash_index" parameter, So I tried without indexing and it worked fine on Command-line and in Galaxy

@peterjc
Copy link
Owner

peterjc commented Nov 30, 2018

So makeblastdb ... -hash_index was causing the "BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:3299542" error? If so, that is good to clarify, but does seem to be unrelated to the original -max_file_sz problem.

Quoting the command line help:

$ makeblastdb -help
...
-max_file_sz <String>
  Maximum file size for BLAST database files
  Default = `1GB'
...

Given the discussion above, it sounds like using a larger value here would be useful (since in the Galaxy context we don't currently hope with chunked databases).

@nathanweeks
Copy link

It looks like the makeblastdb -max_file_sz limit was increased to 4GB in BLAST+ 2.8.0:

  • The 2GB output file size limit for makeblastdb has been increased to 4 GB.

@KinogaMIchael
Copy link

Has anyone found a solution to this? I've tried everything written here nothing seems to work. I'm getting the same error
<BLAST Database creation error: Error: Duplicate seq_ids are found:
DBJ|LC456629.1>
I have a 16GB fasta file.. anyone.. someone..

@peterjc
Copy link
Owner

peterjc commented May 7, 2021

@KinogaMIchael Does explicitly deduplicating your FASTA file first help?

It sounds like our check via https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/check_no_duplicates.py thinks the file is OK, only for BLAST itself to complain about a duplicate (the error message wording is different).

The discussion on this issue was about changing -max_file_sz which may or may not be related. The wrapper https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_makeblastdb.xml does not currently set this value. Are you able to try editing the wrapper to add this to the command line?

Or better, are you able to try running the same makeblastdb command at the terminal? And adding -max_file_sz?

@annabeldekker
Copy link

Dear Peterjc,

I have the same problem, using a Fasta file of around 14GB. I did try this in the terminal, but adding -max_file_sz '20GB' will give me the following: "BLAST options error: max_file_sz must be < 4 GiB".
Adding this to the xml wrapper will thus not make a difference.
I'm curious for different solutions.. Thanks in advance!

Cheers,
Annabel

@peterjc
Copy link
Owner

peterjc commented May 11, 2021

@annabeldekker and are there duplicated identifiers in your input files?

@annabeldekker
Copy link

@peterjc
Hi, I checked and there aren't !

@peterjc
Copy link
Owner

peterjc commented May 11, 2021

Yet you still get a message like BLAST Database creation error: Error: Duplicate seq_ids are found: ... doing this at the command line calling makeblastdb outside of Galaxy?

@annabeldekker
Copy link

annabeldekker commented May 11, 2021

Exactly, when I call this outside of Galaxy I get the same Duplicate seq_ids error, despite there are no duplicates in the file.
I'm using version 2.10.1 btw.

@peterjc
Copy link
Owner

peterjc commented May 11, 2021

OK, good. So it isn't my fault 😉

Please email the NCBI team using blast-help (at) ncbi.nlm.nih.gov with a reproducible example (and I suggest to avoid confusing them, don't mention Galaxy). If you get a reference number it could be useful to log it here.

@annabeldekker
Copy link

Thanks, @peterjc we will keep you guys updated

@annabeldekker
Copy link

We circumvented the error by using the 'parse_seqids' option, which feels a bit odd. It still seems like a blast bug, but at least now it runs without issues. @KinogaMIchael maybe you could try that as well!

@peterjc
Copy link
Owner

peterjc commented May 11, 2021

That does still sound like a BLAST bug, and worth reporting including the use of -parse_seqids as a possible workaround.

@KinogaMIchael
Copy link

KinogaMIchael commented May 18, 2021

@peterjc deduplicating the FASTA file doesn't help..nothing helped.. @annabeldekker I think its a blast bug..i tried this in my terminal makeblastdb -in /home/Virusdb/viruses.fa -parse_seqids -blastdb_version 5 -title "virusdb" -dbtype nucl -max_file_sz 4GB and still the same error

@peterjc
Copy link
Owner

peterjc commented May 18, 2021

@KinogaMIchael this does sound like a BLAST bug, please do report it to the email address requested.

@xiongqian123456789
Copy link

Hello! I aslo meet this problem during blastmakedb database.
I have tried different versions of Blast to make the database, and meet the problem:

  1. BLAST Database creation error: Multi-letters chain PDB id is not supported in v4 BLAST DB
  2. Error: mdb_env_open: Function not implemented

And finally,by blast-2.5.0+, it could runs the command makeblastdb , but the new problems comes,

Command: nohup /newlustre/home/xiongqian/software/ncbi-blast-2.5.0+/bin/makeblastdb -in nt -dbtype nucl -out nt -parse_seqids -max_file_sz 2GB &
Error:
file: /newlustre/home/xiongqian/database/NT/nt.44.nog
file: /newlustre/home/xiongqian/database/NT/nt.45.nin
file: /newlustre/home/xiongqian/database/NT/nt.45.nhr
file: /newlustre/home/xiongqian/database/NT/nt.45.nsq
file: /newlustre/home/xiongqian/database/NT/nt.45.nsi
file: /newlustre/home/xiongqian/database/NT/nt.45.nsd
file: /newlustre/home/xiongqian/database/NT/nt.45.nog
file: /newlustre/home/xiongqian/database/NT/nt.nal
BLAST Database creation error: Error: Duplicate seq_ids are found:
LCL|6O9K_A

Have the Duplicate seq_ids error solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants