makeblastdb -max_file_sz #63

bgruening · 2015-06-09T21:44:14Z

makeblastdb has a -max_file_sz, set by default to 1GB. Can we increase this limit by default to 10GB or should we offer this as a parameter?

The text was updated successfully, but these errors were encountered:

peterjc · 2015-06-10T07:28:10Z

If you make databases that size, will it start to split them? If it so, we may need to update the BLAST datatype to look for the alternative filenames for each chunk and the alias file? (*.nal or *.pal)

bgruening · 2015-06-10T11:03:54Z

I need to figure this out.
This is still failing with BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID|18349221.

dpryan79 · 2017-01-10T10:23:21Z

One of our users is also now running into this. @bgruening do you recall if increasing -max_file_sz fixed this? I'm trying 10GB now, but this isn't exactly a quick process.

Update: -max_file_sz '10GB' causes the same error to occur later on. Oddly, none of the files are bigger than ~2GB. I'll try 100GB and see what happens, but I suspect this is just a blast bug.

bgruening · 2017-01-10T12:37:31Z

I think this fixed it for me, yes.

dpryan79 · 2017-01-10T12:52:07Z

It looks like setting -max_file_sz >2GB is either ignored or otherwise capped at 2GB. Either way, going up to 100GB still produces this error on the dataset in question here. I'm trying 2.6.0+ to see if the issue is resolved there (in that version, -max_file_sz produces an error message if you input something greater than 2GB, which is an improvement over the older behavior).

Anyway, unless this is already fixed in 2.6.0+ then I guess this is just a blast issue and not related to the wrapper.

dpryan79 · 2017-01-10T14:17:08Z

Final update from me: this still happens in 2.6.0+. It looks like whenever the .nhr file is huge and causes multiple files to be written and the need for a .nal file that this will happen. I'll try to track down where to report blast bugs and report this there.

anilthanki · 2018-11-28T15:47:39Z

I am facing same issue with 9.6 GB of fasta on .. Did anyone managed to fix it?

I am using protein sequences on BLAST Galaxy Version 0.3.0

My error is "BLAST Database creation error: Error: Duplicate seq_ids are found:
GNL|BL_ORD_ID:3299542"

peterjc · 2018-11-29T08:42:40Z

@anilthanki your error is different- the wrapper checks for duplicate sequence IDs and aborts if it finds any. BLAST+ itself copes fine but with many of the output formats including the tabular default we use duplicates become very difficult to distinguish and will most likely break your analysis.

peterjc · 2018-11-29T08:47:09Z

The way the BLAST databases are defined in Galaxy as composite data types assumes a single file (no .nal or .pal alias pointing at chunks).

This indirectly limits the DB size as for large databases chunks are used.

Fixing this would be hard (and complicated to deploy now that the data types live in Galaxy itself - not sure what would happen if the tool shed defined data type was different).

Workaround is to define the DB outside Galaxy and add it to the *.lic file instead.

anilthanki · 2018-11-29T12:34:18Z

@peterjc I checked again and there is no duplicates. I think its something to do with the size of input..

peterjc · 2018-11-29T13:48:19Z

Strange, but possible. I didn't check the wording matched my script's error message.

Can you reproduce the error calling makeblastdb at the command line outside of Galaxy?

anilthanki · 2018-11-29T14:25:21Z

I can not reproduce the error with command line on local machine, I used it with and without -max_file_sz parameter.

Any tip on creating database on command line without indexing so it creates only one file that i can upload to Galaxy for rest of the analysis

peterjc · 2018-11-29T20:26:54Z

The BLAST databases datatype in Galaxy does not support upload into Galaxy - the expectation is you upload the FASTA file and run makeblastdb within Galaxy. Or, that the Galaxy admin adds the database to the *.loc file.

Reproducing this outside Galaxy would be really instructive - does the failing command line string Galaxy used (read this via a fail makeblastdb history entry) fail in the same way outside Galaxy?

anilthanki · 2018-11-30T10:56:08Z

Yes I tried running same command as Galaxy on local machine and it was failing because of "-hash_index" parameter, So I tried without indexing and it worked fine on Command-line and in Galaxy

peterjc · 2018-11-30T12:39:37Z

So makeblastdb ... -hash_index was causing the "BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:3299542" error? If so, that is good to clarify, but does seem to be unrelated to the original -max_file_sz problem.

Quoting the command line help:

$ makeblastdb -help
...
-max_file_sz <String>
  Maximum file size for BLAST database files
  Default = `1GB'
...

Given the discussion above, it sounds like using a larger value here would be useful (since in the Galaxy context we don't currently hope with chunked databases).

nathanweeks · 2019-04-15T17:22:59Z

It looks like the makeblastdb -max_file_sz limit was increased to 4GB in BLAST+ 2.8.0:

The 2GB output file size limit for makeblastdb has been increased to 4 GB.

KinogaMIchael · 2021-05-07T10:08:50Z

Has anyone found a solution to this? I've tried everything written here nothing seems to work. I'm getting the same error
<BLAST Database creation error: Error: Duplicate seq_ids are found:
DBJ|LC456629.1>
I have a 16GB fasta file.. anyone.. someone..

peterjc · 2021-05-07T10:17:05Z

@KinogaMIchael Does explicitly deduplicating your FASTA file first help?

It sounds like our check via https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/check_no_duplicates.py thinks the file is OK, only for BLAST itself to complain about a duplicate (the error message wording is different).

The discussion on this issue was about changing -max_file_sz which may or may not be related. The wrapper https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_makeblastdb.xml does not currently set this value. Are you able to try editing the wrapper to add this to the command line?

Or better, are you able to try running the same makeblastdb command at the terminal? And adding -max_file_sz?

annabeldekker · 2021-05-11T08:55:43Z

Dear Peterjc,

I have the same problem, using a Fasta file of around 14GB. I did try this in the terminal, but adding -max_file_sz '20GB' will give me the following: "BLAST options error: max_file_sz must be < 4 GiB".
Adding this to the xml wrapper will thus not make a difference.
I'm curious for different solutions.. Thanks in advance!

Cheers,
Annabel

peterjc · 2021-05-11T09:08:21Z

@annabeldekker and are there duplicated identifiers in your input files?

annabeldekker · 2021-05-11T09:49:38Z

@peterjc
Hi, I checked and there aren't !

peterjc · 2021-05-11T10:06:20Z

Yet you still get a message like BLAST Database creation error: Error: Duplicate seq_ids are found: ... doing this at the command line calling makeblastdb outside of Galaxy?

annabeldekker · 2021-05-11T11:06:35Z

Exactly, when I call this outside of Galaxy I get the same Duplicate seq_ids error, despite there are no duplicates in the file.
I'm using version 2.10.1 btw.

peterjc · 2021-05-11T11:22:20Z

OK, good. So it isn't my fault 😉

Please email the NCBI team using blast-help (at) ncbi.nlm.nih.gov with a reproducible example (and I suggest to avoid confusing them, don't mention Galaxy). If you get a reference number it could be useful to log it here.

annabeldekker · 2021-05-11T11:53:32Z

Thanks, @peterjc we will keep you guys updated

annabeldekker · 2021-05-11T14:33:11Z

We circumvented the error by using the 'parse_seqids' option, which feels a bit odd. It still seems like a blast bug, but at least now it runs without issues. @KinogaMIchael maybe you could try that as well!

peterjc · 2021-05-11T14:49:50Z

That does still sound like a BLAST bug, and worth reporting including the use of -parse_seqids as a possible workaround.

KinogaMIchael · 2021-05-18T14:52:02Z

@peterjc deduplicating the FASTA file doesn't help..nothing helped.. @annabeldekker I think its a blast bug..i tried this in my terminal makeblastdb -in /home/Virusdb/viruses.fa -parse_seqids -blastdb_version 5 -title "virusdb" -dbtype nucl -max_file_sz 4GB and still the same error

peterjc · 2021-05-18T15:01:57Z

@KinogaMIchael this does sound like a BLAST bug, please do report it to the email address requested.

xiongqian123456789 · 2022-11-04T07:41:15Z

Hello！ I aslo meet this problem during blastmakedb database.
I have tried different versions of Blast to make the database, and meet the problem:

BLAST Database creation error: Multi-letters chain PDB id is not supported in v4 BLAST DB
Error: mdb_env_open: Function not implemented

And finally,by blast-2.5.0+, it could runs the command makeblastdb , but the new problems comes,

Command: nohup /newlustre/home/xiongqian/software/ncbi-blast-2.5.0+/bin/makeblastdb -in nt -dbtype nucl -out nt -parse_seqids -max_file_sz 2GB &
Error:
file: /newlustre/home/xiongqian/database/NT/nt.44.nog
file: /newlustre/home/xiongqian/database/NT/nt.45.nin
file: /newlustre/home/xiongqian/database/NT/nt.45.nhr
file: /newlustre/home/xiongqian/database/NT/nt.45.nsq
file: /newlustre/home/xiongqian/database/NT/nt.45.nsi
file: /newlustre/home/xiongqian/database/NT/nt.45.nsd
file: /newlustre/home/xiongqian/database/NT/nt.45.nog
file: /newlustre/home/xiongqian/database/NT/nt.nal
BLAST Database creation error: Error: Duplicate seq_ids are found:
LCL|6O9K_A

Have the Duplicate seq_ids error solved?

bgruening changed the title ~~makeblastdb --maxfilesize~~ makeblastdb -max_file_sz Jun 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

makeblastdb -max_file_sz #63

makeblastdb -max_file_sz #63

bgruening commented Jun 9, 2015

peterjc commented Jun 10, 2015

bgruening commented Jun 10, 2015

dpryan79 commented Jan 10, 2017 •

edited

Loading

bgruening commented Jan 10, 2017

dpryan79 commented Jan 10, 2017 •

edited

Loading

dpryan79 commented Jan 10, 2017

anilthanki commented Nov 28, 2018

peterjc commented Nov 29, 2018

peterjc commented Nov 29, 2018

anilthanki commented Nov 29, 2018

peterjc commented Nov 29, 2018

anilthanki commented Nov 29, 2018

peterjc commented Nov 29, 2018

anilthanki commented Nov 30, 2018 •

edited

Loading

peterjc commented Nov 30, 2018

nathanweeks commented Apr 15, 2019

KinogaMIchael commented May 7, 2021

peterjc commented May 7, 2021

annabeldekker commented May 11, 2021

peterjc commented May 11, 2021

annabeldekker commented May 11, 2021

peterjc commented May 11, 2021

annabeldekker commented May 11, 2021 •

edited

Loading

peterjc commented May 11, 2021

annabeldekker commented May 11, 2021

annabeldekker commented May 11, 2021

peterjc commented May 11, 2021

KinogaMIchael commented May 18, 2021 •

edited

Loading

peterjc commented May 18, 2021

xiongqian123456789 commented Nov 4, 2022

makeblastdb -max_file_sz #63

makeblastdb -max_file_sz #63

Comments

bgruening commented Jun 9, 2015

peterjc commented Jun 10, 2015

bgruening commented Jun 10, 2015

dpryan79 commented Jan 10, 2017 • edited Loading

bgruening commented Jan 10, 2017

dpryan79 commented Jan 10, 2017 • edited Loading

dpryan79 commented Jan 10, 2017

anilthanki commented Nov 28, 2018

peterjc commented Nov 29, 2018

peterjc commented Nov 29, 2018

anilthanki commented Nov 29, 2018

peterjc commented Nov 29, 2018

anilthanki commented Nov 29, 2018

peterjc commented Nov 29, 2018

anilthanki commented Nov 30, 2018 • edited Loading

peterjc commented Nov 30, 2018

nathanweeks commented Apr 15, 2019

KinogaMIchael commented May 7, 2021

peterjc commented May 7, 2021

annabeldekker commented May 11, 2021

peterjc commented May 11, 2021

annabeldekker commented May 11, 2021

peterjc commented May 11, 2021

annabeldekker commented May 11, 2021 • edited Loading

peterjc commented May 11, 2021

annabeldekker commented May 11, 2021

annabeldekker commented May 11, 2021

peterjc commented May 11, 2021

KinogaMIchael commented May 18, 2021 • edited Loading

peterjc commented May 18, 2021

xiongqian123456789 commented Nov 4, 2022

dpryan79 commented Jan 10, 2017 •

edited

Loading

dpryan79 commented Jan 10, 2017 •

edited

Loading

anilthanki commented Nov 30, 2018 •

edited

Loading

annabeldekker commented May 11, 2021 •

edited

Loading

KinogaMIchael commented May 18, 2021 •

edited

Loading