-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
makeblastdb -max_file_sz #63
Comments
If you make databases that size, will it start to split them? If it so, we may need to update the BLAST datatype to look for the alternative filenames for each chunk and the alias file? ( |
I need to figure this out. |
One of our users is also now running into this. @bgruening do you recall if increasing Update: |
I think this fixed it for me, yes. |
It looks like setting Anyway, unless this is already fixed in 2.6.0+ then I guess this is just a blast issue and not related to the wrapper. |
Final update from me: this still happens in 2.6.0+. It looks like whenever the .nhr file is huge and causes multiple files to be written and the need for a .nal file that this will happen. I'll try to track down where to report blast bugs and report this there. |
I am facing same issue with 9.6 GB of fasta on .. Did anyone managed to fix it? I am using protein sequences on BLAST Galaxy Version 0.3.0 My error is "BLAST Database creation error: Error: Duplicate seq_ids are found: |
@anilthanki your error is different- the wrapper checks for duplicate sequence IDs and aborts if it finds any. BLAST+ itself copes fine but with many of the output formats including the tabular default we use duplicates become very difficult to distinguish and will most likely break your analysis. |
The way the BLAST databases are defined in Galaxy as composite data types assumes a single file (no .nal or .pal alias pointing at chunks). This indirectly limits the DB size as for large databases chunks are used. Fixing this would be hard (and complicated to deploy now that the data types live in Galaxy itself - not sure what would happen if the tool shed defined data type was different). Workaround is to define the DB outside Galaxy and add it to the *.lic file instead. |
@peterjc I checked again and there is no duplicates. I think its something to do with the size of input.. |
Strange, but possible. I didn't check the wording matched my script's error message. Can you reproduce the error calling makeblastdb at the command line outside of Galaxy? |
I can not reproduce the error with command line on local machine, I used it with and without -max_file_sz parameter. Any tip on creating database on command line without indexing so it creates only one file that i can upload to Galaxy for rest of the analysis |
The BLAST databases datatype in Galaxy does not support upload into Galaxy - the expectation is you upload the FASTA file and run makeblastdb within Galaxy. Or, that the Galaxy admin adds the database to the *.loc file. Reproducing this outside Galaxy would be really instructive - does the failing command line string Galaxy used (read this via a fail makeblastdb history entry) fail in the same way outside Galaxy? |
Yes I tried running same command as Galaxy on local machine and it was failing because of "-hash_index" parameter, So I tried without indexing and it worked fine on Command-line and in Galaxy |
So Quoting the command line help:
Given the discussion above, it sounds like using a larger value here would be useful (since in the Galaxy context we don't currently hope with chunked databases). |
It looks like the
|
Has anyone found a solution to this? I've tried everything written here nothing seems to work. I'm getting the same error |
@KinogaMIchael Does explicitly deduplicating your FASTA file first help? It sounds like our check via https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/check_no_duplicates.py thinks the file is OK, only for BLAST itself to complain about a duplicate (the error message wording is different). The discussion on this issue was about changing Or better, are you able to try running the same |
Dear Peterjc, I have the same problem, using a Fasta file of around 14GB. I did try this in the terminal, but adding -max_file_sz '20GB' will give me the following: "BLAST options error: max_file_sz must be < 4 GiB". Cheers, |
@annabeldekker and are there duplicated identifiers in your input files? |
@peterjc |
Yet you still get a message like |
Exactly, when I call this outside of Galaxy I get the same Duplicate seq_ids error, despite there are no duplicates in the file. |
OK, good. So it isn't my fault 😉 Please email the NCBI team using blast-help (at) ncbi.nlm.nih.gov with a reproducible example (and I suggest to avoid confusing them, don't mention Galaxy). If you get a reference number it could be useful to log it here. |
Thanks, @peterjc we will keep you guys updated |
We circumvented the error by using the 'parse_seqids' option, which feels a bit odd. It still seems like a blast bug, but at least now it runs without issues. @KinogaMIchael maybe you could try that as well! |
That does still sound like a BLAST bug, and worth reporting including the use of |
@peterjc deduplicating the FASTA file doesn't help..nothing helped.. @annabeldekker I think its a blast bug..i tried this in my terminal |
@KinogaMIchael this does sound like a BLAST bug, please do report it to the email address requested. |
Hello! I aslo meet this problem during blastmakedb database.
And finally,by blast-2.5.0+, it could runs the command makeblastdb , but the new problems comes, Command: nohup /newlustre/home/xiongqian/software/ncbi-blast-2.5.0+/bin/makeblastdb -in nt -dbtype nucl -out nt -parse_seqids -max_file_sz 2GB & Have the Duplicate seq_ids error solved? |
makeblastdb
has a-max_file_sz
, set by default to 1GB. Can we increase this limit by default to 10GB or should we offer this as a parameter?The text was updated successfully, but these errors were encountered: