Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

foldseek cluster crashes with --cluster-reassign 1 #399

Open
shiraz-shah opened this issue Dec 17, 2024 · 12 comments
Open

foldseek cluster crashes with --cluster-reassign 1 #399

shiraz-shah opened this issue Dec 17, 2024 · 12 comments

Comments

@shiraz-shah
Copy link

shiraz-shah commented Dec 17, 2024

Amazing software, guys! More documentation would be helpful, though!!

Expected Behavior

That clustering works with cluster reassignment enabled. Clustering works fine with it disabled.

Current Behavior

foldseek cluster DB C tmp --cluster-reassign 1
Crashes with error:
awk: fatal: cannot open file tmp/9215817526405491371/seq_seeds_ca.index' for reading: No such file or directory`

Steps to Reproduce (for bugs)

Make foldseek database composed of only amino acid sequence and 3di sequences, i.e.:

DB
DB.dbtype
DB_h
DB_h.dbtype
DB_h.index
DB.index
DB_ss
DB_ss.dbtype
DB_ss.index

Foldssek Output (for bugs)

awk: fatal: cannot open file 'tmp/9215817526405491371/seq_seeds_ca.index' for reading: No such file or directory

Context

  • 3di sequences were computed using ProstT5 in Python to obtain GPU accelleration
  • foldseek AA/3di database was created using generate_foldseek_db.py

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • foldseek Version: 9.427df8a
  • Server specifications Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz 64GB RAM
  • Operating system and version: Ubuntu Sever 22.04
martin-steinegger added a commit that referenced this issue Jan 1, 2025
@martin-steinegger
Copy link
Collaborator

Yes, documentation is indeed a weak spot. What do you need better documented?
The issue you mentioned is a bug in the reassign without Calpha. I fixed it.

@shiraz-shah
Copy link
Author

shiraz-shah commented Jan 3, 2025

Amazing! Thank you, Martin.

I missed documentation on the following areas (though maybe I wasn't looking in the right places):

  1. How to properly use ProstT5 for generating 3di representations of aa sequences

  2. How ProstT5 GPU acceleration is achieved. I could not get it to work with foldseek alone. Used Python instead.

  3. Provide the users with some real world statistics on accuracy tradeoffs when searching with experimental PDBs, vs. alphafold PDBs, vs. ProstT5-predicted 3dis, vs. sequence alone.

  4. Overview of the types of databases that foldseek supports (structure vs. faa + 3di vs faa only, etc etc.) and what is the right way to create them with e.g. createtsv, ffindex, ProstT5, etc.

  5. Provide better intuition of which operations (easy workflows, vs. alignment vs clustering) are allowed for what types of databases. Right now one often encounters a wrong db type error with no info on what db type is expected. (While the mmseqs conventions are familiar, each input data type could eventually benefit from its own series of operations. E.g. pdb-search, pdb-cluster, 3di-search, 3di-cluster, etc etc.)

  6. Better explanation of database structure and content, and how to retrieve desired data from databases. Especially result databases seem to contain a wealth of information that could be useful to access dynamically (with e.g. ffindex) instead of having to generate a giant tsv flat file to filter manually. E.g. retrieval of alignment results for specific queries would be especially useful.

  7. The search result createtsv output fields are not properly documented, at least for the 3di+aa-only search (I didn't try PDB or CIF). The README says query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits but to me it looks like query,target,bits,fident,evalue,qstart,qend,qlen,tstart,tend,tlen. I like this combination of output fields because it's useful downstream. But do confirm whether these are really the fields and update the documentation accordingly.

Your team has done an excellent job on writing a piece of software that is versatile, computationally efficient, and revolutionising not least. The field craves such tools. Currently, however, lacking documentation is the bottleneck for widespread adoption. Everybody wants to use this! Thank you again for this.

@martin-steinegger
Copy link
Collaborator

Thank you for the feedback. We will add this documentation.
For (2): We are working a better foldseek prostt5 support. We plan to push it the next few days.

@martin-steinegger
Copy link
Collaborator

martin-steinegger commented Jan 6, 2025

@shiraz-shah we have no a static GPU binary that works with Prostt5. We also reworked the documentation. Could you please give it a try please? You need to redownload the weights though.

wget https://mmseqs.com/foldseek/foldseek-linux-gpu.tar.gz; tar xvfz foldseek-linux-gpu.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

@shiraz-shah
Copy link
Author

shiraz-shah commented Jan 10, 2025 via email

@martin-steinegger
Copy link
Collaborator

Great that the ProstT5 works smoothly.

Also, thank you for pointing out the problem. We’ve updated the documentation to clarify that padding databases are needed (makepaddedseqdb) for GPU searches. Pad database for fast GPU search.

Our easy-search workflow internally calls makepaddedseqdb, the search workflow requires you to explicitly pad the database.

@shiraz-shah
Copy link
Author

OK, Martin, I just tested this with my data set, and here's what it says:

> foldseek makepaddedseqdb vOTUs vOTUs_pad
makepaddedseqdb vOTUs vOTUs_pad 

MMseqs Version:          	2a7d682841e78deb511ada92fa5caa1f8f183f14
Substitution matrix      	aa:3di.out,nucl:3di.out
Mask residues            	1
Mask residues probability	0.9
Write lookup file        	1
Threads                  	16
Verbosity                	3
Cluster search           	0

Failed to execute vOTUs_pad.sh with error 2.

It writes a file vOTUs_pad.sh, but executing that file does nothing.

Any ideas?

@milot-mirdita
Copy link
Member

Please update the binary again. We fixed this bug here in b2e41c1. But use the latest commit anyway. That one should be close to release candidate status for the next release.

@shiraz-shah
Copy link
Author

shiraz-shah commented Jan 10, 2025

OK, that looks better. But now it says:

> foldseek makepaddedseqdb vOTUs vOTUs_pad
makepaddedseqdb vOTUs vOTUs_pad 

MMseqs Version:          	12b76f35bfcdde7f23f47109b8fbfad219427e52
Substitution matrix      	aa:3di.out,nucl:3di.out
Mask residues            	1
Mask residues probability	0.9
Write lookup file        	1
Threads                  	16
Verbosity                	3
Cluster search           	0

lndb vOTUs_h vOTUs_pad_tmp_ss_h 

Time for processing: 0h 0m 0s 0ms
lndb vOTUs_ss vOTUs_pad_tmp_ss 

Time for processing: 0h 0m 0s 0ms
makepaddedseqdb vOTUs_pad_tmp_ss vOTUs_pad_ss --sub-mat 'aa:3di.out,nucl:3di.out' --score-bias 0 --mask 1 --mask-prob 0.9 --write-lookup 1 --threads 16 -v 3 

[=================================================================] 100.00% 1.12M 4s 116ms      
Time for merging to vOTUs_pad_ss: 0h 0m 0s 167ms
Time for merging to vOTUs_pad_ss_h: 0h 0m 0s 67ms
Time for processing: 0h 0m 5s 270ms
rmdb vOTUs_pad_tmp_ss 

Time for processing: 0h 0m 0s 0ms
rmdb vOTUs_pad_tmp_ss_h 

Time for processing: 0h 0m 0s 0ms
renamedbkeys vOTUs_pad_ss.gpu_mapping1 vOTUs vOTUs_pad --subdb-mode 1 --threads 16 -v 3 

Time for merging to vOTUs_pad: 0h 0m 0s 72ms
Time for merging to vOTUs_pad_h: 0h 0m 0s 66ms
Time for processing: 0h 0m 1s 321ms
renamedbkeys vOTUs_pad_ss.gpu_mapping1 vOTUs_ca vOTUs_pad_ca --subdb-mode 1 --threads 16 -v 3 

Key 711779 not found in database
/data/shiraz/T5/foldseek_gpu/vOTUs_pad.sh: 43: fail: not found

FYI, I can see that the input vOTUs and vOTUs_ss are the exact same length (number of lines). Also, it seems the above command succeeds in generating vOTUs_pad and vOTU_ss_pad. They also appear to have the correct size (same number of MBs as vOTUs and vOTUs_ss).

However when I run foldseek search --gpu 1 .. I get the same error as before (Database vOTUs_ss is not a valid GPU database)

@martin-steinegger
Copy link
Collaborator

It stops at the step where it attempts to rename the Calpha database. Does your vOTUs dataset include Calphas, or were they predicted using ProstT5? Could you please share the step before to generate the vOTU?

@shiraz-shah
Copy link
Author

It's ProstT5 only, no Calpha. I generated it like this as per your new instructions:

foldseek databases ProstT5 weights tmp
foldseek createdb vOTUs.faa vOTUs --prostt5-model weights --gpu 1

@martin-steinegger
Copy link
Collaborator

I just tested it with the latest version with my db and it worked, see log below,. I noticed that your database includes a vOTUs_ca. I'm unsure where this comes from, as my prostt5-generated database does not include a _ca database. Could it be that you previously generated a vOTUs database using real structures? You could try to delete the Calpha part rm -f vOTUs_ca* and rerun the makepaddedseqdb.

createdb sample.fasta q --prostt5-model prostt5
q exists and will be overwritten
q exists and will be overwritten
createdb sample.fasta q --prostt5-model prostt5

MMseqs Version:             	591cd2ccbef8e7907155870ddeb5551774623cda
Use GPU                     	0
Path to ProstT5             	prostt5
Chain name mode             	0
Createdb extraction mode    	0
Interface distance threshold	8
Write mapping file          	0
Mask b-factor threshold     	0
Coord store mode            	2
Write lookup file           	1
Input format                	0
File Inclusion Regex        	.*
File Exclusion Regex        	^$
Threads                     	8
Verbosity                   	3

Converting sequences

Time for merging to q_h: 0h 0m 0s 5ms
Time for merging to q: 0h 0m 0s 4ms
Database type: Aminoacid
Metal
CPU
[=================================================================] 100.00% 15 1m 18s 576ms
Time for merging to q_ss: 0h 0m 0s 3ms
Time for merging to q_ss_tmp: 0h 0m 0s 246ms
Time for processing: 0h 1m 19s 245ms


makepaddedseqdb q q_pad

MMseqs Version:          	591cd2ccbef8e7907155870ddeb5551774623cda
Substitution matrix      	aa:3di.out,nucl:3di.out
Mask residues            	1
Mask residues probability	0.9
Write lookup file        	1
Threads                  	8
Verbosity                	3
Cluster search           	0

lndb q_h q_pad_tmp_ss_h

Time for processing: 0h 0m 0s 2ms
lndb q_ss q_pad_tmp_ss

Time for processing: 0h 0m 0s 2ms
makepaddedseqdb q_pad_tmp_ss q_pad_ss --sub-mat 'aa:3di.out,nucl:3di.out' --score-bias 0 --mask 1 --mask-prob 0.9 --write-lookup 1 --threads 8 -v 3

[=================================================================] 100.00% 15 0s 1ms
Time for merging to q_pad_ss: 0h 0m 0s 4ms
Time for merging to q_pad_ss_h: 0h 0m 0s 3ms
Time for processing: 0h 0m 0s 24ms
rmdb q_pad_tmp_ss

Time for processing: 0h 0m 0s 1ms
rmdb q_pad_tmp_ss_h

Time for processing: 0h 0m 0s 1ms
renamedbkeys q_pad_ss.gpu_mapping1 q q_pad --subdb-mode 1 --threads 8 -v 3

Time for merging to q_pad: 0h 0m 0s 0ms
Time for merging to q_pad_h: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 7ms
q_pad_h exists and will be overwritten
renamedbkeys q_pad_ss.gpu_mapping1 q_h q_pad_h --subdb-mode 1 --threads 8 -v 3

Time for merging to q_pad_h: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 3ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants