`foldseek cluster` crashes with `--cluster-reassign 1` #399

shiraz-shah · 2024-12-17T17:46:25Z

Amazing software, guys! More documentation would be helpful, though!!

Expected Behavior

That clustering works with cluster reassignment enabled. Clustering works fine with it disabled.

Current Behavior

foldseek cluster DB C tmp --cluster-reassign 1
Crashes with error:
awk: fatal: cannot open file tmp/9215817526405491371/seq_seeds_ca.index' for reading: No such file or directory`

Steps to Reproduce (for bugs)

Make foldseek database composed of only amino acid sequence and 3di sequences, i.e.:

DB
DB.dbtype
DB_h
DB_h.dbtype
DB_h.index
DB.index
DB_ss
DB_ss.dbtype
DB_ss.index

Foldssek Output (for bugs)

awk: fatal: cannot open file 'tmp/9215817526405491371/seq_seeds_ca.index' for reading: No such file or directory

Context

3di sequences were computed using ProstT5 in Python to obtain GPU accelleration
foldseek AA/3di database was created using generate_foldseek_db.py

Your Environment

Include as many relevant details about the environment you experienced the bug in.

foldseek Version: 9.427df8a
Server specifications Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz 64GB RAM
Operating system and version: Ubuntu Sever 22.04

The text was updated successfully, but these errors were encountered:

martin-steinegger · 2025-01-01T08:43:34Z

Yes, documentation is indeed a weak spot. What do you need better documented?
The issue you mentioned is a bug in the reassign without Calpha. I fixed it.

shiraz-shah · 2025-01-03T08:21:59Z

Amazing! Thank you, Martin.

I missed documentation on the following areas (though maybe I wasn't looking in the right places):

How to properly use ProstT5 for generating 3di representations of aa sequences
How ProstT5 GPU acceleration is achieved. I could not get it to work with foldseek alone. Used Python instead.
Provide the users with some real world statistics on accuracy tradeoffs when searching with experimental PDBs, vs. alphafold PDBs, vs. ProstT5-predicted 3dis, vs. sequence alone.
Overview of the types of databases that foldseek supports (structure vs. faa + 3di vs faa only, etc etc.) and what is the right way to create them with e.g. createtsv, ffindex, ProstT5, etc.
Provide better intuition of which operations (easy workflows, vs. alignment vs clustering) are allowed for what types of databases. Right now one often encounters a wrong db type error with no info on what db type is expected. (While the mmseqs conventions are familiar, each input data type could eventually benefit from its own series of operations. E.g. pdb-search, pdb-cluster, 3di-search, 3di-cluster, etc etc.)
Better explanation of database structure and content, and how to retrieve desired data from databases. Especially result databases seem to contain a wealth of information that could be useful to access dynamically (with e.g. ffindex) instead of having to generate a giant tsv flat file to filter manually. E.g. retrieval of alignment results for specific queries would be especially useful.
The search result createtsv output fields are not properly documented, at least for the 3di+aa-only search (I didn't try PDB or CIF). The README says query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits but to me it looks like query,target,bits,fident,evalue,qstart,qend,qlen,tstart,tend,tlen. I like this combination of output fields because it's useful downstream. But do confirm whether these are really the fields and update the documentation accordingly.

Your team has done an excellent job on writing a piece of software that is versatile, computationally efficient, and revolutionising not least. The field craves such tools. Currently, however, lacking documentation is the bottleneck for widespread adoption. Everybody wants to use this! Thank you again for this.

martin-steinegger · 2025-01-03T17:29:08Z

Thank you for the feedback. We will add this documentation.
For (2): We are working a better foldseek prostt5 support. We plan to push it the next few days.

martin-steinegger · 2025-01-06T10:07:27Z

@shiraz-shah we have no a static GPU binary that works with Prostt5. We also reworked the documentation. Could you please give it a try please? You need to redownload the weights though.

wget https://mmseqs.com/foldseek/foldseek-linux-gpu.tar.gz; tar xvfz foldseek-linux-gpu.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

shiraz-shah · 2025-01-10T07:42:56Z

Martin, GPU inference worked great for generating a 3di database directly from aa sequences! Amazing!! What about `search` though? When I did: `foldseek search DB DB aln tmp --gpu 1 --prefilter-mode 1` I got: `Database vOTUs_ss is not a valid GPU database`

martin-steinegger · 2025-01-10T13:22:11Z

Great that the ProstT5 works smoothly.

Also, thank you for pointing out the problem. We’ve updated the documentation to clarify that padding databases are needed (makepaddedseqdb) for GPU searches. Pad database for fast GPU search.

Our easy-search workflow internally calls makepaddedseqdb, the search workflow requires you to explicitly pad the database.

shiraz-shah · 2025-01-10T18:03:24Z

OK, Martin, I just tested this with my data set, and here's what it says:

> foldseek makepaddedseqdb vOTUs vOTUs_pad
makepaddedseqdb vOTUs vOTUs_pad 

MMseqs Version:          	2a7d682841e78deb511ada92fa5caa1f8f183f14
Substitution matrix      	aa:3di.out,nucl:3di.out
Mask residues            	1
Mask residues probability	0.9
Write lookup file        	1
Threads                  	16
Verbosity                	3
Cluster search           	0

Failed to execute vOTUs_pad.sh with error 2.

It writes a file vOTUs_pad.sh, but executing that file does nothing.

Any ideas?

milot-mirdita · 2025-01-10T18:06:42Z

Please update the binary again. We fixed this bug here in b2e41c1. But use the latest commit anyway. That one should be close to release candidate status for the next release.

shiraz-shah · 2025-01-10T18:41:57Z

OK, that looks better. But now it says:

> foldseek makepaddedseqdb vOTUs vOTUs_pad
makepaddedseqdb vOTUs vOTUs_pad 

MMseqs Version:          	12b76f35bfcdde7f23f47109b8fbfad219427e52
Substitution matrix      	aa:3di.out,nucl:3di.out
Mask residues            	1
Mask residues probability	0.9
Write lookup file        	1
Threads                  	16
Verbosity                	3
Cluster search           	0

lndb vOTUs_h vOTUs_pad_tmp_ss_h 

Time for processing: 0h 0m 0s 0ms
lndb vOTUs_ss vOTUs_pad_tmp_ss 

Time for processing: 0h 0m 0s 0ms
makepaddedseqdb vOTUs_pad_tmp_ss vOTUs_pad_ss --sub-mat 'aa:3di.out,nucl:3di.out' --score-bias 0 --mask 1 --mask-prob 0.9 --write-lookup 1 --threads 16 -v 3 

[=================================================================] 100.00% 1.12M 4s 116ms      
Time for merging to vOTUs_pad_ss: 0h 0m 0s 167ms
Time for merging to vOTUs_pad_ss_h: 0h 0m 0s 67ms
Time for processing: 0h 0m 5s 270ms
rmdb vOTUs_pad_tmp_ss 

Time for processing: 0h 0m 0s 0ms
rmdb vOTUs_pad_tmp_ss_h 

Time for processing: 0h 0m 0s 0ms
renamedbkeys vOTUs_pad_ss.gpu_mapping1 vOTUs vOTUs_pad --subdb-mode 1 --threads 16 -v 3 

Time for merging to vOTUs_pad: 0h 0m 0s 72ms
Time for merging to vOTUs_pad_h: 0h 0m 0s 66ms
Time for processing: 0h 0m 1s 321ms
renamedbkeys vOTUs_pad_ss.gpu_mapping1 vOTUs_ca vOTUs_pad_ca --subdb-mode 1 --threads 16 -v 3 

Key 711779 not found in database
/data/shiraz/T5/foldseek_gpu/vOTUs_pad.sh: 43: fail: not found

FYI, I can see that the input vOTUs and vOTUs_ss are the exact same length (number of lines). Also, it seems the above command succeeds in generating vOTUs_pad and vOTU_ss_pad. They also appear to have the correct size (same number of MBs as vOTUs and vOTUs_ss).

However when I run foldseek search --gpu 1 .. I get the same error as before (Database vOTUs_ss is not a valid GPU database)

martin-steinegger · 2025-01-10T18:58:08Z

It stops at the step where it attempts to rename the Calpha database. Does your vOTUs dataset include Calphas, or were they predicted using ProstT5? Could you please share the step before to generate the vOTU?

shiraz-shah · 2025-01-10T19:21:41Z

It's ProstT5 only, no Calpha. I generated it like this as per your new instructions:

foldseek databases ProstT5 weights tmp
foldseek createdb vOTUs.faa vOTUs --prostt5-model weights --gpu 1

martin-steinegger · 2025-01-11T09:22:27Z

I just tested it with the latest version with my db and it worked, see log below,. I noticed that your database includes a vOTUs_ca. I'm unsure where this comes from, as my prostt5-generated database does not include a _ca database. Could it be that you previously generated a vOTUs database using real structures? You could try to delete the Calpha part rm -f vOTUs_ca* and rerun the makepaddedseqdb.

createdb sample.fasta q --prostt5-model prostt5
q exists and will be overwritten
q exists and will be overwritten
createdb sample.fasta q --prostt5-model prostt5

MMseqs Version:             	591cd2ccbef8e7907155870ddeb5551774623cda
Use GPU                     	0
Path to ProstT5             	prostt5
Chain name mode             	0
Createdb extraction mode    	0
Interface distance threshold	8
Write mapping file          	0
Mask b-factor threshold     	0
Coord store mode            	2
Write lookup file           	1
Input format                	0
File Inclusion Regex        	.*
File Exclusion Regex        	^$
Threads                     	8
Verbosity                   	3

Converting sequences

Time for merging to q_h: 0h 0m 0s 5ms
Time for merging to q: 0h 0m 0s 4ms
Database type: Aminoacid
Metal
CPU
[=================================================================] 100.00% 15 1m 18s 576ms
Time for merging to q_ss: 0h 0m 0s 3ms
Time for merging to q_ss_tmp: 0h 0m 0s 246ms
Time for processing: 0h 1m 19s 245ms


makepaddedseqdb q q_pad

MMseqs Version:          	591cd2ccbef8e7907155870ddeb5551774623cda
Substitution matrix      	aa:3di.out,nucl:3di.out
Mask residues            	1
Mask residues probability	0.9
Write lookup file        	1
Threads                  	8
Verbosity                	3
Cluster search           	0

lndb q_h q_pad_tmp_ss_h

Time for processing: 0h 0m 0s 2ms
lndb q_ss q_pad_tmp_ss

Time for processing: 0h 0m 0s 2ms
makepaddedseqdb q_pad_tmp_ss q_pad_ss --sub-mat 'aa:3di.out,nucl:3di.out' --score-bias 0 --mask 1 --mask-prob 0.9 --write-lookup 1 --threads 8 -v 3

[=================================================================] 100.00% 15 0s 1ms
Time for merging to q_pad_ss: 0h 0m 0s 4ms
Time for merging to q_pad_ss_h: 0h 0m 0s 3ms
Time for processing: 0h 0m 0s 24ms
rmdb q_pad_tmp_ss

Time for processing: 0h 0m 0s 1ms
rmdb q_pad_tmp_ss_h

Time for processing: 0h 0m 0s 1ms
renamedbkeys q_pad_ss.gpu_mapping1 q q_pad --subdb-mode 1 --threads 8 -v 3

Time for merging to q_pad: 0h 0m 0s 0ms
Time for merging to q_pad_h: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 7ms
q_pad_h exists and will be overwritten
renamedbkeys q_pad_ss.gpu_mapping1 q_h q_pad_h --subdb-mode 1 --threads 8 -v 3

Time for merging to q_pad_h: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 3ms

martin-steinegger added a commit that referenced this issue Jan 1, 2025

Fix issue #399

0d8d966

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`foldseek cluster` crashes with `--cluster-reassign 1` #399

`foldseek cluster` crashes with `--cluster-reassign 1` #399

shiraz-shah commented Dec 17, 2024 •

edited

Loading

martin-steinegger commented Jan 1, 2025

shiraz-shah commented Jan 3, 2025 •

edited

Loading

martin-steinegger commented Jan 3, 2025

martin-steinegger commented Jan 6, 2025 •

edited

Loading

shiraz-shah commented Jan 10, 2025 via email •

edited

Loading

martin-steinegger commented Jan 10, 2025

shiraz-shah commented Jan 10, 2025

milot-mirdita commented Jan 10, 2025

shiraz-shah commented Jan 10, 2025 •

edited

Loading

martin-steinegger commented Jan 10, 2025

shiraz-shah commented Jan 10, 2025

martin-steinegger commented Jan 11, 2025

foldseek cluster crashes with --cluster-reassign 1 #399

foldseek cluster crashes with --cluster-reassign 1 #399

Comments

shiraz-shah commented Dec 17, 2024 • edited Loading

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Foldssek Output (for bugs)

Context

Your Environment

martin-steinegger commented Jan 1, 2025

shiraz-shah commented Jan 3, 2025 • edited Loading

martin-steinegger commented Jan 3, 2025

martin-steinegger commented Jan 6, 2025 • edited Loading

shiraz-shah commented Jan 10, 2025 via email • edited Loading

martin-steinegger commented Jan 10, 2025

shiraz-shah commented Jan 10, 2025

milot-mirdita commented Jan 10, 2025

shiraz-shah commented Jan 10, 2025 • edited Loading

martin-steinegger commented Jan 10, 2025

shiraz-shah commented Jan 10, 2025

martin-steinegger commented Jan 11, 2025

`foldseek cluster` crashes with `--cluster-reassign 1` #399

`foldseek cluster` crashes with `--cluster-reassign 1` #399

shiraz-shah commented Dec 17, 2024 •

edited

Loading

shiraz-shah commented Jan 3, 2025 •

edited

Loading

martin-steinegger commented Jan 6, 2025 •

edited

Loading

shiraz-shah commented Jan 10, 2025 via email •

edited

Loading

shiraz-shah commented Jan 10, 2025 •

edited

Loading