Skip to content

Commit

Permalink
Update RNAformer to output base pairs and structure in dot-bracket no…
Browse files Browse the repository at this point in the history
…tation (#1571)

* Initial tool with text input

* Added test-data and support for FASTA input

* Add .shed.yml

* Update .shed.yml

* Update help section

* Include model and configuration files for RNAformer

* Update tools/rna_tools/rnaformer/.shed.yml

Co-authored-by: Björn Grüning <[email protected]>

* Update tools/rna_tools/rnaformer/infer_rnaformer.xml

Co-authored-by: Björn Grüning <[email protected]>

* Remove check and download of model and configuration files

* Download the model and configuration files if needed rather than include it in test-data.

* Replace requirements with correct version of RNAformer conda package

* Update to use RNAformer 0.0.1

* Update infer_rnaformer.xml

* Add checks when attempting to download model and config files

* Try without downloading model and config files

* Revert to downloading model and config files

* Add logging

* Test configuration of tool and update shed

* Update shed and test data

* Update tools/rna_tools/rnaformer/.shed.yml

Co-authored-by: Björn Grüning <[email protected]>

* Revert to original functionality

* Download model files before tool script is run

* Simplify requirements

* Update tools/rna_tools/rnaformer/.shed.yml

* Check that all given sequences are RNA and exit early if not

* Update tools/rna_tools/rnaformer/infer_rnaformer.xml

Co-authored-by: Björn Grüning <[email protected]>

* Update tools/rna_tools/rnaformer/infer_rnaformer.xml

Co-authored-by: Björn Grüning <[email protected]>

* Sanitize text input to only allow RNA bases and commas

* Update test-data to only include base pair positions in output

* Add dot-bracket and formatted output

* Add dot-bracket notation and base pairs to output [WIP]

* Add dot-bracket notation and base pairs to output [WIP]

* bumb tool version

---------

Co-authored-by: Björn Grüning <[email protected]>
  • Loading branch information
ivelet and bgruening authored Jan 26, 2025
1 parent c1c3b6d commit 81ae3a5
Show file tree
Hide file tree
Showing 5 changed files with 45 additions and 14 deletions.
33 changes: 25 additions & 8 deletions tools/rna_tools/rnaformer/infer_rnaformer.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<tool id="infer_rnaformer" name="@EXECUTABLE@" version="@TOOL_VERSION@" profile="22.05">
<tool id="infer_rnaformer" name="@EXECUTABLE@" version="@TOOL_VERSION@+galaxy1" profile="22.05">
<description>Predict the secondary structure of an RNA with RNAformer</description>
<macros>
<import>macros.xml</import>
Expand Down Expand Up @@ -112,6 +112,8 @@ if torch.cuda.is_available():
model.eval()
predicted_structures = []
orig_seq = ""
for sequence in sequences:
with torch.no_grad():
device = "cpu"
Expand All @@ -124,6 +126,8 @@ for sequence in sequences:
length = len(sequence)
src_seq = torch.LongTensor(list(map(seq_stoi.get, sequence)))
orig_seq = sequence
sample = {}
sample['src_seq'] = src_seq.clone()
sample['length'] = torch.LongTensor([length])[0]
Expand All @@ -141,10 +145,12 @@ for sequence in sequences:
pos1_id = pos_id[0].cpu().tolist()
pos2_id = pos_id[1].cpu().tolist()
predicted_structure = f"Pairing index 1: {pos1_id} \nPairing index 2: {pos2_id}"
print(predicted_structure)
pairs = [[a, b] for a, b in zip(pos1_id, pos2_id)]
seqlen = len(sample['src_seq'])
dot_bracket =['.'] * seqlen
pk_count = 0
pk_list = []
for i in range(len(pos1_id)):
open_index = pos1_id[i]
close_index = pos2_id[i]
Expand All @@ -153,8 +159,18 @@ for sequence in sequences:
if dot_bracket[open_index] == '.' and dot_bracket[close_index] == '.':
dot_bracket[open_index] = '('
dot_bracket[close_index] = ')'
else:
## pseudoknots or multiplets present in structure- cannot represent with dot-bracket
pk_count += 1
pk_list.append(pairs[i])
dot_bracket_str_pred = ''.join(dot_bracket)
print(f"{orig_seq}")
print(f"Base pairs: {str(pairs)}")
print(f"Predicted Structure: {dot_bracket_str_pred}")
if pk_count > 0:
print(f"NOTE: {pk_count} pseudoknots and/or multiplets present in predicted structure excluded from dot-bracket notation: {pk_list}")
]]></configfile>
Expand Down Expand Up @@ -186,7 +202,7 @@ for sequence in sequences:
<test>
<param name="input_type" value="False"/>
<param name="rna_input_string" value="GCCCGCAUGGUGAAAUCGGUAAACACAUCGCACUAAUGCGCCGCCUCUGGCUUGCCGGUUCAAGUCCGGCUGCGGGCACCA"/>
<output name="output" file="rna_2d_pred_text.txt"/>
<output name="output" file="rna_2d_pred_out.txt"/>
</test>
<test>
<param name="input_type" value="True"/>
Expand All @@ -196,17 +212,18 @@ for sequence in sequences:
</tests>
<help><![CDATA[
**RNAformer**
This tool reads RNA sequences and predicts their secondary structure using RNAformer.
This tool reads RNA sequences and predicts their secondary structure using RNAformer. Note that unlike conventional methods, RNAformer is capable of predicting all possible secondary structure motifs, including pseudoknots and multiplets. These cannot be represented in dot-bracket notation and thus the output will be partial in these cases, excluding these which will be noted in the output file below the (partial) dot-bracket structure.
**Input format**
RNAformer requires one or more RNA sequences either as a single FASTA file or as plain text.
**Outputs**
- Predicted secondary structure as a text file in the following formats:
- base pair positions
- dot-bracket notation
- Predicted secondary structure as a text file containing the following:
- RNA input sequence
- Base pairs of predicted secondary structure
- Predicted secondary structure in dot-bracket notation
]]></help>
<expand macro="citations" />
</tool>
</tool>
4 changes: 4 additions & 0 deletions tools/rna_tools/rnaformer/test-data/fasta_input_false1.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
>Anolis_caro_chrUn_GL343590.trna2_AlaAGC (218800-218872) Ala (AGC) 73 bp Sc: 49.55
TGGGAATTAGCTCAAATGGAAGAGCGCTCGCTTAGCATGTGAGAGGTAGTGGGATCGATGCCCACATTCTCCA
>Anolis_caro_chrUn_GL343207.trna3_AlaAGC (1513626-1513698) Ala (AGC) 73 bp Sc: 56.15
GGGAATTAGCTCAAATGGAAGAGCGCTCGCTTAGCATGCGAGAGGTAGCGGGATTGATGCCCGCATTCTCCA
12 changes: 8 additions & 4 deletions tools/rna_tools/rnaformer/test-data/rna_2d_pred_FASTA.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
Pairing index 1: [0, 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 8, 9, 9, 10, 11, 12, 13, 13, 17, 18, 20, 21, 21, 22, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 48, 49, 50, 51, 52, 53, 54, 55, 57, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71]
Pairing index 2: [71, 70, 69, 68, 67, 66, 65, 13, 14, 20, 47, 22, 24, 44, 23, 22, 21, 7, 20, 54, 55, 7, 12, 45, 8, 11, 10, 9, 43, 42, 41, 40, 39, 38, 37, 36, 32, 31, 30, 29, 28, 27, 26, 25, 9, 21, 64, 63, 62, 61, 60, 57, 17, 18, 53, 52, 51, 50, 49, 48, 6, 5, 4, 3, 2, 1, 0]
Pairing index 1: [0, 1, 2, 3, 4, 5, 6, 7, 7, 7, 8, 9, 9, 10, 11, 12, 13, 13, 14, 20, 20, 21, 21, 22, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 48, 49, 50, 51, 52, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71]
Pairing index 2: [71, 70, 69, 68, 67, 66, 65, 13, 14, 20, 22, 24, 44, 23, 22, 21, 7, 20, 7, 7, 13, 12, 45, 8, 11, 10, 9, 43, 42, 41, 40, 39, 38, 37, 32, 31, 30, 29, 28, 27, 26, 25, 9, 21, 64, 63, 62, 61, 60, 52, 51, 50, 49, 48, 6, 5, 4, 3, 2, 1, 0]
UGGGAAUUAGCUCAAAUGGUAGAGCGCUCGCUUAGCAUGUGAGAGGUAGUGGGAUCGAUGCCCACAUUCUCCA
Base pairs: [[0, 71], [1, 70], [2, 69], [3, 68], [4, 67], [5, 66], [6, 65], [7, 13], [7, 14], [7, 20], [7, 47], [8, 22], [9, 24], [9, 44], [10, 23], [11, 22], [12, 21], [13, 7], [13, 20], [17, 54], [18, 55], [20, 7], [21, 12], [21, 45], [22, 8], [22, 11], [23, 10], [24, 9], [25, 43], [26, 42], [27, 41], [28, 40], [29, 39], [30, 38], [31, 37], [32, 36], [36, 32], [37, 31], [38, 30], [39, 29], [40, 28], [41, 27], [42, 26], [43, 25], [44, 9], [45, 21], [48, 64], [49, 63], [50, 62], [51, 61], [52, 60], [53, 57], [54, 17], [55, 18], [57, 53], [60, 52], [61, 51], [62, 50], [63, 49], [64, 48], [65, 6], [66, 5], [67, 4], [68, 3], [69, 2], [70, 1], [71, 0]]
Predicted Structure: (((((((((((.()...((..))))((((((((...))))))))....(((((()).)..)))))))))))).
NOTE: 39 pseudoknots and/or multiplets present in predicted structure excluded from dot-bracket notation: [[7, 14], [7, 20], [7, 47], [9, 44], [11, 22], [13, 7], [13, 20], [20, 7], [21, 12], [21, 45], [22, 8], [22, 11], [23, 10], [24, 9], [36, 32], [37, 31], [38, 30], [39, 29], [40, 28], [41, 27], [42, 26], [43, 25], [44, 9], [45, 21], [54, 17], [55, 18], [57, 53], [60, 52], [61, 51], [62, 50], [63, 49], [64, 48], [65, 6], [66, 5], [67, 4], [68, 3], [69, 2], [70, 1], [71, 0]]
GGGGAAUUAGCUCAAAUGGUAGAGCGCUCGCUUAGCAUGCGAGAGGUAGCGGGAUUGAUGCCCGCAUUCUCCA
Base pairs: [[0, 71], [1, 70], [2, 69], [3, 68], [4, 67], [5, 66], [6, 65], [7, 13], [7, 14], [7, 20], [8, 22], [9, 24], [9, 44], [10, 23], [11, 22], [12, 21], [13, 7], [13, 20], [14, 7], [20, 7], [20, 13], [21, 12], [21, 45], [22, 8], [22, 11], [23, 10], [24, 9], [25, 43], [26, 42], [27, 41], [28, 40], [29, 39], [30, 38], [31, 37], [36, 32], [37, 31], [38, 30], [39, 29], [40, 28], [41, 27], [42, 26], [43, 25], [44, 9], [45, 21], [48, 64], [49, 63], [50, 62], [51, 61], [52, 60], [60, 52], [61, 51], [62, 50], [63, 49], [64, 48], [65, 6], [66, 5], [67, 4], [68, 3], [69, 2], [70, 1], [71, 0]]
Predicted Structure: (((((((((((.().......))))((((((()...()))))))....(((((.......)))))))))))).
NOTE: 36 pseudoknots and/or multiplets present in predicted structure excluded from dot-bracket notation: [[7, 14], [7, 20], [9, 44], [11, 22], [13, 7], [13, 20], [14, 7], [20, 7], [20, 13], [21, 12], [21, 45], [22, 8], [22, 11], [23, 10], [24, 9], [37, 31], [38, 30], [39, 29], [40, 28], [41, 27], [42, 26], [43, 25], [44, 9], [45, 21], [60, 52], [61, 51], [62, 50], [63, 49], [64, 48], [65, 6], [66, 5], [67, 4], [68, 3], [69, 2], [70, 1], [71, 0]]
4 changes: 4 additions & 0 deletions tools/rna_tools/rnaformer/test-data/rna_2d_pred_out.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
GCCCGCAUGGUGAAAUCGGUAAACACAUCGCACUAAUGCGCCGCCUCUGGCUUGCCGGUUCAAGUCCGGCUGCGGGCACCA
Base pairs: [[0, 76], [1, 75], [2, 74], [3, 73], [4, 72], [5, 71], [6, 70], [7, 13], [7, 21], [8, 23], [9, 25], [10, 24], [11, 23], [12, 22], [13, 7], [17, 59], [18, 60], [21, 7], [21, 13], [22, 12], [23, 8], [23, 11], [24, 10], [25, 9], [39, 30], [40, 29], [42, 50], [43, 49], [44, 48], [48, 44], [49, 43], [50, 42], [53, 69], [54, 68], [55, 67], [56, 66], [57, 65], [58, 62], [59, 17], [60, 18], [62, 58], [65, 57], [66, 56], [67, 55], [68, 54], [69, 53], [70, 6], [71, 5], [72, 4], [73, 3], [74, 2], [75, 1], [76, 0]]
Predicted Structure: (((((((((((.()...((...))))...))........((.(((...)))..(((((()).)..))))))))))))....
NOTE: 28 pseudoknots and/or multiplets present in predicted structure excluded from dot-bracket notation: [[7, 21], [11, 23], [13, 7], [21, 7], [21, 13], [22, 12], [23, 8], [23, 11], [24, 10], [25, 9], [48, 44], [49, 43], [50, 42], [59, 17], [60, 18], [62, 58], [65, 57], [66, 56], [67, 55], [68, 54], [69, 53], [70, 6], [71, 5], [72, 4], [73, 3], [74, 2], [75, 1], [76, 0]]
6 changes: 4 additions & 2 deletions tools/rna_tools/rnaformer/test-data/rna_2d_pred_text.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
Pairing index 1: [0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 12, 13, 17, 18, 21, 21, 22, 23, 23, 24, 25, 39, 40, 42, 43, 44, 48, 49, 50, 53, 54, 55, 56, 57, 58, 59, 60, 62, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76]
Pairing index 2: [76, 75, 74, 73, 72, 71, 70, 13, 21, 23, 25, 24, 23, 22, 7, 59, 60, 7, 13, 12, 8, 11, 10, 9, 30, 29, 50, 49, 48, 44, 43, 42, 69, 68, 67, 66, 65, 62, 17, 18, 58, 57, 56, 55, 54, 53, 6, 5, 4, 3, 2, 1, 0]
GCCCGCAUGGUGAAAUCGGUAAACACAUCGCACUAAUGCGCCGCCUCUGGCUUGCCGGUUCAAGUCCGGCUGCGGGCACCA
Base pairs: [[0, 76], [1, 75], [2, 74], [3, 73], [4, 72], [5, 71], [6, 70], [7, 13], [7, 21], [8, 23], [9, 25], [10, 24], [11, 23], [12, 22], [13, 7], [17, 59], [18, 60], [21, 7], [21, 13], [22, 12], [23, 8], [23, 11], [24, 10], [25, 9], [39, 30], [40, 29], [42, 50], [43, 49], [44, 48], [48, 44], [49, 43], [50, 42], [53, 69], [54, 68], [55, 67], [56, 66], [57, 65], [58, 62], [59, 17], [60, 18], [62, 58], [65, 57], [66, 56], [67, 55], [68, 54], [69, 53], [70, 6], [71, 5], [72, 4], [73, 3], [74, 2], [75, 1], [76, 0]]
Predicted Structure: (((((((((((.()...((...))))...))........((.(((...)))..(((((()).)..))))))))))))....
NOTE: 28 pseudoknots and/or multiplets present in predicted structure excluded from dot-bracket notation: [[7, 21], [11, 23], [13, 7], [21, 7], [21, 13], [22, 12], [23, 8], [23, 11], [24, 10], [25, 9], [48, 44], [49, 43], [50, 42], [59, 17], [60, 18], [62, 58], [65, 57], [66, 56], [67, 55], [68, 54], [69, 53], [70, 6], [71, 5], [72, 4], [73, 3], [74, 2], [75, 1], [76, 0]]

0 comments on commit 81ae3a5

Please sign in to comment.