Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior on converting between gfa and xg #4

Open
6br opened this issue Jun 30, 2019 · 6 comments
Open

Unexpected behavior on converting between gfa and xg #4

6br opened this issue Jun 30, 2019 · 6 comments

Comments

@6br
Copy link

6br commented Jun 30, 2019

I tried bin/xg built at commit hash 6871a1e011954483e01ace8a517a78ba1a57b7d9.

An input file test.gfa is following.

H       VN:Z:1.0
S       1       CAAATAAG
S       2       A
S       3       G
S       4       T
S       5       C
S       6       TTG
S       7       A
S       8       G
S       9       AAATTTTCTGGAGTTCTAT
S       10      A
S       11      T
S       12      ATAT
S       13      A
S       14      T
S       15      CCAACTCTCTG
L       1       +       2       +       0M
L       1       +       3       +       0M
L       2       +       4       +       0M
L       2       +       5       +       0M
L       3       +       4       +       0M
L       3       +       5       +       0M
L       4       +       6       +       0M
L       5       +       6       +       0M
L       6       +       7       +       0M
L       6       +       8       +       0M
L       7       +       9       +       0M
L       8       +       9       +       0M
L       9       +       10      +       0M
L       9       +       11      +       0M
L       10      +       12      +       0M
L       11      +       12      +       0M
L       12      +       13      +       0M
L       12      +       14      +       0M
L       13      +       15      +       0M
L       14      +       15      +       0M
P       x       1+,3+,5+,6+,8+,9+,11+,12+,14+,15+       *,*,*,*,*,*,*,*,*
P       y       1+,2+,5+,6+,8+,9+,11+,12+,14+,15+       *,*,*,*,*,*,*,*,*
P       z       1+,2+,5+,6+,7+,9+,11+,12+,14+,15+       *,*,*,*,*,*,*,*,*

I run the following commands on a shell.

$ bin/xg -o test.xg -g test.gfa
$ bin/xg -i test.xg --gfa-out

After that, I found that the node 15+ on the path z was truncated.

P       x       1+,3+,5+,6+,8+,9+,11+,12+,14+,15+       8M,1M,1M,3M,1M,19M,1M,4M,1M,11M
P       y       1+,2+,5+,6+,8+,9+,11+,12+,14+,15+       8M,1M,1M,3M,1M,19M,1M,4M,1M,11M
P       z       1+,2+,5+,6+,7+,9+,11+,12+,14+   8M,1M,1M,3M,1M,19M,1M,4M,1M
@ekg
Copy link
Member

ekg commented Aug 21, 2019

Your input is incorrect. There are only 9 * elements, but 10 path elements.

It's annoying that we have to keep these two lists in sync. Maybe we can fix that in rGFA.

@ekg ekg closed this as completed Aug 21, 2019
@6br
Copy link
Author

6br commented Aug 22, 2019

I don't think it is incorrect because it obeys the GFA1 spec. According to the spec, the 4th column means overlaps (between nodes on a path). As long as the path is linear, the number of overlaps between nodes is len(nodes) -1. So, it is natural that there are 9 elements. The example at the end of https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md is similar.
I hope such kinds of ambiguity can be resolved in rGFA.

@ekg
Copy link
Member

ekg commented Aug 22, 2019 via email

@ekg ekg reopened this Aug 22, 2019
@ekg
Copy link
Member

ekg commented Aug 22, 2019

Thanks for pointing this out. I guess the overlaps are being stored in the path because they aren't determined based on the graph topology of an assembly graph.

This does mean that all our GFA P lines are broken. But the fact that we weren't using these fields for any purpose also indicates how useless they were for our applications. In graphs with paths, these overlap/cigar descriptions are hugely expensive. I would love to get rid of them or make them optional. Perhaps *,*,*... is the best we can do. It's a required field. But, what tools actually use it? As far as I know, only variation graph tools care about the paths.

@ekg
Copy link
Member

ekg commented Aug 22, 2019

That said, the current setup of the gfakluge parser used by xg should work for the correct format and correctly parses your example.

@6br
Copy link
Author

6br commented Aug 22, 2019

Thank you for considering my comment. The reason why we faced this problem is that https://github.com/graph-genome/vgbrowser uses
pygfa to exchange data between graph genome browser and xg via GFA format as intermediate files, currently. Since pygfa raises errors for such differences in records of GFA, I feel the restriction of pygfa is a little too strong for practical use cases. Therefore, I would appreciate it if we could replace our current implementation with direct communication to lightweight xg server. We would be free from the differences between GFA parsers if it goes well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants