-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sub-path support #29
Comments
There's probably a method (that could have a default implementation) or flag needed in the above to walk up all the way to the "root" source path and/or compute the equivalent to rGFA's |
In this formulation, I think we definitely need a set of functions which operate on either the path itself if not a subpath, or the parent path if there is one. When we get the path name and offset for output (or read them in as input), we want to automatically articulate things in terms of the base path, and interpret them in terms of the subpaths we actually happen to have available. Something like Given that, I'm not sure that we don't want a handle instead of just a string to identify root paths that have subpaths but are not themselves present in the graph. I think the Right Way to structure things might be to make all actual paths into subpaths, and to separate the root path namespace from the subpath namespace. You can no longer have subpaths of subpaths; they just become possibly overlapping subpaths of the same ultimate root path. You can have a subpath called "chr1" with an offset of 0 that runs the length of chr1, but it is still a subpath and not the actual "chr1" root path, and the only thing that makes it better than some other subpath of the actual "chr1" root path is that it is probably longer. Subpaths would then be very firmly views and not actual paths, so updating one would update all other overlapping subpaths. A design like this would make it harder to accidentally ignore the existence of subpaths. |
Good points. I like Personally, I'm mostly interested in getting global coordinates out of subgraphs, for which the above two methods are sufficient. In addition to vg chunk, this allows for things like extracting an MHC graph from a genome graph, and focusing on it transparently using real chr6 coordinates. I think these methods would also be sufficient for a fairly simple 2-pass subgraph merging algorithm. The interface where we can iterate and query root paths transparently via a collection of subgraphs sounds cool, but seems like a far taller order implementation wise. What would the use case be for it? |
I guess on my MHC example, you may want to go from a root path position to a step in the subgraph which would indeed require some kind of root path handle to do consistently. |
Can we do things The Right Way while being backward compatible with the existing interface? Perhaps if we keep everything as is and assume the current path interface gives handles to root paths. Then behind the scenes the implementation can manage subpaths if they exist. An additional subpath interface could be used to map root-path steps to their actual subpaths and detect discontinuities during iteration, examine the subpaths themselves etc. |
I'm not quite sure what you mean here; I think you meant to say "subpaths" and not "subgraphs"? The use case for being able to query the root path name and coordinates from a handle to a subpath is for VCF output when calling on a subgraph. The use case for being able to see the root path through transparent subpath views is any time we want to modify a path, because we want the modification to affect all the subpaths to maintain consistency.
I don't think we can have the existing interface work for root paths, because the existing iteration interface has no way to signal a discontinuity. We don't want to feed a graph with only parts of a root path into some existing algorithm and have it ignore the discontinuities between the actually-provided subpath parts of the path, without intending to. I think we need a new interface for iterating along the represented parts of a root path, or through the subpaths of a root path, that doesn't let the iteratee ignore the discontinuities accidentally. |
One comment about the suggestions above: if we're going to do efficient range queries we'll also need to build a BBT index over subpath segments. I don't think that's horrible, tbh, but it is a design consideration. I think we also need to extend the
In the first option, there doesn't seem to be any simple mechanism for ensuring consistency between the subpath and the parent path. For example, what if the offset does not fall at a node boundary? Where do the steps come from, and how would we ensure that they match the parent? In the second option, the only way to calculate the offset is by traversing the entire prefix of the path. That's not difficult to implement, but it's not very efficient. Any suggestions for improving the mutable interface? |
What's a BBT index?
I think that we should add a source path position and source path name
accessor for paths. We should add functions get the position in a given
path relative to its source path. I don't think there is any need to make a
new concept (subpath) that's different than a regular path. This just lets
us say who the parent of any given path is.
In serialization, we'd just add some flags to the path line to hold onto
these values.
If we allow for multiple source paths and positions then this scheme can be
used to compress multiple paths into one. I'm not sure how useful this is
though.
…On Wed, Aug 21, 2019 at 8:00 PM Jordan Eizenga ***@***.***> wrote:
One comment about the suggestions above: if we're going to do efficient
range queries we'll also need to build a BBT index over subpath segments. I
don't think that's horrible, tbh, but it is a design consideration.
I think we also need to extend the MutablePathHandleGraph interface to
have some methods for creating subpaths. The implementation here actually
seems somewhat more challenging to me. Off the top of my head, I see two
options here: one offset-based, one step-based.
path_handle_t create_subpath(const string& parent_name, const size_t& offset, const size_t& length);
path_handle_t create_subpath(const PathHandleGraph& from, const step_handle_t& begin, const step_handle_t& end);
In the first option, there doesn't seem to be any simple mechanism for
ensuring consistency between the subpath and the parent path. For example,
what if the offset does not fall at a node boundary? Where do the steps
come from, and how would we ensure that they match the parent?
In the second option, the only way to calculate the offset is by
traversing the entire prefix of the path. That's not difficult to
implement, but it's not very efficient.
Any suggestions for improving the mutable interface?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29?email_source=notifications&email_token=AABDQENJBJPJY6RCKEP5MF3QFV7ENA5CNFSM4INWVMVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42TGXI#issuecomment-523580253>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABDQEJVDRJFCU26WTSCSQLQFV7ENANCNFSM4INWVMVA>
.
|
BBT = balanced binary tree |
Subpath information is useful when working with subgraphs. It will also help with rGFA compatibility (SN/SO fields).
The following ideas emerged from the last meeting:
The subpath interface would be added to
PathHandleGraph
. It could look something likeI'm not sure if, implementation-wise, it's better to combine the above? In theory, they could be a single method like
Similar story for querying. We can provide default implementations of these using the above methods:
I think implementing this should be pretty straightforward, but making sure that existing algorithms that modify paths use it properly may take some doing.
@ekg @jeizenga @adamnovak thoughts?
The text was updated successfully, but these errors were encountered: