-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TINKERPOP-1274: GraphSON 2.0. #351
Conversation
import java.io.IOException; | ||
import java.io.InputStream; | ||
import java.io.OutputStream; | ||
import java.io.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry @newkek but your IDE introduced wildcards to our imports which isn't our code style. could you please fix those?
Note that travis isn't happy - tests are failing:
Also - some documentation odds and ends:
|
Ah I changed something quickly this morning and did not run again those tests. Will correct that. |
You might need to rebase - looks like there are conflicts on the branch - probably on CHANGELOG or something like that. |
Yes the conflicts were introduced by ce19704 (reference docs and Changelog) |
|
Re-conflicts with Changelog against master.. |
yeah - we tend to bump heads on master a little bit on the CHANGELOG - not a big deal - i can resolve that conflict when the time comes. |
e15bafb
to
e1b6dae
Compare
Hey guys, I've been super busy lately but I definitely plan on diving deep into this PR over the weekend. One quick remark though.
I can illustrate both of these points with the following JSON : In conclusion:
PS: I would also like to point out that this format |
I thought that java integer would be implied. I believe that if we wrote a java short for example it would type to a "Short" then you would know what kind of number it is. We purposely went away from the fully qualified class name which really doesn't mean much to non-JVM languages in favor of the more brief and less scary simple name - I thought we had decided that a smaller byte size for the network outweighed the downside of being less specific about the type. no?
I seem to remember that we discussed that before but don't remember the outcome - @newkek do you remember what was said? |
Yes with the current PR serializing an Integer will result in no type added. Serializing a Long or a Short will result in an explicit type added. To be very explicit, in JSON everything is Doubles and Longs, it does that because it's the bigger container format that contains the others less precise ones. Jackson however, considers that when an Integer is to be serialized, there is no need for an explicit type because the precision will be kept since for JSON it's in a Long. When it comes to serializing a Long, still no precision loss but the format explicitly indicates to the Deserializer that what has been serialized initially was a Long. So everything is as defined as if the Integer was typed, because by default the reader assumes everything's an Integer, and if not there will be a type to specify what it is. However we have save a large payload size by not typing the integer. Same concept for Float VS Double, for JSON everything is a Double, if a value is a Float, it will be typed, if not, by default it's a Float. So as I said in the description I think the outcome of the mail conversation was to not type Simple values. Mostly because we're not sure if this is going to be useful or not. However, adding those types in the future, for Simple values, can be very easily done and I've left detailed comments in the
As I see it for how I implemented the TypeDeserializer, it acts as a meta deserializer that will read the raw text JSON sequentially, so there's no chance there can be a mixup in the order, it does not deserialize the whole structure and creates a |
According to RFC 7159, all numerics (i.e., both integral and decimal) in JSON are of a single type, Numeric, with implementation-specific ranges. In any case would "assuming Integer" result in payloads that don't round-trip to the same types? For example, if I serialize an Integer without a type tag, will subsequent deserializations (via, say, Jackson) return a Long? |
If i understand right: Serializing integer without a type tag would result in integer. You would have to specify "Long" as the type for it to be interpreted as a Long in java. This is generally just a problem for numerics so the conversion is:
@newkek is that an appropriate description of what's happening in conversion? |
@spmallette yes, maybe some code will help explain.
Almost, by default decimals are Double, not Floats, so Double are not typed, Floats get a type. |
imo, I think that if we document GraphSON this way, there should not be too much confusion as to how the type system works. |
I hope you guys figured it all out yesterday.I didn't follow all the comments, but started a VOTE: +1 |
Some suggestions:
|
Changing "@Class" to "@type" is ok by me if others like it. I'm not tied to one or the other - "@type" does seem a little better to me the more i think on it, but I'll let @newkek (or others) weigh in on it.
Not sure about BSON typing as a solution. Ultimately we want to know if something is a
imo, we're pretty deep into this approach having discussed it over multiple weeks in the community. making a big switch like that is probably something to reserve for the future especially since @newkek put a fair bit of effort into this work at this point and it delivers what took a while to get agreement on. If however BSON (or some other format) could be proven a more efficient network serialization format that is truly programming language agnostic, with wide support and consistently performant parsers in the major languages we support (which is what had doomed MsgPack some time back), then I think we could consider that as an additional IO format. @robertdale if you have ideas there, it would be nice to hear them. Please consider sending a message on the dev mailing list if you do.
That would be an interesting option, however would mostly be good for network serialization and not so much for file storage. So far we haven't written a network only IO package, though we have written a file storage only one with GraphML. I think that we could consider a network serialization one only since dependence on Gremlin Server for non-JVM languages is going to be something we need to support in the face of GLVs. Thanks for your thoughts @robertdale |
Yes the Concerning BSON (or MsgPack) it probably is indeed a more efficient serialization solution, however the goal of this PR is to optimize and improve JSON. Implementing Graph-BSON is probably another topic adjacent to this one, on which btw I'd be happy to collaborate as well. |
Just for reference: https://groups.google.com/d/msg/gremlin-users/R7qQbJXWcH8/wK9ZuWADAwAJ |
@newkek @spmallette Sorry, my context was only this thread. I agree with you on all accounts. |
Just pushed the change for |
Ok I thought The main point here was robustness in typing for non-java languages. Hence why I suggested even typing things like Int and using java classes. Honestly depending on how you compiled your PHP your basic JSON int will be converted to
You can't guaranty that the order would be maintained in some languages hence my previous comment. These languages will parse the JSON into |
Well, I agree we cannot prevent some parsers and drivers to read the whole JSON content and check the created object for the types only afterward. I would not recommend it though, because it is not great for memory consumption, but it is maybe something we have the opportunity to avoid only thanks to Jackson and some libraries may not offer that capability. So I guess I'll switch to the |
This change is largely to help the consumption of GraphSON by non-JVM languages. Since I'm largely familiar with the behaviors of java based parsers and such, i'm not a good judge of that so i have to rely on @PommeVerte and others in this area. If the switch to Map is helpful then I think we should go that route. @newkek I would not be hasty in the change though. Perhaps we give it a day or so to think about to be sure that everyone is happy with that approach before you make the change and commit to it. |
So I've caught up on the discussion and I'll offer some more food for thought since I haven't seen any other ideas. Embedding metadata is neither easy nor fun (not for me anyway). For any serious integration type work it's always best to have a well-defined schema up-front. On types:
Convenience is not the same as using Java types. By "not using java types", we mean:
Defining primitives, common types:
So if your Java implementation conveniently shares the same name as the type, then that's wonderful. But if you are to be truly language-agnostic, then at some point the types must be known ahead of time in order to be consumed. For instance, how can my X parser know how to handle a Titan GeoPoint if it's all dynamic? It can't. It must be able to handle this type ahead of time. And I can't imagine someone would want to manually read a graphson file to discover all the types that must be handled. Maybe I'm getting out of scope as this goes beyond language and steps into being database agnostic. @newkek, please correct me if I'm wrong, but it doesn't look like the code does any dynamic serializing. It looks like all types are registered anyway. So I'll argue again if you know your types ahead of time, then you may as well have a schema. But let's continue with embedded metadata... In JSON, the only unambiguous types are
To avoid confusion on all other types, including numbers, they should be typed. Thus they are objects (and not lists of things). The metadata can be at the same level as the object and alleviates these concerns: @newkek " a List in which the first element is a Map in which the first entry's key" and @PommeVerte "can be a pain in systems that do not necessarily order lists". Metadata can be differentiated from member fields by a prefix (e.g. '@'). Primitive types (or objects) having only a single value would have a "value" key which maps to the actual value. [
{
"@type":"Vertex",
"id":{
"@type":"int64",
"value":12345
},
"label":"person",
"properties":{
"@type":"VertexProperty",
"skill":{
"id":{ "@type":"int64",
"value":8723
},
"@type":"int32",
"value":5
},
"secrets":[
{ "id":{
"@type":"int64",
"value":8723
},
"@type":"uuid",
"value":"1de7bedf-f9ba-4e94-bde9-f1be28bef239"
},
{ "id":{
"@type":"int64",
"value":8724
},
"@type":"uuid",
"value":"34523adf-f9ba-4e94-bde9-f2345bcd3f45"
}
]
},
"inE":[
{ "@type":"Edge",
"label":"knows",
"id":{
"@type":"int64",
"value":987234
},
"properties":{ },
"outV":[ { } ]
}
]
}
] I wouldn't concern myself with the additional payload size for metadata. I wouldn't sacrifice conciseness for size. One could always compress the file if size is a concern. Also, the reader/writer could be easily enhanced to support zip. I would take the pragmatic approach and address it when it's no longer working for people. Anyway, maybe this is all GraphSON 3.0 stuffs. HTH. |
@robertdale the format you suggest would lead to the same inconsistencies as in GraphSON 1.0. Since the type is at the same level than the data itself, whether the container is an Array or an Object, the type format would not be the same. I just pushed a change in the format that is the one @PommeVerte suggested, which gives a consistent format, without the concern of unordered Lists (for reference the new format is
The Also, types for serialization are known ahead of time (which may contradict what I just said), it's the Serializers that call the Something I'd like to mention as well is that for now, values with no Serializers aren't typed. For unknown values, the GraphSON mapper calls the |
I think we should try to merge this PR at this point (though we need one more +1 - @PommeVerte do you have a moment to do a final review?). There are a few little tweaks here and there, but I can quickly do those after we merge. I configured Gremlin Server to use 2.0 and it works nicely with REST:
I expect to do more tests around the server this week. All tests pass with VOTE +1 |
Note that I've started a new thread on the dev mailing list related to a new IO format with GLVs and Gremlin Server i mind: If any of you have thoughts on the matter, please feel free to join the discussion there. |
ok I caught up and checked code. I think what @robertdale suggests on the typing end would be really nice ( I VOTE +1 on this. @spmallette if you could just add a mention in the documentation that the basic types are based off of the JVM types that would be a nice bonus to have. Nice work @newkek, some nice effort went into this. |
Adding my email on dev@ to here. I’m not following this PR too closely so what I might be saying is a already known/argued against/etc.
For me, my interests with this work is all about a language agnostic way of sending Gremlin traversal bytecode between different languages. This work is exactly what I am looking for. |
While we have 3 +1s, I don't want to make a GraphSON 2.0 that immediately needs to be revised to a 3.0. I don't see clear consensus on what we do with types/schema and I don't think we should hold up release any further, so I think GraphSON 2.0 should be delayed for a future version when we can get everything clear. I think the core of the work done here so far however is good. I've merged this PR to here: https://github.com/apache/tinkerpop/tree/graphson-2.0 for more collaboration and discussion. |
f7086f0
to
4537362
Compare
Closing this in favor of #386. |
https://issues.apache.org/jira/browse/TINKERPOP-1274
Summary of the changes :
Implementation of a format for value types serialization which is uniform and not Java centric. As a reminder the new format is as follows :
value
[{"@type":"typeName"}, value]
The content of
value
can be either a simple value, or a more complex structure. The type prefix allows to call the right Deserializer to deserialize thevalue
whatever its content is.The default's GraphSON's 2.0 format will include types for every
value
that is not a JSON native type (String/int/double/null/boolean/Map/Collection
) - calledPARTIAL_TYPES
. This allows to significantly reduce the size of the JSON payload where those types aren't needed by the Jackson library.GraphSON serialization without types is not affected. To enable it, need to use
NO_TYPES
.Quick walkthrough for the review
There are new components involved in the ser/de process that are extended from the Jackson library :
Class
->typeID
, and from atypeID
->Class
.Most of the serializers already existing for Graph had to be modified, because most of them were manually hardcoding types without calling the TypeSerializer and hence, those were not respecting the format. Now all Serializers respect the format because they call the TypeSerializer given in parameter. I followed the frame put in place by @spmallette to implement those new serializers (prefixed with
V2d0
) without breaking existing clients code.In TypeDeserializer,
baseType
represents the class given in parameter by the user for the ser/de. If the user callsmapper.readValueAsString(jsonString, UUID.class)
thebaseType
in TypeDeserializer will be a JavaType (which is the Jackson's custom class for Java classes) that represents a UUID class. The wildcard in our deserialization mechanism isObject.class
.The deserialization path is as follow :
String/int/double/null/boolean
).baseType
is the same than what was read in the payload (only ifbaseType
is not the wildcard), and call thedeserialize()
method of the JsonDeserializer registered for that type.Some results
GraphSON 2.0 shows a significant reduction of the payload's size for typed serialization. And the consequence in performance is that since there's less to process, the ser/de is faster. Results show a reduction of at least 50% in the payload's size and evolving linearly (the bigger the payload the bigger the difference) :
Tests
Tests cover the same functionalities covered by GraphSON 1.0 typed, plus additional features brought by GraphSON 2.0.
Tradeoffs
TypeIdResolver
has to have an index that it can refer to, and that has been correctly filled. The TinkerpopJackosnModule class will handle providing those new indexes to the TypeIdResolver when custom Deserializers are added, but without doing that we have no way to properly convert a type. We then require that a user instead of extending Jackson's SimpleModule when writing a module, extends TinkerpopJacksonModule which is an extension the Jackson SimpleModule. We can discuss whether this is a showstopper or if it is acceptable. Some solutions to prevent that, could be using a external library that allows searching a Class by its simple name, or switch to writing typeIDs with the full Class's canonical name. I don't have a strong opinion but it seems like the current solution is simple and good enough.value
that has the exact same format as the type format. I.e. a List in which the first element is a Map in which the first entry's key isGraphSONTokens.VALUETYPE
. We may want to highlight that somewhere.Extra facts
FULL_TYPES
has not been implemented. FULL_TYPES means writing types for JSON natively supported values. However all the mechanism is put in place to implement it easily if we deem it necessary. I don't see the necessity for this now.