Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser is sometimes wrong when using CANONICALIZE_FIELD_NAMES #213

Closed
ichernev opened this issue Aug 21, 2015 · 7 comments
Closed

Parser is sometimes wrong when using CANONICALIZE_FIELD_NAMES #213

ichernev opened this issue Aug 21, 2015 · 7 comments
Milestone

Comments

@ichernev
Copy link

If you have a big dictionary (150 000 keys), it will randomly swap one of the field names with another. We traced it down to CANONICALIZE_FIELD_NAMES (if disabled, it doesn't happen).

Out of 1000 parsings of a file with 150 000 keys, around 50 (5 %) will have a single key swapped. I guess if you try with more keys it will fail more often.

Our keys are randomly generated /[0-9A-Za-z]{17}/

@cowtowncoder
Copy link
Member

Ok. Which version is this with? Is the input in form of bytes (InputStream, File, byte[]) or chars (Reader, char[])? Also, how does this show itself -- does parser claim incorrect field name?

@ichernev
Copy link
Author

Version 2.6.1 (latest). I was using the File version of the parser, the problem was in the ByteQuadCanonicalizer, one equality check is missing (a colleague of mine found it, I'll ask for details).

So the canonicalizer is given one name in bytes, but returns another, because of a hash collision and a missing safety check.

@ichernev
Copy link
Author

@cowtowncoder
Copy link
Member

Quick note: I am on vacation, and returning in one week. So while this is a critical issue, there is no progress due to this, but we'll get it fixed as soon as I get back next week.

I was also wondering if this might be related to

https://github.com/FasterXML/jackson-dataformat-smile/issues/26

given that both parsers (and CBOR as well) share the new symbol table implementation for 2.6.
So it would seem likely that problems could manifest themselves in multiple places as well.

@ichernev
Copy link
Author

ichernev commented Sep 1, 2015

@cowtowncoder we worked around it by not using the Quad class, so I'm not in a hurry to get a fix. It is kind of critical (as you mentioned) though :)

About the other problem -- it might be related, but I haven't looked deeper. From our data set we only saw one key to be replaced by another, it may cause an array out of bounds in another module.

@vzx
Copy link

vzx commented Sep 7, 2015

@cowtowncoder I'm experiencing similar issues related to the CANONICALIZE_FIELD_NAMES feature as well.

I have a file with dictionaries with about 500-800 elements each, however when parsing this file, sometimes a few elements are missing, but most of the times they are not. When I disable CANONICALIZE_FIELD_NAMES, it works fine.

@cowtowncoder
Copy link
Member

I suspect this -- FasterXML/jackson-databind#916 -- is same.
Since that one contains a test case, I'll start with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants