-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommend an encoding for binary data #35
Comments
Definitely. I prefer to use Base64 to encode binary in JSON and JSON5. Other APIs, like Node.js have opted to use arrays of numbers to serialize binary data into JSON. I'm interested in more discussion on the pros and cons of each, and whether there are other viable options as well. For the purpose of interoperability, it may be useful to give a recommendation in the spec based on our collective findings. |
That sounds crazy. 😄 Where did you come across that? |
JSON serialization of JSON.stringify(Buffer.from('foobar')) {
"type": "Buffer",
"data": [
102,
111,
111,
98,
97,
114
]
} Even stranger is the JSON serialization of JSON.stringify(Int8Array.from(Buffer.from('foobar'))) {
"0": 102,
"1": 111,
"2": 111,
"3": 98,
"4": 97,
"5": 114
} It's quite interesting that JavaScript has the
|
Probably because Things like Also, if you had object properties containing buffers, these would get serialized as strings - which could be misleading. I mean, there's nothing about a string that safely tells you whether that string is base64 or just some other string. This makes me wonder if we should recommend something that could be identified as binary? I'm thinking Data URLs. These are pretty universal by now as well - and it's arguably both safer, more useful, and more human-readable for binary data to be represented as a data URI than a bare base64 string. Compare this: {"pic": "R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAwAAAC8IyPqcvt3wCcDkiLc7C0qwyGHhSWpjQu5yqmCYsapyuvUUlvONmOZtfzgFzByTB10QgxOR0TqBQejhRNzOfkVJ+5YiUqrXF5Y5lKh/DeuNcP5yLWGsEbtLiOSpa/TPg7JpJHxyendzWTBfX0cxOnKPjgBzi4diinWGdkF8kjdfnycQZXZeYGejmJlZeGl9i2icVqaNVailT6F5iJ90m6mvuTS4OK05M0vDk0Q4XUtwvKOzrcd3iq9uisF81M1OIcR7lEewwcLp7tuNNkM3uNna3F2JQFo97Vriy/Xl4/f1cf5VWzXyym7PHhhx4dbgYKAAA7"} With this: {"pic": ""} The latter ticks a lot of boxes: ✅ It's clear (to machines + humans) that it's base64 encoded. (probably even safe enough to auto-decode in clients.) There is RFC 2397 providing a formal specification that we could refer to - although this looks a little outdated:
According to MDN docs, Opera 11 did have a 64KB limit - but all modern browsers support at least 256MB, so this doesn't seem relevant anymore. JSON itself has some practical size limitations either way, and you probably wouldn't/shouldn't embed hundreds of MB of data in JSON blobs, using base64, or anything else for that matter. Should the spec specify a size limit? Perhaps suggesting external URLs as an alternative, pointing to larger resources for clients to download after parsing the JSON. Only down side I can see, is that data URLs may be less well-supported on the server-side than plain base64 is. I'm sure every language has at least a package for this by now though. Under any circumstances, this would be a recommendation, right? Not a requirement. |
The
Using And, yes, additions to the spec would be interoperability recommendations, similar to the ones you find in RFC 8259. * A data URL with an unspecified MIME type implicitly has a MIME type of |
No, but that's already worth something in itself, I think. Having data with content-type is probably less common than having data with a known type. I don't think "some data has no type" is really an argument against having a type for everything else? And in that marginal case, it's a recommendation - you don't have to follow it if it doesn't make sense for your use-case.
No, but that same argument would work against a date format standard recommendation - if two applications have already negotiated that they're going to use the I think both of these recommendations would be useful - there are oodles of fun and interesting ways to encode both dates and binary data. Often, people will pick the one they know and happen to have close at hand - there's often no compelling reason to pick one format over another, so this would help with that choice. It would simplify things if projects aligned towards one way of encoding these types - opening up to MIME types via data URLs would provide a safe way to encode and embed a lot of data formats, both binary and text. That's just one guy's opinion of course. Would love to hear from other contributors. 🙂 |
I like the data URL as well... I would think that strings would be UTF8 encoded into UInt8Array first of type text/utf8 and buffer or uint8 array would be binary ... Binary going into uint8 array... Other typed arrays being javascript/TYPEarray |
Come to think of it...
Actually, I believe this does add something: an extra layer of validation and explicitness. Some base64 data is indistinguishable from text - that is, your app might expect base64, but somebody put a string in there that just happens to be valid base64, and decodes to some nonsense data, which triggers an obscure error further up the call stack, which could be very difficult to debug. So yeah, that little 13 character preamble does have the benefit of letting somebody explicitly indicate base64 data. Still, this would be a recommendation - you can deviate if it doesn't make sense for a given use case. |
@tracker1 Strings don't necessarily need to be encoded as UTF-8 since JSON5 already has a string type, which is defined as a sequence of Unicode code points. JSON5 documents themselves are recommended to be encoded as UTF-8 however. If you want to store the original UTF-8 representation of text in a Base64 data URL, and let's say that text is the HTML string
Also, specifying |
@mindplay-dk So, there's a snag with using So, recommending |
Right, time has not been good to this ol' standard. Perhaps it would be helpful to also recommend not using an empty MIME type? Honestly, it's the first time I've ever seen a And now that I know what the default is, it makes sense why nobody uses this. If you were actually including ASCII data, you would probably be better off using a MIME type like (For regular UTF-8 content, of course we can just use plain JSON strings rather than It's sort of a marginal case, I think? Probably a more common use-case will be embedding an image. And if someone needs to embed an AES encryption key, a MIME type like |
|
The official IANA MIME type for arbitrary binary data is I'd also like to point out some prior art regarding interoperability of JSON and JSON5 documents, which is what these recommendations are about. JSON Schema is the de facto standard for data contracts, validation, linting, and code completion of JSON documents, and it works just as well for JSON5. It's interesting that JSON Schema defines a For example, a JSON5 document that represents an image file may look like this: {
filename: 'image-01.png',
content: 'KBMPttgrVnXInj4j1ae+jw==',
} It could have a JSON Schema (as a JSON5 document) like this: {
$schema: 'https://json-schema.org/draft/2020-12/schema',
$id: 'https://example.com/image.schema.json5',
title: 'Image File',
description: 'An image with its filename',
type: 'object',
properties: {
filename: {
type: 'string',
},
content: {
type: 'string',
contentEncoding: 'base64',
contentMediaType: 'image/png',
},
},
} Granted, this forces all images to be PNGs. However, if you were to use data URLs like this: {
filename: 'image-01.png',
content: '',
} then your schema could look like this: {
$schema: 'https://json-schema.org/draft/2020-12/schema',
$id: 'https://example.com/image.schema.json5',
title: 'Image File',
description: 'An image with its filename',
type: 'object',
properties: {
filename: {
type: 'string',
},
content: {
type: 'string',
format: 'data-url',
},
},
} but then you'd be using a non-standard Granted, you aren't forced to use the JSON Schema {
filename: 'image-01.png',
content: 'KBMPttgrVnXInj4j1ae+jw==',
mediaType: 'image/png',
} Anyway, the point I'm getting at is that JSON Schema doesn't have native support for data URLs, but it does have native support for Base64 strings and media types. |
I actually prefer the way of using an int array similar to Node.js, as this seems to be the closest to a byte field / binary buffer.
|
@ddomnik to your first point, image data in generally not human-readable - so I don't think an array of numbers is any more human-readable than a base64 string? if it's JPEG or any other compressed binary data, it's not human-readable in any format. To your second point, I don't understand, what field/data type information gets lost? (How is it preserved by an int array?)
@jordanbtucker it does have a If that's not enough, you could validate the data itself using something like:
It looks a bit wonky, but it is very flexible - the example here is actually safer than a built-in data URI type in JSON-schema would be, validating the allowed type/subtypes. (If you have a lot of images, you can use More to your point, yes, JSON-schema does have native support for base64 strings and media types, and no, you can't validate the actual base64 encoding using a I suppose we could recommend this type of pattern instead:
I guess in some ways this is more human-readable than a data URI? Data URIs feel "closer to web", but may be just the feels? 😌 As you point out, it is inflexible, allowing only one media type - though enforcing consistent image/file formats isn't necessarily a bad thing, it does preclude use-cases like arbitrary file attachments. In that case, however, you would probably just use If it feels like too deep of a rabbit hole, I'm not opposed to closing this as out-of-scope. 😅 |
While I agree that the binary content itself (e.g., image data) is not human-readable, a byte array/field offers advantages. Many file formats include readable headers/metadata or payload itself (e.g., PNG, ZIP, Protobuf, serialized (Java) objects, and protocol snippets). Using a byte array instead of Base64 preserves this partial readability, which can aid debugging, inspection, and understanding the data's structure without full decoding. Base64 encoding converts binary data into strings, losing its original type. Consuming applications must "know" that a specific string field contains encoded binary data, adding complexity and ambiguity. With a dedicated bytes type, the format itself enforces type information, just like numbers in JSON can always be interpreted as doubles or integers across languages. This improves interoperability, especially as almost every language supports bytes in some way. Therefore, I suggest adding a dedicated bytes type, similar to Protocol Buffers' bytes. Out of my mind, I have these two syntax proposals for JSON5, where feedback is welcome: Angle Brackets with Hexadecimal
Python-Inspired Notation
Both approaches are URL-safe and more compact than Base64, making JSON5 more efficient and compatible with other formats like Protobuf. Looking at this question from stack overflow with more than 350k views, I think the support for a bytes type is generally desired. https://stackoverflow.com/questions/20706783/put-byte-array-to-json-and-vice-versa |
I mean, when you embed binary data into a human-readable file, its original type is lost no matter how it's embedded.
They must know that either way, since
So what would you expect a JSON5 library to do in JS then? Accept and output This likely creates a weird situation where every platform needs to figure out how to represent binary values. It also makes JSON5 fundamentally incompatible with JSON, since, unfortunately, JSON itself has no way to represent binary data. I think part of the expectation for JSON5 is it should be able to convert both to and from JSON in an intuitive way? Base64 is most likely what APIs etc. are using already, since it's really the only practical way to embed binary data in JSON - arrays would likely lead to unacceptable file sizes, and we would need to consider JSON parse time overhead as well. I don't know, it sounds nice in theory, but it doesn't really seem practical, and base64 functions are readily available everywhere, and it's almost certainly what people are using already - I'm not convinced this solves more problems than it creates. |
Not if the human-readable format support a type for binary data. What we currently do is basically casting binary data to strings, which requires it to be base64 encoded, otherwise it won't be a valid string (e.g. due to a wrongly interpreted null terminator).
This is true if we use the array approach, but not if it's a dedicated type. Then it will be always an Uint8Array.
Exactly and as mentioned, each language supports binary data, so no need to "figure" it out in my opinion.
I think the general concept for never versions is that they are downward compatible, not the other way round. So everything that is supported by JSON currently will work if the consumer site uses JSON5. That's how it works for every lib I know. If the consumer site has a lower version than the producer should be able to adjust -> use the commonly accepted way and use String with base64.
For me, this sounds more like a solution born out of necessity due to the fact that JSON did not support this type by default. And as we see, we need it a lot, just take your image data for example. Maybe also because JSON was initially designed for Web/REST. Now with the shift to IoT and Smart Home Devices, the Web will need to adapt to producers that are embedded and need to work with limited resources. Images converted to base64 are huge, some cite 33%-37%. I don't see any harm in adding an extra type that can but don't need to be used. If people see no need for it, they will stick with base64. For the ones which require byte arrays, however, this feature is a blessing. |
Comments aren't part of the data - so you can (currently) convert JSON5 to JSON (and back) e.g. for programs or APIs that do not accept JSON5, and you would get data that is identical when parsing the JSON5 or the converted JSON. As I recall, we had the same discussion about date/time - another data type that can't be converted without using a different, ambiguous representation. There was a proposal for a date/time literal, I think - but you can't convert that to JSON and then back to JSON5, you would just get a number or an ISO string, or whatever representation you chose. Similarly, if we add a native binary data type, you would simply have to decide how to represent this as JSON - and whether that's a base64 string or an array of numbers, those values are ambiguous with strings/arrays/numbers after conversion, and can't be converted back to JSON5 except as the unwrapped, ambiguous strings/arrays/numbers. I would like to hear from @jordanbtucker on this, but as I recall, he did not want to change the schema of the JSON data format itself, but: json5-spec/docs/1.0.0/index.html Lines 18 to 20 in d77331d
ECMAScript has no binary data literals, nor date/time literals. json5-spec/docs/1.0.0/index.html Lines 36 to 37 in d77331d
JSON only has the four primitive types. To my understanding, if the spec goes anywhere outside of that scope, it'll recommendations only - and even that has taken years of discussions to settle on anything. (I actually thought the date/time recommendation had been settled, but apparently that's not even in the spec, so.) |
Have you thought about some sort of support for embedding binary data? (blobs)
Unicode strings are not generic - not all escape sequences are valid Unicode.
What people typically do, is they encode binary data in base64 format - it's not very efficient or elegant, but probably okay for smaller binary chunks.
I wonder if we can think of something better?
If not, perhaps we could make a recommendation about how binary data should be encoded? Base64 sometimes uses different characters - RFC 4648 defines two encodings, one being URL safe, and several encodings with smaller character sets.
Personally, I like the "URL and Filename safe" variant - in the context of JSON, which will likely be served from URLs a lot of the time, it would be nice if programs could use the same library functions (with the same settings) to reliably decode JSON blobs, query-strings, post-data, etc.
What do you think, is this worth touching on in the spec?
Anything that makes JSON and the ecosystem around it more coherent is helpful, in my opinion.
The text was updated successfully, but these errors were encountered: