Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend an encoding for binary data #35

Open
mindplay-dk opened this issue Apr 13, 2022 · 19 comments
Open

Recommend an encoding for binary data #35

mindplay-dk opened this issue Apr 13, 2022 · 19 comments
Assignees
Milestone

Comments

@mindplay-dk
Copy link

Have you thought about some sort of support for embedding binary data? (blobs)

Unicode strings are not generic - not all escape sequences are valid Unicode.

What people typically do, is they encode binary data in base64 format - it's not very efficient or elegant, but probably okay for smaller binary chunks.

I wonder if we can think of something better?

If not, perhaps we could make a recommendation about how binary data should be encoded? Base64 sometimes uses different characters - RFC 4648 defines two encodings, one being URL safe, and several encodings with smaller character sets.

Personally, I like the "URL and Filename safe" variant - in the context of JSON, which will likely be served from URLs a lot of the time, it would be nice if programs could use the same library functions (with the same settings) to reliably decode JSON blobs, query-strings, post-data, etc.

What do you think, is this worth touching on in the spec?

Anything that makes JSON and the ecosystem around it more coherent is helpful, in my opinion.

@jordanbtucker
Copy link
Member

Definitely. I prefer to use Base64 to encode binary in JSON and JSON5. Other APIs, like Node.js have opted to use arrays of numbers to serialize binary data into JSON. I'm interested in more discussion on the pros and cons of each, and whether there are other viable options as well. For the purpose of interoperability, it may be useful to give a recommendation in the spec based on our collective findings.

@jordanbtucker jordanbtucker self-assigned this Apr 13, 2022
@jordanbtucker jordanbtucker added this to the v1.1.0 milestone Apr 13, 2022
@jordanbtucker jordanbtucker changed the title Binary data? Recommend an encoding for binary data Apr 13, 2022
@mindplay-dk
Copy link
Author

Other APIs, like Node.js have opted to use arrays of numbers to serialize binary data into JSON.

That sounds crazy. 😄

Where did you come across that?

@jordanbtucker
Copy link
Member

jordanbtucker commented Apr 17, 2022

JSON serialization of Buffer in Node.js:

JSON.stringify(Buffer.from('foobar'))
{
  "type": "Buffer",
  "data": [
    102,
    111,
    111,
    98,
    97,
    114
  ]
}

Even stranger is the JSON serialization of TypedArray in JavaScript:

JSON.stringify(Int8Array.from(Buffer.from('foobar')))
{
  "0": 102,
  "1": 111,
  "2": 111,
  "3": 98,
  "4": 97,
  "5": 114
}

It's quite interesting that JavaScript has the btoa function, yet no built-in JSON serializations use Base64. This paragraph from Binary strings on MDN (cached since the page has been removed), particularly its reference to "multiple conversions", may shed some light on that.

In the past, [manipulating raw binary data] had to be simulated by treating the raw data as a string and using the charCodeAt() method to read the bytes from the data buffer (i.e., using binary strings). However, this is slow and error-prone, due to the need for multiple conversions (especially if the binary data is not actually byte-format data, but, for example, 32-bit integers or floats).

@mindplay-dk
Copy link
Author

It's quite interesting that JavaScript has the btoa function, yet no built-in JSON serializations use Base64

Probably because btoa has some ugly limitations

image

Things like toDataURL in Canvas presumably must use a different base64 encoder internally.

Also, if you had object properties containing buffers, these would get serialized as strings - which could be misleading. I mean, there's nothing about a string that safely tells you whether that string is base64 or just some other string.

This makes me wonder if we should recommend something that could be identified as binary?

I'm thinking Data URLs.

These are pretty universal by now as well - and it's arguably both safer, more useful, and more human-readable for binary data to be represented as a data URI than a bare base64 string.

Compare this:

{"pic": "R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAwAAAC8IyPqcvt3wCcDkiLc7C0qwyGHhSWpjQu5yqmCYsapyuvUUlvONmOZtfzgFzByTB10QgxOR0TqBQejhRNzOfkVJ+5YiUqrXF5Y5lKh/DeuNcP5yLWGsEbtLiOSpa/TPg7JpJHxyendzWTBfX0cxOnKPjgBzi4diinWGdkF8kjdfnycQZXZeYGejmJlZeGl9i2icVqaNVailT6F5iJ90m6mvuTS4OK05M0vDk0Q4XUtwvKOzrcd3iq9uisF81M1OIcR7lEewwcLp7tuNNkM3uNna3F2JQFo97Vriy/Xl4/f1cf5VWzXyym7PHhhx4dbgYKAAA7"}

With this:

{"pic": ""}

The latter ticks a lot of boxes:

✅ It's clear (to machines + humans) that it's base64 encoded. (probably even safe enough to auto-decode in clients.)
✅ It specifies the content-type: you don't have to know or detect it. (probably safer in browser context, too.)
✅ It's web-friendly: you can inject this directly into an <img src="..."> etc.

There is RFC 2397 providing a formal specification that we could refer to - although this looks a little outdated:

The "data:" URL scheme is only useful for short values

According to MDN docs, Opera 11 did have a 64KB limit - but all modern browsers support at least 256MB, so this doesn't seem relevant anymore.

JSON itself has some practical size limitations either way, and you probably wouldn't/shouldn't embed hundreds of MB of data in JSON blobs, using base64, or anything else for that matter. Should the spec specify a size limit? Perhaps suggesting external URLs as an alternative, pointing to larger resources for clients to download after parsing the JSON.

Only down side I can see, is that data URLs may be less well-supported on the server-side than plain base64 is. I'm sure every language has at least a package for this by now though.

Under any circumstances, this would be a recommendation, right? Not a requirement.

@jordanbtucker
Copy link
Member

jordanbtucker commented Apr 19, 2022

The btoa function is designed to work on binary strings, which are sequences of Unicode code points in the range of U+0000 through U+00FF (i.e. the Latin-1 character set), where each code point represents a single byte. So, I wouldn't call it an ugly limitation but a necessary one. What byte should (U+2713) represent?

toDataURL, on the other hand, converts the canvas to an image format, like PNG, and then encodes those raw bytes as Base64. So, btoa and toDataURL work on two different types of input.

Using data: URLs is an interesting thought. It would allow the string to hold more information than just the binary data. But what about when the data does not have a MIME type, for example an AES encryption key? I know that technically it would be data:;base64,hJxB...aso=*, but does that data:;base64, intro, 13 additional characters, really add anything beyond indicating that the data is encoded as Base64? Is that really necessary if two JSON5 applications have already negotiated that the key property is encoded as Base64?

And, yes, additions to the spec would be interoperability recommendations, similar to the ones you find in RFC 8259.

* A data URL with an unspecified MIME type implicitly has a MIME type of text/plain;charset=US-ASCII according to RFC 2397.

@mindplay-dk
Copy link
Author

does that data:;base64, intro, 13 additional characters, really add anything beyond indicating that the data is encoded as Base64?

No, but that's already worth something in itself, I think.

Having data with content-type is probably less common than having data with a known type. I don't think "some data has no type" is really an argument against having a type for everything else?

And in that marginal case, it's a recommendation - you don't have to follow it if it doesn't make sense for your use-case.

Is that really necessary if two JSON5 applications have already negotiated that the key property is encoded as Base64?

No, but that same argument would work against a date format standard recommendation - if two applications have already negotiated that they're going to use the date property encoded in RFC 3339 format, the date format recommendation isn't useful either. (You seemed to support that idea?)

I think both of these recommendations would be useful - there are oodles of fun and interesting ways to encode both dates and binary data. Often, people will pick the one they know and happen to have close at hand - there's often no compelling reason to pick one format over another, so this would help with that choice.

It would simplify things if projects aligned towards one way of encoding these types - opening up to MIME types via data URLs would provide a safe way to encode and embed a lot of data formats, both binary and text.

That's just one guy's opinion of course. Would love to hear from other contributors. 🙂

@tracker1
Copy link

I like the data URL as well... I would think that strings would be UTF8 encoded into UInt8Array first of type text/utf8 and buffer or uint8 array would be binary ... Binary going into uint8 array... Other typed arrays being javascript/TYPEarray

@mindplay-dk
Copy link
Author

Come to think of it...

... what about when the data does not have a MIME type, for example an AES encryption key? I know that technically it would be data:;base64,hJxB...aso=, but does that data:;base64, intro, 13 additional characters, really add anything beyond indicating that the data is encoded as Base64?

Actually, I believe this does add something: an extra layer of validation and explicitness.

Some base64 data is indistinguishable from text - that is, your app might expect base64, but somebody put a string in there that just happens to be valid base64, and decodes to some nonsense data, which triggers an obscure error further up the call stack, which could be very difficult to debug.

So yeah, that little 13 character preamble does have the benefit of letting somebody explicitly indicate base64 data.

Still, this would be a recommendation - you can deviate if it doesn't make sense for a given use case.

@jordanbtucker
Copy link
Member

@tracker1 Strings don't necessarily need to be encoded as UTF-8 since JSON5 already has a string type, which is defined as a sequence of Unicode code points. JSON5 documents themselves are recommended to be encoded as UTF-8 however.

If you want to store the original UTF-8 representation of text in a Base64 data URL, and let's say that text is the HTML string <h1>Hello, World!</h1>, then the data URL would look like this:

data:text/html;charset=utf-8;base64,PGgxPkhlbGxvLCBXb3JsZCE8L2gxPg==

Also, specifying UInt8Array and TypedArray is a JavaScript-centric way of looking at things. JSON5 is meant to be used on all platforms, not just within the JavaScript ecosystem.

@jordanbtucker
Copy link
Member

jordanbtucker commented Jul 16, 2022

@mindplay-dk So, there's a snag with using data:;base64, for generic binary data. According to RFC 2397, a data URL with an unspecified MIME type implicitly has a MIME type of text/plain;charset=US-ASCII.

So, recommending data:;base64, for binary data with an unspecified MIME type actually goes against the data URL spec, and these URLs should be interpreted as plain text.

@mindplay-dk
Copy link
Author

Right, time has not been good to this ol' standard.

Perhaps it would be helpful to also recommend not using an empty MIME type?

Honestly, it's the first time I've ever seen a data: URI with the MIME type omitted - I didn't even know that was possible.

And now that I know what the default is, it makes sense why nobody uses this. If you were actually including ASCII data, you would probably be better off using a MIME type like text/plain; charset=us-ascii to be explicit about it, since most likely a person would expect UTF-8 for text content today.

(For regular UTF-8 content, of course we can just use plain JSON strings rather than data: URIs anyhow.)

It's sort of a marginal case, I think? Probably a more common use-case will be embedding an image.

And if someone needs to embed an AES encryption key, a MIME type like application/x-aes, even if all it indicates to a program is "binary data", it likely still has value to a person, in terms of making the JSON easier to understand.

@tracker1
Copy link

binary/octet-string would probably be appropriate mime type for general binary encoded data mime type.

@jordanbtucker
Copy link
Member

jordanbtucker commented Aug 23, 2022

The official IANA MIME type for arbitrary binary data is application/octet-stream as defined in RFC 2046. So, if you want to encode the bytes 80 80 80 80 as arbitrary binary data in Base64 in a data URL, you should use data:application/octet-stream;base64,gICAgA==.

I'd also like to point out some prior art regarding interoperability of JSON and JSON5 documents, which is what these recommendations are about. JSON Schema is the de facto standard for data contracts, validation, linting, and code completion of JSON documents, and it works just as well for JSON5. It's interesting that JSON Schema defines a contentEncoding and a contentMediaType property, yet it doesn't define a format for data URLs like it does other string formats like dates, email addresses, etc.

For example, a JSON5 document that represents an image file may look like this:

{
  filename: 'image-01.png',
  content: 'KBMPttgrVnXInj4j1ae+jw==',
}

It could have a JSON Schema (as a JSON5 document) like this:

{
  $schema: 'https://json-schema.org/draft/2020-12/schema',
  $id: 'https://example.com/image.schema.json5',
  title: 'Image File',
  description: 'An image with its filename',
  type: 'object',
  properties: {
    filename: {
      type: 'string',
    },
    content: {
      type: 'string',
      contentEncoding: 'base64',
      contentMediaType: 'image/png',
    },
  },
}

Granted, this forces all images to be PNGs.

However, if you were to use data URLs like this:

{
  filename: 'image-01.png',
  content: '',
}

then your schema could look like this:

{
  $schema: 'https://json-schema.org/draft/2020-12/schema',
  $id: 'https://example.com/image.schema.json5',
  title: 'Image File',
  description: 'An image with its filename',
  type: 'object',
  properties: {
    filename: {
      type: 'string',
    },
    content: {
      type: 'string',
      format: 'data-url',
    },
  },
}

but then you'd be using a non-standard data-url value for the format field. However, you gain the ability to represent more than just PNG files.

Granted, you aren't forced to use the JSON Schema contentMediaType property. You could just specify the media type in the JSON5 document directly, like this:

{
  filename: 'image-01.png',
  content: 'KBMPttgrVnXInj4j1ae+jw==',
  mediaType: 'image/png',
}

Anyway, the point I'm getting at is that JSON Schema doesn't have native support for data URLs, but it does have native support for Base64 strings and media types.

@ddomnik
Copy link

ddomnik commented Nov 19, 2024

I actually prefer the way of using an int array similar to Node.js, as this seems to be the closest to a byte field / binary buffer.
I have two problems with the base64 encoded string approach:

  • The contained data is not human-readable.
  • The field/data type information gets lost, so you need to know what field has been encoded as base64 to revert it.

@mindplay-dk
Copy link
Author

I actually prefer the way of using an int array similar to Node.js, as this seems to be the closest to a byte field / binary buffer. I have two problems with the base64 encoded string approach:

  • The contained data is not human-readable.
  • The field/data type information gets lost, so you need to know what field has been encoded as base64 to revert it.

@ddomnik to your first point, image data in generally not human-readable - so I don't think an array of numbers is any more human-readable than a base64 string? if it's JPEG or any other compressed binary data, it's not human-readable in any format.

To your second point, I don't understand, what field/data type information gets lost? (How is it preserved by an int array?)

Anyway, the point I'm getting at is that JSON Schema doesn't have native support for data URLs, but it does have native support for Base64 strings and media types.

@jordanbtucker it does have a uri type though - that would you give you at least a basic validation.

If that's not enough, you could validate the data itself using something like:

{
  "type": "object",
  "properties": {
    "image": {
      "type": "string",
      "pattern": "^data:image/(png|jpeg|gif);base64,[A-Za-z0-9+/=]+$"
    }
  }
}

It looks a bit wonky, but it is very flexible - the example here is actually safer than a built-in data URI type in JSON-schema would be, validating the allowed type/subtypes. (If you have a lot of images, you can use $defs and $ref to make it reusable.)

More to your point, yes, JSON-schema does have native support for base64 strings and media types, and no, you can't validate the actual base64 encoding using a pattern as in my example.

I suppose we could recommend this type of pattern instead:

{
  "image": {
    "name": "avatar.png",
    "type": "image/png",
    "data": "KBMPttgrVnXInj4j1ae+jw=="
  }
}

I guess in some ways this is more human-readable than a data URI? Data URIs feel "closer to web", but may be just the feels? 😌

As you point out, it is inflexible, allowing only one media type - though enforcing consistent image/file formats isn't necessarily a bad thing, it does preclude use-cases like arbitrary file attachments. In that case, however, you would probably just use contentEncoding without a contentMediaType, and then handle unsupported attachments at the application-level. I mean, there's an expected limit to how much detail you can validate with schema-only, right?

If it feels like too deep of a rabbit hole, I'm not opposed to closing this as out-of-scope. 😅

@ddomnik
Copy link

ddomnik commented Nov 21, 2024

I actually prefer the way of using an int array similar to Node.js, as this seems to be the closest to a byte field / binary buffer. I have two problems with the base64 encoded string approach:

  • The contained data is not human-readable.
  • The field/data type information gets lost, so you need to know what field has been encoded as base64 to revert it.

@ddomnik to your first point, image data in generally not human-readable - so I don't think an array of numbers is any more human-readable than a base64 string? if it's JPEG or any other compressed binary data, it's not human-readable in any format.

To your second point, I don't understand, what field/data type information gets lost? (How is it preserved by an int array?)

Anyway, the point I'm getting at is that JSON Schema doesn't have native support for data URLs, but it does have native support for Base64 strings and media types.

While I agree that the binary content itself (e.g., image data) is not human-readable, a byte array/field offers advantages. Many file formats include readable headers/metadata or payload itself (e.g., PNG, ZIP, Protobuf, serialized (Java) objects, and protocol snippets). Using a byte array instead of Base64 preserves this partial readability, which can aid debugging, inspection, and understanding the data's structure without full decoding.

Base64 encoding converts binary data into strings, losing its original type. Consuming applications must "know" that a specific string field contains encoded binary data, adding complexity and ambiguity. With a dedicated bytes type, the format itself enforces type information, just like numbers in JSON can always be interpreted as doubles or integers across languages. This improves interoperability, especially as almost every language supports bytes in some way.

Therefore, I suggest adding a dedicated bytes type, similar to Protocol Buffers' bytes. Out of my mind, I have these two syntax proposals for JSON5, where feedback is welcome:

Angle Brackets with Hexadecimal

{
    "image": <89504E470D0A1A0A>,  // PNG file header in hex
    "signature": <305C4D3F2A>
}

Python-Inspired Notation

{
    "image": b'\x89\x50\x4E\x47\x0D\x0A\x1A\x0A',  // PNG file header in hex
}

Both approaches are URL-safe and more compact than Base64, making JSON5 more efficient and compatible with other formats like Protobuf.

Looking at this question from stack overflow with more than 350k views, I think the support for a bytes type is generally desired. https://stackoverflow.com/questions/20706783/put-byte-array-to-json-and-vice-versa

@mindplay-dk
Copy link
Author

Base64 encoding converts binary data into strings, losing its original type.

I mean, when you embed binary data into a human-readable file, its original type is lost no matter how it's embedded.

Consuming applications must "know" that a specific string field contains encoded binary data, adding complexity and ambiguity.

They must know that either way, since [1,2,3] could be a byte array or a signed integer array - in JS, you can't tell.

With a dedicated bytes type, the format itself enforces type information, just like numbers in JSON can always be interpreted as doubles or integers across languages. This improves interoperability, especially as almost every language supports bytes in some way.

So what would you expect a JSON5 library to do in JS then?

Accept and output Uint8Array for this proposed dedicated binary type or what?

This likely creates a weird situation where every platform needs to figure out how to represent binary values.

It also makes JSON5 fundamentally incompatible with JSON, since, unfortunately, JSON itself has no way to represent binary data. I think part of the expectation for JSON5 is it should be able to convert both to and from JSON in an intuitive way?

Base64 is most likely what APIs etc. are using already, since it's really the only practical way to embed binary data in JSON - arrays would likely lead to unacceptable file sizes, and we would need to consider JSON parse time overhead as well.

I don't know, it sounds nice in theory, but it doesn't really seem practical, and base64 functions are readily available everywhere, and it's almost certainly what people are using already - I'm not convinced this solves more problems than it creates.

@ddomnik
Copy link

ddomnik commented Nov 21, 2024

I mean, when you embed binary data into a human-readable file, its original type is lost no matter how it's embedded.

Not if the human-readable format support a type for binary data. What we currently do is basically casting binary data to strings, which requires it to be base64 encoded, otherwise it won't be a valid string (e.g. due to a wrongly interpreted null terminator).

They must know that either way, since [1,2,3] could be a byte array or a signed integer array - in JS, you can't tell.

This is true if we use the array approach, but not if it's a dedicated type. Then it will be always an Uint8Array.

So what would you expect a JSON5 library to do in JS then?
Accept and output Uint8Array for this proposed dedicated binary type or what?
This likely creates a weird situation where every platform needs to figure out how to represent binary values.

Exactly and as mentioned, each language supports binary data, so no need to "figure" it out in my opinion.

// C
unsigned char bytearray[] = {0x01, 0x02, 0x03, 0x04};
// C++
std::vector<uint8_t> bytearray = {0x01, 0x02, 0x03, 0x04};
// C#
byte[] bytearray = new byte[] {0x01, 0x02, 0x03, 0x04};
// Java
byte[] bytearray = {0x01, 0x02, 0x03, 0x04};
// JS
let bytearray = new Uint8Array([0x01, 0x02, 0x03, 0x04]);
// Python
bytearray_obj = bytearray([0x01, 0x02, 0x03, 0x04])

It also makes JSON5 fundamentally incompatible with JSON, since, unfortunately, JSON itself has no way to represent binary data. I think part of the expectation for JSON5 is it should be able to convert both to and from JSON in an intuitive way?

I think the general concept for never versions is that they are downward compatible, not the other way round. So everything that is supported by JSON currently will work if the consumer site uses JSON5. That's how it works for every lib I know. If the consumer site has a lower version than the producer should be able to adjust -> use the commonly accepted way and use String with base64.
And wouldn't it be already incompatible, due to the comments?

Base64 is most likely what APIs etc. are using already, since it's really the only practical way to embed binary data in JSON - arrays would likely lead to unacceptable file sizes, and we would need to consider JSON parse time overhead as well.

I don't know, it sounds nice in theory, but it doesn't really seem practical, and base64 functions are readily available everywhere, and it's almost certainly what people are using already - I'm not convinced this solves more problems than it creates.

For me, this sounds more like a solution born out of necessity due to the fact that JSON did not support this type by default. And as we see, we need it a lot, just take your image data for example. Maybe also because JSON was initially designed for Web/REST. Now with the shift to IoT and Smart Home Devices, the Web will need to adapt to producers that are embedded and need to work with limited resources. Images converted to base64 are huge, some cite 33%-37%.

I don't see any harm in adding an extra type that can but don't need to be used. If people see no need for it, they will stick with base64. For the ones which require byte arrays, however, this feature is a blessing.
It might be helpful to gather more opinions to highlight the pros, cons, and overall necessity.

@mindplay-dk
Copy link
Author

And wouldn't it be already incompatible, due to the comments?

Comments aren't part of the data - so you can (currently) convert JSON5 to JSON (and back) e.g. for programs or APIs that do not accept JSON5, and you would get data that is identical when parsing the JSON5 or the converted JSON.

As I recall, we had the same discussion about date/time - another data type that can't be converted without using a different, ambiguous representation. There was a proposal for a date/time literal, I think - but you can't convert that to JSON and then back to JSON5, you would just get a number or an ISO string, or whatever representation you chose.

Similarly, if we add a native binary data type, you would simply have to decide how to represent this as JSON - and whether that's a base64 string or an array of numbers, those values are ambiguous with strings/arrays/numbers after conversion, and can't be converted back to JSON5 except as the unwrapped, ambiguous strings/arrays/numbers.

I would like to hear from @jordanbtucker on this, but as I recall, he did not want to change the schema of the JSON data format itself, but:

The JSON5 Data Interchange Format is a proposed extension to JSON that aims
to make it easier for humans to write and maintain by hand. It does this by
adding some minimal syntax features directly from ECMAScript 5.1.

ECMAScript has no binary data literals, nor date/time literals.

Similar to JSON, JSON5 can represent four primitive types (strings, numbers,
Booleans, and null) and two structured types (objects and arrays).

JSON only has the four primitive types.

To my understanding, if the spec goes anywhere outside of that scope, it'll recommendations only - and even that has taken years of discussions to settle on anything. (I actually thought the date/time recommendation had been settled, but apparently that's not even in the spec, so.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants