Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest DECIMAL cql type to be returned as string #1608

Open
hemidactylus opened this issue Oct 25, 2024 · 7 comments
Open

Suggest DECIMAL cql type to be returned as string #1608

hemidactylus opened this issue Oct 25, 2024 · 7 comments
Labels
Feature Agreed feature yet to be implemented On Hold Not ready for active development

Comments

@hemidactylus
Copy link
Contributor

hemidactylus commented Oct 25, 2024

The rounding issues that essentially led to creating the DECIMAL cql type would dictate such columns to be returned as strings, to let callers/users/clients to do the right thing. Otherwise, unexpected results may occur.

Principle

In binary floating-point 3.3 - 2.2 - 1.1 != 0, because of rounding issues during the calculation. Decimal holds a special representation for numbers with decimals that avoids that.

In pure CQL

This table has two columns (decimal, float) populated the same, which demonstrate this difference:

CREATE TABLE default_keyspace.tmp_d (
    a int PRIMARY KEY,
    b decimal,
    c float
);
cassandra@cqlsh:default_keyspace> select * from tmp_d  ;

 a  | b    | c
----+------+------
 -1 | -1.1 | -1.1
 -2 | -2.2 | -2.2
  3 |  3.3 |  3.3

The DECIMAL column:

cassandra@cqlsh:default_keyspace> select sum(b) from tmp_d;

 system.sum(b)
---------------
           0.0

The FLOAT column:

cassandra@cqlsh:default_keyspace> select sum(c) from tmp_d;

 system.sum(c)
---------------
   -1.1921e-07

Data API

When reading this table with the Data API (dockerized 1.0.18 to be precise), this is what I see:

{
    "data": {
        "documents": [
            {
                "a": -1,
                "b": -1.1,
                "c": -1.1
            },
            {
                "a": -2,
                "b": -2.2,
                "c": -2.2
            },
            {
                "a": 3,
                "b": 3.3,
                "c": 3.3
            }
        ],
        "nextPageState": null
    },
    "status": {
        "warnings": [
            "Zero filters were provided in the ..."
        ]
    }
}

As a result, there is a lossy math when the original intent of the column type was precisely to avoid that.

@tatu-at-datastax
Copy link
Contributor

I do not follow: since DECIMAL is specifically 10-based value it should have exact textual representation.

But even with 2-based (binary) floating-point I am not sure I get the point: yes, number representation as 10-based text may need to be rounded -- but would of JSON String make any difference? How would that help?

@tatu-at-datastax
Copy link
Contributor

Ok, based on discussions this is concern wrt JSON handling differing between JSON Numbers and JSON Strings -- so if server-side was to use 2-based FPs (float, double) for all numbers, then use of JSON Strings would avoid the issue.

As thing are, Data API uses Java BigDecimal for all JSON numbers with fractions so we should be good -- but if testing shows problems, we def need to address.

@vkarpov15
Copy link
Collaborator

My 2 cents: I think returning decimals as numbers is the right way to go. There shouldn't be any loss of precision since JSON serialization isn't doing any arithmetic. And it seems semantically cleaner for numbers to end up as numbers in JSON.

However, serializing decimals as strings wouldn't be unprecedented. For example, MongoDB's JavaScript BSON serializer only supports Decimal128.fromString(), there's no way to convert a JavaScript number (binary floating point) to a Decimal128 without going through a string first.

@amorton
Copy link
Contributor

amorton commented Nov 4, 2024

Reading through...

In pure CQL

Not sure how this exactly relates to the issue of encoding decimal, yes there is a reason for having floats and decimals.

I will note that the Decimal and Float sums above are the same, depending on the number of decimal places and this is why we have the decimal

In [5]: s = "-9.9362e-10"

In [6]: f = float(s)

In [8]: print(f"10 decimal places: {f:.10f}")
10 decimal places: -0.0000000010

In [12]: print(f"1 decimal places: {f:.1f}")
1 decimal places: -0.0

# using new rounding https://peps.python.org/pep-0682/
In [20]:  print(f"1 decimal places: {f:z.1f}")
1 decimal places: 0.0

In [11]: float("-0.0") == float("0")
Out[11]: True

In [14]: float("-0.0") == int("0")
Out[14]: True

# false if we use the full exponent 
In [15]: float(s) == int("0")
Out[15]: False

Data API

I am not following the example when it says "As a result, there is a lossy math when the original intent of the column type was precisely to avoid that." the values returned by the Data API are exactly the same as those returned by CQL.

For example, MongoDB's JavaScript BSON serializer only supports Decimal128.fromString(), there's no way to convert a JavaScript number (binary floating point) to a Decimal128 without going through a string first.

This is the mongo docs for this https://github.com/mongodb/specifications/blob/master/source/bson-decimal128/decimal128.md

and https://www.mongodb.com/docs/manual/reference/mongodb-extended-json/#mongodb-extended-json--v2-

Thoughts...

not decisions just things to be considered....

JSON view on numbers

As the JSON - Data Types page says....

Numbers in JSON are agnostic with regard to their representation within programming languages. While this allows for numbers of arbitrary precision to be serialized, it may lead to portability issues... There is no inherent precision loss in serializing a machine-level binary representation of a floating-point number (like binary64) into a human-readable decimal representation (like numbers in JSON) and back since there exist published algorithms to do this exactly and optimally.

My view on this is the JSON protocol says it has a way to encode an number that is not infinite or NaN. And then says it is up to the parsing code to alert the user if there is an issue with the decoding such as if the code cannot represent the number.

Mongo DB view on Decimal

If a decimal is declared, they are always using an EJSON value { "$numberDecimal": "<number>" } this is the same for their Canonical and Relaxed formats for their JSON formatting. The BSON formatting is "IEEE 754-2008 128-bit decimal floating point " , this is what they are sending over the wire and remember BSON as a little richer data types.

And if I read the spec correctly they are telling clients to not round trip into the native decimal types.

Telephone number problem

What happens to the classic telephone number problem, the string "012345" is a string of digits not a number.

Relaxed Vs Strict

We are building relaxed representation first, and the idea is to be as basic a JSON as we can. Any read will return return the schema, if a client wants to go beyond handling a number as JSOn number it can use the schema returned to know what is a decimal. The parsing in Python and Node.js and other libraries allows for special handling of fields and floats to handle these cases.

@tatu-at-datastax
Copy link
Contributor

Related: filed #1654 wrt "other direction" for decimal type.

@amorton amorton added the On Hold Not ready for active development label Dec 11, 2024
@amorton
Copy link
Contributor

amorton commented Dec 11, 2024

moving to on hold , still think we should use JSON for the simple encoding and can then do this for the strict encoding format

@vkarpov15
Copy link
Collaborator

I thought a little bit more about this in the context of datastax/astra-db-ts#90 and I think I see the reason for returning DECIMAL, as well as potentially VARINT and LONG, as strings. Built-in JSON parsers, like JavaScript's, are often limited by the language's built-in number limits, so values that cannot safely fit in numbers may lose precision.

For example, 9223372036854775807 is max long value in Java. Consider what happens when JavaScript's JSON.parse() tries to handle that value as a number in the following example. JavaScript ends up parsing that value into 9223372036854776000 as a JavaScript number, which is an incorrect result.

$ node
Welcome to Node.js v20.10.0.
Type ".help" for more information.
> const str = '9,223,372,036,854,775,807'.replace(/,/g, '')
undefined
> str
'9223372036854775807'
> JSON.parse(`{"value":${str}}`)
{ value: 9223372036854776000 }
> `{"value":${str}}`
'{"value":9223372036854775807}'
> 

This puts clients in a potentially tricky position of needing to implement customized JSON parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Agreed feature yet to be implemented On Hold Not ready for active development
Projects
None yet
Development

No branches or pull requests

4 participants