-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API (string dtype): comparisons between different string classes #60639
Comments
You mean the return types should be This is another motivating case for PDEP-13 #58455 I think without that, they should probably just return the bool extension type to at least preserve the possibility of null values |
Ah, indeed, thanks! I meant the Boolean dtype determined by the left side. So |
|
Just to be explicit, the two possible return values we currently have for the example above (in case of consistent dtypes for left and right operand): >>> lhs == rhs.astype(lhs.dtype)
<ArrowExtensionArray>
[True, <NA>, True]
Length: 3, dtype: bool[pyarrow]
>>> lhs.astype(rhs.dtype) == rhs
array([ True, False, True]) are While in general letting the left operand take priority sounds fine and something that typically happens in Python (with Python also automatically first calling For example also for the nullable numeric dtypes vs numpy dtypes, we always give preference to the nullable dtype in that case: >>> ser1 = pd.Series([1, 2, 3], dtype="int64") # numpy
>>> ser2 = pd.Series([1, 2, 3], dtype="Int64") # nullable
# numpy gives numpy bool
>>> (ser1 == ser1).dtype
dtype('bool')
# but once left OR right is nullable, result is nullable
>>> (ser1 == ser2).dtype
BooleanDtype
>>> (ser2 == ser1).dtype
BooleanDtype |
Making the overview of the result of
import numpy as np
import pandas as pd
import pyarrow as pa
dtypes = [
np.dtype(object),
pd.StringDtype("pyarrow", na_value=np.nan),
pd.StringDtype("python", na_value=np.nan),
pd.StringDtype("pyarrow", na_value=pd.NA),
pd.StringDtype("python", na_value=pd.NA),
pd.ArrowDtype(pa.string())
]
results = []
for dt1, dt2 in itertools.product(dtypes, dtypes):
ser1 = pd.Series(["a", None, "b"], dtype=dt1)
ser2 = pd.Series(["a", None, "b"], dtype=dt2)
try:
res = ser1 == ser2
results.append((dt1, dt2, res.dtype))
except:
results.append((dt1, dt2, "<error>"))
df_results = pd.DataFrame(results, columns=["left", "right", "result"])
print(df_results.pivot(index="left", columns="right", values="result").to_markdown()) Some quick observations:
|
That's quite the table there...thanks for putting that together. I still think that all of the extension types equality ops should return the boolean extension type. The "string(python)" type has done this for years, so I think the new iterations (backed by pyarrow and the "str" types) should continue along with that API. In general I find this a really non-value added activity for end users of the library, so long term still push for the logical type system. But in the interim I think we can not rock the boat to much and just stick with what was in place already for the string extension type |
For "string(pyarrow)", fully agreed (and as mentioned above, this dtype also did that in the past, this was only changed in pandas 2.0). (of course we could also have added a NaN variant of the boolean extension dtype, and then it could have used that. But we didn't do that, and I think that is now too late for 3.x?) And I also fully agree we should improve that messy table with logical dtypes. But that said, we still need to make a decision on the short term for 2.3 / 3.x about which resulting dtype to use in case of mixed dtype operands:
(and for the hierarchy option, also have "python < pyarrow" for the storage within the NaN or NA variant. This is less relevant for I personally lean to the second option, I think. (side note: also when having logical dtypes, we will have to answer this question if we have multiple variants of the same logical dtype) |
Out of those two options I definitely prefer the hierarchy; I think giving the lhs more importance than the rhs is really hard to keep track of, especially in a potentially long pipeline of operations. I think your proposed solution works well too, although something does feel weird about mixing between NA markers for the str / string types...but that may be the least of all evils.
This may be worth further discussing in the PDEP, but I think it is just an implementation detail (?) and not something that should worry the end user too much |
My main opposition to "always return Boolean extension dtype", in addition to what @jorisvandenbossche mentioned, is that if I have two pyarrow-backed Series with different NA-semantics, I would be very surprised to get back a NumPy-backed Series. Likewise, if I have two NaN Series ( I think this puts me in the hierarchy camp. My proposed hierarchy is:
My reasoning is that if you have NA-backed data, you've done something to opt into it as it isn't the default. Likewise, if you have PyArrow installed and you have python-backed strings, you've done something to get that as it isn't the default. So since you've opted into, we should give it preference. Of course, hopefully all of this is a pathological edge-case that users don't encounter. |
This makes sense but then shouldn't the pyarrow implementations get a higher priority than the python ones in your hierarchy?
Unfortunately I think this is going to be a very common occurrence. Especially when considering I/O formats that preserve metadata (namely parquet) it will very pretty easy to mix all these up
This is definitely valid with the current implementation of the extension types, although keep in mind with the proposed logical types that the storage becomes an implementation detail. We could set up whatever rules we wanted to manage this, although in the least pathological cases I would think pyarrow would be the main storage |
Thinking through this one some more, I'm also not sure that the python string implementations should ever take priority over pyarrow. That's a huge performance degradation |
If you have PyArrow installed, then
I expect the common occurrence will be object dtype against one of the other 4, but not e.g. NA-pyarrow against NaN-pyarrow. |
That's probably true for data that you create in your process, but when you bring I/O into the mix things can easily get mixed up.. For instance, if you load a parquet file that someone saved with I don't know if our parquet I/O keeps the storage backend as part of the metadata, but if it does that would also make it easy to mix up types (ex: a user without PyArrow installed saves a file that gets loaded by someone with PyArrow) |
Current parquet behavior is to always infer pd.set_option("infer_string", True)
df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="object")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str
df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="string")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str
df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="str")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str |
Ah that looks like a bug. |
I vaguely recall there being some work done upstream in Arrow to better differentiate these. Not sure if the version of that matters but Joris would know best |
For the hierarchy, my first reaction was also to find it odd to do "pyarrow < python" (since pyarrow is always preferred over python if installed). But your explanation certainly makes sense (you only get that situation when explicitly opting in to "python", so in that case preserve it). Now, the question if how often users will accidentally have data using the "python" storage? One exception I can think of is pickle. Using For Parquet:
That's indeed a known limitation right now. Normally pyarrow's Lines 79 to 80 in 1be2637
(and when specifying that keyword, that currently overrides any other mechanisms to infer the dtype from the metadata) I am fixing this on the pyarrow side in apache/arrow#44195 (essentially moving the logic of choosing the "str" dtype by default for pandas >= 3 to pyarrow, so that then we don't have to override that anymore). So starting with pyarrow 19.0, we should be able to restore the functionality of preserving For both "string" and "str" dtype, it will preserve that dtype (i.e. whether it uses NaN vs NA), but not the storage. That is because we don't include that information in the (we currently don't really control the pyarrow->pandas conversion, which makes changes to the default behaviour for this conversion tricky and difficult, see xref #59780 for moving that logic from pyarrow to pandas) |
Maybe we can just disallow a user from trying to force python storage when pyarrow is installed? That would help simplify the rules further |
What happens with pickles that contain Python storage? |
Haven’t read the whole thread, just OP. In general == should be commutative. We use pandas_priority to determine which object defers to which. |
Do you think this would happen from someone starting in an environment without pyarrow, exporting an object via pickle, then loading it into another environment with pyarrow? If so, I wonder if we shouldn't restore that state using the pyarrow backend. |
Talked about this in the dev meeting today, the main idea floated there was raising on certain cases:
I think the main thing captured is that we do not what to be changing what NA-semantics the user is dealing with. I think that when presented with both NA and NaN we should resist the temptation to guess. For storage, there was a desire to have However we choose to proceed, the consensus in the dev meeting that it should apply to all binary operations. |
I definitely don't think we should be raising on storage - one of the main values of PDEP-14 was to offer an abstraction over what storage is being used.
This is a good point when thinking about the output of an operation. The input is pretty clear though - we are comparing values that are missing (at least with strings, floats have some ambiguity). Given our long term strategy for types has been to move away from using np.nan as a missing value, I think we should prefer pd.NA in this case |
I would find it strange to start raising an error for those certain cases, while right now we almost never raise an error, but rather return False instead of an error if values are not really comparable. I know that in this case it is not really about the values being comparable or not (the string values itself are perfectly comparable), but about implicit conversion to different NA semantics. So that's a different situation .. But still, we are talking here in the context of string dtype. But what do we do for all the other dtypes where we also support those different NA semantics? Would you also want to start raising for that? (eg when comparing numpy floats with nullable ints or floats?) |
That's a good point. FWIW our status quo seems to prefer using pd.NA as the missing value sentinel when mixing NA-semantics today: >>> ser1 = pd.Series([1, np.nan, 2])
>>> ser2 = pd.Series([1, pd.NA, 2], dtype=pd.Float64Dtype())
>>> ser1 < ser2
0 False
1 <NA>
2 False
dtype: boolean Although that comparison is likely not very common, I think sticking with that status quo is the best path forward for now |
Some comparisons between different classes of string (e.g.
string[pyarrow]
andstr
) raise. Resolving this is straightforward except for what class should be returned. I would expect it should always be the left obj, e.g.string[pyarrow] == str
should returnstring[pyarrow]
whereasstr == string[pyarrow]
should returnstr
. Is this the concensus?We currently run into issues with how Python handles subclasses with comparison dunders.
The two results above differ because
ArrowStringArrayNumpySemantics
is a proper subclass ofArrowStringArray
and therefore Python first callsrhs.__eq__(lhs)
.We can avoid this by special casing this particular case in
ArrowStringArrayNumpySemantics
, but I wanted to open up an issue for discussion before proceeding.cc @WillAyd @jorisvandenbossche
The text was updated successfully, but these errors were encountered: