Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST(string dtype): Resolve replace xfails #60659

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rhshadrach
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@rhshadrach rhshadrach added Testing pandas testing functions or related to the test suite Strings String extension data type and string data replace replace method labels Jan 4, 2025
Comment on lines -1481 to -1483
# TODO(infer_string): both string columns get cast to object,
# while only needed for column A
expected_df2 = DataFrame({"A": [1], "B": ["1"]}, dtype=object)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this behavior was correct - we get object dtype here because we are trying to replace string values with integer values. If we were to make the result a string dtype, then that would be introducing value-specific behavior.

else:
expected_df2 = DataFrame({"A": Series([1], dtype=object), "B": ["1"]})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior looks incorrect to me, B here should also be object dtype. I think we can raise an issue (this is independent of infer_strings) if others agree.

expected_df1 = DataFrame({"A": [1], "B": [1]}, dtype=object)
result_df1 = df1.replace(to_replace="0", value=1, regex=regex)
# When value is an integer, coerce result to object.
# When value is a string, infer the correct string dtype.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to coerce to string instead of raising? The object case makes sense I'm just not as sure onn the string side if we should be implicitly casting like that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand. When infer_string=True, the input DataFrame is str dtype. Then when we go to replace "0" with value="1", certainly we want the result to still be str dtype, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think that makes sense, but I'm not as sure when the target value is a non-string, i.e. replace(..., value=1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the target value is a non-string, we coerce to object dtype in order to hold both integers and strings. What are you not sure about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was just misreading the comment - I think this is good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
replace replace method Strings String extension data type and string data Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants