Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update modin dependency to 0.30.1 #1965

Merged
merged 27 commits into from
Oct 28, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
31e5320
bump modin
sfc-gh-joshi Jul 23, 2024
845e7b4
changelog
sfc-gh-joshi Jul 23, 2024
8995a8f
Fix issue
sfc-gh-dpetersohn Jul 31, 2024
3bcad42
Fix issue
sfc-gh-dpetersohn Jul 31, 2024
bd49781
add pandas version test to daily job
sfc-gh-joshi Aug 16, 2024
900d4fd
try to add updated sproc dependencies
sfc-gh-joshi Aug 19, 2024
4ab25c9
add 2.2.3 to matrix
sfc-gh-joshi Oct 1, 2024
00de3af
revert series/test_loc and query compiler
sfc-gh-joshi Oct 1, 2024
d5cff74
fix drop_duplicates and unique
sfc-gh-joshi Oct 2, 2024
7cf2d34
pin pandas version
sfc-gh-joshi Oct 3, 2024
f64e5f9
override series.unique
sfc-gh-joshi Oct 4, 2024
b45d1f9
branch error messages on pandas version
sfc-gh-joshi Oct 7, 2024
2da165b
try zip import
sfc-gh-joshi Oct 9, 2024
fd71a74
temporarily xfail map/applymap
sfc-gh-joshi Oct 9, 2024
86a996d
pragma no cover
sfc-gh-joshi Oct 10, 2024
7ff0c6c
move changelog entry
sfc-gh-joshi Oct 11, 2024
df19140
update changelog wording
sfc-gh-joshi Oct 11, 2024
06c552a
xfail udf tests with 2.2.3
sfc-gh-joshi Oct 15, 2024
f01a3db
fix missed skips
sfc-gh-joshi Oct 15, 2024
ad447e0
add more skips
sfc-gh-joshi Oct 16, 2024
c9c6e20
strengthen version checks
sfc-gh-joshi Oct 22, 2024
40275b5
remove 2.2.3 from daily matrix
sfc-gh-joshi Oct 22, 2024
8559b7b
add versoin check ticket
sfc-gh-joshi Oct 22, 2024
5bb1bc6
skip a test
sfc-gh-joshi Oct 22, 2024
9ee5cb8
remove melt override
sfc-gh-joshi Oct 22, 2024
5f59431
skip doctests
sfc-gh-joshi Oct 22, 2024
f1de876
Merge remote-tracking branch 'origin/main' into joshi-SNOW-1552497-mo…
sfc-gh-joshi Oct 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions .github/workflows/daily_modin_precommit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,68 @@ jobs:
.tox/.coverage
.tox/coverage.xml

test-pandas-patch-versions:
name: Test Snowpark pandas with pandas ${{ matrix.pandas-version }}
needs: build
runs-on: ${{ matrix.os.image_name }}
strategy:
fail-fast: false
matrix:
os:
- image_name: ubuntu-latest-64-cores
download_name: linux
pandas-version: ["2.2.1", "2.2.2"]
python-version: ["3.9"]
cloud-provider: [aws]
steps:
- name: Checkout Code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Display Python version
run: python -c "import sys; print(sys.version)"
- name: Decrypt parameters.py
shell: bash
run: .github/scripts/decrypt_parameters.sh
env:
PARAMETER_PASSWORD: ${{ secrets.PARAMETER_PASSWORD }}
CLOUD_PROVIDER: ${{ matrix.cloud-provider }}
- name: Download wheel(s)
uses: actions/download-artifact@v4
with:
name: wheel
path: dist
- name: Show wheels downloaded
run: ls -lh dist
shell: bash
- name: Upgrade setuptools, pip and wheel
run: python -m pip install -U setuptools pip wheel
- name: Install tox
run: python -m pip install tox
- if: ${{ contains('macos', matrix.os.download_name) }}
name: Run Snowpark pandas API doctests
run: python -m tox -e "modin_pandas_version-py${PYTHON_VERSION}-doctest-snowparkpandasdoctest-modin-ci"
env:
MODIN_PANDAS_PATCH_VERSION: ${{ matrix.pandas-version }}
PYTHON_VERSION: ${{ matrix.python-version }}
cloud_provider: ${{ matrix.cloud-provider }}
PYTEST_ADDOPTS: --color=yes --tb=short
TOX_PARALLEL_NO_SPINNER: 1
# Specify SNOWFLAKE_IS_PYTHON_RUNTIME_TEST: 1 when adding >= python3.11 with no server-side support
# For example, see https://github.com/snowflakedb/snowpark-python/pull/681
shell: bash
- name: Run Snowpark pandas API tests (excluding doctests)
run: python -m tox -e "modin_pandas_version-py${PYTHON_VERSION/\./}-snowparkpandasdailynotdoctest-modin-ci"
env:
MODIN_PANDAS_PATCH_VERSION: ${{ matrix.pandas-version }}
PYTHON_VERSION: ${{ matrix.python-version }}
cloud_provider: ${{ matrix.cloud-provider }}
PYTEST_ADDOPTS: --color=yes --tb=short
TOX_PARALLEL_NO_SPINNER: 1
shell: bash

test-disable-sql-simplifier: # Will be removed after sql simplifier is stable and no option to opt out.
name: Test Disable SQL Simplifier modin-${{ matrix.os.download_name }}-${{ matrix.python-version }}-${{ matrix.cloud-provider }}
needs: build
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,11 @@

### Snowpark pandas API Updates

### Dependency Updates

- Updated `modin` from 0.28.1 to 0.30.1.
- Added support for all `pandas` 2.2.x versions.

#### New Features

- Added support for `np.subtract`, `np.multiply`, `np.divide`, and `np.true_divide`.
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ cd snowpark-python

- Create a new Python virtual environment with any Python version that we support.
- The Snowpark Python API supports **Python 3.8, Python 3.9, Python 3.10, and Python 3.11**.
- The Snowpark pandas API supports **Python 3.9, Python 3.10, and Python 3.11**. Additionally, Snowpark pandas requires **Modin 0.28.1** and **pandas 2.2.1**.
- The Snowpark pandas API supports **Python 3.9, Python 3.10, and Python 3.11**. Additionally, Snowpark pandas requires **Modin 0.30.1** and **pandas 2.2.x**.

```bash
conda create --name snowpark-dev python=3.9
Expand Down
4 changes: 1 addition & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,7 @@
THIS_DIR = os.path.dirname(os.path.realpath(__file__))
SRC_DIR = os.path.join(THIS_DIR, "src")
SNOWPARK_SRC_DIR = os.path.join(SRC_DIR, "snowflake", "snowpark")
MODIN_DEPENDENCY_VERSION = (
"==0.28.1" # Snowpark pandas requires modin 0.28.1, which depends on pandas 2.2.1
)
MODIN_DEPENDENCY_VERSION = "==0.30.1" # Snowpark pandas requires modin 0.30.1, which is compatible with pandas 2.2.x
CONNECTOR_DEPENDENCY_VERSION = ">=3.10.0, <4.0.0"
INSTALL_REQ_LIST = [
"setuptools>=40.6.0",
Expand Down
38 changes: 32 additions & 6 deletions src/snowflake/snowpark/modin/plugin/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,17 @@
# since modin may raise its own warnings/errors on the wrong pandas version
import pandas # isort: skip # noqa: E402

supported_pandas_version = "2.2.1"
if pandas.__version__ != supported_pandas_version:
# TODO SNOW-1758773: perform version check in modin instead
supported_pandas_major_version = 2
supported_pandas_minor_version = 2
actual_pandas_version = version.parse(pandas.__version__)
sfc-gh-joshi marked this conversation as resolved.
Show resolved Hide resolved
if (
actual_pandas_version.major != supported_pandas_major_version
and actual_pandas_version.minor != supported_pandas_minor_version
):
raise RuntimeError(
f"The pandas version installed ({pandas.__version__}) does not match the supported pandas version in"
+ f" Snowpark pandas ({supported_pandas_version}). "
+ f" Snowpark pandas ({supported_pandas_major_version}.{supported_pandas_minor_version}.x). "
+ install_msg
) # pragma: no cover

Expand All @@ -36,7 +42,7 @@
"Modin is not installed. " + install_msg
) # pragma: no cover

supported_modin_version = "0.28.1"
supported_modin_version = "0.30.1"
if version.parse(modin.__version__) != version.parse(supported_modin_version):
raise ImportError(
f"The Modin version installed ({modin.__version__}) does not match the supported Modin version in"
Expand Down Expand Up @@ -136,25 +142,45 @@
register_pd_accessor,
register_series_accessor,
)
from modin.pandas.accessor import ModinAPI # isort: skip # noqa: E402,F401
sfc-gh-yzou marked this conversation as resolved.
Show resolved Hide resolved

from snowflake.snowpark.modin.plugin._internal.telemetry import ( # isort: skip # noqa: E402,F401
TELEMETRY_PRIVATE_METHODS,
snowpark_pandas_telemetry_standalone_function_decorator,
try_add_telemetry_to_attribute,
)

# Add telemetry on the ModinAPI accessor object.
# modin 0.30.1 introduces the pd.DataFrame.modin accessor object for non-pandas methods,
# such as pd.DataFrame.modin.to_pandas and pd.DataFrame.modin.to_ray. We will automatically
# raise NotImplementedError for all methods on this accessor object except to_pandas, but
# we still want to record telemetry.
for attr_name in dir(ModinAPI):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some more detailed comment there like what you have mentioned in the previous comment about the ModinAPI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we have both pd.Dataframe.modin, and pd.Series.modin, right? otherwise, i don't think we will need to skip this for dir(Series)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pd.DataFrame.modin and pd.Series.modin are both valid; the for attr_name in dir(ModinAPI) loop will add telemetry to chained method accesses like pd.DataFrame().modin.to_pandas(), but we skip it in dir(Series) because our telemetry code has trouble handling the object returned by pd.DataFrame().modin (without a method call at the end).

if not attr_name.startswith("_") or attr_name in TELEMETRY_PRIVATE_METHODS:
setattr(
ModinAPI,
attr_name,
try_add_telemetry_to_attribute(attr_name, getattr(ModinAPI, attr_name)),
)

for attr_name in dir(Series):
# Since Series is defined in upstream Modin, all of its members were either defined upstream
# or overridden by extension.
if not attr_name.startswith("_") or attr_name in TELEMETRY_PRIVATE_METHODS:
# Skip the `modin` accessor object, since we apply telemetry to all its fields.
sfc-gh-joshi marked this conversation as resolved.
Show resolved Hide resolved
if attr_name != "modin" and (
not attr_name.startswith("_") or attr_name in TELEMETRY_PRIVATE_METHODS
):
register_series_accessor(attr_name)(
try_add_telemetry_to_attribute(attr_name, getattr(Series, attr_name))
)

for attr_name in dir(DataFrame):
# Since DataFrame is defined in upstream Modin, all of its members were either defined upstream
# or overridden by extension.
if not attr_name.startswith("_") or attr_name in TELEMETRY_PRIVATE_METHODS:
# Skip the `modin` accessor object, since we apply telemetry to all its fields.
if attr_name != "modin" and (
not attr_name.startswith("_") or attr_name in TELEMETRY_PRIVATE_METHODS
):
register_dataframe_accessor(attr_name)(
try_add_telemetry_to_attribute(attr_name, getattr(DataFrame, attr_name))
)
Expand Down
10 changes: 6 additions & 4 deletions src/snowflake/snowpark/modin/plugin/docstrings/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -713,6 +713,7 @@ def aggregate():
agg = aggregate

def apply():
# TODO SNOW-1739034 unskip UDF tests when pandas 2.2.3 is available in anaconda
"""
Apply a function along an axis of the DataFrame.

Expand Down Expand Up @@ -821,15 +822,15 @@ def apply():

Using a reducing function on ``axis=1``:

>>> df.apply(np.sum, axis=1)
>>> df.apply(np.sum, axis=1) # doctest: +SKIP
0 2
1 10
2 13
dtype: int64

Returning a list-like object will result in a Series:

>>> df.apply(lambda x: [1, 2], axis=1)
>>> df.apply(lambda x: [1, 2], axis=1) # doctest: +SKIP
0 [1, 2]
1 [1, 2]
2 [1, 2]
Expand Down Expand Up @@ -1022,6 +1023,7 @@ def keys():
"""

def transform():
# TODO SNOW-1739034 unskip UDF tests when pandas 2.2.3 is available in anaconda
"""
Call ``func`` on self producing a Snowpark pandas DataFrame with the same axis shape as self.

Expand Down Expand Up @@ -1055,15 +1057,15 @@ def transform():
0 1 3
1 2 4
2 3 5
>>> df.transform(lambda x: x + 1, axis=1)
>>> df.transform(lambda x: x + 1, axis=1) # doctest: +SKIP
col1 col2
0 2 4
1 3 5
2 4 6

Apply a numpy ufunc to every value in the DataFrame.

>>> df.transform(np.square, axis=1)
>>> df.transform(np.square, axis=1) # doctest: +SKIP
col1 col2
0 1 9
1 4 16
Expand Down
10 changes: 6 additions & 4 deletions src/snowflake/snowpark/modin/plugin/docstrings/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -989,6 +989,7 @@ def cummax():
"""

def apply():
# TODO SNOW-1739034 unskip UDF tests when pandas 2.2.3 is available in anaconda
"""
Apply function ``func`` group-wise and combine the results together.

Expand Down Expand Up @@ -1050,7 +1051,7 @@ def apply():
its argument and returns a DataFrame. `apply` combines the result for
each group together into a new DataFrame:

>>> g1[['B', 'C']].apply(lambda x: x.select_dtypes('number') / x.select_dtypes('number').sum()) # doctest: +NORMALIZE_WHITESPACE
>>> g1[['B', 'C']].apply(lambda x: x.select_dtypes('number') / x.select_dtypes('number').sum()) # doctest: +SKIP
B C
0.0 0.333333 0.4
1.0 0.666667 0.6
Expand All @@ -1059,7 +1060,7 @@ def apply():
In the above, the groups are not part of the index. We can have them included
by using ``g2`` where ``group_keys=True``:

>>> g2[['B', 'C']].apply(lambda x: x.select_dtypes('number') / x.select_dtypes('number').sum()) # doctest: +NORMALIZE_WHITESPACE
>>> g2[['B', 'C']].apply(lambda x: x.select_dtypes('number') / x.select_dtypes('number').sum()) # doctest: +SKIP
B C
A
a 0.0 0.333333 0.4
Expand Down Expand Up @@ -1942,6 +1943,7 @@ def cov():
pass

def transform():
# TODO SNOW-1739034 unskip UDF tests when pandas 2.2.3 is available in anaconda
"""
Call function producing a same-indexed DataFrame on each group.

Expand Down Expand Up @@ -2011,7 +2013,7 @@ def transform():
i X 9 90 -9
j Y 10 10 -10

>>> df.groupby("col1", dropna=True).transform(lambda df, n: df.head(n), n=2)
>>> df.groupby("col1", dropna=True).transform(lambda df, n: df.head(n), n=2) # doctest: +SKIP
col2 col3 col4
a 1.0 40.0 -1.0
b NaN NaN NaN
Expand All @@ -2024,7 +2026,7 @@ def transform():
i NaN NaN NaN
j 10.0 10.0 -10.0

>>> df.groupby("col1", dropna=False).transform("mean")
>>> df.groupby("col1", dropna=False).transform("mean") # doctest: +SKIP
col2 col3 col4
a 2.50 25.0 -2.50
b 5.00 65.0 -5.00
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1552,6 +1552,37 @@ def __getitem__(self, key):
return self.loc[:, key]


# Modin uses the unique() query compiler method instead of aliasing the duplicated frontend method as of 0.30.1.
# TODO SNOW-1758721: use the more efficient implementation
@register_base_override("drop_duplicates")
def drop_duplicates(
self, keep="first", inplace=False, **kwargs
): # noqa: PR01, RT01, D200
"""
Return `BasePandasDataset` with duplicate rows removed.
"""
inplace = validate_bool_kwarg(inplace, "inplace")
ignore_index = kwargs.get("ignore_index", False)
subset = kwargs.get("subset", None)
if subset is not None:
if is_list_like(subset):
if not isinstance(subset, list):
subset = list(subset) # pragma: no cover
else:
subset = [subset]
df = self[subset]
else:
df = self
duplicated = df.duplicated(keep=keep)
result = self[~duplicated]
if ignore_index:
result.index = pandas.RangeIndex(stop=len(result))
sfc-gh-joshi marked this conversation as resolved.
Show resolved Hide resolved
if inplace:
self._update_inplace(result._query_compiler) # pragma: no cover
else:
return result


# Snowpark pandas does extra argument validation, which may need to be upstreamed.
@register_base_override("sort_values")
def sort_values(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1481,46 +1481,6 @@ def mask(
)


# Snowpark pandas has a fix for a pandas behavior change. It is available in Modin 0.30.1 (SNOW-1552497).
@register_dataframe_accessor("melt")
def melt(
self,
id_vars=None,
value_vars=None,
var_name=None,
value_name="value",
col_level=None,
ignore_index=True,
): # noqa: PR01, RT01, D200
"""
Unpivot a ``DataFrame`` from wide to long format, optionally leaving identifiers set.
"""
# TODO: SNOW-1063346: Modin upgrade - modin.pandas.DataFrame functions
if id_vars is None:
id_vars = []
if not is_list_like(id_vars):
id_vars = [id_vars]
if value_vars is None:
# Behavior of Index.difference changed in 2.2.x
# https://github.com/pandas-dev/pandas/pull/55113
# This change needs upstream to Modin:
# https://github.com/modin-project/modin/issues/7206
value_vars = self.columns.drop(id_vars)
if var_name is None:
columns_name = self._query_compiler.get_index_name(axis=1)
var_name = columns_name if columns_name is not None else "variable"
return self.__constructor__(
query_compiler=self._query_compiler.melt(
id_vars=id_vars,
value_vars=value_vars,
var_name=var_name,
value_name=value_name,
col_level=col_level,
ignore_index=ignore_index,
)
)


# Snowpark pandas does more thorough error checking.
@register_dataframe_accessor("merge")
def merge(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from pandas._typing import (
AggFuncType,
AnyArrayLike,
ArrayLike,
Axis,
FillnaOptions,
IgnoreRaise,
Expand Down Expand Up @@ -1596,6 +1597,19 @@ def rename(
return self_cp


# In some cases, modin after 0.30.1 returns a DatetimeArray instead of a numpy array. This
# still differs from the expected pandas behavior, which would return DatetimeIndex
sfc-gh-joshi marked this conversation as resolved.
Show resolved Hide resolved
# (see SNOW-1019312).
@register_series_accessor("unique")
def unique(self) -> ArrayLike: # noqa: RT01, D200
"""
Return unique values of Series object.
"""
# `values` can't be used here because it performs unnecessary conversion,
# after which the result type does not match the pandas
return self.__constructor__(query_compiler=self._query_compiler.unique()).to_numpy()


# Modin defaults to pandas for some arguments for unstack
@register_series_accessor("unstack")
def unstack(
Expand Down
Loading
Loading