SNOW-1818205: Add support for pd.json_normalize #2657

sfc-gh-helmeleegy · 2024-11-20T21:09:10Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-1818205
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
Please describe how your code solves the related issue.

Add support for pd.json_normalize.

sfc-gh-nkrishna

Left one question, but approving

src/snowflake/snowpark/modin/plugin/extensions/io_overrides.py

sfc-gh-rdurrani

overall lgtm, but I left a couple of comments, and have a few questions:

Since we're defaulting to native pandas, should we provide some warning to the user or something, that the data will be processed serially in memory, and may be slow for large datasets?
Do we want to consider adding a distributed/threaded approach later on? (e.g. process batches of data records in parallel by loading them into tables (using https://docs.snowflake.com/en/user-guide/tutorials/script-data-load-transform-json), and then joining those tables?)

src/snowflake/snowpark/modin/plugin/docstrings/io.py

src/snowflake/snowpark/modin/plugin/extensions/io_overrides.py

src/snowflake/snowpark/modin/plugin/docstrings/io.py

src/snowflake/snowpark/modin/plugin/extensions/io_overrides.py

src/snowflake/snowpark/modin/plugin/io/snow_io.py

sfc-gh-helmeleegy · 2024-11-22T02:27:45Z

overall lgtm, but I left a couple of comments, and have a few questions:

Since we're defaulting to native pandas, should we provide some warning to the user or something, that the data will be processed serially in memory, and may be slow for large datasets?

Do we want to consider adding a distributed/threaded approach later on? (e.g. process batches of data records in parallel by loading them into tables (using https://docs.snowflake.com/en/user-guide/tutorials/script-data-load-transform-json), and then joining those tables?)

As mentioned in another thread, the input data in this case is already in-memory.

sfc-gh-helmeleegy requested a review from a team as a code owner November 20, 2024 21:09

sfc-gh-helmeleegy requested review from sfc-gh-rdurrani and sfc-gh-joshi November 20, 2024 21:09

github-actions bot added the snowpark-pandas label Nov 20, 2024

SNOW-1818205: Add support for pd.json_normalize

8b965ca

sfc-gh-helmeleegy force-pushed the helmeleegy-SNOW-1818205 branch from ab66505 to 8b965ca Compare November 20, 2024 21:14

sfc-gh-nkrishna approved these changes Nov 21, 2024

View reviewed changes

src/snowflake/snowpark/modin/plugin/extensions/io_overrides.py Show resolved Hide resolved

sfc-gh-rdurrani reviewed Nov 21, 2024

View reviewed changes

src/snowflake/snowpark/modin/plugin/docstrings/io.py Outdated Show resolved Hide resolved

src/snowflake/snowpark/modin/plugin/docstrings/io.py Show resolved Hide resolved

src/snowflake/snowpark/modin/plugin/extensions/io_overrides.py Show resolved Hide resolved

sfc-gh-joshi approved these changes Nov 21, 2024

View reviewed changes

src/snowflake/snowpark/modin/plugin/docstrings/io.py Show resolved Hide resolved

src/snowflake/snowpark/modin/plugin/extensions/io_overrides.py Show resolved Hide resolved

src/snowflake/snowpark/modin/plugin/io/snow_io.py Show resolved Hide resolved

sfc-gh-helmeleegy added 2 commits November 21, 2024 14:25

Merge branch 'main' into helmeleegy-SNOW-1818205

4771ae3

address comments

c3a28a6

sfc-gh-helmeleegy enabled auto-merge (squash) November 21, 2024 22:32

sfc-gh-helmeleegy added 4 commits November 21, 2024 14:54

fix errors

65c92d4

fix errors

a9951d3

fix errors

bd57a51

fix errors

e7d7ea5

sfc-gh-helmeleegy merged commit bbd7a62 into main Nov 22, 2024
36 of 37 checks passed

sfc-gh-helmeleegy deleted the helmeleegy-SNOW-1818205 branch November 22, 2024 02:51

github-actions bot locked and limited conversation to collaborators Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1818205: Add support for pd.json_normalize #2657

SNOW-1818205: Add support for pd.json_normalize #2657

sfc-gh-helmeleegy commented Nov 20, 2024 •

edited

Loading

sfc-gh-nkrishna left a comment

sfc-gh-rdurrani left a comment

sfc-gh-helmeleegy commented Nov 22, 2024

SNOW-1818205: Add support for pd.json_normalize #2657

SNOW-1818205: Add support for pd.json_normalize #2657

Conversation

sfc-gh-helmeleegy commented Nov 20, 2024 • edited Loading

sfc-gh-nkrishna left a comment

Choose a reason for hiding this comment

sfc-gh-rdurrani left a comment

Choose a reason for hiding this comment

sfc-gh-helmeleegy commented Nov 22, 2024

sfc-gh-helmeleegy commented Nov 20, 2024 •

edited

Loading