Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1818205: Add support for pd.json_normalize #2657

Merged
merged 7 commits into from
Nov 22, 2024

Conversation

sfc-gh-helmeleegy
Copy link
Contributor

@sfc-gh-helmeleegy sfc-gh-helmeleegy commented Nov 20, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1818205

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

    Add support for pd.json_normalize.

Copy link
Contributor

@sfc-gh-nkrishna sfc-gh-nkrishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one question, but approving

Copy link
Contributor

@sfc-gh-rdurrani sfc-gh-rdurrani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, but I left a couple of comments, and have a few questions:

  1. Since we're defaulting to native pandas, should we provide some warning to the user or something, that the data will be processed serially in memory, and may be slow for large datasets?
  2. Do we want to consider adding a distributed/threaded approach later on? (e.g. process batches of data records in parallel by loading them into tables (using https://docs.snowflake.com/en/user-guide/tutorials/script-data-load-transform-json), and then joining those tables?)

@sfc-gh-helmeleegy sfc-gh-helmeleegy enabled auto-merge (squash) November 21, 2024 22:32
@sfc-gh-helmeleegy
Copy link
Contributor Author

overall lgtm, but I left a couple of comments, and have a few questions:

  1. Since we're defaulting to native pandas, should we provide some warning to the user or something, that the data will be processed serially in memory, and may be slow for large datasets?
  2. Do we want to consider adding a distributed/threaded approach later on? (e.g. process batches of data records in parallel by loading them into tables (using https://docs.snowflake.com/en/user-guide/tutorials/script-data-load-transform-json), and then joining those tables?)

As mentioned in another thread, the input data in this case is already in-memory.

@sfc-gh-helmeleegy sfc-gh-helmeleegy merged commit bbd7a62 into main Nov 22, 2024
36 of 37 checks passed
@sfc-gh-helmeleegy sfc-gh-helmeleegy deleted the helmeleegy-SNOW-1818205 branch November 22, 2024 02:51
@github-actions github-actions bot locked and limited conversation to collaborators Nov 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants