Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wire up the ETag from S3's upload response back to the BlobDTO's MD5 field, to handle multipart upload correctly #915

Merged
merged 3 commits into from
Nov 22, 2024

Conversation

sfc-gh-hmadan
Copy link
Collaborator

With the late breaking design change to use subscoped tokens instead of direct S3 PUTs, we ended up using the same file handling logic as BDECs. This meant going through JDBC and the S3 SDK.

As part of recent testing I've discovered that for files greater than 16MB, S3 splits the file into a multipart upload.
The ETag of such a file is NOT the MD5 hash, which is what's also documented.

For BDECs, we calculate the MD5 hash ourselves and send it to snowflake, where it's stored in the fileContentKey field.
For parquet files operating specifically in the iceberg table, there is a check in XP to ensure that the ETag of the blob being read is identical to the fileContentKey stored in snowflake metadata.

Connecting these dots - what's happening before this fix is that for iceberg ingestion of files greater than 16 MB, the SDK sends the MD5 hash into the fileContentKey property whereas XP expects it to be the ETag value (which is NOT the MD5 of the contents IF its a multipart upload).

The proper fix is to make JDBC return the ETag value after uploading the file, through all the layers of JDBC classes, to the API that ingest SDK uses (uploadWithoutConnection).

Since we need to fix this right away, this PR copies over those parts of JDBC that are used for iceberg ingestion. As soon as JDBC driver has the new fix we'll remove all these classes.

Note that this PR accidentally changes the timeout to 20 seconds, another PR tomorrow is going to make that change and i'll back it out of this branch before merging.

@sfc-gh-hmadan sfc-gh-hmadan marked this pull request as ready for review November 21, 2024 17:30
@sfc-gh-hmadan sfc-gh-hmadan requested review from sfc-gh-tzhang and a team as code owners November 21, 2024 17:30
Copy link
Contributor

@sfc-gh-alhuang sfc-gh-alhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the fix!

@sfc-gh-hmadan
Copy link
Collaborator Author

@sfc-gh-alhuang since the iceberg merge gate does not go against GCS, can you please:

  1. test against gcs
  2. add encrypted profile.json files for the preprods too (right now its just the prod profile.jsons)

@sfc-gh-hmadan sfc-gh-hmadan merged commit 84727c3 into master Nov 22, 2024
49 checks passed
@sfc-gh-hmadan sfc-gh-hmadan deleted the hmadan-etag-fix branch November 22, 2024 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants