Wire up the ETag from S3's upload response back to the BlobDTO's MD5 field, to handle multipart upload correctly #915

sfc-gh-hmadan · 2024-11-21T02:35:21Z

With the late breaking design change to use subscoped tokens instead of direct S3 PUTs, we ended up using the same file handling logic as BDECs. This meant going through JDBC and the S3 SDK.

As part of recent testing I've discovered that for files greater than 16MB, S3 splits the file into a multipart upload.
The ETag of such a file is NOT the MD5 hash, which is what's also documented.

For BDECs, we calculate the MD5 hash ourselves and send it to snowflake, where it's stored in the fileContentKey field.
For parquet files operating specifically in the iceberg table, there is a check in XP to ensure that the ETag of the blob being read is identical to the fileContentKey stored in snowflake metadata.

Connecting these dots - what's happening before this fix is that for iceberg ingestion of files greater than 16 MB, the SDK sends the MD5 hash into the fileContentKey property whereas XP expects it to be the ETag value (which is NOT the MD5 of the contents IF its a multipart upload).

The proper fix is to make JDBC return the ETag value after uploading the file, through all the layers of JDBC classes, to the API that ingest SDK uses (uploadWithoutConnection).

Since we need to fix this right away, this PR copies over those parts of JDBC that are used for iceberg ingestion. As soon as JDBC driver has the new fix we'll remove all these classes.

Note that this PR accidentally changes the timeout to 20 seconds, another PR tomorrow is going to make that change and i'll back it out of this branch before merging.

sfc-gh-alhuang

Looks good, thanks for the fix!

sfc-gh-hmadan · 2024-11-21T19:36:52Z

@sfc-gh-alhuang since the iceberg merge gate does not go against GCS, can you please:

test against gcs
add encrypted profile.json files for the preprods too (right now its just the prod profile.jsons)

sfc-gh-hmadan requested a review from sfc-gh-alhuang November 21, 2024 03:18

sfc-gh-hmadan and others added 2 commits November 21, 2024 06:55

ready

3aa72f6

done

0652b1f

sfc-gh-hmadan force-pushed the hmadan-etag-fix branch from 1326801 to 0652b1f Compare November 21, 2024 06:56

sfc-gh-hmadan marked this pull request as ready for review November 21, 2024 17:30

sfc-gh-hmadan requested review from sfc-gh-tzhang and a team as code owners November 21, 2024 17:30

sfc-gh-alhuang approved these changes Nov 21, 2024

View reviewed changes

fix gcs

237e56e

sfc-gh-xhuang approved these changes Nov 21, 2024

View reviewed changes

sfc-gh-hmadan merged commit 84727c3 into master Nov 22, 2024
49 checks passed

sfc-gh-hmadan deleted the hmadan-etag-fix branch November 22, 2024 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wire up the ETag from S3's upload response back to the BlobDTO's MD5 field, to handle multipart upload correctly #915

Wire up the ETag from S3's upload response back to the BlobDTO's MD5 field, to handle multipart upload correctly #915

sfc-gh-hmadan commented Nov 21, 2024

sfc-gh-alhuang left a comment

sfc-gh-hmadan commented Nov 21, 2024

Wire up the ETag from S3's upload response back to the BlobDTO's MD5 field, to handle multipart upload correctly #915

Wire up the ETag from S3's upload response back to the BlobDTO's MD5 field, to handle multipart upload correctly #915

Conversation

sfc-gh-hmadan commented Nov 21, 2024

sfc-gh-alhuang left a comment

Choose a reason for hiding this comment

sfc-gh-hmadan commented Nov 21, 2024