Wire up the ETag from S3's upload response back to the BlobDTO's MD5 field, to handle multipart upload correctly #915
+2,645
−31
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With the late breaking design change to use subscoped tokens instead of direct S3 PUTs, we ended up using the same file handling logic as BDECs. This meant going through JDBC and the S3 SDK.
As part of recent testing I've discovered that for files greater than 16MB, S3 splits the file into a multipart upload.
The ETag of such a file is NOT the MD5 hash, which is what's also documented.
For BDECs, we calculate the MD5 hash ourselves and send it to snowflake, where it's stored in the fileContentKey field.
For parquet files operating specifically in the iceberg table, there is a check in XP to ensure that the ETag of the blob being read is identical to the fileContentKey stored in snowflake metadata.
Connecting these dots - what's happening before this fix is that for iceberg ingestion of files greater than 16 MB, the SDK sends the MD5 hash into the fileContentKey property whereas XP expects it to be the ETag value (which is NOT the MD5 of the contents IF its a multipart upload).
The proper fix is to make JDBC return the ETag value after uploading the file, through all the layers of JDBC classes, to the API that ingest SDK uses (uploadWithoutConnection).
Since we need to fix this right away, this PR copies over those parts of JDBC that are used for iceberg ingestion. As soon as JDBC driver has the new fix we'll remove all these classes.
Note that this PR accidentally changes the timeout to 20 seconds, another PR tomorrow is going to make that change and i'll back it out of this branch before merging.