Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce StorageConnector for GCS #14611

Closed
wants to merge 9 commits into from

Conversation

LakshSingla
Copy link
Contributor

Description

This PR adds the storage connector to interact with GCS using the API functions exposed in google-api-services-storage. It will allow Durable storage and MSQ's interactive APIs to work with GCS.

This also refactors the currently available S3 connector so that the chunking downloads that is currently done by the S3 connector can be extended to other connectors.

Due to the current versions of libraries used, the connector has the following 3 improvement areas:

  1. Currently, due to the limitations of google-api-services-storage and the version used by it, we can't use multipart uploads or streaming uploads. Therefore GCS connector writes the intermediate contents to a file and uploads them in a single go. There are composite objects, however, the functionality seems incorrect. This can be improved once we upgrade the libraries.

  2. For fetching the file, there is a isChunkedDownloads flag which controls if we want to download in chunks using the range header, https://cloud.google.com/storage/docs/xml-api/reference-headers#range, however since it can be ignored, the functionality is kept behind a flag for now. Fetching using range isn't supported in the library currently.

  3. All delete requests are done individually and not in a batched manner.

This implementation can be improved provided that we use the google-cloud-storage library instead of the google-api-services-storage library, though that would require a rehaul of the currently existing Google functions.

Release note

To be added


Key changed/added classes in this PR
  • GoogleStorageConnector
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@cryptoe cryptoe added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Jul 19, 2023
@LakshSingla
Copy link
Contributor Author

Parking this for now, since the current library doesn’t support chunked downloads, and uploads, and Druid is bound to the library because Guava cannot be updated for a while.

Will update the PR with a list of requirements and the versions of the libraries required for enabling this connector. Working on Azure connector in the meantime.

@LakshSingla LakshSingla deleted the gcs-storage-connector branch January 15, 2024 09:24
@LakshSingla
Copy link
Contributor Author

Closed in favor of #15398

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants