-
Notifications
You must be signed in to change notification settings - Fork 2
Large Deposits
In 2023 H2 added Globus integration that allows users to deposit large amounts of data, both in terms of number of files and their sizes. For instructions on how this is done please see this demo video.
Despite the ability to create deposits with Globus there are (currently) practical constraints on moving data through the SDR, which were diagnosed during a production test of a 500 GB object with 19,000 files (hj302gv2126). These are some notes on changes that were made in March of 2023 to improve SDR processes to allow for a large deposit. These notes are meant to inform future efforts to scale SDR's ability to process large deposits.
- The Technical Metadata step in the Accessioning Workflow was not able to find files. This was because the Accessioning workflow was being kicked off twice, and the amount of data was causing a race condition which ordinarily didn't manifest for Globus deposits. https://github.com/sul-dlss/happy-heron/issues/3007
- H2 was unable to fetch a list of all the files from Globus in a deposit when the deposit had a large number of files (in this case 19,000). The
GlobusClient.list_files()
method in our globus_client library needs to issue an API call for every directory contained in the user's Globus upload directory in order to get a complete list of files. This seemed to encounter an intermittent connection termination in production (on sul-h2-prod) but not in our staging or development environments. Since we were unable to determine why the production network was behaving this way our solution was to retry the HTTP call when it failed using faraday-retry. https://github.com/sul-dlss/happy-heron/issues/3008 - H2's call to update a deposit using the resource update endpoint in the SDR API was timing out. Increasing the timeout to 30 minutes didn't help. We discovered that this API request was taking a long time because it was generating missing digests for Globus deposits, since digests are not available from the Globus API itself. The solution was to move fixity generation into the SDR-API's background IngestJob and UpdateJob jobs instead of doing it as part of the HTTP response generation. See https://github.com/sul-dlss/happy-heron/issues/2995
- The H2 application encountered a socket timeout when trying to update the database after waiting a long time for the Globus list files operation to complete. Rails' ActiveRecord holds on to database connections, and the network connection between sul-h2-prod and sul-h2-db-prod was getting interrupted by a firewall rule. The solution was to re-open all the database connections after returning from the (potentially long)
GlobusClient.list_files()
operation. This is a pattern that we've had to use elsewhere in the SDR. https://github.com/sul-dlss/happy-heron/issues/3019 - Once deposited it takes 40 secs for Argo to render the item view for druid:hj302gv2126 . But at least it renders, eventually.
- Once shelved it takes about a minute for the PURL to completely render the file listing: https://purl.stanford.edu/hj302gv2126
- The sdr-client encounters a timeout when retrieving metadata for this large object when accessing from outside the VPN (which some users of SDR-API might be doing?) It appears to get cut off after two minutes. For example:
sdr get druid:hj302gv2126
- The
reset-workspace
in the Accession Workflow encounters a network timeout. https://github.com/sul-dlss/common-accessioning/issues/1039
Since network timeouts (HTTP API calls and Database connections) seemed to be a common theme in these difficulties we can expect to remediate these problems by a combination of:
- Moving expensive work into background jobs (e.g. Sidekiq) which then issue a callback of some kind to indicate completion.
- Replacing HTTP API calls with RabbitMQ messages which can be picked up and responded to asynchronously.
- Partitioning Cocina responses into multiple HTTP resources using HATEOAS and paging of resources. So instead of getting all the metadata and files for an object as part of a single request it should be possible for the API and the client to allow paging of resources.