-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow running pooch on air-gapped systems #352
Comments
@dokempf allowing users to configure the cache location is already implemented through environment variables that package developers can enable. Having a "fetch all" function could also be easily made by package developers with 2 lines of code: for fname in POOCH_INSTANCE.registry:
POOCH_INSTANCE.fetch(fname) Having such a function in Pooch itself would probably not help much since package developers would have to implement it on their end as well since users are not even meant to know about Pooch. So maybe this is already resolved? |
You are right that the "fetch all" thing is minor and can easily be implemented by a loop. What does not work air-gapped though is DOI resolution. This requires requests caching through e.g. https://github.com/requests-cache/requests-cache. I do have a partial implementation of this that I could finish now that this upstream issue is fixed. |
But if the data have already been fetched, we wouldn't need to resolve the DOI. Caching the response opens up a load of potential issues with cache invalidation, which I'd rather avoid. Pooch is meant to be relatively simple and straight forward. So maybe not worth the maintenance burden for a somewhat niche application? |
Currently, that only works when the registry is given explicitly. If it is populated from the data repository, it does not work offline, because it needs to do DOI resolution to learn what files there are. Could there be an easy fix, where the registry file is stored in the cache for this scenario? |
Again, this opens up a lot of issues around validating the cached registry. These things have a tendency to hide non-obvious bugs and we've been bitten by them before. In this case, users could implement the caching of the registry with a few lines of code: import json
POOCH = pooch.Pooch(...)
try:
POOCH.load_registry_from_doi()
with open("registry.json", "w") as output:
json.dump(POOCH.registry, output)
except:
if os.path.exists("registry.json"):
with open("registry.json") as input:
POOCH.registry = json.load(input)
else:
raise ... It's an easy fix from the user side but implementing this into Pooch would mean designing an API around all of this that wouldn't break any existing code out there. With Pooch being pulled by scipy and scikit-image, if we push a release that breaks compatibility in any way, it tends to be messy. |
Description of the desired feature:
When building Python libraries (as opposed to Python applications) that use Pooch for data downloading, we currently have to accept the fact that our software will not be usable in scenarios without network access (either due to temporary inavailability or due to the running environment being air-gapped). I think pooch itself could mitigate this risk by doing the following things:
Pooch
s registry (e.g. by allowingfetch()
without arguments)A challenge that would require some discussion is to also make other API requests (e.g. in DOI resolution) obsolete by caching the responses in a relocatable location.
Are you willing to help implement and maintain this feature?
Yes, with no specific timeline for the implementation.
The text was updated successfully, but these errors were encountered: