Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is data only harvested up to the last day? #12

Open
gunnihinn opened this issue Mar 25, 2019 · 3 comments
Open

Why is data only harvested up to the last day? #12

gunnihinn opened this issue Mar 25, 2019 · 3 comments

Comments

@gunnihinn
Copy link

The readme says

Currently, there is a limitation which only allows to harvest data up to the last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.

which is indeed the current behavior. Do you remember what the reason for this limitation is? Is it something inherent in the OAI protocol, or does it come from somewhere else?

I'm using metha to harvest the arXiv and am curious about this one-day delay.

@miku
Copy link
Owner

miku commented Mar 25, 2019

It is not a limitation of the protocol, but a implementation tradeoff - that I'd like to fix in some future version.

Basically: OAI allows two date granularities, day and second. In order to have a single filename type on disk (e.g. 2018-04-30-00000000.xml.gz), we used the coarser granularity. Also, we wanted to avoid having to check for duplicates (e.g. when requesting an endpoint, that only supports day-granularity every hour).

It's not ideal, and I have some prototypes for more seamless handling, already - just need to weave it into metha.

@gunnihinn
Copy link
Author

Thanks, that's fair enough. If you'd like some help with or review of any of those prototypes or their design or implementation, I'd be happy to be of assistance.

@miku
Copy link
Owner

miku commented Mar 26, 2019

If you'd like some help with or review of any of those prototypes or their design or implementation, I'd be happy to be of assistance.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants