You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, there is a limitation which only allows to harvest data up to the last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.
which is indeed the current behavior. Do you remember what the reason for this limitation is? Is it something inherent in the OAI protocol, or does it come from somewhere else?
I'm using metha to harvest the arXiv and am curious about this one-day delay.
The text was updated successfully, but these errors were encountered:
It is not a limitation of the protocol, but a implementation tradeoff - that I'd like to fix in some future version.
Basically: OAI allows two date granularities, day and second. In order to have a single filename type on disk (e.g. 2018-04-30-00000000.xml.gz), we used the coarser granularity. Also, we wanted to avoid having to check for duplicates (e.g. when requesting an endpoint, that only supports day-granularity every hour).
It's not ideal, and I have some prototypes for more seamless handling, already - just need to weave it into metha.
Thanks, that's fair enough. If you'd like some help with or review of any of those prototypes or their design or implementation, I'd be happy to be of assistance.
The readme says
which is indeed the current behavior. Do you remember what the reason for this limitation is? Is it something inherent in the OAI protocol, or does it come from somewhere else?
I'm using metha to harvest the arXiv and am curious about this one-day delay.
The text was updated successfully, but these errors were encountered: