-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selective Harvesting and metha-cat #34
Comments
Sorry for my overly delayed reply.
Yes.
Yes.
Yes, I understand. So metha does not do much except caching responses so subsequent invocations are faster (that's something I haven't seen a lot in other tools). So be on the safe side with respect to updates, one can always delete the cache for a particular endpoint and start anew.
That of course requires some tolerance of possibly stale records - depending on the requirements. |
No problem and thanks for your response. I'll have a closer look at an endpoint's cache where I assume that a lot of updated records flow in. Otherwise, metha works nicely and stable :-) it's a part of our automated workflow since a couple of months. |
Hi @miku, We have run into some redundancy trouble related to caching. Would there be a way to merge updated records into one with metha-cat? So if a record has been published and then updated, could metha-cat simply return the latest record? I do not know Golang but maybe you could point me to the code where this could possibly happen. EDIT I now know a little tony bit of Go ... https://github.com/miku/metha/blob/master/render.go#L65-L84 It just iterates over the lists of records for each compressed .xml.gzip and ignores records that do not match the datestamp if from and/or until are set. Once all the records have been collected, could they be matched by record identifier, taking the latest record if there are several for one identifier? EDIT 2 I think this is difficult since there is no step of collecting all records in memory before writing them to stdout ... |
This is a tradeoff, because we store multiple records per file it's hard to overwrite a particular record. Originally, I opted for the time "windowed" approach, because requesting single record from an endpoint that emits e.g. a few million records would result in the same number of HTTP requests and that is somewhat stressful for the server. One way it could be addressed would be to request many records (in a time window) at once, but then store them individually on disk, so that a record could be overwritten, if a new version is found. The next question then would be, if one file per record is the right approach. For the time being, rerunning from scratch is probably the simplest, albeit crude, approach. |
Thanks for the explanation.
So Looking at the current behaviour, What else would be affected? I could offer my support in working on that. I do not know Golang but at least I managed to run it from the CLI ... My motivation: I think incremental harvesting is the one thing OAI-PMH is great at and it would be a pitty to give that away. |
Hi @miku,
We are adding more and more OAI-PMH endpoints and metha does a great job!
I have a question about selective harvesting and
metha-cat
. I have automated harvesting via crontab.After an initial harvest that gets all records from the earliest day on, we do one selective harvest a week:
metha-sync -T 5m -r 20 -base-dir /mydir -format marcmxl https://zenodo.org/oai2d
Since all previous harvests are written to
/mydir
(local cache),metha-sync
implicitly sets the-from
param according to the last harvest, correct?Now with
metha-cat
(without providing a timestamp), I have observed that more records are returned in the virtual XML that are actually in the repo, so I assume this includes also updates of a record (so the same record can occur multiple times inmetha-cat
's output). Is this interpretation correct?EDIT: What I'd like to get is the latest version of each record via
metha-cat
.Thanks and kind regards,
Tobias
The text was updated successfully, but these errors were encountered: