Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selective Harvesting and metha-cat #34

Open
tobiasschweizer opened this issue Sep 15, 2023 · 5 comments
Open

Selective Harvesting and metha-cat #34

tobiasschweizer opened this issue Sep 15, 2023 · 5 comments

Comments

@tobiasschweizer
Copy link

tobiasschweizer commented Sep 15, 2023

Hi @miku,

We are adding more and more OAI-PMH endpoints and metha does a great job!

I have a question about selective harvesting and metha-cat. I have automated harvesting via crontab.
After an initial harvest that gets all records from the earliest day on, we do one selective harvest a week:

metha-sync -T 5m -r 20 -base-dir /mydir -format marcmxl https://zenodo.org/oai2d

Since all previous harvests are written to /mydir (local cache), metha-sync implicitly sets the -from param according to the last harvest, correct?

Now with metha-cat (without providing a timestamp), I have observed that more records are returned in the virtual XML that are actually in the repo, so I assume this includes also updates of a record (so the same record can occur multiple times in metha-cat's output). Is this interpretation correct?

EDIT: What I'd like to get is the latest version of each record via metha-cat.

Thanks and kind regards,

Tobias

@miku
Copy link
Owner

miku commented Nov 29, 2023

Sorry for my overly delayed reply.

Since all previous harvests are written to /mydir (local cache), metha-sync implicitly sets the -from param according to the last harvest, correct?

Yes.

Now with metha-cat (without providing a timestamp), I have observed that more records are returned in the virtual XML that are actually in the repo, so I assume this includes also updates of a record (so the same record can occur multiple times in metha-cat's output). Is this interpretation correct?

Yes.

EDIT: What I'd like to get is the latest version of each record via metha-cat.

Yes, I understand. So metha does not do much except caching responses so subsequent invocations are faster (that's something I haven't seen a lot in other tools). So be on the safe side with respect to updates, one can always delete the cache for a particular endpoint and start anew.

$ rm $(metha-sync -dir http://my.server.org)

That of course requires some tolerance of possibly stale records - depending on the requirements.

@tobiasschweizer
Copy link
Author

No problem and thanks for your response. I'll have a closer look at an endpoint's cache where I assume that a lot of updated records flow in.

Otherwise, metha works nicely and stable :-) it's a part of our automated workflow since a couple of months.

@tobiasschweizer
Copy link
Author

tobiasschweizer commented Oct 31, 2024

Hi @miku,

We have run into some redundancy trouble related to caching.
Deleting the cache obviously removes the issue but is not a very good option when dealing with big amounts of data.

Would there be a way to merge updated records into one with metha-cat? So if a record has been published and then updated, could metha-cat simply return the latest record?

I do not know Golang but maybe you could point me to the code where this could possibly happen.

EDIT I now know a little tony bit of Go ...
I think this is where the magic happens:

https://github.com/miku/metha/blob/master/render.go#L65-L84

It just iterates over the lists of records for each compressed .xml.gzip and ignores records that do not match the datestamp if from and/or until are set.

Once all the records have been collected, could they be matched by record identifier, taking the latest record if there are several for one identifier?

EDIT 2 I think this is difficult since there is no step of collecting all records in memory before writing them to stdout ...
On the other hand, collecting everything is probably a bad idea as there could be several Gigabytes of data. Not sure how to approach this best. Some kind of postprocessing?

@miku
Copy link
Owner

miku commented Oct 31, 2024

This is a tradeoff, because we store multiple records per file it's hard to overwrite a particular record. Originally, I opted for the time "windowed" approach, because requesting single record from an endpoint that emits e.g. a few million records would result in the same number of HTTP requests and that is somewhat stressful for the server.

One way it could be addressed would be to request many records (in a time window) at once, but then store them individually on disk, so that a record could be overwritten, if a new version is found. The next question then would be, if one file per record is the right approach.

For the time being, rerunning from scratch is probably the simplest, albeit crude, approach.

@tobiasschweizer
Copy link
Author

Thanks for the explanation.

One way it could be addressed would be to request many records (in a time window) at once, but then store them individually on disk, so that a record could be overwritten, if a new version is found. The next question then would be, if one file per record is the right approach.

So metha-sync would make sure that there is exactly one representation / file stored per record (also if this record has been updated). Then the virtual XML produced metha-cat would already be without duplicates.

Looking at the current behaviour, metha-sync creates GZIP files with multiple records in them organised by publication date. So working on that would affect both metha-sync and metha-cat (render.go).
Obviously, the new version would not be compatible with old caches unless there is some kind of migration assistant.

What else would be affected? I could offer my support in working on that. I do not know Golang but at least I managed to run it from the CLI ...

My motivation: I think incremental harvesting is the one thing OAI-PMH is great at and it would be a pitty to give that away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants