Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request Entity Too Large #6

Open
boferri opened this issue Jul 30, 2018 · 5 comments
Open

Request Entity Too Large #6

boferri opened this issue Jul 30, 2018 · 5 comments
Assignees

Comments

@boferri
Copy link

boferri commented Jul 30, 2018

while trying to harvest authority data from DNB OAI endpoint, I'm getting following error:

INFO[0000] https://services.dnb.de/oai/repository?from=2008-04-01T00:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&until=2008-04-30T23:59:59Z&verb=ListRecords 
FATA[0001] failed with Request Entity Too Large on https://services.dnb.de/oai/repository?from=2008-04-01T00:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&until=2008-04-30T23:59:59Z&verb=ListRecords: <nil>

any chance to fix this?

this metha-sync call is following:

metha-sync -format MARC21-xml -set authorities:person https://services.dnb.de/oai/repository
@miku
Copy link
Owner

miku commented Jul 30, 2018

@Zazi, thanks for the bug report. Could reproduce. The DNB endpoint is in general relatively broken. I believe I saw this error before:

Your request matches to many records (&gt;100000). The result size is 353017. Please try to restrict the request-period.

$ curl -vL "https://services.dnb.de/oai/repository?from=2008-04-05T00:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&until=2008-04-05T23:59:59Z&verb=ListRecords"
<html><head><title>Error</title></head><body>Your request matches to many records (&amp;gt;100000). The result size is 353017. Please try to restrict the request-period.</body></html>

It really odd, because even on a daily slice (using the -daily flag) it is too much. If, in theory, all records would have a single timestamp, there would be no way at all to retrieve the records in a windowed fashion - which in turn means that it is not fully OAI compliant.

Next thing I would try would be:

$ oaicrawl -verbose -f MARC21-xml https://services.dnb.de/oai/repository

We wrote oaicrawl for zvdd.de OAI, because it's calling itself OAI, despite being broken. The oaicrawl is a much blunter tool, it will fetch all identifiers (ListIdentifiers) and request records one-by-one (GetRecord). Let's see what happens with DNB:

$ oaicrawl -verbose -f MARC21-xml https://services.dnb.de/oai/repository
FATA[2018-07-30T14:15:52+02:00] expected element type <OAI-PMH> but have <html> 

Digging into it a bit more:

<title>Error</title>Your request matches to many records (&gt;100000). The result size is 13413063. Please try to restrict the request-period.

Now, let me rant on a bit. Why does OAI has so-called "resumption-tokens" at all? Datacite, base (Bielefeld) and other huge repositories can work just fine by paging through the data (tens of millions of records) for days. It's a DNB problem, it would be best, if they use their own resources to solve this problem.

@boferri
Copy link
Author

boferri commented Jul 30, 2018

thanks a lot @miku for your very fast reply. I was also on trying oaicrawl for this, but then I thought that it might be a bit to much fetching this rather larger authorities set 1-by-1 from DNB - so I skipped this approach. Furthermore, as far as I understood the arguments from oaicrawl - I cannot define the concrete set over there, or?
Thanks a lot for your feedback, I'll forward it to DNB somehow.
For our concrete usecase it probably might even be enough to get the data excerpt from "Sächsische Bibliographie" via SRU. Then I "only" need to be able to define the appropriate CQL query (which is a bit out of my knowledge so far).

@boferri
Copy link
Author

boferri commented Jul 30, 2018

while writing the draft for an answer to DNB and reading their OAI docs again, I came to a possible solution:
since the request return a 413, which is a standard HTTP status code from RFC 7231 - one can make use of this information and reduce the standard interval from daily to e.g. hourly for such cases (which requires to set both parameters, from and until, in the request).

Does this sound like a solution for you @miku ?

PS: the DNB OAI docu also says "Depending on the OAI repository these can be either defined to the day (YYYY-MM-DD) or to the second (YYYY-MM-DDThh:mm:ssZ)" - so working with hourly slice might be possible.

curl -vL "https://services.dnb.de/oai/repository?from=2008-04-05T13:00:00Z&until=2008-04-05T14:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&verb=ListRecords"

delivers at least some results (incl. a resumption token)

@miku
Copy link
Owner

miku commented Jul 30, 2018

I cannot define the concrete set over there, or?

Yes, oaicrawl was more of a one-shot for a particular endpoint and has a minimal feature set.

Thanks a lot for your feedback, I'll forward it to DNB somehow.

I can try to do the same.

Does this sound like a solution for you @miku ?

Yes, sure this is an option. This is also a limitation of metha, which I would like to get rid of one day (it was not essential for the use cases so far, so it is not implemented): It has only monthly and daily slices, not arbitrary precision.

@miku miku self-assigned this Jul 30, 2018
@boferri
Copy link
Author

boferri commented Jul 31, 2018

Ok, we've send a request to DNB, whether they can increase the result size limit. On the other side, we would appreciate, when you could implement the proposed fall-back functionality, when a 413 will be thrown, i.e., decrease the interval temporarily to hourly (and the go back to daily).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants