Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIKE] irIIIFService: Fix bug with encoding encoded strings #150

Open
devangm opened this issue Oct 2, 2024 · 7 comments · Fixed by #151
Open

[SPIKE] irIIIFService: Fix bug with encoding encoded strings #150

devangm opened this issue Oct 2, 2024 · 7 comments · Fixed by #151
Assignees

Comments

@devangm
Copy link

devangm commented Oct 2, 2024

irIIIFService encodes a URI string even if it's already encoded. This results in a 404 for manifest generation if a percent for an encoded character is in the value (it encodes the encoded char with another %25.

#149
#139

Acceptance Criteria

Manifests for files with spaces are generated as expected.

@kaladay
Copy link
Contributor

kaladay commented Oct 2, 2024

There is explicit encoding happening here:

Introduced here:

And here:

To address some sort of dspace situation (based on the branch name containing "dspace" in it).

See also:

@markpbaggett
Copy link
Member

markpbaggett commented Oct 2, 2024

@kaladay and @qtamu Adding this here just for science. One thing that I thought about last week and wanted to try but couldn't as I don't have access to FCREPO on dev and pre was to see if Fedora can even have a URI that is not URL encoded. My interpretation of RDF Concepts is that this shouldn't be allowed, but who knows if Fedora and DSPACE follow the specification closely enough to say for sure.

To test the idea, I was thinking we should directly target the problematic URI in question, delete it, and then reinsert an unescaped URI. The request will likely fail altogether as the SPARQL update wouldn't even be valid with the URI unescaped, but it would be interesting to see if it does in fact pass.

How to Test on Dev with an Existing Resource

Create a file called update.ru with the following SPARQL Update as the body of the file:

PREFIX iana: <http://www.iana.org/assignments/relation/>

DELETE {
  <> iana:describedby <https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday%20card_1.jpg/fcr:metadata> .
}
INSERT {
  <> iana:describedby <https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday card_1.jpg/fcr:metadata> .
}
WHERE { }

Do a curl request via shell on local host like so:

curl -X PATCH \
     -H "Content-Type: application/sparql-update" \
     --data-binary "@update.ru" \
     "https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday%20card_1.jpg/fcr:metadata"

You'll need to also pass username and auth and both can be found in environmental variables in Rancher dev.

What I think will happen when we run this?

  1. My prediction: a 412 Precondition failed response. I think this is most likely as the SPARQL isn't valid (URIs can't be unescaped).
  2. Possible: a 204 response where only the delete is successful. Again, I just don't see the insert working, but maybe something bad here allows the initial part of the request to go through. No problem though, as we can just fix with running an Insert request to add the URI back as it was originally in the DELETE (also this is dev).
  3. Unlikely: a 204 where the url gets deleted and reinserted but the object of the triple in question ends up being https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday%20card_1.jpg/fcr:metadata. In other words, request successfully goes through, but Fedora escapes the URI even though we said for it not to.
  4. I'd be shocked: a 204 where the url gets deleted and the object of the triple becomes https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday card_1.jpg/fcr:metadata. In this case, we've absolutely got to account for unescaped URIs and handle escaping in our various requests.

@kaladay
Copy link
Contributor

kaladay commented Oct 9, 2024

I would note that a Cantaloupe issue mentions Percent encoding problems here:

@kaladay
Copy link
Contributor

kaladay commented Oct 9, 2024

Other links regardding URIs and percent encoding in Cantaloupe:

@kaladay
Copy link
Contributor

kaladay commented Oct 9, 2024

The example script as described above fails like this:

# bash curl-update_ru.curl 
Encountered " "<" "< "" at line 7, column 23.
Was expecting one of:
    <IRIref> ...
    <PNAME_NS> ...
    <PNAME_LN> ...
    <BLANK_NODE_LABEL> ...
    <VAR1> ...
    <VAR2> ...
    "true" ...
    "false" ...
    <INTEGER> ...
    <DECIMAL> ...
    <DOUBLE> ...
    <INTEGER_POSITIVE> ...
    <DECIMAL_POSITIVE> ...
    <DOUBLE_POSITIVE> ...
    <INTEGER_NEGATIVE> ...
    <DECIMAL_NEGATIVE> ...
    <DOUBLE_NEGATIVE> ...
    <STRING_LITERAL1> ...
    <STRING_LITERAL2> ...
    <STRING_LITERAL_LONG1> ...
    <STRING_LITERAL_LONG2> ...
    "(" ...
    <NIL> ...
    "[" ...
    <ANON> ...

@markpbaggett
Copy link
Member

markpbaggett commented Oct 10, 2024

The example script as described above fails like this:

# bash curl-update_ru.curl 
Encountered " "<" "< "" at line 7, column 23.
Was expecting one of:
    <IRIref> ...
    <PNAME_NS> ...
    <PNAME_LN> ...
    <BLANK_NODE_LABEL> ...
    <VAR1> ...
    <VAR2> ...
    "true" ...
    "false" ...
    <INTEGER> ...
    <DECIMAL> ...
    <DOUBLE> ...
    <INTEGER_POSITIVE> ...
    <DECIMAL_POSITIVE> ...
    <DOUBLE_POSITIVE> ...
    <INTEGER_NEGATIVE> ...
    <DECIMAL_NEGATIVE> ...
    <DOUBLE_NEGATIVE> ...
    <STRING_LITERAL1> ...
    <STRING_LITERAL2> ...
    <STRING_LITERAL_LONG1> ...
    <STRING_LITERAL_LONG2> ...
    "(" ...
    <NIL> ...
    "[" ...
    <ANON> ...

In my SPARQL update, we are attempting to insert a URL as a URI rather than a Literal. By enclosing the value in <> rather than "" we are saying that the object of the triple is a URI and this error is due to the fact that URIs can't have spaces in RDF including linked data platforms, graph databases, and triple stores. In other words, it's not exactly my initial prediction (e.g. My prediction: a 412 Precondition failed response. I think this is most likely as the SPARQL isn't valid (URIs can't be unescaped) but it's the same sentiment.

In RDF, a binary file, like an RDF resource, must always be a URI (not a literal). This is because the file identifies a specific resource (the binary file), and in RDF, URIs are used to uniquely identify resources.

What this shows is that at least with update / patch, fcrepo is saying you can't have spaces that are unescaped in a file.

I think we should take this one step further and also show on dev that we can't do this with a post. If we can, whatever writes to fcrepo is potentially a problem that must be accounted for, but I'm confident that it will be (famous last words).

@kaladay
Copy link
Contributor

kaladay commented Oct 11, 2024

This URL does work:
https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday%20card_1.jpg/fcr:metadata

However, in the Java, using:

String rdf = restTemplate.getForObject(url, String.class);

It instead fails with:

2024-10-11T18:27:20.702064802Z edu.tamu.iiif.exception.NotFoundException: RDF not found! https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday%20card_1.jpg/fcr:metadata
2024-10-11T18:27:20.702070190Z 	at edu.tamu.iiif.service.AbstractManifestService.getRdf(AbstractManifestService.java:192) ~[classes!/:na]
2024-10-11T18:27:20.702074796Z 	at edu.tamu.iiif.service.AbstractManifestService.getRdfModel(AbstractManifestService.java:169) ~[classes!/:na

The code is returning:

throw new NotFoundException("RDF not found! " + url);

Rather than providing the actual error message from the response.

A curl to that URL, returns data:

@prefix premis:  <http://www.loc.gov/premis/rdf/v1#> .
@prefix ns022:  <http://avalonmediasystem.org/rdf/vocab/common#> .
@prefix ns021:  <http://avalonmediasystem.org/rdf/vocab/derivative#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ns020:  <http://avalonmediasystem.org/rdf/vocab/encoding#> .
@prefix ns004:  <http://purl.org/dc/elements/1.1/relation:> .
@prefix ns003:  <http://digital.library.tamu.edu/schemas/> .
@prefix local:  <http://digital.library.tamu.edu/schemas/local/> .
@prefix ns024:  <http://purl.org/dc/elements/1.1/title:> .
@prefix ns002:  <http://www.openarchives.org/ore/terms#> .
@prefix ns023:  <http://schema.org/> .
@prefix xsi:  <http://www.w3.org/2001/XMLSchema-instance> .
@prefix ns001:  <http://pcdm.org/models#> .
@prefix ns008:  <http://purl.org/dc/elements/1.1/subject:> .
@prefix ns007:  <http://purl.org/dc/elements/1.1/identifer:> .
@prefix xmlns:  <http://www.w3.org/2000/xmlns/> .
@prefix ns006:  <http://purl.org/dc/elements/1.1/identifier:> .
@prefix ns005:  <http://purl.org/dc/elements/1.1/rights:> .
@prefix xml:  <http://www.w3.org/XML/1998/namespace> .
@prefix ns009:  <info:fedora/fedora-system:def/model#> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix fedoraconfig:  <http://fedora.info/definitions/v4/config#> .
@prefix foaf:  <http://xmlns.com/foaf/0.1/> .
@prefix authz:  <http://fedora.info/definitions/v4/authorization#> .
@prefix test:  <info:fedora/test/> .
@prefix ns011:  <http://avalonmediasystem.org/rdf/vocab/collection#> .
@prefix ns010:  <http://bibframe.org/vocab/> .
@prefix ns015:  <http://projecthydra.org/ns/relations#> .
@prefix ns014:  <info:fedora/fedora-system:def/relations-external#> .
@prefix ns013:  <http://projecthydra.org/ns/auth/acl#> .
@prefix ns012:  <http://www.w3.org/ns/auth/acl#> .
@prefix ns019:  <http://www.openarchives.org/ore/terms/> .
@prefix ns018:  <http://avalonmediasystem.org/rdf/vocab/master_file#> .
@prefix ns017:  <http://avalonmediasystem.org/rdf/vocab/transcoding#> .
@prefix ns016:  <http://avalonmediasystem.org/rdf/vocab/media_object#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix fedora:  <http://fedora.info/definitions/v4/repository#> .
@prefix ebucore:  <http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#> .
@prefix ldp:  <http://www.w3.org/ns/ldp#> .
@prefix iana:  <http://www.iana.org/assignments/relation/> .
@prefix xs:  <http://www.w3.org/2001/XMLSchema> .
@prefix dc:  <http://purl.org/dc/elements/1.1/> .

<https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday%20card_1.jpg>
        rdf:type                 ns001:File ;
        rdf:type                 fedora:Binary ;
        rdf:type                 fedora:Resource ;
        dc:filename              "blumberg-holiday card_1.jpg"^^<http://www.w3.org/2001/XMLSchema#string> ;
        fedora:lastModifiedBy    "fedoraAdmin"^^<http://www.w3.org/2001/XMLSchema#string> ;
        premis:hasSize           "537844"^^<http://www.w3.org/2001/XMLSchema#long> ;
        ebucore:hasMimeType      "image/jpeg"^^<http://www.w3.org/2001/XMLSchema#string> ;
        fedora:createdBy         "fedoraAdmin"^^<http://www.w3.org/2001/XMLSchema#string> ;
        fedora:created           "2024-10-02T17:01:15.365Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        premis:hasMessageDigest  <urn:sha1:5be16973c518fb173ee604d64616ac1f082dfb36> ;
        fedora:lastModified      "2024-10-02T17:01:15.365Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ebucore:filename         "blumberg-holiday card_1.jpg"^^<http://www.w3.org/2001/XMLSchema#string> ;
        rdf:type                 ldp:NonRDFSource ;
        fedora:writable          "false"^^<http://www.w3.org/2001/XMLSchema#boolean> ;
        iana:describedby         <https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday%20card_1.jpg/fcr:metadata> ;
        fedora:hasParent         <https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files> ;
        fedora:hasFixityService  <https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday%20card_1.jpg/fcr:fixity> .

Note how this has a space:

dc:filename              "blumberg-holiday card_1.jpg"^^<http://www.w3.org/2001/XMLSchema#string> ;

This URL fails:

https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday card_1.jpg/fcr:metadata

But the error message printed uses this URL:

Failed to get RDF for https://api-dev.library.tamu.edu/fcrepo/rest/3b/6f/c3/25/3b6fc325-f6ca-41d8-b91e-8c5db3be8c13/basbanes-exhibit-texts-todd-magpietest_objects/17/pages/page_0/files/blumberg-holiday%20card_1.jpg/fcr:metadata: 404 : ...

Which is misleading and a regression around the error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants