Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IRIIIF interaction with Cantaloupe is problematic, consider caching changes. #146

Open
kaladay opened this issue Sep 20, 2024 · 2 comments
Labels
story A user story or similar.

Comments

@kaladay
Copy link
Contributor

kaladay commented Sep 20, 2024

This is created following investigations of:

There are different, competing, ideas and solutions on how to address the performance and configuration problems. This story is meant to provide a list of those so that they can be groomed and broken down further into quests, features, bugs, etc...

Long Term Caching vs Redis Caching or Other Caching

The caching system in place includes REDIS.
The REDIS data is derived on the fly or as needed.
The REDIS cache has a relatively short expiration (such as say 30 days).

Rather than focus on caching, we might want to focus on long-term-caching.
Which, is really just making a copy of a file and storing it on disk indefinitely until such time that the source file changes. Only when the source file changes should the cached file be deleted or re-created.

The long-term-caching would be ideal for well-known or common files.
Unusual files, such as rotating an image by 2 degrees would not fall under this use case.
What the use case of when to do this or when not to do this is up for determination and is not described here.

Other caching, including REDIS, may still have a place but the caching situation as a whole still needs to be re-considered.

Caching Special TIFF via IRIIIF

It seems like IRIIIF can potentially be utilized to provide long-term-caching of problematic TIFF files.

There are two major types of TIFF files that are relevant here.

  1. Strip Based
  2. Tile Base

(There are other states that may be significant but are not directly relevant here, such as seekable, sequential, and pyramidal TIFFs.)

Most TIFF encoders produce Strip Based TIFFs but those become slower to read the larger the file is.

Having the IRIIIF convert Strip Based TIFFs into Tile Based TIFFs in a long-term-cache should result in a faster load and processing time for/by Cantaloupe.
This should help address major performance problems.
This long-term-cache can then be used to create short-term-cache of images by the Cantaloupe server.
This long-term-cache would also allow use to avoid modifying the source image in any way.

We can also consider creating a Pyramidal TIFF in the long-term-cache.
These images would be created at different commonly used resolutions to allow for more efficient zooming.

From the Cantaloupe Documentation:

For efficient deep zooming, TIFF images need to be pyramidal, and each level of the pyramid must be tiled.

Stop Managing DSpace

We are not using DSpace 7.
DSpace 7 now has its own IIIF server.

Why should IRIIIF manage Dspace now?
Is there any reason to do this?

It may or may not be a good idea simplify IRIIIF and remove the DSpace functionality and let DSpace directly provide its IIIF support directly on its own.

This would save on network traffic and complexity.
I would imagine that this would improve performance and increase maintainability.

This can either require direct interaction with DSpace's IIIF functionality or Ingress can be used to mask the server.

Pass DSpace Through

Rather than removing the DSPace functionality entirely, just act as a pass through where appropriate.

Anything that IRIIIF needs to handle can be handled.
Anything that can go straight through into DSpace can just be routed.
The remote servers therefore only need to just talk to IRIIIF and do not have to select between DSpace and IRIIIF.
No masking via Ingress needed.

This allows for some simplification and some improvements of maintainability but not as much as completely pulling DSpace out.
This still incurs unnecessary network traffic due to the routing of the data (passing through) and is sub-optimal compared to directly going to DSpace.

Metadata into XMP Rather than EXIF, et al.

The Cantaloupe documentation regarding the metadata shows that only XMP is well supported.

We could utilize long-term-caching to cache this metadata if we convert anything not using XMP into using XMP.
Anything already in XMP could either be passed through or cached along with the converted data.

Metadata into XMP Rather than EXIF, et al., but not in IRIIIF

Cantaloupe itself documents ways to modify XMP data.

It could be that Cantaloupe provides ways to do this.

As noted in the documentation, Cantaloupe also supports caching of this XMP Metadata.

After metadata is initially read from a source image, it may be cached, in which case subsequent requests will read it from the cache rather than the source image. In this case, changes to the source image's metadata will not be reflected in the application until the cached metadata becomes invalid and is re-read. If you need to change a source image's metadata, you should manually purge any cached content relating to it afterwards.

Cantaloupe may need to be further investigated to determine if it can convert the XMP metadata for us.
If so, then this would save a lot of programming and maintenance effort by having Cantaloupe do this for us.

Use Delegates Instead

Cantaloupe supports the use of delegates.

These can be written in languages like Ruby or Java.
It may be that some or all of the functionality provided by IRIIIF can be removed and instead added as a Delegate directly on Cantaloupe.

@kaladay kaladay added the story A user story or similar. label Sep 20, 2024
@kaladay kaladay changed the title IRIIIF interaction with Cantaloupe is problematic. IRIIIF interaction with Cantaloupe is problematic, consider caching changes. Sep 20, 2024
@markpbaggett
Copy link
Member

Caching Special TIFF via IRIIIF
It seems like IRIIIF can potentially be utilized to provide long-term-caching of problematic TIFF files.

There are two major types of TIFF files that are relevant here.

Strip Based
Tile Base
(There are other states that may be significant but are not directly relevant here, such as seekable, sequential, and pyramidal TIFFs.)

Most TIFF encoders produce Strip Based TIFFs but those become slower to read the larger the file is.

Having the IRIIIF convert Strip Based TIFFs into Tile Based TIFFs in a long-term-cache should result in a faster load and processing time for/by Cantaloupe.
This should help address major performance problems.
This long-term-cache can then be used to create short-term-cache of images by the Cantaloupe server.
This long-term-cache would also allow use to avoid modifying the source image in any way.

I just wanted to add that one of the things that I'm wrestling with right now is figuring out what files we should be storing in Fedora and ultimately what files Cantaloupe and irIIIFService should be interacting with. There is a belief from DiSC that we are and have been taking their files and creating appropriate Intermediate Files from them. This is tricky as it's hard to understand the thinking that went on when creating the original preservation file, but if we are going to consider making irIIIFService convert tiffs, maybe we should be doing this before they even come in to the repository through our processing and ingest practices.

@kaladay
Copy link
Contributor Author

kaladay commented Sep 24, 2024

This chart here is a good justification for using only Pyramidal Tiled TIFFs (when using TIFFs).

(It also shows that using JP2 for source images is even better than TIFF because then we can utilize potentially faster processors like Kakadu and OpenJPEG.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
story A user story or similar.
Projects
None yet
Development

No branches or pull requests

2 participants