Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr collections used for indexing #742

Open
hectorcorrea opened this issue Jan 14, 2025 · 0 comments
Open

Solr collections used for indexing #742

hectorcorrea opened this issue Jan 14, 2025 · 0 comments

Comments

@hectorcorrea
Copy link
Member

hectorcorrea commented Jan 14, 2025

Our current approach to indexing data from PDC Discovery creates a new Solr collection, indexes into it, and then swap Solr to point to the new collection. This was needed because of the way we were indexing data from two sources (PDC Describe and DataaSpace) but once we stop indexing DataSpace we won't need this process anymore.

I suggest ditching this process once we stop harvesting from DataSpace since the collection creation has caused issues before where Solr stop accepting requests to create and delete collections (but reading from collections work).

We should also figure out how to handle withdrawals from PDC (how to remove them from the index). One approach for this is to tag all records with a timestamp indicating when they were indexed and at the end of the index process delete all records that are older than the time the rake task was ran (meaning, delete Solr documents that were not touched during the index).

Related to #684

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant