-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create sitemap.xml for all the dataset pages #351
Comments
Wrote simple py requests to create a usable sitemap.xml as a start; |
For very large spaces with very large sets of datasets, that don't change all that much (eg. time-series) allow for: sitemap at dataset or space level, after/near-where you set it to be public. To have google and others harvest it, we will have to call the space a schema:Dataset but can have another element that shows, that is is actually something more like a DataCatalog So I will sketch up something like the other mapping here. So we can get further comment on any new elements. UX-wise, alternately we could always list the space, and just have the radio button have: private,public, pubic-w/sitemap |
Starting mapping,for spaces looks like this: spaceLD = Json.obj(
|
I will consolidate all the linked notes, into one summary, that we can ok for me to move forward on more |
Most of the changes already creeped into the last pr (other class to ld+json scripts), except for actually making the sitemap, that has code, but no comments on; so not sure if I have the ok to finish this? |
have sitemap.xml branch to try https://github.com/dfabulich/sitemapgen4j but could even start w/just looping over datasets and putting w/in tags |
Have a route to get the sitemap.xml and a way to check the cached version |
Decided to make an add-sitemap-route draft PR, on this direct branch vs the sitemap.xml fork |
We want the new schema.org metadata from issue #335 to be findable by https://datasetsearch.research.google.com via a sitemap.xml listing those pages
The url for the sitemap could go to a route that generates it on the fly, or it could be done via a chron job and cached (maybe as a cfg option).
Some places have so much to crawl that their main sitemap has links to sub sitemaps of say 1k links each. We will have to allow for this.
The main starting place is deciding if the whole enpoint should be made findable for a crawl, or some subset (eg. space), so this could also end up as a cfg option at some point, down the road.
The text was updated successfully, but these errors were encountered: