Create sitemap.xml for all the dataset pages #351

MBcode · 2022-04-15T20:02:20Z

We want the new schema.org metadata from issue #335 to be findable by https://datasetsearch.research.google.com via a sitemap.xml listing those pages

The url for the sitemap could go to a route that generates it on the fly, or it could be done via a chron job and cached (maybe as a cfg option).

Some places have so much to crawl that their main sitemap has links to sub sitemaps of say 1k links each. We will have to allow for this.

The main starting place is deciding if the whole enpoint should be made findable for a crawl, or some subset (eg. space), so this could also end up as a cfg option at some point, down the road.

MBcode · 2022-05-04T17:46:21Z

Wrote simple py requests to create a usable sitemap.xml as a start;
we can iterate as much as ppl like, have notes on howto do w/in scala
but then it can't be reused in v2

MBcode · 2022-05-04T19:48:28Z

For very large spaces with very large sets of datasets, that don't change all that much (eg. time-series) allow for: sitemap at dataset or space level, after/near-where you set it to be public.

To have google and others harvest it, we will have to call the space a schema:Dataset but can have another element that shows, that is is actually something more like a DataCatalog

So I will sketch up something like the other mapping here. So we can get further comment on any new elements.
So we can add a new to_jsonld method to implement this mapping underneath the spaces html pages.

UX-wise, alternately we could always list the space, and just have the radio button have: private,public, pubic-w/sitemap

MBcode · 2022-05-06T15:56:21Z

Starting mapping,for spaces looks like this: spaceLD = Json.obj(

```
      "context" -> so,
```
```
      "identifier" -> id.toString,
```
```
      "name" -> name,
```
```
      "description" -> description,
```

      "dateCreated" -> Formatters.iso8601(created),

      "creator" -> Json.toJson(creator)

MBcode · 2022-07-01T18:10:54Z

I will consolidate all the linked notes, into one summary, that we can ok for me to move forward on more

MBcode · 2022-07-14T20:24:32Z

Most of the changes already creeped into the last pr (other class to ld+json scripts), except for actually making the sitemap, that has code, but no comments on; so not sure if I have the ok to finish this?

MBcode · 2022-09-16T18:57:56Z

have sitemap.xml branch to try https://github.com/dfabulich/sitemapgen4j but could even start w/just looping over datasets and putting w/in tags

MBcode · 2022-09-21T18:05:49Z

Have a route to get the sitemap.xml and a way to check the cached version

MBcode · 2022-10-14T14:53:56Z

Decided to make an add-sitemap-route draft PR, on this direct branch vs the sitemap.xml fork

MBcode added the enhancement New feature or request label Apr 15, 2022

MBcode mentioned this issue Apr 15, 2022

Insert metadata in (dataset) pages, for (google)datasetsearch #335

Open

MBcode mentioned this issue Apr 28, 2022

make ellipse... capped lists's max-length configurable #354

Open

MBcode mentioned this issue Oct 14, 2022

vs using sitemap.xml fork branch #389

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create sitemap.xml for all the dataset pages #351

Create sitemap.xml for all the dataset pages #351

MBcode commented Apr 15, 2022

MBcode commented May 4, 2022 •

edited

Loading

MBcode commented May 4, 2022 •

edited

Loading

MBcode commented May 6, 2022 •

edited

Loading

MBcode commented Jul 1, 2022

MBcode commented Jul 14, 2022

MBcode commented Sep 16, 2022

MBcode commented Sep 21, 2022

MBcode commented Oct 14, 2022

Create sitemap.xml for all the dataset pages #351

Create sitemap.xml for all the dataset pages #351

Comments

MBcode commented Apr 15, 2022

MBcode commented May 4, 2022 • edited Loading

MBcode commented May 4, 2022 • edited Loading

MBcode commented May 6, 2022 • edited Loading

MBcode commented Jul 1, 2022

MBcode commented Jul 14, 2022

MBcode commented Sep 16, 2022

MBcode commented Sep 21, 2022

MBcode commented Oct 14, 2022

MBcode commented May 4, 2022 •

edited

Loading

MBcode commented May 4, 2022 •

edited

Loading

MBcode commented May 6, 2022 •

edited

Loading