Skip to content

Latest commit

 

History

History
104 lines (65 loc) · 4.07 KB

README.md

File metadata and controls

104 lines (65 loc) · 4.07 KB

conventoarchiver

Repository for collecting scripts to help capture MyConvento newsroom press-releases from the MyConvento PR management suite.

More on MyConvento: https://www.myconvento.com/en/welcome/ In German: https://www.myconvento.com/

Introduction

MyConvento is a "newsroom" platform. It provides a means for organizations to write press releases and other similar assets and then publish them online.

Integration, at least in some observable cases is via iframe and redirect. The host organization only ever points to a page on a MyConvento newsroom site.

Archiving of MyConvento content is difficult because most agents cannot easily navigate to the content directly. The content exists an iframe, remote to the site being called, the crawler tends to only retrieve the host website, not the press-release.

Demo

Take a look at how the redirect works on what I think is a demo site:

Redirects to:

The second link here is the one you would normally see from a corporate website. The link to MyConvento is entirely opaque unless you inspect the source and identify where the content is being retrieved from.

If you try to save the second link in the internet archive (today December 2021) you end up with an IA slug that looks as follows:

  • /web/20211217121622/https://www.membratech-b2b-portal.com/media-newsroom/

Today's archive page:

NB. That IA slug is: https://www.membratech-b2b-portal.com/media-newsroom/

From there you cannot see the news article information at the original site that we were trying to save.

Working through the permutation here, then it is difficult to see exactly how to archive MyConvento sites.

After truncating the MyConvento URL, reducing it just to the article ID, then one can create a URL that links directly to a URL hosted by MyConvento without the redirect to the host organizaiton and associated iframe content.

Which can be archived more conventionally.

Archiving the index

At the time of writing, I didn't have an approach for accessing a company's newsroom index. I managed to find a link that didn't rewrite the URL via Gooogle. The addition of &c=1 seems to make the page static; although there is other noise in the URL that might also have an impact (but doesn't seem to).

For the example above the static newsroom is at:

Chnage the integer after id= to access the permalink of your newsroom.

Process

To archive a myconvento newsroom, therefore, you need to take the source URL of the newsroom - this acts like an index. Using the example above the index would look as follows:

That redirects to:

From the newsroom page, save all the story items approx 40 on a full page but that may also be configurable by the host (it does not look like parameters work).

For each story, identify the PDF associated with the story, and optionally read the page title.

Save each out to a list to then be processed, i.e. saved to an internet archive.

Other aspects of pages, including media files (<!-- Mediafiles -->) could be archived too, but this is best left to the system doing that by setting save outlinks where possible.

License

GNU GPLv3.