Skip to content
bcampbell edited this page Sep 13, 2010 · 4 revisions

Problem:
Articles often have multiple URLs.
Reasons:

  • user input error (missing the www. or adding a trailing slash)
  • tracking cruft (eg params to say where that the url came via an RSS feed)
  • session IDs
  • article appears in multiple sections of a newspaper (eg /global/ and /politics/)
  • redirects after migration to a new CMS

Current solution:
Most sites expose an internal ID which is unique enough for us to use. We try to pick this out of the URL and use it to determine if we’ve already got the article (we call it “srcid”). We only store the srcid and a permalink (usually the first link for an article which we encounter).

  • Good: because we can determine if we’ve already got an article just by looking at the URL – no need to download it.
  • Bad: Needs to be done per-publication.
  • Bad: Not scalable.
  • Bad: doesn’t handle redirects (eg bulk-renaming when sites change their CMS and entire URL scheme)
  • Bad: doesn’t always pick the canonical URL for the permalink

Ideas for new approach:

  • Keep track of known URLs for each article.
  • pick one as the canonical one (ie the permalink)
  • use rel=‘canonical’
  • track all redirects (assume final destination URL is canonical)
  • Kind of Bad: If we get a new URL we still have to download the article before we can tell that we’ve already got it via redirects or rel=“canonical” (but only the one time. Once we know the URL is for an article we’ve already got we won’t have to download it again to find out if we encounter the URL in the future).

General musings;

  • Who uses rel=“canonical”? BBC, Guardian. others?
  • Not such a big deal for blogs – they tend to keep their urls sane and unique using slugs or similar.
  • probably mostly an issue for bigger organisations who will likely be pretty good with rel=“canonical”
Clone this wiki locally