From dcf1551379bf52c250912fa52254885b63258199 Mon Sep 17 00:00:00 2001 From: Ross Kennedy Date: Thu, 25 Jan 2024 10:26:44 +0100 Subject: [PATCH] Add data linking (#394) * add splink to tools * add first draft data linking section * add sections for data linking tables * :recycle: Refactor file location / add space --------- Co-authored-by: Gary H <26419401+Gary-H9@users.noreply.github.com> --- .../curated-databases/data-linking/index.html.md.erb | 10 ++++++++++ .../data-docs/curated-databases-docs/data-linking.md | 12 ++++++++++++ .../data-docs/curated-databases-docs/index.md | 1 + source/documentation/tools/index.md | 5 ++++- 4 files changed, 27 insertions(+), 1 deletion(-) create mode 100644 source/data/curated-databases/data-linking/index.html.md.erb create mode 100644 source/documentation/data-docs/curated-databases-docs/data-linking.md diff --git a/source/data/curated-databases/data-linking/index.html.md.erb b/source/data/curated-databases/data-linking/index.html.md.erb new file mode 100644 index 00000000..bbccfdce --- /dev/null +++ b/source/data/curated-databases/data-linking/index.html.md.erb @@ -0,0 +1,10 @@ +--- +title: Data Linking +weight: 100 +last_reviewed_on: 2024-01-24 +review_in: 1 year +show_expiry: true +owner_slack: '#ask-data-linking' +--- + +<%= partial 'documentation/data-docs/curated-databases-docs/data-linking' %> diff --git a/source/documentation/data-docs/curated-databases-docs/data-linking.md b/source/documentation/data-docs/curated-databases-docs/data-linking.md new file mode 100644 index 00000000..4cdab8a2 --- /dev/null +++ b/source/documentation/data-docs/curated-databases-docs/data-linking.md @@ -0,0 +1,12 @@ +# Data Linking + +As a department, we struggle with a lack of consistent, reliable, unique identifiers within and across our systems. Unique IDs are critical to getting a true picture of the justice system and as an analyst it is important to be able to link together different datasets across domains. + +The Internal Data Linking team have created Data Linking tables (using the [Splink](https://moj-analytical-services.github.io/splink/index.html) under the hood) for use across Data & Analysis to allow analysts to: + +1. Deduplicate Individual Datasets +2. Link between Datasets (i.e. across domains) + +The Data Linking tables contain estimated unique IDs attached to each ID within the linked datasets. They function as a lookup table that associates a raw system ID with the unique linked ID we have generated. This linked ID can then be used to deduplicate and/or link datasets. + +For more on the Data Linking tables, how they are made and how to use them, check out the [data discovery tool](https://data-discovery-tool.analytical-platform.service.justice.gov.uk/data_linking_anonymised/index.html). diff --git a/source/documentation/data-docs/curated-databases-docs/index.md b/source/documentation/data-docs/curated-databases-docs/index.md index 05d7a707..50c60c36 100644 --- a/source/documentation/data-docs/curated-databases-docs/index.md +++ b/source/documentation/data-docs/curated-databases-docs/index.md @@ -6,4 +6,5 @@ This is guidance contains information on using curated databases on the Analytic * [Amazon Athena](amazon-athena/) * [Querying Athena from the AP](dbtools/) * [Databases](databases/) +* [Data Linking](data-linking/) * [Data Discovery Tool](data-documentation/) diff --git a/source/documentation/tools/index.md b/source/documentation/tools/index.md index 81354c9f..c1f827a2 100644 --- a/source/documentation/tools/index.md +++ b/source/documentation/tools/index.md @@ -61,6 +61,9 @@ Defined metadata that interacts with other packages (including arrow-pd-parser) ### [pydbtools](https://github.com/moj-analytical-services/pydbtools) Queries MoJAP athena databases with features such as temp table creation. +### [splink](https://github.com/moj-analytical-services/splink) +Provides the ability to link datasets at scale. Splink is the matching engine behind the linked data on the Analytical Platform. This package is maintained by the Internal Data Linking team, support is offered via the **#ask-data-linking** Slack channel. + ## R packages The following native R packages remove the need for using Python in R projects. @@ -72,4 +75,4 @@ Allows you to access databases from the Analytical Platform. The Data Engineerin Allows you to access Athena databases from the Analytical Platform. The Analytical Platform community maintain this package. ### [Rs3tools](https://github.com/moj-analytical-services/Rs3tools) -Allows you to access AWS S3 from the Analytical Platform, which is mainly compatible with the legacy package [s3tools](https://github.com/moj-analytical-services/s3tools). The Analytical Platform community maintain this package. \ No newline at end of file +Allows you to access AWS S3 from the Analytical Platform, which is mainly compatible with the legacy package [s3tools](https://github.com/moj-analytical-services/s3tools). The Analytical Platform community maintain this package.