Add data linking (#394)

* add splink to tools * add first draft data linking section * add sections for data linking tables * ♻️ Refactor file location / add space --------- Co-authored-by: Gary H <[email protected]>
moj-analytical-services · Jan 25, 2024 · dcf1551 · dcf1551
1 parent 78c9edb
commit dcf1551
Show file tree

Hide file tree

Showing 4 changed files with 27 additions and 1 deletion.
diff --git a/source/data/curated-databases/data-linking/index.html.md.erb b/source/data/curated-databases/data-linking/index.html.md.erb
@@ -0,0 +1,10 @@
+---
+title: Data Linking
+weight: 100
+last_reviewed_on: 2024-01-24
+review_in: 1 year
+show_expiry: true
+owner_slack: '#ask-data-linking'
+---
+
+<%= partial 'documentation/data-docs/curated-databases-docs/data-linking' %>
diff --git a/source/documentation/data-docs/curated-databases-docs/data-linking.md b/source/documentation/data-docs/curated-databases-docs/data-linking.md
@@ -0,0 +1,12 @@
+# Data Linking
+
+As a department, we struggle with a lack of consistent, reliable, unique identifiers within and across our systems. Unique IDs are critical to getting a true picture of the justice system and as an analyst it is important to be able to link together different datasets across domains.
+
+The Internal Data Linking team have created Data Linking tables (using the [Splink](https://moj-analytical-services.github.io/splink/index.html) under the hood) for use across Data & Analysis to allow analysts to:
+
+1. Deduplicate Individual Datasets
+2. Link between Datasets (i.e. across domains)
+
+The Data Linking tables contain estimated unique IDs attached to each ID within the linked datasets. They function as a lookup table that associates a raw system ID with the unique linked ID we have generated. This linked ID can then be used to deduplicate and/or link datasets.
+
+For more on the Data Linking tables, how they are made and how to use them, check out the [data discovery tool](https://data-discovery-tool.analytical-platform.service.justice.gov.uk/data_linking_anonymised/index.html).
diff --git a/source/documentation/data-docs/curated-databases-docs/index.md b/source/documentation/data-docs/curated-databases-docs/index.md
@@ -6,4 +6,5 @@ This is guidance contains information on using curated databases on the Analytic
 * [Amazon Athena](amazon-athena/)
 * [Querying Athena from the AP](dbtools/)
 * [Databases](databases/)
+* [Data Linking](data-linking/)
 * [Data Discovery Tool](data-documentation/)
diff --git a/source/documentation/tools/index.md b/source/documentation/tools/index.md
@@ -61,6 +61,9 @@ Defined metadata that interacts with other packages (including arrow-pd-parser)
 ### [pydbtools](https://github.com/moj-analytical-services/pydbtools)
 Queries MoJAP athena databases with features such as temp table creation.
 
+### [splink](https://github.com/moj-analytical-services/splink)
+Provides the ability to link datasets at scale. Splink is the matching engine behind the linked data on the Analytical Platform. This package is maintained by the Internal Data Linking team, support is offered via the **#ask-data-linking** Slack channel.
+
 ## R packages
 
 The following native R packages remove the need for using Python in R projects.
@@ -72,4 +75,4 @@ Allows you to access databases from the Analytical Platform. The Data Engineerin
 Allows you to access Athena databases from the Analytical Platform. The Analytical Platform community maintain this package.
 
 ### [Rs3tools](https://github.com/moj-analytical-services/Rs3tools)
-Allows you to access AWS S3 from the Analytical Platform, which is mainly compatible with the legacy package [s3tools](https://github.com/moj-analytical-services/s3tools). The Analytical Platform community maintain this package.
+Allows you to access AWS S3 from the Analytical Platform, which is mainly compatible with the legacy package [s3tools](https://github.com/moj-analytical-services/s3tools). The Analytical Platform community maintain this package.