Skip to content

Commit

Permalink
Add data linking (#394)
Browse files Browse the repository at this point in the history
* add splink to tools

* add first draft data linking section

* add sections for data linking tables

* ♻️ Refactor file location / add space

---------

Co-authored-by: Gary H <[email protected]>
  • Loading branch information
RossKen and Gary-H9 authored Jan 25, 2024
1 parent 78c9edb commit dcf1551
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 1 deletion.
10 changes: 10 additions & 0 deletions source/data/curated-databases/data-linking/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Data Linking
weight: 100
last_reviewed_on: 2024-01-24
review_in: 1 year
show_expiry: true
owner_slack: '#ask-data-linking'
---

<%= partial 'documentation/data-docs/curated-databases-docs/data-linking' %>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Data Linking

As a department, we struggle with a lack of consistent, reliable, unique identifiers within and across our systems. Unique IDs are critical to getting a true picture of the justice system and as an analyst it is important to be able to link together different datasets across domains.

The Internal Data Linking team have created Data Linking tables (using the [Splink](https://moj-analytical-services.github.io/splink/index.html) under the hood) for use across Data & Analysis to allow analysts to:

1. Deduplicate Individual Datasets
2. Link between Datasets (i.e. across domains)

The Data Linking tables contain estimated unique IDs attached to each ID within the linked datasets. They function as a lookup table that associates a raw system ID with the unique linked ID we have generated. This linked ID can then be used to deduplicate and/or link datasets.

For more on the Data Linking tables, how they are made and how to use them, check out the [data discovery tool](https://data-discovery-tool.analytical-platform.service.justice.gov.uk/data_linking_anonymised/index.html).
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ This is guidance contains information on using curated databases on the Analytic
* [Amazon Athena](amazon-athena/)
* [Querying Athena from the AP](dbtools/)
* [Databases](databases/)
* [Data Linking](data-linking/)
* [Data Discovery Tool](data-documentation/)
5 changes: 4 additions & 1 deletion source/documentation/tools/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ Defined metadata that interacts with other packages (including arrow-pd-parser)
### [pydbtools](https://github.com/moj-analytical-services/pydbtools)
Queries MoJAP athena databases with features such as temp table creation.

### [splink](https://github.com/moj-analytical-services/splink)
Provides the ability to link datasets at scale. Splink is the matching engine behind the linked data on the Analytical Platform. This package is maintained by the Internal Data Linking team, support is offered via the **#ask-data-linking** Slack channel.

## R packages

The following native R packages remove the need for using Python in R projects.
Expand All @@ -72,4 +75,4 @@ Allows you to access databases from the Analytical Platform. The Data Engineerin
Allows you to access Athena databases from the Analytical Platform. The Analytical Platform community maintain this package.

### [Rs3tools](https://github.com/moj-analytical-services/Rs3tools)
Allows you to access AWS S3 from the Analytical Platform, which is mainly compatible with the legacy package [s3tools](https://github.com/moj-analytical-services/s3tools). The Analytical Platform community maintain this package.
Allows you to access AWS S3 from the Analytical Platform, which is mainly compatible with the legacy package [s3tools](https://github.com/moj-analytical-services/s3tools). The Analytical Platform community maintain this package.

0 comments on commit dcf1551

Please sign in to comment.