Skip to content

Commit

Permalink
Update paper with more examples
Browse files Browse the repository at this point in the history
- added hospital data example
- added drug use data example
  • Loading branch information
vankesteren committed Nov 7, 2024
1 parent be1aa22 commit 44b9857
Show file tree
Hide file tree
Showing 7 changed files with 227 additions and 121 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -151,3 +151,6 @@ docs/paper/media

# Generated api docs stuff
docs/source/api/generated

# uv stuff
uv.lock
17 changes: 12 additions & 5 deletions docs/paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -150,9 +150,16 @@ @inproceedings{ping2017datasynthesizer
}

@article{vankesteren2024democratize,
title={To democratize research with sensitive data, we should make synthetic data more accessible},
author={{van Kesteren}, Erik-Jan},
journal={arXiv preprint arXiv:2404.17271},
year={2024},
doi={10.48550/arXiv.2404.17271}
title = {To democratize research with sensitive data, we should make synthetic data more accessible},
volume = {5},
ISSN = {2666-3899},
url = {http://dx.doi.org/10.1016/j.patter.2024.101049},
DOI = {10.1016/j.patter.2024.101049},
number = {9},
journal = {Patterns},
publisher = {Elsevier BV},
author = {{van Kesteren}, Erik-Jan},
year = {2024},
month = sep,
pages = {101049}
}
141 changes: 66 additions & 75 deletions docs/paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,36 +43,33 @@ These choices enable the software to generate synthetic data with __privacy and

# Software features

At its core, `metasyn` has three main functions:

1. __Estimation__: Fit a generative model to a properly formatted tabular dataset, optionally with privacy guarantees.
2. __(De)serialization__: Create an intermediate representation of the fitted model for auditing, editing, and saving.
3. __Generation__: Synthesize new datasets based on a fitted model.
At its core, `metasyn` has three main functions: __estimation__, to fit a model to a properly formatted tabular dataset; __generation__, to synthesize new datasets based on a fitted model; and __(de)serialization__, to create a file from the model for auditing, editing, and saving.

## Estimation
The generative model in `metasyn` makes the assumption of marginal independence: each column is considered separately, similar to naïve Bayes classifiers [@hastie2009elements]. Some key advantages of this naïve approach are transparency and explainability, flexibility in handling mixed data types, and computational scalability to high-dimensional datasets. Formally, the generative model for $K$-variate data $\mathbf{x}$ is:
Model estimation starts with an appropriately pre-processed data frame, meaning it is tidy [@wickham2014tidy], each column has the correct data type, and missing data are represented by a missing value. Accordingly, `metasyn` is built on the `polars` data frame library [@vink2024polars]. As an example, the first records of the "hospital" data built into `metasyn` are printed below:

\begin{equation} \label{eq:model}
p(\mathbf{x}) = \prod_{k = 1}^K p(x_k)
\end{equation}
```
┌────────────┬───────────────┬───────────────┬──────┬──────┬───────────────┐
│ patient_id ┆ date_admitted ┆ time_admitted ┆ type ┆ age ┆ hours_in_room │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ time ┆ cat ┆ i64 ┆ f64 │
╞════════════╪═══════════════╪═══════════════╪══════╪══════╪═══════════════╡
│ A5909X0 ┆ 2024-01-01 ┆ 10:30:00 ┆ IVT ┆ null ┆ 3.633531 │
│ B4025X2 ┆ 2024-01-01 ┆ 11:23:00 ┆ IVT ┆ 59 ┆ 6.932891 │
│ B6999X2 ┆ 2024-01-01 ┆ 11:58:00 ┆ IVT ┆ 77 ┆ 1.970654 │
│ B9525X2 ┆ 2024-01-01 ┆ 16:56:00 ┆ MYE ┆ null ┆ 1.620047 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
└────────────┴───────────────┴───────────────┴──────┴──────┴───────────────┘
```

Model estimation starts with an appropriately pre-processed data frame, meaning it is tidy [@wickham2014tidy], each column has the correct data type, and missing data are represented by a missing value. Internally, our software uses the `polars` data frame library [@vink2024polars], as it is performant, has consistent data types, and natively supports missing data (i.e., `null` values). An example source table is printed below (NB: categorical data are appropriately encoded as `cat`, not `str`):
Note that categorical data are encoded as `cat` (not `str`) and missing data is represented by `null` values. Model estimation with `metasyn` is then performed as follows:

```
┌─────┬────────┬─────┬────────┬──────────┐
│ ID ┆ fruits ┆ B ┆ cars ┆ optional │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ cat ┆ i64 ┆ cat ┆ i64 │
╞═════╪════════╪═════╪════════╪══════════╡
│ 1 ┆ banana ┆ 5 ┆ beetle ┆ 28 │
│ 2 ┆ banana ┆ 4 ┆ audi ┆ 300 │
│ 3 ┆ apple ┆ 3 ┆ beetle ┆ null │
│ 4 ┆ apple ┆ 2 ┆ beetle ┆ 2 │
│ 5 ┆ banana ┆ 1 ┆ beetle ┆ -30 │
└─────┴────────┴─────┴────────┴──────────┘
```python
from metasyn import MetaFrame
mf = MetaFrame.fit_dataframe(df_hospital)
```

For each data type, a set of candidate distributions is fitted (see \autoref{tbl:dist}), and then `metasyn` selects the one with the lowest BIC [@neath2012bayesian]. For distributions where BIC computation is impossible (e.g., for the string data type) a pseudo-BIC is created that trades off fit and complexity of the underlying models.
The generative model in `metasyn` makes the simplifying assumption of _marginal independence_: each column is considered separately, similar to naïve Bayes classifiers [@hastie2009elements]. For each column, a set of candidate distributions is fitted (see \autoref{tbl:dist}), and then `metasyn` selects the one that fits best (usually having the lowest BIC [@neath2012bayesian]). Key advantages of this approach are transparency and explainability, flexibility in handling mixed data types, and computational scalability to high-dimensional datasets.

Table: \label{tbl:dist} Candidate distributions associated with data types in the core `metasyn` package.

Expand All @@ -84,77 +81,71 @@ Table: \label{tbl:dist} Candidate distributions associated with data types in th
| String | Regex, Categorical, Faker, FreeText, Constant |
| Date/time | Uniform, Constant |

From this table, the string distributions deserve special attention as they are not common probability distributions. The regex (regular expression) distribution uses the package [`regexmodel`](https://pypi.org/project/regexmodel/) to automatically detect structure such as room numbers (A108, C122, B109), e-mail addresses, or websites. The FreeText distribution detects the language (using [lingua](https://pypi.org/project/lingua-language-detector/)) and randomly picks words from that language. The [Faker](https://pypi.org/project/Faker/) distribution can generate specific data types such as localized names and addresses pre-specified by the user.
From this table, the string distributions deserve special attention as they are not common probability distributions. The regex (regular expression) distribution uses the package [`regexmodel`](https://pypi.org/project/regexmodel/) to automatically detect structure such as room numbers (A108, C122, B109), identifiers, e-mail addresses, or websites. The FreeText distribution detects the language (using [lingua](https://pypi.org/project/lingua-language-detector/)) and randomly picks words from that language. The [Faker](https://pypi.org/project/Faker/) distribution can generate specific data types such as localized names and addresses pre-specified by the user.

Generative model estimation with `metasyn` can be performed as follows:

```python
from metasyn import MetaFrame
mf = MetaFrame.fit_dataframe(df)
```
## Data generation

## Serialization and deserialization
After fitting a model, `metasyn` can transparently store it in a human- and machine-readable `.json` metadata file. This file contains dataset-level descriptive information as well as the following variable-level information:

```json
{
"name": "fruits",
"type": "categorical",
"dtype": "Categorical(ordering='physical')",
"prop_missing": 0.0,
"distribution": {
"implements": "core.multinoulli",
"version": "1.0",
"provenance": "builtin",
"class_name": "MultinoulliDistribution",
"unique": false,
"parameters": {
"labels": ["apple", "banana"],
"probs": [0.4, 0.6]
}
},
"creation_method": { "created_by": "metasyn" }
}
After creating a `MetaFrame`, `metasyn` can randomly sample synthetic datapoints from it. This is done using the `synthesize()` method:

```python
df_syn = mf.synthesize(3)
```

This `.json` can be manually audited, edited, and after saving this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` can be performed as follows:
This may result in the following data frame. Note that missing values in the `age` column are appropriately reproduced as well.

```python
mf.save("fruits.json")
mf_new = MetaFrame.load("fruits.json")
```
┌────────────┬───────────────┬───────────────┬──────┬──────┬───────────────┐
│ patient_id ┆ date_admitted ┆ time_admitted ┆ type ┆ age ┆ hours_in_room │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ time ┆ cat ┆ i64 ┆ f64 │
╞════════════╪═══════════════╪═══════════════╪══════╪══════╪═══════════════╡
│ B7906X1 ┆ 2024-01-04 ┆ 13:32:00 ┆ IVT ┆ 37 ┆ 4.955418 │
│ B0553X2 ┆ 2024-01-02 ┆ 10:54:00 ┆ IVT ┆ 39 ┆ 3.872872 │
│ A5397X7 ┆ 2024-01-03 ┆ 18:16:00 ┆ CAT ┆ null ┆ 6.569082 │
└────────────┴───────────────┴───────────────┴──────┴──────┴───────────────┘
```

## Data generation

For each variable in a `MetaFrame` object, `metasyn` can randomly sample synthetic datapoints. Data generation (or synthetization) in `metasyn` can be performed as follows:
## Serialization and deserialization
`MetaFrame`s can also be transparently stored in a human- and machine-readable `.json` metadata file. This file contains dataset-level descriptive information as well as variable-level information. This `.json` can be manually audited, edited, and after saving this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` is done using the `save()` and `load()` methods:

```python
df_syn = mf.synthesize(3)
mf.save("hospital_admissions.json")
mf_new = MetaFrame.load("hospital_admissions.json")
```

This may result in the following data frame. Note that missing values in the `optional` column are appropriately reproduced as well.
# Privacy
As a general principle, `metasyn` errs on the side of privacy by default, aiming to recreate the structure but not all content and relations in the source data. For example, take the following sensitive dataset where study participants state how they use drugs in daily life:

```
┌─────┬────────┬─────┬────────┬──────────┐
│ ID ┆ fruits ┆ B ┆ cars ┆ optional │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ cat ┆ i64 ┆ cat ┆ i64 │
╞═════╪════════╪═════╪════════╪══════════╡
│ 1 ┆ banana ┆ 4 ┆ beetle ┆ null │
│ 2 ┆ banana ┆ 3 ┆ audi ┆ null │
│ 3 ┆ banana ┆ 2 ┆ beetle ┆ 172 │
└─────┴────────┴─────┴────────┴──────────┘
┌────────────────┬─────────────────────────────────┐
│ participant_id ┆ drug_use │
│ --- ┆ --- │
│ str ┆ str │
╞════════════════╪═════════════════════════════════╡
│ OOWJAHA4 ┆ I use marijuana in the evening… │
│ 8CA1RV4P ┆ I occasionally take CBD to hel… │
│ FMSVAKPM ┆ Prescription medication helps … │
│ … ┆ … │
└────────────────┴─────────────────────────────────┘
```

# Plug-ins and automatic privacy
The `metasyn` package also allows for plug-ins: packages that alter the distribution fitting behaviour. Through this system, privacy guarantees can be built into `metasyn` ([privacy plug-in template](https://github.com/sodascience/metasyn-privacy-template)) and additional distributions can be supported ([distribution plug-in template](https://github.com/sodascience/metasyn-distribution-template)). The [`metasyn-disclosure-control`](https://github.com/sodascience/metasyn-disclosure-control) plug-in implements output guidelines from Eurostat [@bond2015guidelines] by including micro-aggregation. In this way, information transfer from the sensitive real data to the synthetic public data can be further limited. Disclosure control is performed as follows:

```python
from metasyn import MetaFrame
from metasyncontrib.disclosure import DisclosurePrivacy
When creating synthetic data for this example, the information in the open answers is removed, and using our standard `FreeText` distribution this information is replaced by words from the detected language (English):

mf = MetaFrame.fit_dataframe(df, privacy=DisclosurePrivacy())
```
┌────────────────┬─────────────────────────────────┐
│ participant_id ┆ drug_use │
│ --- ┆ --- │
│ str ┆ str │
╞════════════════╪═════════════════════════════════╡
│ ZQJZQAB7 ┆ Lawyer let sort her yet line e… │
│ 7KDLEL0S ┆ Particularly third myself edge… │
│ QBZKGXC7 ┆ Put color against call researc… │
└────────────────┴─────────────────────────────────┘
```

Additionally, the `metasyn` package supports [plug-ins](https://github.com/sodascience/metasyn-privacy-template) which alter the estimation behaviour. Through this system, privacy guarantees can be built into `metasyn` and additional distributions can be supported. For example, [`metasyn-disclosure-control`](https://github.com/sodascience/metasyn-disclosure-control) implements output guidelines from Eurostat [@bond2015guidelines] through _micro-aggregation_.

# Acknowledgements

Expand Down
Binary file modified docs/paper/paper.pdf
Binary file not shown.
Loading

0 comments on commit 44b9857

Please sign in to comment.