Update paper with more examples

- added hospital data example - added drug use data example
sodascience · Nov 7, 2024 · 44b9857 · 44b9857
1 parent be1aa22
commit 44b9857
Show file tree

Hide file tree

Showing 7 changed files with 227 additions and 121 deletions.
diff --git a/.gitignore b/.gitignore
@@ -151,3 +151,6 @@ docs/paper/media
 
 # Generated api docs stuff
 docs/source/api/generated
+
+# uv stuff
+uv.lock
diff --git a/docs/paper/paper.bib b/docs/paper/paper.bib
@@ -150,9 +150,16 @@ @inproceedings{ping2017datasynthesizer
 }
 
 @article{vankesteren2024democratize,
-  title={To democratize research with sensitive data, we should make synthetic data more accessible},
-  author={{van Kesteren}, Erik-Jan},
-  journal={arXiv preprint arXiv:2404.17271},
-  year={2024},
-  doi={10.48550/arXiv.2404.17271}
+  title = {To democratize research with sensitive data,  we should make synthetic data more accessible},
+  volume = {5},
+  ISSN = {2666-3899},
+  url = {http://dx.doi.org/10.1016/j.patter.2024.101049},
+  DOI = {10.1016/j.patter.2024.101049},
+  number = {9},
+  journal = {Patterns},
+  publisher = {Elsevier BV},
+  author = {{van Kesteren},  Erik-Jan},
+  year = {2024},
+  month = sep,
+  pages = {101049}
 }
diff --git a/docs/paper/paper.md b/docs/paper/paper.md
@@ -43,36 +43,33 @@ These choices enable the software to generate synthetic data with __privacy and
 
 # Software features
 
-At its core, `metasyn` has three main functions:
-
-1. __Estimation__: Fit a generative model to a properly formatted tabular dataset, optionally with privacy guarantees.
-2. __(De)serialization__: Create an intermediate representation of the fitted model for auditing, editing, and saving.
-3. __Generation__: Synthesize new datasets based on a fitted model.
+At its core, `metasyn` has three main functions: __estimation__, to fit a model to a properly formatted tabular dataset; __generation__, to synthesize new datasets based on a fitted model; and __(de)serialization__, to create a file from the model for auditing, editing, and saving.
 
 ## Estimation
-The generative model in `metasyn` makes the assumption of marginal independence: each column is considered separately, similar to naïve Bayes classifiers [@hastie2009elements]. Some key advantages of this naïve approach are transparency and explainability, flexibility in handling mixed data types, and computational scalability to high-dimensional datasets. Formally, the generative model for $K$-variate data $\mathbf{x}$ is:
+Model estimation starts with an appropriately pre-processed data frame, meaning it is tidy [@wickham2014tidy], each column has the correct data type, and missing data are represented by a missing value. Accordingly, `metasyn` is built on the `polars` data frame library [@vink2024polars]. As an example, the first records of the "hospital" data built into `metasyn` are printed below:
 
-\begin{equation} \label{eq:model}
-p(\mathbf{x}) = \prod_{k = 1}^K p(x_k)
-\end{equation}
+```
+┌────────────┬───────────────┬───────────────┬──────┬──────┬───────────────┐
+│ patient_id ┆ date_admitted ┆ time_admitted ┆ type ┆ age  ┆ hours_in_room │
+│ ---        ┆ ---           ┆ ---           ┆ ---  ┆ ---  ┆ ---           │
+│ str        ┆ date          ┆ time          ┆ cat  ┆ i64  ┆ f64           │
+╞════════════╪═══════════════╪═══════════════╪══════╪══════╪═══════════════╡
+│ A5909X0    ┆ 2024-01-01    ┆ 10:30:00      ┆ IVT  ┆ null ┆ 3.633531      │
+│ B4025X2    ┆ 2024-01-01    ┆ 11:23:00      ┆ IVT  ┆ 59   ┆ 6.932891      │
+│ B6999X2    ┆ 2024-01-01    ┆ 11:58:00      ┆ IVT  ┆ 77   ┆ 1.970654      │
+│ B9525X2    ┆ 2024-01-01    ┆ 16:56:00      ┆ MYE  ┆ null ┆ 1.620047      │
+│ …          ┆ …             ┆ …             ┆ …    ┆ …    ┆ …             │
+└────────────┴───────────────┴───────────────┴──────┴──────┴───────────────┘
+```
 
-Model estimation starts with an appropriately pre-processed data frame, meaning it is tidy [@wickham2014tidy], each column has the correct data type, and missing data are represented by a missing value. Internally, our software uses the `polars` data frame library [@vink2024polars], as it is performant, has consistent data types, and natively supports missing data (i.e., `null` values). An example source table is printed below (NB: categorical data are appropriately encoded as `cat`, not `str`):
+Note that categorical data are encoded as `cat` (not `str`) and missing data is represented by `null` values. Model estimation with `metasyn` is then performed as follows:
 
-```
-┌─────┬────────┬─────┬────────┬──────────┐
-│ ID  ┆ fruits ┆ B   ┆ cars   ┆ optional │
-│ --- ┆ ---    ┆ --- ┆ ---    ┆ ---      │
-│ i64 ┆ cat    ┆ i64 ┆ cat    ┆ i64      │
-╞═════╪════════╪═════╪════════╪══════════╡
-│ 1   ┆ banana ┆ 5   ┆ beetle ┆ 28       │
-│ 2   ┆ banana ┆ 4   ┆ audi   ┆ 300      │
-│ 3   ┆ apple  ┆ 3   ┆ beetle ┆ null     │
-│ 4   ┆ apple  ┆ 2   ┆ beetle ┆ 2        │
-│ 5   ┆ banana ┆ 1   ┆ beetle ┆ -30      │
-└─────┴────────┴─────┴────────┴──────────┘
+```python
+from metasyn import MetaFrame
+mf = MetaFrame.fit_dataframe(df_hospital)
 ```
 
-For each data type, a set of candidate distributions is fitted (see \autoref{tbl:dist}), and then `metasyn` selects the one with the lowest BIC [@neath2012bayesian]. For distributions where BIC computation is impossible (e.g., for the string data type) a pseudo-BIC is created that trades off fit and complexity of the underlying models.
+The generative model in `metasyn` makes the simplifying assumption of _marginal independence_: each column is considered separately, similar to naïve Bayes classifiers [@hastie2009elements]. For each column, a set of candidate distributions is fitted (see \autoref{tbl:dist}), and then `metasyn` selects the one that fits best (usually having the lowest BIC [@neath2012bayesian]). Key advantages of this approach are transparency and explainability, flexibility in handling mixed data types, and computational scalability to high-dimensional datasets. 
 
 Table: \label{tbl:dist} Candidate distributions associated with data types in the core `metasyn` package.
 
@@ -84,77 +81,71 @@ Table: \label{tbl:dist} Candidate distributions associated with data types in th
 | String      | Regex, Categorical, Faker, FreeText, Constant                      |
 | Date/time   | Uniform, Constant                                                  |
 
-From this table, the string distributions deserve special attention as they are not common probability distributions. The regex (regular expression) distribution uses the package [`regexmodel`](https://pypi.org/project/regexmodel/) to automatically detect structure such as room numbers (A108, C122, B109), e-mail addresses, or websites. The FreeText distribution detects the language (using [lingua](https://pypi.org/project/lingua-language-detector/)) and randomly picks words from that language. The [Faker](https://pypi.org/project/Faker/) distribution can generate specific data types such as localized names and addresses pre-specified by the user. 
+From this table, the string distributions deserve special attention as they are not common probability distributions. The regex (regular expression) distribution uses the package [`regexmodel`](https://pypi.org/project/regexmodel/) to automatically detect structure such as room numbers (A108, C122, B109), identifiers, e-mail addresses, or websites. The FreeText distribution detects the language (using [lingua](https://pypi.org/project/lingua-language-detector/)) and randomly picks words from that language. The [Faker](https://pypi.org/project/Faker/) distribution can generate specific data types such as localized names and addresses pre-specified by the user. 
 
-Generative model estimation with `metasyn` can be performed as follows:
 
-```python
-from metasyn import MetaFrame
-mf = MetaFrame.fit_dataframe(df)
-```
+## Data generation
 
-## Serialization and deserialization
-After fitting a model, `metasyn` can transparently store it in a human- and machine-readable `.json` metadata file. This file contains dataset-level descriptive information as well as the following variable-level information:
-
-```json
-{
-  "name": "fruits",
-  "type": "categorical",
-  "dtype": "Categorical(ordering='physical')",
-  "prop_missing": 0.0,
-  "distribution": {
-    "implements": "core.multinoulli",
-    "version": "1.0",
-    "provenance": "builtin",
-    "class_name": "MultinoulliDistribution",
-    "unique": false,
-    "parameters": {
-      "labels": ["apple", "banana"],
-      "probs": [0.4, 0.6]
-    }
-  },
-  "creation_method": { "created_by": "metasyn" }
-}
+After creating a `MetaFrame`, `metasyn` can randomly sample synthetic datapoints from it. This is done using the `synthesize()` method:
+
+```python
+df_syn = mf.synthesize(3)
 ```
 
-This `.json` can be manually audited, edited, and after saving this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` can be performed as follows:
+This may result in the following data frame. Note that missing values in the `age` column are appropriately reproduced as well.
 
-```python
-mf.save("fruits.json")
-mf_new = MetaFrame.load("fruits.json")
+```
+┌────────────┬───────────────┬───────────────┬──────┬──────┬───────────────┐
+│ patient_id ┆ date_admitted ┆ time_admitted ┆ type ┆ age  ┆ hours_in_room │
+│ ---        ┆ ---           ┆ ---           ┆ ---  ┆ ---  ┆ ---           │
+│ str        ┆ date          ┆ time          ┆ cat  ┆ i64  ┆ f64           │
+╞════════════╪═══════════════╪═══════════════╪══════╪══════╪═══════════════╡
+│ B7906X1    ┆ 2024-01-04    ┆ 13:32:00      ┆ IVT  ┆ 37   ┆ 4.955418      │
+│ B0553X2    ┆ 2024-01-02    ┆ 10:54:00      ┆ IVT  ┆ 39   ┆ 3.872872      │
+│ A5397X7    ┆ 2024-01-03    ┆ 18:16:00      ┆ CAT  ┆ null ┆ 6.569082      │
+└────────────┴───────────────┴───────────────┴──────┴──────┴───────────────┘
 ```
 
-## Data generation
 
-For each variable in a `MetaFrame` object, `metasyn` can randomly sample synthetic datapoints. Data generation (or synthetization) in `metasyn` can be performed as follows:
+## Serialization and deserialization
+`MetaFrame`s can also be transparently stored in a human- and machine-readable `.json` metadata file. This file contains dataset-level descriptive information as well as variable-level information. This `.json` can be manually audited, edited, and after saving this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` is done using the `save()` and `load()` methods:
 
 ```python
-df_syn = mf.synthesize(3)
+mf.save("hospital_admissions.json")
+mf_new = MetaFrame.load("hospital_admissions.json")
 ```
 
-This may result in the following data frame. Note that missing values in the `optional` column are appropriately reproduced as well.
+# Privacy
+As a general principle, `metasyn` errs on the side of privacy by default, aiming to recreate the structure but not all content and relations in the source data. For example, take the following sensitive dataset where study participants state how they use drugs in daily life:
 
 ```
-┌─────┬────────┬─────┬────────┬──────────┐
-│ ID  ┆ fruits ┆ B   ┆ cars   ┆ optional │
-│ --- ┆ ---    ┆ --- ┆ ---    ┆ ---      │
-│ i64 ┆ cat    ┆ i64 ┆ cat    ┆ i64      │
-╞═════╪════════╪═════╪════════╪══════════╡
-│ 1   ┆ banana ┆ 4   ┆ beetle ┆ null     │
-│ 2   ┆ banana ┆ 3   ┆ audi   ┆ null     │
-│ 3   ┆ banana ┆ 2   ┆ beetle ┆ 172      │
-└─────┴────────┴─────┴────────┴──────────┘
+┌────────────────┬─────────────────────────────────┐
+│ participant_id ┆ drug_use                        │
+│ ---            ┆ ---                             │
+│ str            ┆ str                             │
+╞════════════════╪═════════════════════════════════╡
+│ OOWJAHA4       ┆ I use marijuana in the evening… │
+│ 8CA1RV4P       ┆ I occasionally take CBD to hel… │
+│ FMSVAKPM       ┆ Prescription medication helps … │
+│ …              ┆ …                               │
+└────────────────┴─────────────────────────────────┘
 ```
 
-# Plug-ins and automatic privacy
-The `metasyn` package also allows for plug-ins: packages that alter the distribution fitting behaviour. Through this system, privacy guarantees can be built into `metasyn` ([privacy plug-in template](https://github.com/sodascience/metasyn-privacy-template)) and additional distributions can be supported ([distribution plug-in template](https://github.com/sodascience/metasyn-distribution-template)). The [`metasyn-disclosure-control`](https://github.com/sodascience/metasyn-disclosure-control) plug-in implements output guidelines from Eurostat [@bond2015guidelines] by including micro-aggregation. In this way, information transfer from the sensitive real data to the synthetic public data can be further limited. Disclosure control is performed as follows:
-
-```python
-from metasyn import MetaFrame
-from metasyncontrib.disclosure import DisclosurePrivacy
+When creating synthetic data for this example, the information in the open answers is removed, and using our standard `FreeText` distribution this information is replaced by words from the detected language (English):
 
-mf = MetaFrame.fit_dataframe(df, privacy=DisclosurePrivacy())
 ```
+┌────────────────┬─────────────────────────────────┐
+│ participant_id ┆ drug_use                        │
+│ ---            ┆ ---                             │
+│ str            ┆ str                             │
+╞════════════════╪═════════════════════════════════╡
+│ ZQJZQAB7       ┆ Lawyer let sort her yet line e… │
+│ 7KDLEL0S       ┆ Particularly third myself edge… │
+│ QBZKGXC7       ┆ Put color against call researc… │
+└────────────────┴─────────────────────────────────┘
+```
+
+Additionally, the `metasyn` package supports [plug-ins](https://github.com/sodascience/metasyn-privacy-template) which alter the estimation behaviour. Through this system, privacy guarantees can be built into `metasyn` and additional distributions can be supported. For example, [`metasyn-disclosure-control`](https://github.com/sodascience/metasyn-disclosure-control) implements output guidelines from Eurostat [@bond2015guidelines] through _micro-aggregation_.
 
 # Acknowledgements
 

diff --git a/docs/paper/paper.pdf b/docs/paper/paper.pdf