Skip to content

Commit

Permalink
Improved module/package docstrings (#226)
Browse files Browse the repository at this point in the history
* Update base.py

* Changes to main modules and classes docstrings

* More docstrings + line length fixes

* Final series of module docstrings

* Fix ruff issues

* Fix pydocstyle issue

* Apply suggestions from code review

Co-authored-by: qubixes <[email protected]>

* Changes based on PR comments

---------

Co-authored-by: Raoul Schram <[email protected]>
Co-authored-by: qubixes <[email protected]>
  • Loading branch information
3 people authored Jan 8, 2024
1 parent 1f908f1 commit 7dca1cb
Show file tree
Hide file tree
Showing 20 changed files with 112 additions and 34 deletions.
19 changes: 16 additions & 3 deletions metasyn/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,21 @@
"""Metasyn: a package for creating synthetic datasets.
One part concerns the creation of the statistical metadata from the
original data, while the other part creates a synthetic dataset from the
metadata.
Metasyn has three main purposes:
1. Estimation: Metasyn can create a MetaFrame from a dataset.
A MetaFrame is metadata describing a table, augmented with statistical
information on the columns. It captures individual distributions and
features and enables generation of synthetic data based on it.
2. Serialization and deserialization: Metasyn can export a
MetaFrame into an easy to read GMF file. This allows users to audit,
understand, and modify their data generation model. These GMF files
can also be imported back into Metasyn to generate synthetic data.
3. Generation: Metasyn can generate synthetic data based on a MetaFrame.
The synthetic data produced solely depends on the MetaFrame, thereby
maintaining a critical separation between the original sensitive data and the
generated synthetic data.
"""

from importlib.metadata import version
Expand Down
6 changes: 5 additions & 1 deletion metasyn/__main__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
"""CLI for generating synthetic data frames from a metasyn .json file."""
"""Module providing a Command Line Interface (CLI) for metasyn.
It provides functionality to generate GMF (.json) metadata files,
synthetic data from GMF files and creating json schemas for GMF files.
"""
import argparse
import json
import pathlib
Expand Down
2 changes: 1 addition & 1 deletion metasyn/demo/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Package including demo datasets for tutorials."""
"""Package to create and retrieve demo datasets used in tutorials."""

from metasyn.demo.dataset import demo_file

Expand Down
2 changes: 1 addition & 1 deletion metasyn/demo/dataset.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Load/create different demo datasets."""
"""Create and retrieve demo datasets."""

import random
from pathlib import Path
Expand Down
10 changes: 6 additions & 4 deletions metasyn/distribution/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
"""Distributions for variables.
"""Package providing different distributions used in metasyn.
These distributions can be fit to datasets/series so that the synthesis is
somewhat realistic. The concept of distributions here is not only for
numerical data, but also for generating strings for example.
Each distribution class provides methods for fitting the distribution to a
a series of values, and for generating synthetic data based on the fitted
distribution. Each distribution class also provides a way to calculate the
information criterion, used for selecting the best distribution for
a given set of values.
""" # pylint: disable=invalid-name

from metasyn.distribution.categorical import MultinoulliDistribution
Expand Down
16 changes: 15 additions & 1 deletion metasyn/distribution/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,18 @@
"""Module for the base distribution and the scipy distribution."""
"""
Module serving as the basis for all metasyn distributions.
The base module contains the BaseDistribution class, which is the base class
for all distributions. It also contains the ScipyDistribution class,
which is a specialized base class for distributions that are built on top of
SciPy's statistical distributions.
Additionally it contains the UniqueDistributionMixin class,
which is a mixin class that can be used to make a distribution unique
(i.e., one that does not contain duplicate values).
Finally it contains the metadist() decorator, which is used to set the
class attributes of a distribution.
"""

from __future__ import annotations

Expand Down
2 changes: 1 addition & 1 deletion metasyn/distribution/categorical.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Module containing categorical distributions."""
"""Module implementing categorical distributions."""

from __future__ import annotations

Expand Down
7 changes: 6 additions & 1 deletion metasyn/distribution/constant.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
"""Module containing the class with constant distributions."""
"""
Module implementing constant distributions.
The module contains a base class for constant distributions, and subclasses
that implement constant distributions for different variable types.
"""

import datetime as dt

Expand Down
2 changes: 1 addition & 1 deletion metasyn/distribution/continuous.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Implemented floating point distributions."""
"""Module implementing continuous (floating point) distributions."""

import numpy as np
from scipy.optimize import minimize
Expand Down
2 changes: 1 addition & 1 deletion metasyn/distribution/datetime.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Distributions for date and time types."""
"""Module implementing distributions for date and time types."""

import datetime as dt
from abc import abstractmethod
Expand Down
2 changes: 1 addition & 1 deletion metasyn/distribution/discrete.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Module with discrete distributions."""
"""Module implementing discrete distributions."""

from typing import Set

Expand Down
9 changes: 8 additions & 1 deletion metasyn/distribution/faker.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
"""Module containing an interface to the faker package."""
"""
Module implementing Faker based distributions.
This module acts as an interface to the Faker package and can be used to
create distributions that can generate fake names, addresses, e-mails, etc.
Faker can be found here: https://github.com/joke2k/faker/tree/master
"""
from typing import Iterable, Optional

# from lingua._constant import LETTERS, PUNCTUATION
Expand Down
6 changes: 5 additions & 1 deletion metasyn/distribution/na.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
"""Module containing the class with NA distributions."""
"""Module implementing NA distributions.
This module contains a single class for creating distributions that only
return NA.
"""

import polars as pl

Expand Down
10 changes: 9 additions & 1 deletion metasyn/distribution/regex.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
"""Distribution for structured strings, using regexes."""
"""
Module implementing structured string distributions.
This module provides a RegexDistribution class that fits a regular expression
to structured strings such as email addresses, IDs, telephone numbers, and IP
addresses. It is based on the regexmodel package found here:
https://github.com/sodascience/regexmodel.
"""

from __future__ import annotations

from typing import Optional, Union
Expand Down
11 changes: 8 additions & 3 deletions metasyn/metaframe.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Conversion of DataFrames to MetaFrames.""" # pylint: disable=invalid-name
"""Module defining the MetaFrame class, used for the conversion of DataFrames to MetaFrames."""

from __future__ import annotations

Expand All @@ -22,8 +22,13 @@
class MetaFrame():
"""Metasyn metaframe consisting of variables.
The metasyn metaframe structure that is most easily created from
a polars dataset with the from_dataframe class method.
A MetaFrame, short for metadata frame, is a structure that holds statistical metadata
about a dataset. The data contained in a MetaFrame is in line with the
Generative Metadata Format (GMF). It is essentially, a collection of MetaVar objects,
each representing a column in a dataset.
The metaframe is most easily created from a polars dataset with the from_dataframe()
class method.
Parameters
----------
Expand Down
6 changes: 3 additions & 3 deletions metasyn/provider.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""Module for distribution providers.
"""Module implementing distribution providers.
These are used to find/fit distributions that are available. See pyproject.toml on how the
builtin distribution provider is registered.
Distribution providers are used to find/fit distributions that are available.
See pyproject.toml on how the builtin distribution provider is registered.
"""

from __future__ import annotations
Expand Down
2 changes: 1 addition & 1 deletion metasyn/schema/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
"""Package containing the JSON-schemas for validation."""
"""Package containing the JSON-schema that can be used for validating metadata."""
6 changes: 5 additions & 1 deletion metasyn/testutils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
"""Testing utilities for plugins."""
"""Module for testing the functionality of distributions and providers.
The testutils module provides a set of utilities for testing the functionality
and internal consistency of individual distributions and providers.
"""


from __future__ import annotations
Expand Down
5 changes: 4 additions & 1 deletion metasyn/validation.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
"""Tools for validating distribution/GMF file output."""
"""The validation module contains functions to validate the serialized output of distributions.
This ensures that the Generative Metadata Format (GMF) files are interoperable and well formed.
"""

from __future__ import annotations

Expand Down
21 changes: 15 additions & 6 deletions metasyn/var.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Variable module that creates metadata variables.""" # pylint: disable=invalid-name
"""Module defining the MetaVar class, which represents a metadata variable."""

from __future__ import annotations

Expand All @@ -14,15 +14,21 @@


class MetaVar():
"""Meta data variable.
"""Metadata variable.
Acts as a base class for specific types of variables, but also as a
launching pad for detecting its type.
MetaVar is a structure that holds all metadata needed to generate a
synthetic column for it. This is the variable level building block for the
MetaFrame. It contains the methods to convert a polars Series into a
variable with an appropriate distribution. The MetaVar class is to the
MetaFrame what a polars Series is to a DataFrame.
This class is considered a passthrough class used by the MetaFrame class,
and is not intended to be used directly by the user.
Parameters
----------
var_type:
Variable type as a string, e.g. continuous, string, etc.
String containing the variable type, e.g. continuous, string, etc.
series:
Series to create the variable from. Series is None by default and in
this case the value is ignored. If it is not supplied, then the
Expand All @@ -38,7 +44,7 @@ class MetaVar():
Type of the original values, e.g. int64, float, etc. Used for type-casting
back.
description:
User provided description of the variable.
User-provided description of the variable.
"""

dtype = "unknown"
Expand Down Expand Up @@ -85,6 +91,9 @@ def detect(cls,
prop_missing: Optional[float] = None):
"""Detect variable class(es) of series or dataframe.
This method does not fit any distribution, but it does infer the
correct types for the MetaVar and saves the Series for later fitting.
Parameters
----------
series_or_dataframe: pd.Series or pd.Dataframe
Expand Down

0 comments on commit 7dca1cb

Please sign in to comment.