Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable Specifications #194

Closed
wants to merge 2 commits into from
Closed

Variable Specifications #194

wants to merge 2 commits into from

Conversation

Samuwhale
Copy link
Collaborator

Had issues with rebasing, so I figured the easiest was to just create a new branch and add var_spec there. This PR replaces the old one #161.

It's still WIP, hence the draft.

Here's an example of how to use it:

from metasynth.distribution import FakerDistribution, DiscreteUniformDistribution, RegexDistribution
from metasynth.spec import MetaFrameSpec, VariableSpec

demo_file_path = demo_file()
demo_types={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical
}
df = pl.read_csv(demo_file_path, try_parse_dates=True, dtypes=demo_types)

# Old approach

df_spec_0 = {
    "PassengerId": {"unique": True},
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "ExponentialDistribution"},
    "Age": {
        "distribution": DiscreteUniformDistribution(20, 40),
    },
    "Cabin": {"distribution": RegexDistribution(r"[ABCDEF]\d{2,3}")}
}

mf_0 = MetaFrame.fit_dataframe(df, spec=df_spec_0)

print(mf_0.synthesize(10))

# New approach (option 1: creating VariableSpec objects directly, and passing a dictionary of those to fit_dataframe())
passenger_spec = VariableSpec()
passenger_spec.unique = True

name_spec = VariableSpec()
name_spec.distribution = FakerDistribution("name")

fare_spec = VariableSpec()
fare_spec.distribution = "ExponentialDistribution"

age_spec = VariableSpec()
age_spec.distribution = DiscreteUniformDistribution(600, 1000)

cabin_spec = VariableSpec()
cabin_spec.distribution = RegexDistribution(r"[ABCDEF]\d{2,3}")

df_spec_1 = {
    "PassengerId": passenger_spec.to_dict(),
    "Name": name_spec.to_dict(),
    "Fare": fare_spec.to_dict(),
    "Age": age_spec.to_dict(),
    "Cabin": cabin_spec.to_dict()
}

mf_1 = MetaFrame.fit_dataframe(df, spec=df_spec_1)

print(mf_1.synthesize(10))

# New approach (option 2: creating a MetaFrameSpec object, and passing that to fit_dataframe())

df_spec_2 = MetaFrameSpec(df)
df_spec_2["PassengerId"].unique = True
df_spec_2["Name"].distribution = FakerDistribution("name")
df_spec_2["Fare"].distribution = "ExponentialDistribution"
df_spec_2["Age"].distribution = DiscreteUniformDistribution(600, 1000)
df_spec_2["Cabin"].distribution = RegexDistribution(r"[ABCDEF]\d{2,3}")

mf_2 = MetaFrame.fit_dataframe(df, spec=df_spec_2.to_dict())

print(mf_2.synthesize(10))

Had issues with rebasing, so I figured the easiest was to just create a new branch and add var_spec there.
@Samuwhale Samuwhale mentioned this pull request Oct 12, 2023
MetaFrameSpec can now be initialized without passing in a df.
@Samuwhale
Copy link
Collaborator Author

@qubixes and I discussed that it might be nice to set these specs through config files (probably TOML is nice)

@qubixes
Copy link
Member

qubixes commented Jan 19, 2024

I think given the new PR #227, we can close this PR for now. Thanks for the inspiration!

@qubixes qubixes closed this Jan 19, 2024
@vankesteren vankesteren deleted the var_spec-metasyn branch February 26, 2024 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants