Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Var spec obj #161

Closed
wants to merge 4 commits into from
Closed

Var spec obj #161

wants to merge 4 commits into from

Conversation

Samuwhale
Copy link
Collaborator

@Samuwhale Samuwhale commented Sep 8, 2023

EDIT: Due to rebasing issues I have moved to a new branch, see #194 for the up to date PR.

Since last discussing this with @qubixes I have completely rewritten the code, as it was not behaving as it should. Now, however, it should work pretty well, it's also pretty easy to read/understand since I have added docstrings. It's still VERY basic, but it's a good first step.

Here's an example of how to use it:

from metasynth.distribution import FakerDistribution, DiscreteUniformDistribution, RegexDistribution
from metasynth.spec import MetaFrameSpec, VariableSpec

demo_file_path = demo_file()
demo_types={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical
}
df = pl.read_csv(demo_file_path, try_parse_dates=True, dtypes=demo_types)

# Old approach

df_spec_0 = {
    "PassengerId": {"unique": True},
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "ExponentialDistribution"},
    "Age": {
        "distribution": DiscreteUniformDistribution(20, 40),
    },
    "Cabin": {"distribution": RegexDistribution(r"[ABCDEF]\d{2,3}")}
}

mf_0 = MetaFrame.fit_dataframe(df, spec=df_spec_0)

print(mf_0.synthesize(10))

# New approach (option 1: creating VariableSpec objects directly, and passing a dictionary of those to fit_dataframe())
passenger_spec = VariableSpec()
passenger_spec.unique = True

name_spec = VariableSpec()
name_spec.distribution = FakerDistribution("name")

fare_spec = VariableSpec()
fare_spec.distribution = "ExponentialDistribution"

age_spec = VariableSpec()
age_spec.distribution = DiscreteUniformDistribution(600, 1000)

cabin_spec = VariableSpec()
cabin_spec.distribution = RegexDistribution(r"[ABCDEF]\d{2,3}")

df_spec_1 = {
    "PassengerId": passenger_spec.to_dict(),
    "Name": name_spec.to_dict(),
    "Fare": fare_spec.to_dict(),
    "Age": age_spec.to_dict(),
    "Cabin": cabin_spec.to_dict()
}

mf_1 = MetaFrame.fit_dataframe(df, spec=df_spec_1)

print(mf_1.synthesize(10))

# New approach (option 2: creating a MetaFrameSpec object, and passing that to fit_dataframe())

df_spec_2 = MetaFrameSpec(df)
df_spec_2["PassengerId"].unique = True
df_spec_2["Name"].distribution = FakerDistribution("name")
df_spec_2["Fare"].distribution = "ExponentialDistribution"
df_spec_2["Age"].distribution = DiscreteUniformDistribution(600, 1000)
df_spec_2["Cabin"].distribution = RegexDistribution(r"[ABCDEF]\d{2,3}")

mf_2 = MetaFrame.fit_dataframe(df, spec=df_spec_2.to_dict())

print(mf_2.synthesize(10))

Currently a new object, MetaVarSpec has been added. It can be used as such:

Previously:

```python
from metasynth.distribution import FakerDistribution, DiscreteUniformDistribution, RegexDistribution
from metasynth import MetaVarSpec

demo_file_path = demo_file()
demo_types={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical
}

df = pl.read_csv(demo_file_path, try_parse_dates=True, dtypes=demo_types)

var_spec = {
    "PassengerId": {"unique": True},
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "ExponentialDistribution"},
    "Age": {
        "distribution": DiscreteUniformDistribution(20, 40),
    },
    "Cabin": {"distribution": RegexDistribution(r"[ABCDEF]\d{2,3}")}
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec)

print(mf.synthesize(10))
```

New approach:

```python

new_spec = MetaVarSpec(df)

new_spec.PassengerId.unique = True
new_spec.Name.distribution = FakerDistribution("name")
new_spec.Fare.distribution = "ExponentialDistribution"
new_spec.Age.distribution = DiscreteUniformDistribution(20, 40)
new_spec.Cabin.distribution = RegexDistribution(r"[ABCDEF]\d{2,3}")

mf2 = MetaFrame.fit_dataframe(df, spec=new_spec.to_dict())

print(mf2.synthesize(10))
```

Far from perfect, but a good start.
Last approach was having a lot of bugs, this new approach should be simpler to use AND understand


def type_check(key, value, type_dict=AVAILABLE_TYPES):
if not any(isinstance(value, allowed_type) for allowed_type in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use isinstance(value, type_dict[key]) to check for multiple types at once.

@qubixes
Copy link
Member

qubixes commented Sep 11, 2023

It looks really good! One thing that might be good to discuss is whether to do type checking here. I think there is a good reason to do it here, but generally it's not considered "pythonic" (whatever it means) as far as I understand. Let's discuss it later today.

@qubixes
Copy link
Member

qubixes commented Sep 13, 2023

@Samuwhale What are the plans with this PR? I would like to merge this PR so that we can start with renaming the package. Is it working to the point where it can be optimized/improved later, or does it still need a lot of work you think?

@vankesteren
Copy link
Member

let's maybe hold off on implementing this one until the rename is done. I'd like to have a bit more conversation around the interface before we finalize this one.

@Samuwhale
Copy link
Collaborator Author

I just pushed a commit that fixed all workflow warnings. The code is fully functional without changing any of the current code-base (it only adds to it), so in theory, it can be merged.

That said, I think it probably is indeed nicer to work on this a bit further before implementing it!

@Samuwhale Samuwhale closed this Oct 12, 2023
@Samuwhale Samuwhale deleted the var_spec-obj branch October 12, 2023 10:05
@qubixes qubixes mentioned this pull request Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants