Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rudimentary file handling support. #355

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from
Draft

Conversation

qubixes
Copy link
Member

@qubixes qubixes commented Jan 10, 2025

This PR is very much unfinished and is mostly directed towards initial discussion.

The problem I have run into, when trying to use SPSS (.sav) files is that there is a lot of information stored outside of what can be captured in the dataframe. For example column labels, formats (how many decimal points), file labels, tables for categories, etc. When you synthesize the data from the resulting GMF file (which does not include this information), you get an SPSS file that is not quite comparable to the original, which partly defeats the purpose of metasyn.

Another example with CSV files could be that the delimiter could be different from the original versus the synthetic data.

So, to store this information, I see two ways:

  • Store the information about the file format in the GMF file.
  • Store the information in a separate file.

Both have advantages. For the second option, the modification to metasyn itself is minimal/zero, which is a big plus. However, in this case actual metadata for the dataset will be outside the GMF file, so that the GMF file only contains part of the metadata, resulting in two files that the user needs to keep track of. I have for now implemented the feature using the second method for simplicity.

I have only integrated the file handlers with the CLI, where it makes the most sense. There are two file handlers implemented: .sav and .csv, so both of those can now be read and written back as a synthetic file.

As said, the new API is unfinished, so this is more about the concept than anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant