Add rudimentary file handling support. #355

qubixes · 2025-01-10T13:20:50Z

This PR is very much unfinished and is mostly directed towards initial discussion.

The problem I have run into, when trying to use SPSS (.sav) files is that there is a lot of information stored outside of what can be captured in the dataframe. For example column labels, formats (how many decimal points), file labels, tables for categories, etc. When you synthesize the data from the resulting GMF file (which does not include this information), you get an SPSS file that is not quite comparable to the original, which partly defeats the purpose of metasyn.

Another example with CSV files could be that the delimiter could be different from the original versus the synthetic data.

So, to store this information, I see two ways:

Store the information about the file format in the GMF file.
Store the information in a separate file.

Both have advantages. For the second option, the modification to metasyn itself is minimal/zero, which is a big plus. However, in this case actual metadata for the dataset will be outside the GMF file, so that the GMF file only contains part of the metadata, resulting in two files that the user needs to keep track of. I have for now implemented the feature using the second method for simplicity.

I have only integrated the file handlers with the CLI, where it makes the most sense. There are two file handlers implemented: .sav and .csv, so both of those can now be read and written back as a synthetic file.

As said, the new API is unfinished, so this is more about the concept than anything else.

Add rudimentary file handling support.

1de230b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rudimentary file handling support. #355

Add rudimentary file handling support. #355

qubixes commented Jan 10, 2025

Add rudimentary file handling support. #355

Are you sure you want to change the base?

Add rudimentary file handling support. #355

Conversation

qubixes commented Jan 10, 2025