Add rudimentary file handling support. #355
Draft
+223
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is very much unfinished and is mostly directed towards initial discussion.
The problem I have run into, when trying to use SPSS (.sav) files is that there is a lot of information stored outside of what can be captured in the dataframe. For example column labels, formats (how many decimal points), file labels, tables for categories, etc. When you synthesize the data from the resulting GMF file (which does not include this information), you get an SPSS file that is not quite comparable to the original, which partly defeats the purpose of metasyn.
Another example with CSV files could be that the delimiter could be different from the original versus the synthetic data.
So, to store this information, I see two ways:
Both have advantages. For the second option, the modification to metasyn itself is minimal/zero, which is a big plus. However, in this case actual metadata for the dataset will be outside the GMF file, so that the GMF file only contains part of the metadata, resulting in two files that the user needs to keep track of. I have for now implemented the feature using the second method for simplicity.
I have only integrated the file handlers with the CLI, where it makes the most sense. There are two file handlers implemented: .sav and .csv, so both of those can now be read and written back as a synthetic file.
As said, the new API is unfinished, so this is more about the concept than anything else.