-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unnecessary fields in inputs
/dataset
section of config files
#485
Comments
See also: imperialCHEPI/healthgps-examples#12 |
Just to add, one alternative to removing these fields would be to make some/al of them constants for now. We could enforce this with a schema. If we add support for changing them to HGPS at some point in the future, we can update the schema accordingly. That said, I'd still be inclined to remove |
Things like format and encoding can go. If we remove `delimiter, I wonder if the CSV lib we use is smart enough to figure out European decimal points? Had all sorts of trouble when Ali was feeding me mixed period/comma config CSVs. I'm a bit hesitant about inferring column type from the files, as I've seen mixed I think as long as it doesn't cause problems, go for it. |
I think the answer there is to only allow commas as delimiters and dots as decimal points. If we open the door to users providing the data in weird formats, then we make everyone's life harder.
I was assuming that these CSV files always have the same columns -- is that not the case? If not, then it does make sense to leave things as is. I was hoping to be able to use table schemas to validate CSV files (which similarly indicate the data type of each column), but obviously that will only work if we're expecting CSV files to be in a specific format. |
Yeah, columns can be added, for instance. Come to think of it, I think these |
That's a bit strange... So, to be clear, are you saying that there are:
If so, we could define table schemas which require certain columns to be present, but allowing for arbitrary extra columns, if the user wants. |
I mean that the whole file is used only in displaying the initial stat table, I think. Never used elsewhere, 97% sure. |
This section currently looks something like this:
This seems a bit bloated and I'm wondering if we can cut it down (which would simplify #449).
Some thoughts after going through the examples repo:
format
: This is always"csv"
. Not used by HGPS anywhere. HGPS can only read CSV files anyway, so I suggest we just drop this.delimiter
: This is always","
, though it looks like the code should work with other delimiters. That said, I can't think why users would actually want to do this and I don't think we should encourage users to separate their fields with heart emojis or whatever, so let's just drop this too.encoding
: This is always"ASCII"
or"UTF8"
, though all of the examples are in plain ASCII. Not used by HGPS anywhere. I don't know, but I suspect if you tried throwing some unicode in there then HGPS would fall over (on Windows, at least -- C++ doesn't have builtin unicode support). (Maybe we should investigate this and open an issue if necessary?) I think we should only support ASCII or UTF8 and drop this field.columns
: This one specifies the names and types for the different columns. I'm not sure about this one, as these fields are actually used in the program. An alternative would be to infer the types of the columns based on the contents of the CSV file (or just assume everything's adouble
).Any thoughts @jamesturner246?
The text was updated successfully, but these errors were encountered: