Synthesize binary and categorical features as strings or seeds #21

donboyd5 · 2018-12-13T08:46:25Z

The synthesized MARS values are non-integer when they should be integer, and occasionally they fall far from the nearest integer. I round the values to the nearest integer.

MaxGhenis · 2018-12-14T17:52:08Z

I expanded this issue to include other binary and categorical features, which should be synthesized either as seeds or as strings to avoid decimals. Here's my proposal for features with cardinality < 10, also captured in the pufvars Google sheet:

vname	vdesc	Cardinality	Synthesis method	Description booklet entry (as needed)
dsi	Dependent Status Indicator	2	Seed	Taxpayer not being claimed as a dependent on another tax return: 0 Taxpayer claimed as a dependent on another tax return: 1
f6251	Form 6251, Alternative Minimum Tax	2	Classification
midr	Married Filing Separately Itemized Deductions Requirement Indicator	2	Classification
fded	Form of Deduction Code	3	Classification	Aggregated Return: 0 Itemized deductions: 1 Standard deduction:2 Taxpayer did not use itemized or standard deduction: 3
eic	Earned Income Credit Code	4	Regression	No children claimed: 0 One child claimed: 1 Two children claimed: 2 Three children claimed: 3
f2441	Form 2441, Child Care Credit Qualified Individual	4	Regression	No Form 2441 attached to return: 0 Number of qualifying individuals: 1-3
mars	Marital (Filing) Status	4	Seed
n24	Number of Children for Child Tax Credit	4	Regression
xtot	Total Exemptions	6	Regression

We'll test out different specifications of seed vs. classification so that's less important right now. Does this sound right, in that all regression features will be rounded? I think capturing the ordinal nature of low-cardinality features like n24 is more important than avoiding rounding. I'm not aware of ordinal logistic regression for RF and trees, but that could also be an option for linear models down the line.

MaxGhenis · 2018-12-15T00:21:03Z

synpuf5 and 6 use all the classification/seed variables in the above table as seed variables, as I need to revise the rf_synth function to support classification. Other variables are rounded.

These datasets also fix #17 and use 50 instead of 20 trees.

MaxGhenis changed the title ~~Minor issue - synthesized MARS is non-integer~~ Synthesize binary and categorical features as strings Dec 14, 2018

MaxGhenis changed the title ~~Synthesize binary and categorical features as strings~~ Synthesize binary and categorical features as strings or seeds Dec 14, 2018

MaxGhenis closed this as completed Dec 15, 2018

MaxGhenis self-assigned this Dec 15, 2018

MaxGhenis mentioned this issue Dec 15, 2018

Ensure e00600 >= e00650 and e01500 >= e01700 #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthesize binary and categorical features as strings or seeds #21

Synthesize binary and categorical features as strings or seeds #21

donboyd5 commented Dec 13, 2018 •

edited

Loading

MaxGhenis commented Dec 14, 2018 •

edited

Loading

MaxGhenis commented Dec 15, 2018

Synthesize binary and categorical features as strings or seeds #21

Synthesize binary and categorical features as strings or seeds #21

Comments

donboyd5 commented Dec 13, 2018 • edited Loading

MaxGhenis commented Dec 14, 2018 • edited Loading

MaxGhenis commented Dec 15, 2018

donboyd5 commented Dec 13, 2018 •

edited

Loading

MaxGhenis commented Dec 14, 2018 •

edited

Loading