-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Which variables should we seed on? #37
Comments
Oh, the seeds include the calculated variables - I wasn't aware of that.
Still if the synthesis process amounts to "find a record with the same
values as the seeds, and call that the synthetic record" then it isn't
synthesizing at all. Is there an explanation for why this is happening?
dan
…On Thu, 7 Mar 2019, Max Ghenis wrote:
We've seen that, in general, the more seeds in the synthesis production, the
higher-fidelity the synthesis is, at the expense of privacy. More precisely,
the relationship probably has to do with the unique identifiability of
records when limited to the seeds.
For example, the only difference between the green and red bars here is that
the green adds several more seeds:
image
Furthermore, even calculated seeds (which are dropped after the synthesis to
be recalculated with Tax-Calculator) produce this relationship. The green
bar above used calculated seeds.
Another data point supporting this is synthpop8, which used 9 calculated
seeds ('E00100', 'E04600', 'P04470', 'E04800', 'E62100', 'E05800', 'E08800',
'E59560', 'E26190') that together uniquely identified over 80% of records.
Each row in this synthesis exactly matched a training record, indicating we
need to use far fewer seeds.
While we shouldn't use too many, we may also care a special amount about
these calculated features, which could justify seeding on them rather than
seeding on some other raw feature. Whether this approach improves the
validity of calculated features like AGI is an empirical question we haven't
tested, but it seems like a reasonable hypothesis.
Selecting the seeds is therefore one of the most important decisions in the
synthesis process. I'd suggest a couple factors to consider in this
decision:
1. Prioritizing categorical features. This simplifies the synthesis process
to be only on continuous measures. So for example, we'd want to
prioritize MARS.
2. Prioritizing logically "initial" features. For example, XTOT, nu18, MARS
etc. are features of the household which logically precede income and
deduction measures. This feeds into the question of visit sequence.
3. Prioritizing the most important features. This could be critical
calculated features like AGI, or the most important features in
determining those critical calculated features.
Regarding (3): I ran a random forests model to determine the importance of
each "raw" feature in predicting the 9 calculated features in synthpop8.
Here are the top 5, according to the average rank in predicting those 9:
1. E00200 (salaries and wages): most important for predicting E26190
(non-passive income) and E59560 (earned income for EIC).
2. E18400 (SALT): most important for E05800 (income tax before credit),
E08800 (income tax after credits), and P04470 (total deductions).
3. S006 (weight): most important for E04800 (taxable income), E05800
(taxbc), and E08800 (taxac).
4. E02000 (Schedule E), most important for E26190 (non-passive income).
5. P23250 (Long-term gains less losses), most important for E00100 (AGI),
E04800 (taxable income), and E62100 (alternative minimum taxable
income).
image
Together these 5 features uniquely identify 61% of PUF records, so we'd
probably still want a subset, especially if we add something like MARS and
XTOT, but I suspect these will be valuable and avoid extra complexity of
seeding on calculated features (also makes a simpler story to SOI that we're
only using 65 features).
FEATURES = ['E00200', 'E18400', 'S006', 'E02000', 'P23250']
~pd.read_csv('~/puf2011.csv', usecols=FEATURES).duplicated(keep=False)).mean
()
# 0.6131326698821662
?
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVdlEflHBWi3-JwUWPjUESGb08bl1ks5vUFaYgaJpZM4biEaM.gif]
|
What does it mean for "5 features uniquely identify 61% of PUF records". Does it mean "an exact match on 5 continuous variables" or something less? |
This isn't how the synthesis works in general, but it is how it works when there's no conditional variance of the synthesized features. If you have a tree-based model based on data where all records where x=2 and y=3 also have z=1, and you pass it data where x=2 and y=3, that tree-based model may assign 100% probability to the z=1 scenario. Depending on how strong this is, models that do more to fight overfitting like random forests could still assign that 100% probability. That seems to be what's happening here, and indicates we need to increase the conditional variance by reducing the conditions (seeds).
Right, restricting the PUF to |
On Wed, 6 Mar 2019, Max Ghenis wrote:
if the synthesis process amounts to "find a record with the same values as
the seeds, and call that the synthetic record" then it isn't synthesizing
at all.
This isn't how the synthesis works in general, but it is how it works when there's no
conditional variance of the synthesized features. If you have a tree-based model based
on data where all records where x=2 and y=3 also have z=1, and you pass it data where
x=2 and y=3, that tree-based model may assign 100% probability to the z=1 scenario.
If there are 9 continuous variables, is it surprising there is only one
exact match? I thought the "match" was only to "above median" or "below
median", which should make a match fairly unlikely.
There is also the oddity that the revenue scores are so poor if all the
matches are exact. Is it only the weights that are off??
Dan
… Depending on how strong this is, models that do more to fight overfitting like random
forests could still assign that 100% probability. That seems to be what's happening
here, and indicates we need to increase the conditional variance by reducing the
conditions (seeds).
What does it mean for "5 features uniquely identify 61% of PUF records".
Does it mean "an exact match on 5 continuous variables" or something less?
Right, restricting the PUF to ['E00200', 'E18400', 'S006', 'E02000', 'P23250']
produces a dataset where 61% of records are unique (this doesn't concern synthetic
data).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVQCZp0RrlZAfPMgkEElUUYVZnc3rks5vUGhvgaJpZM4biEaM.gif]
|
I'm not really surprised but it depends on the variable; some are more fine than others, and some correlate more with others.
It's probably mostly the weights. Are you using @donboyd5's revised linear-programmed weights? It could also be that it's not the same records in the same representation; all synthetic records are exactly present in the true PUF, but I haven't checked if the reverse is true. |
Isn't there a way to loosen the restriction for a match from exat match to
"in the same bin"? In the examples I recall reading about, the bins were
above or below median.
Isn't there a way to specify a minimum number of leaves before additional
subdivision takes place? I recall reading examples where a minimum of 5
leaves were required.
Am I mistaken in my belief that the synthesis process maintains
covariances only to the extent that they are mediated by seed variables?
For example, what about the synthesis process encourages property tax and
mortgage interest to be correlated?
Is there somewhere I can read up on this?
Dan
|
Thanks, Max, this is great, with a lot of great detective work. It gives us lots to talk about tomorrow. I created a Google doc named selected_MARS3group_puf_synthpop8_matches in our Google drive synpuf folder that explains some of my reasons for what I say below, and I also sent a link to each of you. To be on the safe side I am not putting the doc link here but if you have access to the folder you can get it. I have four main comments:
That doesn't mean we shouldn't be on the lookout for them, but it does mean we have to interpret them carefully and think carefully about what to address and how.
Where we do have to reduce seeds, and we may need to, Max's detective work will prove really valuable.
|
Could you share some records in We should decide whether we're treating PUF data as real data, as we've discussed in the past. We know that SOI blurs and rounds data, that lots of fields are zero, and that some records are duplicated when limiting to the 65 features we're synthesizing, but in lieu of the real data or details on how exactly they blur, how many real records each PUF record represents, etc., I think we need to just treat it as real data. That should mean avoiding synthesizing exact matches on records that appear only once in the PUF. @feenberg asked:
Right now we're seeing true exact matches, and we're also looking at distance measures. I think below/above median would be too blunt an instrument to evaluate privacy concerns.
Yes I think synthpop CART does this, but I'm not sure this guarantees variance.
No, the synthesis maintains covariances by including them in each prediction model. Suppose we only seed on MARS, and then the first two non-seed synthesized features are property tax and mortgage interest. Property tax will essentially be synthesized as the distribution of property tax, conditional on each MARS value. Mortgage interest will then be synthesized as the distribution of mortgage interest conditional on each record's MARS value, and its conditional property tax. Each covariance is maintained this way: one of each pair of features is synthesized as the distribution conditioned (at least in part) on the other. |
Let me dig it out and send email back.
These are a lot of good things to talk about tomorrow. Does 2pm ET work for
you the two of you?
Don
…On Thu, Mar 7, 2019 at 1:58 PM Max Ghenis ***@***.***> wrote:
@donboyd5 <https://github.com/donboyd5> your doc
<https://docs.google.com/document/d/1c3Sz3MY1oXOugYX8h4EcGKFzm9AROmGsYWGOxKMcKQc>
says:
The total number of puf records involved in exact matches (npufrecs=419)
out of the 3,144 puf records with MARS=3, and the number of *syn records
involved in exact matches (nsynrecs=1,057) of of the 15,720 syn records in
the group.*
Could you share some records in synthpop8 that you found don't exactly
match a training record. I just triple-checked that all records in
synthpop8 exactly match training records on all features in this notebook
<https://colab.research.google.com/drive/13qxcg_GEzUONqMyw_UaSMB2PN4k8kcDH>.
Note I'm dropping S006 because that'll be reconstructed, and isn't
relevant to privacy concerns.
We should decide whether we're treating PUF data as real data, as we've
discussed in the past. We know that SOI blurs and rounds data, that lots of
fields are zero, and that some records are duplicated when limiting to the
65 features we're synthesizing, but in lieu of the real data or details on
how exactly they blur, how many real records each PUF record represents,
etc., I think we need to just treat it as real data. That should mean
avoiding synthesizing exact matches on records that appear only once in the
PUF.
@feenberg <https://github.com/feenberg> asked:
Isn't there a way to loosen the restriction for a match from exat match to
"in the same bin"? In the examples I recall reading about, the bins were
above or below median.
Right now we're seeing true exact matches, and we're also looking at
distance measures. I think below/above median would be too blunt an
instrument to evaluate privacy concerns.
Isn't there a way to specify a minimum number of leaves before additional
subdivision takes place? I recall reading examples where a minimum of 5
leaves were required.
Yes I think synthpop CART does this, but I'm not sure this guarantees
variance.
Am I mistaken in my belief that the synthesis process maintains
covariances only to the extent that they are mediated by seed variables?
For example, what about the synthesis process encourages property tax and
mortgage interest to be correlated?
No, the synthesis maintains covariances by including them in each
prediction model. Suppose we only seed on MARS, and then the first two
non-seed synthesized features are property tax and mortgage interest.
Property tax will essentially be synthesized as the distribution of
property tax, conditional on each MARS value. Mortgage interest will then
be synthesized as the distribution of mortgage interest conditional on each
record's MARS value, and its conditional property tax. Each covariance is
maintained this way: one of each pair of features is synthesized as the
distribution conditioned (at least in part) on the other.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGPEmO6j0-Rl14Bsv4jh8YVOcpOKLwCgks5vUWE4gaJpZM4biEaM>
.
|
I matched from puf to syn, rather than from syn to puf. Maybe that is the
difference. But let me check.
That's worth discussion, too.
Don
…On Thu, Mar 7, 2019 at 2:02 PM Don Boyd ***@***.***> wrote:
Let me dig it out and send email back.
These are a lot of good things to talk about tomorrow. Does 2pm ET work
for you the two of you?
Don
On Thu, Mar 7, 2019 at 1:58 PM Max Ghenis ***@***.***>
wrote:
> @donboyd5 <https://github.com/donboyd5> your doc
> <https://docs.google.com/document/d/1c3Sz3MY1oXOugYX8h4EcGKFzm9AROmGsYWGOxKMcKQc>
> says:
>
> The total number of puf records involved in exact matches (npufrecs=419)
> out of the 3,144 puf records with MARS=3, and the number of *syn records
> involved in exact matches (nsynrecs=1,057) of of the 15,720 syn records in
> the group.*
>
> Could you share some records in synthpop8 that you found don't exactly
> match a training record. I just triple-checked that all records in
> synthpop8 exactly match training records on all features in this notebook
> <https://colab.research.google.com/drive/13qxcg_GEzUONqMyw_UaSMB2PN4k8kcDH>.
> Note I'm dropping S006 because that'll be reconstructed, and isn't
> relevant to privacy concerns.
>
> We should decide whether we're treating PUF data as real data, as we've
> discussed in the past. We know that SOI blurs and rounds data, that lots of
> fields are zero, and that some records are duplicated when limiting to the
> 65 features we're synthesizing, but in lieu of the real data or details on
> how exactly they blur, how many real records each PUF record represents,
> etc., I think we need to just treat it as real data. That should mean
> avoiding synthesizing exact matches on records that appear only once in the
> PUF.
>
> @feenberg <https://github.com/feenberg> asked:
>
> Isn't there a way to loosen the restriction for a match from exat match
> to "in the same bin"? In the examples I recall reading about, the bins were
> above or below median.
>
> Right now we're seeing true exact matches, and we're also looking at
> distance measures. I think below/above median would be too blunt an
> instrument to evaluate privacy concerns.
>
> Isn't there a way to specify a minimum number of leaves before additional
> subdivision takes place? I recall reading examples where a minimum of 5
> leaves were required.
>
> Yes I think synthpop CART does this, but I'm not sure this guarantees
> variance.
>
> Am I mistaken in my belief that the synthesis process maintains
> covariances only to the extent that they are mediated by seed variables?
> For example, what about the synthesis process encourages property tax and
> mortgage interest to be correlated?
>
> No, the synthesis maintains covariances by including them in each
> prediction model. Suppose we only seed on MARS, and then the first two
> non-seed synthesized features are property tax and mortgage interest.
> Property tax will essentially be synthesized as the distribution of
> property tax, conditional on each MARS value. Mortgage interest will then
> be synthesized as the distribution of mortgage interest conditional on each
> record's MARS value, and its conditional property tax. Each covariance is
> maintained this way: one of each pair of features is synthesized as the
> distribution conditioned (at least in part) on the other.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#37 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AGPEmO6j0-Rl14Bsv4jh8YVOcpOKLwCgks5vUWE4gaJpZM4biEaM>
> .
>
|
Would 1PM or 1:30PM be OK? I think syn->puf is more relevant to privacy concerns, since we want to avoid releasing synthetic records that look too much like real records. The reverse is useful for comprehensiveness--to the extent that real records add value, ensuring they're not totally ignored by the model will probably produce a better synthesis--but outside this particular scope IMO. |
On Thu, 7 Mar 2019, Max Ghenis wrote:
@donboyd5 your doc says:
The total number of puf records involved in exact matches
(npufrecs=419) out of the 3,144 puf records with MARS=3, and the
number of syn records involved in exact matches (nsynrecs=1,057)
of of the 15,720 syn records in the group.
Could you share some records in synthpop8 that you found don't exactly match
a training record. I just triple-checked that all records in synthpop8
exactly match training records on all features in this notebook. Note I'm
dropping S006 because that'll be reconstructed, and isn't relevant to
privacy concerns.
We should decide whether we're treating PUF data as real data, as we've
discussed in the past. We know that SOI blurs and rounds data, that lots of
fields are zero, and that some records are duplicated when limiting to the
65 features we're synthesizing, but in lieu of the real data or details on
how exactly they blur, how many real records each PUF record represents,
etc., I think we need to just treat it as real data. That should mean
avoiding synthesizing exact matches on records that appear only once in the
PUF.
@feenberg asked:
Isn't there a way to loosen the restriction for a match from
exat match to "in the same bin"? In the examples I recall
reading about, the bins were above or below median.
Right now we're seeing true exact matches, and we're also looking at
distance measures. I think below/above median would be too blunt an
instrument to evaluate privacy concerns.
Isn't there a way to specify a minimum number of leaves before
additional subdivision takes place? I recall reading examples
where a minimum of 5 leaves were required.
Yes I think synthpop CART does this, but I'm not sure this guarantees
variance.
Am I mistaken in my belief that the synthesis process maintains
covariances only to the extent that they are mediated by seed
variables? For example, what about the synthesis process
encourages property tax and mortgage interest to be correlated?
No, the synthesis maintains covariances by including them in each prediction
model. Suppose we only seed on MARS, and then the first two non-seed
synthesized features are property tax and mortgage interest. Property tax
will essentially be synthesized as the distribution of property tax,
conditional on each MARS value. Mortgage interest will then be synthesized
I understand this. We sample from the property tax values divided into
MARS categories.
as the distribution of mortgage interest conditional on each record's MARS
value, and its conditional property tax. Each covariance is maintained this
How is this done? "conditional on" can cover a multitude of possible
procedures when the variables are continuous. I am thinking that bins for
the cross of MARS and property tax ranges are created, and for each puf
record a value of mortgage interest is selected from a record whose MARS
and property tax fall into the same bin. But even if the bins start out
with large numbers of possible values, don't the bins get very small as
the number of variables synthesized increases? I am think that 2**65 is a
very large number.
I guess I am still confused.
dan
… way: one of each pair of features is synthesized as the distribution
conditioned (at least in part) on the other.
?
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVR_XtfIsRqJFjObgc8fxkXzBdr1yks5vUWE4gaJpZM4biEaM.gif]
|
Either of those times is good for me.
Don
…On Thu, Mar 7, 2019 at 2:08 PM Max Ghenis ***@***.***> wrote:
Would 1PM or 1:30PM be OK?
I think syn->puf is more relevant to privacy concerns, since we want to
avoid releasing synthetic records that look too much like real records. The
reverse is useful for comprehensiveness--to the extent that real records
add value, ensuring they're not totally ignored by the model will probably
produce a better synthesis--but outside this particular scope IMO.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGPEmPmJ6V7SsqhwwuPLbWoRRog8vuz0ks5vUWPDgaJpZM4biEaM>
.
|
@feenberg We're basically using quantile regression, where the regression incorporates all the seeds and previously synthesized features. So we're predicting the 10th percentile, 20th, 30th, etc., and sampling a random quantile from there to capture the full conditional distribution. In reality both CART and RF do this nonparametrically, so it's something in between the binning approach you describe and parametric regression models. |
Hi Max,
I'm really sorry to say this, but somehow the Google Drive file
...synpuf\syntheses\synthpop8.csv is, despite its name, the puf, which you
can see from the variable ftype and from the # of records (=# in puf, not
5x that #). I don't know how I did it, but apparently I did and I'm sorry.
I know you invested a lot of work in it. As one possible small silver
lining, maybe we all benefited nonetheless if it drove you to think about
ways to select seed variables that are more methodical and smart than what
I've been doing. But I never want to waste anyone's time, so I'm sorry
about that.
Anyway, I went back to the file I've been using, which is
...synpuf\synthpop8_stack.csv, which has the synthetic file stacked with a
conforming puf (i.e., without aggregate records, and with only the same
variables). I have pulled 72 synthetic records that are in the synthetic
part of that file but not in the puf part of the file and written them to
...synpuf\synthpop8_selected_nonmatches.csv. These all have MARS=3.
In addition to the synthesized variables it has the following variables of
note:
- ftype -- identifies the portion of synthpop8_stack.csv that this record
comes from puf or syn -- it will be syn for every record because I found no
such records in the puf part
- rownum -- this is the number of the row in synthpop8_stack.csv in which
you can find this record in case you want to do that. IT IS NOT A COLUMN IN
synthpop8_stack.csv -- I CREATED IT AFTER THE FACT -- BUT IT WILL MATCH THE
SEQUENCE POSITION OF THE RECORD IN synthpop8_stack.csv.
- n -- the number of records in synthpop8_stack.csv that are identical to
this record, based on the variables to the right of wt (I did not include
wt in the exact match); you will note that every record either has n=10 or
n=62 -- there are two different sets of identical records
- npuf - the number of those records (the n records) that came from the puf
part of the file; this will be 0 for all
- nsyn - the number of those records that came from the syn part of the
file (this will be either 10 or 62 for all)
Again, I'm sorry about this. This highlights the importance of Dan's
comment about having file structure and names all in one place. Maybe I can
do that by saying more in the Google sheet, or else by writing a Google
doc. Anyway, let's include this in our discussion tomorrow.
Don
…On Thu, Mar 7, 2019 at 2:08 PM Max Ghenis ***@***.***> wrote:
Would 1PM or 1:30PM be OK?
I think syn->puf is more relevant to privacy concerns, since we want to
avoid releasing synthetic records that look too much like real records. The
reverse is useful for comprehensiveness--to the extent that real records
add value, ensuring they're not totally ignored by the model will probably
produce a better synthesis--but outside this particular scope IMO.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGPEmPmJ6V7SsqhwwuPLbWoRRog8vuz0ks5vUWPDgaJpZM4biEaM>
.
|
Dan do you have a preference for 1:00 pm or 1:30 pm (Eastern time) tomorrow (assuming you can make the call)? |
I should add that I checked for exact matches in both directions -- within MARS=3, all puf against all syn records and all syn records against all puf records. (This is easy for exact-match checks. Much more computing work for distance measures.) |
On Thu, 7 Mar 2019, Max Ghenis wrote:
@feenberg We're basically using quantile regression, where the regression
incorporates all the seeds and previously synthesized features. So we're
predicting the 10th percentile, 20th, 30th, etc., and sampling a random
quantile from there to capture the full conditional distribution.
But unless the quantile regression imposes some structure on the shape of
the distribution, you end up in the end with 10**65 bins, so most bins
will have zero entries, but a few will have a single entry. I imagine the
quantile regression does impose structure - linear or log-linear of some
sort.
It seems to me that the mere fact that the synthesized records are no
different from the training records is positive evidence that something is
wrong with the methodology, and should not be ascribed to having too many
seed variables. You describe the result as "sampling a random quantile".
How can the random sample be the same as the training set unless choosen
from a universe of one? Isn't that the problem, not the large number of
seeds. But apparently increasing the number of seeds decreases the size of
leaves from which to sample. Can that be right?
dan
…
In reality both CART and RF do this nonparametrically, so it's something in
between the binning approach you describe and parametric regression models.
?
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVYipwPvfaeD9nMYD-jLDxiF_zXojks5vUWcXgaJpZM4biEaM.gif]
|
Please see my prior note. It's my fault. The synthpop8.csv file Max was
looking at was indeed the puf. The note above gives the proper file to use.
Don
On Thu, Mar 7, 2019 at 4:36 PM Daniel Feenberg <[email protected]>
wrote:
…
On Thu, 7 Mar 2019, Max Ghenis wrote:
>
> @feenberg We're basically using quantile regression, where the regression
> incorporates all the seeds and previously synthesized features. So we're
> predicting the 10th percentile, 20th, 30th, etc., and sampling a random
> quantile from there to capture the full conditional distribution.
But unless the quantile regression imposes some structure on the shape of
the distribution, you end up in the end with 10**65 bins, so most bins
will have zero entries, but a few will have a single entry. I imagine the
quantile regression does impose structure - linear or log-linear of some
sort.
It seems to me that the mere fact that the synthesized records are no
different from the training records is positive evidence that something is
wrong with the methodology, and should not be ascribed to having too many
seed variables. You describe the result as "sampling a random quantile".
How can the random sample be the same as the training set unless choosen
from a universe of one? Isn't that the problem, not the large number of
seeds. But apparently increasing the number of seeds decreases the size of
leaves from which to sample. Can that be right?
dan
>
> In reality both CART and RF do this nonparametrically, so it's something
in
> between the binning approach you describe and parametric regression
models.
>
> ?
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the
> thread.[AHvQVYipwPvfaeD9nMYD-jLDxiF_zXojks5vUWcXgaJpZM4biEaM.gif]
>
>
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGPEmH_QcY0QZwBPp-IOBtog0UXUpQ3dks5vUYZIgaJpZM4biEaM>
.
|
So only 72 synthetic records are not identical to a PUF record? This is a
randomization process that only epsilon away from a simple copy command.
This can only happen if the "sampling from a conditional distribution" is
sampling from a universe of one for each value in each output record.
Dan
…On Thu, 7 Mar 2019, Don Boyd wrote:
Hi Max,
I'm really sorry to say this, but somehow the Google Drive file
...synpuf\syntheses\synthpop8.csv is, despite its name, the puf, which you
can see from the variable ftype and from the # of records (=# in puf, not
5x that #). I don't know how I did it, but apparently I did and I'm sorry.
I know you invested a lot of work in it. As one possible small silver
lining, maybe we all benefited nonetheless if it drove you to think about
ways to select seed variables that are more methodical and smart than what
I've been doing. But I never want to waste anyone's time, so I'm sorry
about that.
Anyway, I went back to the file I've been using, which is
...synpuf\synthpop8_stack.csv, which has the synthetic files stacked with a
conforming puf (i.e., without aggregate records, and with only the same
variables). I have pulled 72 synthetic records that are in the synthetic
part of that file but not in the puf part of the file and written them to
...synpuf\synthpop8_selected_nonmatches.csv. These all have MARS=3.
In addition to the synthesized variables it has the following variables of
note:
- ftype -- identifies the portion of synthpop8_stack.csv that this record
comes from puf or syn -- it will be syn for every record because I found no
such records in the puf part
- rownum -- this is the number of the row in synthpop8_stack.csv in which
you can find this record in case you want to do that. IT IS NOT A COLUMN IN
synthpop8_stack.csv -- I CREATED IT AFTER THE FACT -- BUT IT WILL MATCH THE
SEQUENCE POSITION OF THE RECORD IN synthpop8_stack.csv.
- n -- the number of records in synthpop8_stack.csv that are identical to
this record, based on the variables to the right of wt (I did not include
wt in the exact match); you will note that every record either has n=10 or
n=62 -- there are two different sets of identical records
- npuf - the number of those records (the n records) that came from the puf
part of the file; this will be 0 for all
- nsyn - the number of those records that came from the syn part of the
file (this will be either 10 or 62 for all)
Again, I'm sorry about this. This highlights the importance of Dan's
comment about having file structure and names all in one place. Maybe I can
do that by saying more in the Google sheet, or else by writing a Google
doc. Anyway, let's include this in our discussion tomorrow.
Don
On Thu, Mar 7, 2019 at 2:08 PM Max Ghenis ***@***.***> wrote:
> Would 1PM or 1:30PM be OK?
>
> I think syn->puf is more relevant to privacy concerns, since we want to
> avoid releasing synthetic records that look too much like real records.
The
> reverse is useful for comprehensiveness--to the extent that real records
> add value, ensuring they're not totally ignored by the model will probably
> produce a better synthesis--but outside this particular scope IMO.
>
> ?
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#37 (comment)>, or
mute
> the thread
><https://github.com/notifications/unsubscribe-auth/AGPEmPmJ6V7SsqhwwuPLbWoR
Rog8vuz0ks5vUWPDgaJpZM4biEaM>
> .
>
?
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVUiqr95Bnhz0UA8k1yrGsXCnaw5xks5vUX7rgaJpZM4biEaM.gif]
|
On Thu, 7 Mar 2019, Don Boyd wrote:
Dan do you have a preference for 1:00 pm or 1:30 pm tomorrow (assuming you
can make the call)?
1:30 preferred.
dan
…
?
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVaKSK-ZOo_mGdbMFuSFT_NDvvMjvks5vUYAMgaJpZM4biEaM.gif]
|
No, it was just a selection of nonmatching records. See Google doc
mentioned earlier.
On Thu, Mar 7, 2019 at 4:55 PM Daniel Feenberg <[email protected]>
wrote:
…
So only 72 synthetic records are not identical to a PUF record? This is a
randomization process that only epsilon away from a simple copy command.
This can only happen if the "sampling from a conditional distribution" is
sampling from a universe of one for each value in each output record.
Dan
On Thu, 7 Mar 2019, Don Boyd wrote:
> Hi Max,
>
> I'm really sorry to say this, but somehow the Google Drive file
> ...synpuf\syntheses\synthpop8.csv is, despite its name, the puf, which
you
> can see from the variable ftype and from the # of records (=# in puf, not
> 5x that #). I don't know how I did it, but apparently I did and I'm
sorry.
> I know you invested a lot of work in it. As one possible small silver
> lining, maybe we all benefited nonetheless if it drove you to think about
> ways to select seed variables that are more methodical and smart than
what
> I've been doing. But I never want to waste anyone's time, so I'm sorry
> about that.
>
> Anyway, I went back to the file I've been using, which is
> ...synpuf\synthpop8_stack.csv, which has the synthetic files stacked
with a
> conforming puf (i.e., without aggregate records, and with only the same
> variables). I have pulled 72 synthetic records that are in the synthetic
> part of that file but not in the puf part of the file and written them to
> ...synpuf\synthpop8_selected_nonmatches.csv. These all have MARS=3.
>
> In addition to the synthesized variables it has the following variables
of
> note:
>
> - ftype -- identifies the portion of synthpop8_stack.csv that this record
> comes from puf or syn -- it will be syn for every record because I found
no
> such records in the puf part
>
> - rownum -- this is the number of the row in synthpop8_stack.csv in which
> you can find this record in case you want to do that. IT IS NOT A COLUMN
IN
> synthpop8_stack.csv -- I CREATED IT AFTER THE FACT -- BUT IT WILL MATCH
THE
> SEQUENCE POSITION OF THE RECORD IN synthpop8_stack.csv.
>
> - n -- the number of records in synthpop8_stack.csv that are identical to
> this record, based on the variables to the right of wt (I did not include
> wt in the exact match); you will note that every record either has n=10
or
> n=62 -- there are two different sets of identical records
>
> - npuf - the number of those records (the n records) that came from the
puf
> part of the file; this will be 0 for all
>
> - nsyn - the number of those records that came from the syn part of the
> file (this will be either 10 or 62 for all)
>
> Again, I'm sorry about this. This highlights the importance of Dan's
> comment about having file structure and names all in one place. Maybe I
can
> do that by saying more in the Google sheet, or else by writing a Google
> doc. Anyway, let's include this in our discussion tomorrow.
>
> Don
>
>
> On Thu, Mar 7, 2019 at 2:08 PM Max Ghenis ***@***.***>
wrote:
>
> > Would 1PM or 1:30PM be OK?
> >
> > I think syn->puf is more relevant to privacy concerns, since we want to
> > avoid releasing synthetic records that look too much like real records.
> The
> > reverse is useful for comprehensiveness--to the extent that real
records
> > add value, ensuring they're not totally ignored by the model will
probably
> > produce a better synthesis--but outside this particular scope IMO.
> >
> > ?
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <#37 (comment)>,
or
> mute
> > the thread
> ><
https://github.com/notifications/unsubscribe-auth/AGPEmPmJ6V7SsqhwwuPLbWoR
> Rog8vuz0ks5vUWPDgaJpZM4biEaM>
> > .
> >
>
> ?
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the
> thread.[AHvQVUiqr95Bnhz0UA8k1yrGsXCnaw5xks5vUX7rgaJpZM4biEaM.gif]
>
>
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGPEmKi_pceUxc5VHH3sd9fqjZ3t6kE2ks5vUYrIgaJpZM4biEaM>
.
|
You should begin much earlier in the thread at
#37 (comment) (or
earlier) - and also read the google doc mentioned there.
…On Thu, Mar 7, 2019 at 5:17 PM Don Boyd ***@***.***> wrote:
No, it was just a selection of nonmatching records. See Google doc
mentioned earlier.
On Thu, Mar 7, 2019 at 4:55 PM Daniel Feenberg ***@***.***>
wrote:
>
> So only 72 synthetic records are not identical to a PUF record? This is a
> randomization process that only epsilon away from a simple copy command.
> This can only happen if the "sampling from a conditional distribution" is
> sampling from a universe of one for each value in each output record.
>
> Dan
>
> On Thu, 7 Mar 2019, Don Boyd wrote:
>
> > Hi Max,
> >
> > I'm really sorry to say this, but somehow the Google Drive file
> > ...synpuf\syntheses\synthpop8.csv is, despite its name, the puf, which
> you
> > can see from the variable ftype and from the # of records (=# in puf,
> not
> > 5x that #). I don't know how I did it, but apparently I did and I'm
> sorry.
> > I know you invested a lot of work in it. As one possible small silver
> > lining, maybe we all benefited nonetheless if it drove you to think
> about
> > ways to select seed variables that are more methodical and smart than
> what
> > I've been doing. But I never want to waste anyone's time, so I'm sorry
> > about that.
> >
> > Anyway, I went back to the file I've been using, which is
> > ...synpuf\synthpop8_stack.csv, which has the synthetic files stacked
> with a
> > conforming puf (i.e., without aggregate records, and with only the same
> > variables). I have pulled 72 synthetic records that are in the synthetic
> > part of that file but not in the puf part of the file and written them
> to
> > ...synpuf\synthpop8_selected_nonmatches.csv. These all have MARS=3.
> >
> > In addition to the synthesized variables it has the following variables
> of
> > note:
> >
> > - ftype -- identifies the portion of synthpop8_stack.csv that this
> record
> > comes from puf or syn -- it will be syn for every record because I
> found no
> > such records in the puf part
> >
> > - rownum -- this is the number of the row in synthpop8_stack.csv in
> which
> > you can find this record in case you want to do that. IT IS NOT A
> COLUMN IN
> > synthpop8_stack.csv -- I CREATED IT AFTER THE FACT -- BUT IT WILL MATCH
> THE
> > SEQUENCE POSITION OF THE RECORD IN synthpop8_stack.csv.
> >
> > - n -- the number of records in synthpop8_stack.csv that are identical
> to
> > this record, based on the variables to the right of wt (I did not
> include
> > wt in the exact match); you will note that every record either has n=10
> or
> > n=62 -- there are two different sets of identical records
> >
> > - npuf - the number of those records (the n records) that came from the
> puf
> > part of the file; this will be 0 for all
> >
> > - nsyn - the number of those records that came from the syn part of the
> > file (this will be either 10 or 62 for all)
> >
> > Again, I'm sorry about this. This highlights the importance of Dan's
> > comment about having file structure and names all in one place. Maybe I
> can
> > do that by saying more in the Google sheet, or else by writing a Google
> > doc. Anyway, let's include this in our discussion tomorrow.
> >
> > Don
> >
> >
> > On Thu, Mar 7, 2019 at 2:08 PM Max Ghenis ***@***.***>
> wrote:
> >
> > > Would 1PM or 1:30PM be OK?
> > >
> > > I think syn->puf is more relevant to privacy concerns, since we want
> to
> > > avoid releasing synthetic records that look too much like real
> records.
> > The
> > > reverse is useful for comprehensiveness--to the extent that real
> records
> > > add value, ensuring they're not totally ignored by the model will
> probably
> > > produce a better synthesis--but outside this particular scope IMO.
> > >
> > > ?
> > > You are receiving this because you were mentioned.
> > > Reply to this email directly, view it on GitHub
> > > <#37 (comment)>,
> or
> > mute
> > > the thread
> > ><
> https://github.com/notifications/unsubscribe-auth/AGPEmPmJ6V7SsqhwwuPLbWoR
> > Rog8vuz0ks5vUWPDgaJpZM4biEaM>
> > > .
> > >
> >
> > ?
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub, or mute the
> > thread.[AHvQVUiqr95Bnhz0UA8k1yrGsXCnaw5xks5vUX7rgaJpZM4biEaM.gif]
> >
> >
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#37 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AGPEmKi_pceUxc5VHH3sd9fqjZ3t6kE2ks5vUYrIgaJpZM4biEaM>
> .
>
|
@donboyd5 no problem thanks for checking. I re-ran the distance metrics on 1% of @feenberg Linear quantile regression would impose a linear structure on the relationships, but by using RF/CART, we don't impose a structure, nor do we have to define huge number of bins. These tree methods split on each node (feature) based on semi-random thresholds, and then either recursively improves these splits (CART) or builds many trees (RF) to produce the predictions, which are then sampled to generate the conditional quantiles. Here's an explanation of RF: everything is the same in RF regression as it is in RF quantile regression, except for the final stage where we use the distribution of predictions instead of the mean. |
Re @MaxGhenis's earlier comment, I agree, we need to treat puf data as if they are true tax returns, and hold ourselves to that standard. Whether we think that is best or not doesn't matter as it is how SOI wants to view it, so we need to view it that way, also. While I have made some comments about certain kinds of exact matches that shouldn't be worrisome, I think we need to worry about them nonetheless and find best possible ways of eradicating any exact matches that involve non-zero-valued continuous variables (in addition to categorical variables) and perhaps even exact matches that include categoricals and only zero-valued continuous variables - these are good topics for discussion. That said, in some senses it may be a harder test than comparison to true returns, and in others it might be easier. I think exact matches are likely to be less of a concern vs. true returns (as they will not have been blurred), but I am not sure whether distances will be a harder or easier test. I do believe that after we get fully comfortable with comparisons to puf, we should seek a way to get low-stakes comparisons done against true returns before we face a high-stakes do or die test (via SOI) by that approach. |
I kept promising to pull together some notes on distance measures. I have been failing, but I have made some progress. You can find what I've done here. I'll try to update it. |
We've seen that, in general, the more seeds in the synthesis production, the higher-fidelity the synthesis is, at the expense of privacy. More precisely, the relationship probably has to do with the unique identifiability of records when limited to the seeds.
For example, the only difference between the green and red bars here is that the green adds several more seeds:
Furthermore, even calculated seeds (which are dropped after the synthesis to be recalculated with Tax-Calculator) produce this relationship. The green bar above used calculated seeds.
Another data point supporting this is
synthpop8
, which used 9 calculated seeds ('E00100', 'E04600', 'P04470', 'E04800', 'E62100', 'E05800', 'E08800', 'E59560', 'E26190'
) that together uniquely identified over 80% of records. Each row in this synthesis exactly matched a training record, indicating we need to use far fewer seeds.While we shouldn't use too many, we may also care a special amount about these calculated features, which could justify seeding on them rather than seeding on some other raw feature. Whether this approach improves the validity of calculated features like AGI is an empirical question we haven't tested, but it seems like a reasonable hypothesis.
Selecting the seeds is therefore one of the most important decisions in the synthesis process. I'd suggest a couple factors to consider in this decision:
Regarding (3): I ran a random forests model to determine the importance of each "raw" feature in predicting the 9 calculated features in
synthpop8
. Here are the top 5, according to the average rank in predicting those 9:E00200
(salaries and wages): most important for predictingE26190
(non-passive income) andE59560
(earned income for EIC).E18400
(SALT): most important forE05800
(income tax before credit),E08800
(income tax after credits), andP04470
(total deductions).S006
(weight): most important forE04800
(taxable income),E05800
(taxbc), andE08800
(taxac).E02000
(Schedule E), most important forE26190
(non-passive income).P23250
(Long-term gains less losses), most important forE00100
(AGI),E04800
(taxable income), andE62100
(alternative minimum taxable income).Together these 5 features uniquely identify 61% of PUF records, so we'd probably still want a subset, especially if we add something like
MARS
andXTOT
, but I suspect these will be valuable and avoid extra complexity of seeding on calculated features (also makes a simpler story to SOI that we're only using 65 features).The text was updated successfully, but these errors were encountered: