We investigate whether generating synthetic data can be a viable strategy for providing access to... more We investigate whether generating synthetic data can be a viable strategy for providing access to detailed geocoding information for external researchers, without compromising the confidentiality of the units included in the database. Our work was motivated by a recent project at the Institute for Employment Research in Germany that linked exact geocodes to the Integrated Employment Biographies, a large administrative database containing several million records. We evaluate the performance of three synthesizers regarding the trade-off between preserving analytical validity and limiting disclosure risks: one synthesizer employs Dirichlet Process mixtures of products of multinomials, while the other two use different versions of Classification and Regression Trees (CART). In terms of preserving analytical validity, our proposed synthesis strategy for geocodes based on categorical CART models outperforms the other two. If the risks of the synthetic data generated by the categorical CAR...
We present an approach for evaluating disclosure risks for fully synthetic categorical data. The ... more We present an approach for evaluating disclosure risks for fully synthetic categorical data. The basic idea is to compute probability distributions of unknown confidential data values given the synthetic data and assumptions about intruder knowledge. We use a ''worst-case'' scenario of an intruder knowing all but one of the records in the confidential data. To create the synthetic data, we use a Dirichlet process mixture of products of multinomial distributions, which is a Bayesian version of a latent class model. In addition to offering synthetic data with decent utility, the likelihood function admits simple and convenient approximations to the posterior distributions via importance sampling. We apply the model to synthesize a subset of data from the American Community Survey.
In typical implementations of multiple imputation for missing data, analysts create m completed d... more In typical implementations of multiple imputation for missing data, analysts create m completed data sets based on approximately independent draws of imputation model parameters. We use theoretical arguments and simulations to show that, provided m is large, the use of independent draws is not necessary. In fact, appropriate use of dependent draws can improve precision relative to the use of independent draws. It also eliminates the sometimes difficult task of obtaining independent draws; for example, in fully Bayesian imputation models based on MCMC, analysts can avoid the search for a subsampling interval that ensures approximately independent draws for all parameters. We illustrate the use of dependent draws in multiple imputation with a study of the effect of breast feeding on children's later cognitive abilities.
We investigate whether generating synthetic data can be a viable strategy for providing access to... more We investigate whether generating synthetic data can be a viable strategy for providing access to detailed geocoding information for external researchers, without compromising the confidentiality of the units included in the database. Our work was motivated by a recent project at the Institute for Employment Research in Germany that linked exact geocodes to the Integrated Employment Biographies, a large administrative database containing several million records. We evaluate the performance of three synthesizers regarding the trade-off between preserving analytical validity and limiting disclosure risks: one synthesizer employs Dirichlet Process mixtures of products of multinomials, while the other two use different versions of Classification and Regression Trees (CART). In terms of preserving analytical validity, our proposed synthesis strategy for geocodes based on categorical CART models outperforms the other two. If the risks of the synthetic data generated by the categorical CAR...
We present an approach for evaluating disclosure risks for fully synthetic categorical data. The ... more We present an approach for evaluating disclosure risks for fully synthetic categorical data. The basic idea is to compute probability distributions of unknown confidential data values given the synthetic data and assumptions about intruder knowledge. We use a ''worst-case'' scenario of an intruder knowing all but one of the records in the confidential data. To create the synthetic data, we use a Dirichlet process mixture of products of multinomial distributions, which is a Bayesian version of a latent class model. In addition to offering synthetic data with decent utility, the likelihood function admits simple and convenient approximations to the posterior distributions via importance sampling. We apply the model to synthesize a subset of data from the American Community Survey.
In typical implementations of multiple imputation for missing data, analysts create m completed d... more In typical implementations of multiple imputation for missing data, analysts create m completed data sets based on approximately independent draws of imputation model parameters. We use theoretical arguments and simulations to show that, provided m is large, the use of independent draws is not necessary. In fact, appropriate use of dependent draws can improve precision relative to the use of independent draws. It also eliminates the sometimes difficult task of obtaining independent draws; for example, in fully Bayesian imputation models based on MCMC, analysts can avoid the search for a subsampling interval that ensures approximately independent draws for all parameters. We illustrate the use of dependent draws in multiple imputation with a study of the effect of breast feeding on children's later cognitive abilities.
Uploads
Papers by Jingchen (Monika) Hu