We propose and apply a novel paradigm for characterization of genome data quality, which quantifi... more We propose and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.
We articulate and investigate issues associated with performing statistical disclosure limitation... more We articulate and investigate issues associated with performing statistical disclosure limitation (SDL) for data subject to edit rules. The central problem is that many SDL methods generate data records that violate the constraints. We propose and study two approaches. In the first, existing SDL methods are applied, and any constraint-violating values they produce are replaced by means of a constraint-preserving imputation procedure. In the second, the SDL methods are modified to prevent them from generating violations. We present a simulation study, based on data from the Colombian Annual Manufacturing Survey, that evaluates several SDL methods from the existing literature. The results suggest that (i) in practice, some SDL methods cannot be implemented with the second approach, and (ii) differences in risk-utility profiles across SDL approaches dwarf differences across the two approaches. Among the SDL strategies, microaggreggation followed by adding noise and partially synthetic ...
When releasing microdata to the public, data disseminators typically alter the original data to p... more When releasing microdata to the public, data disseminators typically alter the original data to protect the confidentiality of database subjects' identities and sensitive attributes. However, such alteration negatively impacts the utility (quality) of the released data. In this paper, we present quantitative measures of data utility for masked microdata, with the aim of improving disseminators' evaluations of competing masking strategies. The measures, which are global in that they reflect similarities between the entire distributions of the original and released data, utilize empirical distribution estimation, cluster analysis, and propensity scores. We evaluate the measures using both simulated and genuine data. The results suggest that measures based on propensity score methods are the most promising for general use.
Data swapping is a statistical disclosure limitation method used to protect the confidentiality o... more Data swapping is a statistical disclosure limitation method used to protect the confidentiality of data by interchanging variable values between records. We propose a risk-utility framework for selecting an optimal swapped data release when considering several swap variables and multiple swap rates. Risk and utility values associated with each such swapped data file are traded off along a frontier of undominated potential releases, which contains the optimal release(s). Current Population Survey data are used to illustrate the framework for categorical data swapping.
To protect conden tiality, statistical agencies typically alter data before re- leasing them to t... more To protect conden tiality, statistical agencies typically alter data before re- leasing them to the public. Ideally, although rarely done, the agency releasing data also provides a way for secondary data analysts to assess the quality of inferences obtained with the released data. Quality measures can help secondary data analysts to disregard inaccurate conclusions resulting from the disclosure limitation procedures, as well as have condence in accurate conclusions. We propose an interactive computer system that an- alysts can query for measures of data quality. We focus on potential disclosure risks of providing these quality measures.
Disseminationofinformationderivedfromlargecontingencytablesformedfromconfidentialdataisa majorres... more Disseminationofinformationderivedfromlargecontingencytablesformedfromconfidentialdataisa majorresponsibilityofstatisticalagencies.Inthispaperwepresentsolutionstoseveralcomputational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to support efficient computationofmarginalsandalgorithmssuchasiterativeproportionalfitting,aswellasageneralized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002
We describe two classes of software systems that release tabular summaries of an underlying datab... more We describe two classes of software systems that release tabular summaries of an underlying database. Table servers respond to user queries for (marginal) sub-tables of the "full" table summarizing the entire database, and are characterized by dynamic assessment of disclosure risk, in light of previously answered queries. Optimal tabular releases are static releases of sets of sub-tables that are characterized by maximizing the amount of information released, as given by a measure of data utility, subject to a constraint on disclosure risk. Underlying abstractions — primarily associated with the query space, as well as released and unreleasable sub-tables and frontiers, computational algorithms and issues, especially scalability, and prototype software implementations are discussed.
We construct a decision-theoretic formulation of data swapping in which quantitative measures of ... more We construct a decision-theoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap ...
In this paper we study the impact of statistical disclosure limitation in the setting of paramete... more In this paper we study the impact of statistical disclosure limitation in the setting of parameter estimation for a finite population. Using a simulation experiment with microdata from the 2010 American Community Survey, we demonstrate a framework for applying risk-utility paradigms to microdata for a finite population, which incorporates a utility measure based on estimators with survey weights and risk measures based on record linkage techniques with composite variables. The simulation study shows a special caution on variance estimation for finite populations with the released data that are masked by statistical disclosure limitation. We also compare various disclosure limitation methods including a modified version of microaggregation that accommodates survey weights. The results confirm previous findings that a two-stage procedure, microaggregation with adding noise, is effective in terms of data utility and disclosure risk.
We propose and apply a novel paradigm for characterization of genome data quality, which quantifi... more We propose and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.
We articulate and investigate issues associated with performing statistical disclosure limitation... more We articulate and investigate issues associated with performing statistical disclosure limitation (SDL) for data subject to edit rules. The central problem is that many SDL methods generate data records that violate the constraints. We propose and study two approaches. In the first, existing SDL methods are applied, and any constraint-violating values they produce are replaced by means of a constraint-preserving imputation procedure. In the second, the SDL methods are modified to prevent them from generating violations. We present a simulation study, based on data from the Colombian Annual Manufacturing Survey, that evaluates several SDL methods from the existing literature. The results suggest that (i) in practice, some SDL methods cannot be implemented with the second approach, and (ii) differences in risk-utility profiles across SDL approaches dwarf differences across the two approaches. Among the SDL strategies, microaggreggation followed by adding noise and partially synthetic ...
When releasing microdata to the public, data disseminators typically alter the original data to p... more When releasing microdata to the public, data disseminators typically alter the original data to protect the confidentiality of database subjects' identities and sensitive attributes. However, such alteration negatively impacts the utility (quality) of the released data. In this paper, we present quantitative measures of data utility for masked microdata, with the aim of improving disseminators' evaluations of competing masking strategies. The measures, which are global in that they reflect similarities between the entire distributions of the original and released data, utilize empirical distribution estimation, cluster analysis, and propensity scores. We evaluate the measures using both simulated and genuine data. The results suggest that measures based on propensity score methods are the most promising for general use.
Data swapping is a statistical disclosure limitation method used to protect the confidentiality o... more Data swapping is a statistical disclosure limitation method used to protect the confidentiality of data by interchanging variable values between records. We propose a risk-utility framework for selecting an optimal swapped data release when considering several swap variables and multiple swap rates. Risk and utility values associated with each such swapped data file are traded off along a frontier of undominated potential releases, which contains the optimal release(s). Current Population Survey data are used to illustrate the framework for categorical data swapping.
To protect conden tiality, statistical agencies typically alter data before re- leasing them to t... more To protect conden tiality, statistical agencies typically alter data before re- leasing them to the public. Ideally, although rarely done, the agency releasing data also provides a way for secondary data analysts to assess the quality of inferences obtained with the released data. Quality measures can help secondary data analysts to disregard inaccurate conclusions resulting from the disclosure limitation procedures, as well as have condence in accurate conclusions. We propose an interactive computer system that an- alysts can query for measures of data quality. We focus on potential disclosure risks of providing these quality measures.
Disseminationofinformationderivedfromlargecontingencytablesformedfromconfidentialdataisa majorres... more Disseminationofinformationderivedfromlargecontingencytablesformedfromconfidentialdataisa majorresponsibilityofstatisticalagencies.Inthispaperwepresentsolutionstoseveralcomputational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to support efficient computationofmarginalsandalgorithmssuchasiterativeproportionalfitting,aswellasageneralized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002
We describe two classes of software systems that release tabular summaries of an underlying datab... more We describe two classes of software systems that release tabular summaries of an underlying database. Table servers respond to user queries for (marginal) sub-tables of the "full" table summarizing the entire database, and are characterized by dynamic assessment of disclosure risk, in light of previously answered queries. Optimal tabular releases are static releases of sets of sub-tables that are characterized by maximizing the amount of information released, as given by a measure of data utility, subject to a constraint on disclosure risk. Underlying abstractions — primarily associated with the query space, as well as released and unreleasable sub-tables and frontiers, computational algorithms and issues, especially scalability, and prototype software implementations are discussed.
We construct a decision-theoretic formulation of data swapping in which quantitative measures of ... more We construct a decision-theoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap ...
In this paper we study the impact of statistical disclosure limitation in the setting of paramete... more In this paper we study the impact of statistical disclosure limitation in the setting of parameter estimation for a finite population. Using a simulation experiment with microdata from the 2010 American Community Survey, we demonstrate a framework for applying risk-utility paradigms to microdata for a finite population, which incorporates a utility measure based on estimators with survey weights and risk measures based on record linkage techniques with composite variables. The simulation study shows a special caution on variance estimation for finite populations with the released data that are masked by statistical disclosure limitation. We also compare various disclosure limitation methods including a modified version of microaggregation that accommodates survey weights. The results confirm previous findings that a two-stage procedure, microaggregation with adding noise, is effective in terms of data utility and disclosure risk.
Uploads
Papers by Alan Karr