Alan Karr

Followers

Following

Mentions

Public Views

Interests

Uploads

Papers by Alan Karr

Secure Statistical Analysis of Distributed Databases

Springer eBooks, Jan 16, 2007

Download

Measuring Quality of DNA Sequence Data via Degradation

We propose and apply a novel paradigm for characterization of genome data quality, which quantifi... more We propose and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.

Download

Data swapping: a risk-utility framework and web service implementation

International Conference on Digital Government Research, May 18, 2003

Download

www.niss.org Statistical Disclosure Limitation in the Presence of Edit Rules

We articulate and investigate issues associated with performing statistical disclosure limitation... more We articulate and investigate issues associated with performing statistical disclosure limitation (SDL) for data subject to edit rules. The central problem is that many SDL methods generate data records that violate the constraints. We propose and study two approaches. In the first, existing SDL methods are applied, and any constraint-violating values they produce are replaced by means of a constraint-preserving imputation procedure. In the second, the SDL methods are modified to prevent them from generating violations. We present a simulation study, based on data from the Colombian Annual Manufacturing Survey, that evaluates several SDL methods from the existing literature. The results suggest that (i) in practice, some SDL methods cannot be implemented with the second approach, and (ii) differences in risk-utility profiles across SDL approaches dwarf differences across the two approaches. Among the SDL strategies, microaggreggation followed by adding noise and partially synthetic ...

Download

Global Measures of Data Utility for Microdata Masked for Disclosure Limitation

Journal of Privacy and Confidentiality, 2009

When releasing microdata to the public, data disseminators typically alter the original data to p... more When releasing microdata to the public, data disseminators typically alter the original data to protect the confidentiality of database subjects' identities and sensitive attributes. However, such alteration negatively impacts the utility (quality) of the released data. In this paper, we present quantitative measures of data utility for masked microdata, with the aim of improving disseminators' evaluations of competing masking strategies. The measures, which are global in that they reflect similarities between the entire distributions of the original and released data, utilize empirical distribution estimation, cluster analysis, and propensity scores. We evaluate the measures using both simulated and genuine data. The results suggest that measures based on propensity score methods are the most promising for general use.

Download

A risk-utility framework for categorical data swapping

Data swapping is a statistical disclosure limitation method used to protect the confidentiality o... more Data swapping is a statistical disclosure limitation method used to protect the confidentiality of data by interchanging variable values between records. We propose a risk-utility framework for selecting an optimal swapped data release when considering several swap variables and multiple swap rates. Risk and utility values associated with each such swapped data file are traded off along a frontier of undominated potential releases, which contains the optimal release(s). Current Population Survey data are used to illustrate the framework for categorical data swapping.

Download

Evaluating the disclosure risks of reporting quality measures to the public

To protect conden tiality, statistical agencies typically alter data before re- leasing them to t... more To protect conden tiality, statistical agencies typically alter data before re- leasing them to the public. Ideally, although rarely done, the agency releasing data also provides a way for secondary data analysts to assess the quality of inferences obtained with the released data. Quality measures can help secondary data analysts to disregard inaccurate conclusions resulting from the disclosure limitation procedures, as well as have condence in accurate conclusions. We propose an interactive computer system that an- alysts can query for measures of data quality. We focus on potential disclosure risks of providing these quality measures.

Download

Analysis of Integrated Data without Data Integration

CHANCE, 2004

Download

Using Statistics to Protect Privacy

Privacy, Big Data, and the Public Good

Download

Secure Statistical Analysis of Distributed Databases

Statistical Methods in Counterterrorism

Download

Preserving confidentiality of high-dimensional tabulated data: Statistical and computational issues

Statistics and Computing, 2003

Disseminationofinformationderivedfromlargecontingencytablesformedfromconfidentialdataisa majorres... more Disseminationofinformationderivedfromlargecontingencytablesformedfromconfidentialdataisa majorresponsibilityofstatisticalagencies.Inthispaperwepresentsolutionstoseveralcomputational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to support efficient computationofmarginalsandalgorithmssuchasiterativeproportionalfitting,aswellasageneralized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.

Download

Secure, Privacy-Preserving Analysis of Distributed Databases

Technometrics, 2007

Download

Data quality: A statistical perspective

Statistical Methodology, 2006

Download

Multistage Masking Methods for Microdata Protection

Download

Secure analysis of distributed chemical databases without data integration

Journal of Computer-Aided Molecular Design, 2005

Download

Secure Regression on Distributed Databases

Journal of Computational and Graphical Statistics, 2005

Download

Software Systems for Tabular Data Releases

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002

We describe two classes of software systems that release tabular summaries of an underlying datab... more We describe two classes of software systems that release tabular summaries of an underlying database. Table servers respond to user queries for (marginal) sub-tables of the "full" table summarizing the entire database, and are characterized by dynamic assessment of disclosure risk, in light of previously answered queries. Optimal tabular releases are static releases of sets of sub-tables that are characterized by maximizing the amount of information released, as given by a measure of data utility, subject to a constraint on disclosure risk. Underlying abstractions — primarily associated with the query space, as well as released and unreleasable sub-tables and frontiers, computational algorithms and issues, especially scalability, and prototype software implementations are discussed.

Download

Secure computation with horizontally partitioned data using adaptive regression splines

Computational Statistics & Data Analysis, 2007

Download

Data swapping as a decision problem

JOURNAL OF OFFICIAL …, 2005

We construct a decision-theoretic formulation of data swapping in which quantitative measures of ... more

Download

www.niss.org The Effect of Statistical Disclosure Limitation on Parameter Estimation for a Finite Population

In this paper we study the impact of statistical disclosure limitation in the setting of paramete... more In this paper we study the impact of statistical disclosure limitation in the setting of parameter estimation for a finite population. Using a simulation experiment with microdata from the 2010 American Community Survey, we demonstrate a framework for applying risk-utility paradigms to microdata for a finite population, which incorporates a utility measure based on estimators with survey weights and risk measures based on record linkage techniques with composite variables. The simulation study shows a special caution on variance estimation for finite populations with the released data that are masked by statistical disclosure limitation. We also compare various disclosure limitation methods including a modified version of microaggregation that accommodates survey weights. The results confirm previous findings that a two-stage procedure, microaggregation with adding noise, is effective in terms of data utility and disclosure risk.

Download