Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An algorithm for the microaggregation problem using column generation

Published: 01 August 2022 Publication History

Abstract

The field of statistical disclosure control aims to reduce the risk of re-identifying an individual from disseminated data, a major concern among national statistical agencies. Operations Research (OR) techniques have been widely used in the past for protecting tabular data, but not microdata (i.e., files of individuals and attributes). Few papers apply OR techniques to the microaggregation problem, which is considered one of the best methods for microdata protection and is known to be NP-hard.
The new heuristic approach is based on a column generation scheme and, unlike previous (primal) heuristics for microaggregation, it also provides a lower bound on the optimal microaggregation. Using real data that is typically used in the literature, our computational results show, first, that solutions with small gaps are often achieved and, second, that dramatic improvements are obtained relative to the literature’s most popular heuristics.

Highlights

Microaggregation is a Statistical Disclosure Control technique to protect microarrays.
An algorithm for computing a feasible solution for Microaggregation is also proposed.
The proposed algorithm improves on the solution quality with respect to algorithms MDAV and V-MDAV.

References

[1]
Abowd J.M., Domingo-Ferrer J., Torra V., Using mahalanobis distance-based record linkage for disclosure risk assessment, in: Domingo-Ferrer J., Franconi L. (Eds.), Privacy in Statistical Databases 2006, in: Lecture Notes in Computer Science, vol. 4302, Springer, Heidelberg, 2006, pp. 233–242.
[2]
Aloise D., Hansen P., Rocha C., Santi E., Column generation bounds for numerical microaggregation, J. Glob. Optim. 60 (2014) 165–182.
[3]
Baena D., Castro J., Frangioni A., Stabilized benders methods for large-scale combinatorial optimization, with application to data privacy, Manage. Sci. 66 (2020) 3051–3068.
[4]
Baena D., Castro J., González J.A., Fix-and-relax approaches for controlled tabular adjustment, Comput. Oper. Res. 58 (2015) 41–52.
[5]
Brand R., Domingo-Ferrer J., Mateo-Sanz J.M., Reference data sets to test and compare SDC methods for protection of numerical microdata, 2002, European Project IST-2000-25069 CASC, http://neon.vb.cbs.nl/casc, https://research.cbs.nl/casc/CASCtestsets.htm.
[6]
Castro J., A shortest paths heuristic for statistical disclosure control in positive tables, INFORMS J. Comput. 9 (4) (2007) 520–533.
[7]
Castro J., Recent advances in optimization techniques for statistical tabular data protection, European J. Oper. Res. 216 (2012) 257–269.
[8]
Castro J., Frangioni A., Gentile C., Perspective reformulations of the CTA problem with L2 distances, Oper. Res. 62 (4) (2014) 891–909.
[9]
Castro J., González J.A., A linear optimization based method for data privacy in statistical tabular data, Optim. Methods Softw. 34 (2019) 37–61.
[10]
Dalenius T., Reiss S., Data-swapping: a technique for disclosure control (extended abstract), in: Proc. ASA Section on Survey Research Methods, American Statistical Association, Washington DC, 1978, pp. 191–194.
[11]
Domingo-Ferrer J., Mateo-Sanz J.M., Practical data-oriented microaggregation for statistical disclosure control, IEEE Trans. Knowl. Data Eng. 14 (1) (2002).
[12]
Domingo-Ferrer J., Torra V., A quantitative comparison of disclosure control methods for microdata, in: Doyle P., Lane J., Theeuwes J., Zayatz L. (Eds.), Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, North-Holland, Amsterdam, 2001, pp. 111–134.
[13]
Domingo-Ferrer J., Torra V., Ordinal, continuous and heterogeneous k-anonimity through microaggregation, Data Min. Knowl. Discov. 11 (2005) 195–212.
[14]
Fischetti M., Salazar-González J.J., Models and algorithms for the 2-dimensional cell suppression problem in statistical disclosure control, Math. Program. 84 (2) (1999) 283–312.
[15]
Fischetti M., Salazar-González J.J., Solving the cell suppression problem on tabular data with linear constraints, Manage. Sci. 47 (7) (2001) 1008–1027.
[16]
Ghosh J., Liu A., K-Means, in: The Top Ten Algorithms in Data Mining, Taylor & Francis, Boca Raton, 2009, pp. 21–35.
[17]
González J.A., Castro J., A heuristic block coordinate descent approach for controlled tabular adjustment, Comput. Oper. Res. 38 (2011) 1826–1835.
[18]
Hansen S., Mukherjee S., A polynomial algorithm for optimal univariate microaggregation, IEEE Trans. Knowl. Data Eng. 15 (4) (2003).
[19]
Hernández-García M.S., Salazar-González J.J., Enhanced controlled tabular adjustment, Comput. Oper. Res. 43 (2014) 61–67.
[20]
Hundepool A., Domingo-Ferrer J., Franconi L., Giessing S., Nordholt E.S., Spicer K., de Wolf P.-P., Statistical Disclosure Control, Wiley, Chichester, 2012.
[21]
Ji X., Mitchell J.E., Branch-and-price-and-cut on the clique partitioning problem with minimum clique size requirement, Discrete Optim. 4 (1) (2007) 87–102.
[22]
Melo M.T., Nickel S., da Gama F.S., Facility location and supply chain management–A review, European J. Oper. Res. 196 (2009) 401–412.
[23]
Moore R., Controlled Data-Swapping Techniques for Masking Public Use Microdata, U.S. Bureau of the Census Statistical Research Division, 1996.
[24]
Muralidhar K., Sarathy R., Data shuffling: A new masking approach for numerical data, Manage. Sci. 52 (2006) 658–570.
[25]
Oganian A., Domingo-Ferrer J., On the complexity of optimal microaggregation for statistical disclosure control, Stat. J. United Nations Econ. Comission Eur. 18 (2001) 345–354.
[26]
Sage A.J., Wright S.E., Obtaining cell counts for contingency tables from rounded conditional frequencies, European J. Oper. Res. 250 (1) (2016) 91–100.
[27]
Salazar-González J.J., Mathematical models for applying cell suppression methodology in statistical data protection, European J. Oper. Res. 154 (2004) 740–754.
[28]
Salazar-González J.J., Controlled rounding and cell perturbation: Statistical disclosure limitation methods for tabular data, Math. Program. 105 (2006) 583–603.
[29]
Samarati P., Protecting respondents identities in microdata release, IEEE Trans. Knowl. Data Eng. 13 (6) (2001) 1010–1027.
[30]
Solanas, A., Martínez-Ballesté, A., 2006. V-MDAV: A Multivariate Microaggregation With Variable Group Size. In: Proc. COMPSTAT Symp. IASC. pp. 917–925.
[31]
Spagnolo E., On the use of Integer Programming to pursue optimal Microaggregation, (Master’s thesis) School of Mathematics and Statistics, Universitat Politècnica de Catalunya, 2016.

Cited By

View all
  • (2022)An Optimization-Based Decomposition Heuristic for the Microaggregation ProblemPrivacy in Statistical Databases10.1007/978-3-031-13945-1_1(3-14)Online publication date: 21-Sep-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Computers and Operations Research
Computers and Operations Research  Volume 144, Issue C
Aug 2022
498 pages

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 August 2022

Author Tags

  1. Integer programming
  2. Column generation
  3. Data privacy
  4. Clustering
  5. Microaggregation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)An Optimization-Based Decomposition Heuristic for the Microaggregation ProblemPrivacy in Statistical Databases10.1007/978-3-031-13945-1_1(3-14)Online publication date: 21-Sep-2022

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media