Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Tailoring data source distributions for fairness-aware data integration

Published: 01 July 2021 Publication History

Abstract

Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.

References

[1]
[n.d.]. Dawex: Sell, buy and share data. https://www.dawex.com/en.
[2]
[n.d.]. WorldQuant. https://www.worldquant.com.
[3]
[n.d.]. Xignite. https://aws.amazon.com/solutionspace/financialservices/solutions/xignite-market-data-cloudplatform.
[4]
2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning 5, 1 (2012), 1--122.
[5]
2020. Data Broker Registry. https://oag.ca.gov/data-brokers.
[6]
July 2020. Google Flights API: Incorporate Travel Data into Your App. The Rapid API Blog.
[7]
June 2021. Airborne Flights database. U.S. Department of Transportation, https://www.transtats.bts.gov.
[8]
June 2021. The Socrata Open Data API. ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata/.
[9]
June 2021. The Texas Tribune Data set. https://salaries.texastribune.org.
[10]
Chiara Accinelli, Simone Minisi, and Barbara Catania. 2020. Coverage-based Rewriting for Data Preparation. In EDBT/ICDT Workshops.
[11]
Abolfazl Asudeh, HV Jagadish, Gerome Miklau, and Julia Stoyanovich. 2019. On Obtaining Stable Rankings. PVLDB 12, 3 (2019).
[12]
Abolfazl Asudeh, HV Jagadish, Julia Stoyanovich, and Gautam Das. 2019. Designing fair ranking schemes. In SIGMOD. 1259--1276.
[13]
Abolfazl Asudeh and H. V. Jagadish. 2020. Fairly evaluating and scoring items in a data set. PVLDB 13, 12 (2020), 3445--3448.
[14]
Abolfazl Asudeh, Zhongjun Jin, and H. V. Jagadish. 2019. Assessing and Remedying Coverage for a Given Dataset. In ICDE. 554--565.
[15]
Abolfazl Asudeh, Nima Shahbazi, Zhongjun Jin, and HV Jagadish. 2021. Identifying Insufficient Data Coverage for Ordinal Continuous-Valued Attributes. SIGMOD (2021).
[16]
Abolfazl Asudeh, Saravanan Thirumuruganathan, Nan Zhang, and Gautam Das. 2016. Discovering the Skyline of Web Databases. PVLDB 9, 7 (2016), 600--611.
[17]
Abolfazl Asudeh, Nan Zhang, and Gautam Das. 2016. Query reranking as a service. PVLDB 9, 11 (2016), 888--899.
[18]
Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2019. Fairness and machine learning: Limitations and opportunities. fairmlbook.org.
[19]
Solon Barocas and Andrew D Selbst. 2016. Big data's disparate impact. Calif. L. Rev. 104 (2016), 671.
[20]
Robert Bartlett, Adair Morse, Richard Stanton, and Nancy Wallace. 2019. Consumer-lending discrimination in the FinTech era. Technical Report. National Bureau of Economic Research.
[21]
Gustavo EAPA Batista, Ronaldo C Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter 6, 1 (2004), 20--29.
[22]
Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW. 1365--1375.
[23]
Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. 2017. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems. 3992--4001.
[24]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.
[25]
Irene Chen, Fredrik D Johansson, and David Sontag. 2018. Why Is My Classifier Discriminatory?. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. 3539--3550.
[26]
Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David Karger. 2020. ARDA: Automatic Relational Data Augmentation for Machine Learning. PVLDB 13, 9 (2020), 1373--1387.
[27]
Jeffrey Dastin. 2018. Amazon scraps secret AI recruiting tool that showed bias against women. Reuters.
[28]
Marina Drosou, HV Jagadish, Evaggelia Pitoura, and Julia Stoyanovich. 2017. Diversity in big data: A review. Big data 5, 2 (2017).
[29]
Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 259--268.
[30]
Sorelle A Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P Hamilton, and Derek Roth. 2019. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the conference on fairness, accountability, and transparency. 329--338.
[31]
Lise Getoor. 2019. Responsible Data Science. In SIGMOD.
[32]
Yifan Guan, Abolfazl Asudeh, Pranav Mayuram, HV Jagadish, Julia Stoyanovich, Gerome Miklau, and Gautam Das. 2019. Mithraranking: A system for responsible ranking design. In SIGMOD. 1913--1916.
[33]
Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of opportunity in supervised learning. arXiv preprint arXiv:1610.02413 (2016).
[34]
Wassily Hoeffding. 1994. Probability Inequalities for sums of Bounded Random Variables. 409--426.
[35]
David Holt and David Elliot. 1991. Methods of weighting for unit non-response. Journal of the Royal Statistical Society: Series D (The Statistician) 40, 3 (1991), 333--342.
[36]
Zhongjun Jin, Mengjing Xu, Chenkai Sun, Abolfazl Asudeh, and HV Jagadish. 2020. MithraCoverage: A system for investigating population bias for intersectional fairness. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2721--2724.
[37]
Zhongjun Jin, Mengjing Xu, Chenkai Sun, Abolfazl Asudeh, and HV Jagadish. 2020. MithraCoverage: A System for Investigating Population Bias for Intersectional Fairness. SIGMOD (2020).
[38]
Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33, 1 (2012), 1--33.
[39]
Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy. 2010. Discrimination aware decision tree learning. In 2010 IEEE International Conference on Data Mining. IEEE, 869--874.
[40]
Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. 2012. Fairness-aware classifier with prejudice remover regularizer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 35--50.
[41]
Michael N. Katehakis and Arthur F. Veinott Jr. 1987. The Multi-Armed Bandit Problem: Decomposition and Computation. Math. Oper. Res. 12, 2 (1987), 262--268.
[42]
Paraschos Koutris, Prasang Upadhyaya, Magdalena Balazinska, Bill Howe, and Dan Suciu. 2015. Query-Based Data Pricing. J. ACM 62, 5 (2015), 43:1--43:44.
[43]
Caitlin Kuhlman and Elke Rundensteiner. 2020. Rank aggregation algorithms for fair consensus. PVLDB 13, 12 (2020), 2706--2719.
[44]
Oliver Lehmberg and Christian Bizer. 2019. Synthesizing N-ary Relations from Web Tables. In WIMS. 17:1--17:12.
[45]
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander Join: Online Aggregation via Random Walks. In SIGMOD. 615--629.
[46]
Yin Lin, Yifan Guan, Abolfazl Asudeh, and Jagadish H. V. 2020. Identifying Insufficient Data Coverage in Databases with Multiple Relations. PVLDB 13, 11 (2020), 2229--2242.
[47]
Gang Luo, Curt J. Ellmann, Peter J. Haas, and Jeffrey F. Naughton. 2002. A scalable hash ripple join algorithm. In SIGMOD. 252--262.
[48]
Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. 2008. Google's deep web crawl. Proceedings of the VLDB Endowment 1, 2 (2008), 1241--1252.
[49]
Rajeev Motwani and Prabhakar Raghavan. 1995. Randomized algorithms. Cambridge university press.
[50]
M. Mulshine. 2015. A major flaw in Google's algorithm allegedly tagged two black people's faces with the word 'gorillas'. Business Insider.
[51]
Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table Union Search on Open Data. PVLDB 11, 7 (2018), 813--825.
[52]
Jerzy Neyman and Egon Sharpe Pearson. 1936. Contributions to the theory of testing statistical hypotheses. Statistical Research Memoirs (1936).
[53]
Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Emre Kiciman. 2019. Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data 2 (2019), 13.
[54]
Laurel J. Orr, Magdalena Balazinska, and Dan Suciu. 2020. Sample Debiasing in the Themis Open World Database System. In SIGMOD. 257--268.
[55]
Amir Bahador Parsa, Homa Taghipour, Sybil Derrible, and Abolfazl Kouros Mohammadian. 2019. Real-time accident detection: coping with imbalanced data. Accident Analysis & Prevention 129 (2019), 202--210.
[56]
Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the Web Using Column Keywords. PVLDB 5, 10 (2012), 908--919.
[57]
Li Qian, Michael J. Cafarella, and H. V. Jagadish. 2012. Sample-driven schema mapping. In SIGMOD. 73--84.
[58]
Ilija Radosavovic, Piotr Dollár, Ross B. Girshick, Georgia Gkioxari, and Kaiming He. 2018. Data Distillation: Towards Omni-Supervised Learning. In CVPR. 4119--4128.
[59]
Theodoros Rekatsinas, Amol Deshpande, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2016. SourceSight: Enabling Effective Source Selection. In SIGMOD. 2157--2160.
[60]
Adam Rose. 2010. Are Face-Detection Cameras Racist? Time Business.
[61]
Shazia Wasim Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F. Ilyas, Sebastian Link, Renée J. Miller, Felix Naumann, Xiaofang Zhou, and Divesh Srivastava. 2017. Data Quality: The Role of Empiricism. SIGMOD Rec. 46, 4 (2017), 35--43.
[62]
Babak Salimi, Bill Howe, and Dan Suciu. 2020. Database Repair Meets Algorithmic Fairness. ACM SIGMOD Record 49, 1 (2020), 34--41.
[63]
Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional fairness: Causal database repair for algorithmic fairness. In SIGMOD. 793--810.
[64]
Nihar B Shah and Zachary Lipton. 2020. SIGMOD 2020 Tutorial on Fairness and Bias in Peer Review and Other Sociotechnical Intelligent Systems. In SIGMOD. 2637--2640.
[65]
Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, and Lev Novik. 2014. Discovering queries based on example tuples. In SIGMOD. 493--504.
[66]
Cheng Sheng, Nan Zhang, Yufei Tao, and Xin Jin. 2012. Optimal algorithms for crawling a hidden database in the web. arXiv preprint arXiv:1208.0075 (2012).
[67]
N. Singer. 2013. A data broker offers a peek behind the curtain. The New York Times.
[68]
Aleksandrs Slivkins. 2019. Introduction to Multi-Armed Bandits. Found. Trends Mach. Learn. 12, 1--2 (2019), 1--286.
[69]
Julia Stoyanovich, Bill Howe, and HV Jagadish. 2020. Responsible data management. PVLDB 13, 12 (2020), 3474--3488.
[70]
Chenkai Sun, Abolfazl Asudeh, HV Jagadish, Bill Howe, and Julia Stoyanovich. 2019. Mithralabel: Flexible dataset nutritional labels for responsible data science. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2893--2896.
[71]
Tess Townsend. 2017. Most engineers are white and so are the faces they use to train software. Recode.
[72]
Suresh Venkatasubramanian. 2019. Algorithmic fairness: Measures, methods and representations. In PODS. 481--481.
[73]
Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. 2017. Learning non-discriminatory predictors. In Conference on Learning Theory. PMLR, 1920--1953.
[74]
Ke Yang, Julia Stoyanovich, Abolfazl Asudeh, Bill Howe, HV Jagadish, and Gerome Miklau. 2018. A nutritional label for rankings. In SIGMOD. 1773--1776.
[75]
M. B. Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. 2015. Fairness constraints: Mechanisms for fair classification. CoRR, abs/1507.05259 (2015).
[76]
Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. 2017. Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics. PMLR, 962--970.
[77]
Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In ICML.
[78]
Hantian Zhang, Xu Chu, Abolfazl Asudeh, and Shamkant Navathe. 2021. OmniFair: A Declarative System for Model-Agnostic Group Fairness in Machine Learning. SIGMOD (2021).
[79]
Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random Sampling over Joins Revisited. In SIGMOD. 1525--1539.
[80]
Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet-Scale Domain Search. PVLDB 9, 12 (2016), 1185--1196.

Cited By

View all
  • (2024)Chameleon: Foundation Models for Fairness-Aware Multi-Modal Data Augmentation to Enhance Coverage of MinoritiesProceedings of the VLDB Endowment10.14778/3681954.368201417:11(3470-3483)Online publication date: 30-Aug-2024
  • (2024)Fainder: A Fast and Accurate Index for Distribution-Aware Dataset SearchProceedings of the VLDB Endowment10.14778/3681954.368199917:11(3269-3282)Online publication date: 1-Jul-2024
  • (2024)FairHash: A Fair and Memory/Time-efficient HashmapProceedings of the ACM on Management of Data10.1145/36549392:3(1-29)Online publication date: 30-May-2024
  • Show More Cited By

Index Terms

  1. Tailoring data source distributions for fairness-aware data integration
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 14, Issue 11
      July 2021
      732 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 July 2021
      Published in PVLDB Volume 14, Issue 11

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)108
      • Downloads (Last 6 weeks)11
      Reflects downloads up to 17 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Chameleon: Foundation Models for Fairness-Aware Multi-Modal Data Augmentation to Enhance Coverage of MinoritiesProceedings of the VLDB Endowment10.14778/3681954.368201417:11(3470-3483)Online publication date: 30-Aug-2024
      • (2024)Fainder: A Fast and Accurate Index for Distribution-Aware Dataset SearchProceedings of the VLDB Endowment10.14778/3681954.368199917:11(3269-3282)Online publication date: 1-Jul-2024
      • (2024)FairHash: A Fair and Memory/Time-efficient HashmapProceedings of the ACM on Management of Data10.1145/36549392:3(1-29)Online publication date: 30-May-2024
      • (2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
      • (2023)Through the Fairness Lens: Experimental Analysis and Evaluation of Entity MatchingProceedings of the VLDB Endowment10.14778/3611479.361152516:11(3279-3292)Online publication date: 24-Aug-2023
      • (2023)Aggregation Consistency Errors in Semantic Layers and How to Avoid ThemProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3597465.3605224(1-7)Online publication date: 18-Jun-2023
      • (2023)Representation Bias in Data: A Survey on Identification and Resolution TechniquesACM Computing Surveys10.1145/358843355:13s(1-39)Online publication date: 13-Jul-2023
      • (2023)Next-generation Challenges of Responsible Data IntegrationProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3572727(1256-1259)Online publication date: 27-Feb-2023
      • (2023)The Many Facets of Data EquityJournal of Data and Information Quality10.1145/353342514:4(1-21)Online publication date: 7-Feb-2023
      • (2022)Towards distribution-aware query answering in data marketsProceedings of the VLDB Endowment10.14778/3551793.355185815:11(3137-3144)Online publication date: 29-Sep-2022
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media