research-article

Tailoring data source distributions for fairness-aware data integration

Authors:

Fatemeh Nargesian,

Abolfazl Asudeh,

H. V. JagadishAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 11

Pages 2519 - 2532

https://doi.org/10.14778/3476249.3476299

Published: 01 July 2021 Publication History

Abstract

Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.

References

[1]

[n.d.]. Dawex: Sell, buy and share data. https://www.dawex.com/en.

[2]

[n.d.]. WorldQuant. https://www.worldquant.com.

[3]

[n.d.]. Xignite. https://aws.amazon.com/solutionspace/financialservices/solutions/xignite-market-data-cloudplatform.

[4]

2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning 5, 1 (2012), 1--122.

[5]

2020. Data Broker Registry. https://oag.ca.gov/data-brokers.

[6]

July 2020. Google Flights API: Incorporate Travel Data into Your App. The Rapid API Blog.

[7]

June 2021. Airborne Flights database. U.S. Department of Transportation, https://www.transtats.bts.gov.

[8]

June 2021. The Socrata Open Data API. ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata/.

[9]

June 2021. The Texas Tribune Data set. https://salaries.texastribune.org.

[10]

Chiara Accinelli, Simone Minisi, and Barbara Catania. 2020. Coverage-based Rewriting for Data Preparation. In EDBT/ICDT Workshops.

[11]

Abolfazl Asudeh, HV Jagadish, Gerome Miklau, and Julia Stoyanovich. 2019. On Obtaining Stable Rankings. PVLDB 12, 3 (2019).

Digital Library

[12]

Abolfazl Asudeh, HV Jagadish, Julia Stoyanovich, and Gautam Das. 2019. Designing fair ranking schemes. In SIGMOD. 1259--1276.

Digital Library

[13]

Abolfazl Asudeh and H. V. Jagadish. 2020. Fairly evaluating and scoring items in a data set. PVLDB 13, 12 (2020), 3445--3448.

Digital Library

[14]

Abolfazl Asudeh, Zhongjun Jin, and H. V. Jagadish. 2019. Assessing and Remedying Coverage for a Given Dataset. In ICDE. 554--565.

[15]

Abolfazl Asudeh, Nima Shahbazi, Zhongjun Jin, and HV Jagadish. 2021. Identifying Insufficient Data Coverage for Ordinal Continuous-Valued Attributes. SIGMOD (2021).

Digital Library

[16]

Abolfazl Asudeh, Saravanan Thirumuruganathan, Nan Zhang, and Gautam Das. 2016. Discovering the Skyline of Web Databases. PVLDB 9, 7 (2016), 600--611.

Digital Library

[17]

Abolfazl Asudeh, Nan Zhang, and Gautam Das. 2016. Query reranking as a service. PVLDB 9, 11 (2016), 888--899.

Digital Library

[18]

Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2019. Fairness and machine learning: Limitations and opportunities. fairmlbook.org.

[19]

Solon Barocas and Andrew D Selbst. 2016. Big data's disparate impact. Calif. L. Rev. 104 (2016), 671.

[20]

Robert Bartlett, Adair Morse, Richard Stanton, and Nancy Wallace. 2019. Consumer-lending discrimination in the FinTech era. Technical Report. National Bureau of Economic Research.

[21]

Gustavo EAPA Batista, Ronaldo C Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter 6, 1 (2004), 20--29.

Digital Library

[22]

Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW. 1365--1375.

Digital Library

[23]

Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. 2017. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems. 3992--4001.

Digital Library

[24]

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.

Digital Library

[25]

Irene Chen, Fredrik D Johansson, and David Sontag. 2018. Why Is My Classifier Discriminatory?. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. 3539--3550.

Digital Library

[26]

Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David Karger. 2020. ARDA: Automatic Relational Data Augmentation for Machine Learning. PVLDB 13, 9 (2020), 1373--1387.

Digital Library

[27]

Jeffrey Dastin. 2018. Amazon scraps secret AI recruiting tool that showed bias against women. Reuters.

[28]

Marina Drosou, HV Jagadish, Evaggelia Pitoura, and Julia Stoyanovich. 2017. Diversity in big data: A review. Big data 5, 2 (2017).

[29]

Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 259--268.

Digital Library

[30]

Sorelle A Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P Hamilton, and Derek Roth. 2019. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the conference on fairness, accountability, and transparency. 329--338.

Digital Library

[31]

Lise Getoor. 2019. Responsible Data Science. In SIGMOD.

Digital Library

[32]

Yifan Guan, Abolfazl Asudeh, Pranav Mayuram, HV Jagadish, Julia Stoyanovich, Gerome Miklau, and Gautam Das. 2019. Mithraranking: A system for responsible ranking design. In SIGMOD. 1913--1916.

Digital Library

[33]

Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of opportunity in supervised learning. arXiv preprint arXiv:1610.02413 (2016).

Digital Library

[34]

Wassily Hoeffding. 1994. Probability Inequalities for sums of Bounded Random Variables. 409--426.

[35]

David Holt and David Elliot. 1991. Methods of weighting for unit non-response. Journal of the Royal Statistical Society: Series D (The Statistician) 40, 3 (1991), 333--342.

[36]

Zhongjun Jin, Mengjing Xu, Chenkai Sun, Abolfazl Asudeh, and HV Jagadish. 2020. MithraCoverage: A system for investigating population bias for intersectional fairness. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2721--2724.

Digital Library

[37]

Zhongjun Jin, Mengjing Xu, Chenkai Sun, Abolfazl Asudeh, and HV Jagadish. 2020. MithraCoverage: A System for Investigating Population Bias for Intersectional Fairness. SIGMOD (2020).

Digital Library

[38]

Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33, 1 (2012), 1--33.

Digital Library

[39]

Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy. 2010. Discrimination aware decision tree learning. In 2010 IEEE International Conference on Data Mining. IEEE, 869--874.

Digital Library

[40]

Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. 2012. Fairness-aware classifier with prejudice remover regularizer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 35--50.

[41]

Michael N. Katehakis and Arthur F. Veinott Jr. 1987. The Multi-Armed Bandit Problem: Decomposition and Computation. Math. Oper. Res. 12, 2 (1987), 262--268.

Digital Library

[42]

Paraschos Koutris, Prasang Upadhyaya, Magdalena Balazinska, Bill Howe, and Dan Suciu. 2015. Query-Based Data Pricing. J. ACM 62, 5 (2015), 43:1--43:44.

Digital Library

[43]

Caitlin Kuhlman and Elke Rundensteiner. 2020. Rank aggregation algorithms for fair consensus. PVLDB 13, 12 (2020), 2706--2719.

Digital Library

[44]

Oliver Lehmberg and Christian Bizer. 2019. Synthesizing N-ary Relations from Web Tables. In WIMS. 17:1--17:12.

Digital Library

[45]

Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander Join: Online Aggregation via Random Walks. In SIGMOD. 615--629.

Digital Library

[46]

Yin Lin, Yifan Guan, Abolfazl Asudeh, and Jagadish H. V. 2020. Identifying Insufficient Data Coverage in Databases with Multiple Relations. PVLDB 13, 11 (2020), 2229--2242.

Digital Library

[47]

Gang Luo, Curt J. Ellmann, Peter J. Haas, and Jeffrey F. Naughton. 2002. A scalable hash ripple join algorithm. In SIGMOD. 252--262.

Digital Library

[48]

Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. 2008. Google's deep web crawl. Proceedings of the VLDB Endowment 1, 2 (2008), 1241--1252.

Digital Library

[49]

Rajeev Motwani and Prabhakar Raghavan. 1995. Randomized algorithms. Cambridge university press.

Digital Library

[50]

M. Mulshine. 2015. A major flaw in Google's algorithm allegedly tagged two black people's faces with the word 'gorillas'. Business Insider.

[51]

Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table Union Search on Open Data. PVLDB 11, 7 (2018), 813--825.

Digital Library

[52]

Jerzy Neyman and Egon Sharpe Pearson. 1936. Contributions to the theory of testing statistical hypotheses. Statistical Research Memoirs (1936).

[53]

Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Emre Kiciman. 2019. Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data 2 (2019), 13.

[54]

Laurel J. Orr, Magdalena Balazinska, and Dan Suciu. 2020. Sample Debiasing in the Themis Open World Database System. In SIGMOD. 257--268.

Digital Library

[55]

Amir Bahador Parsa, Homa Taghipour, Sybil Derrible, and Abolfazl Kouros Mohammadian. 2019. Real-time accident detection: coping with imbalanced data. Accident Analysis & Prevention 129 (2019), 202--210.

[56]

Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the Web Using Column Keywords. PVLDB 5, 10 (2012), 908--919.

Digital Library

[57]

Li Qian, Michael J. Cafarella, and H. V. Jagadish. 2012. Sample-driven schema mapping. In SIGMOD. 73--84.

Digital Library

[58]

Ilija Radosavovic, Piotr Dollár, Ross B. Girshick, Georgia Gkioxari, and Kaiming He. 2018. Data Distillation: Towards Omni-Supervised Learning. In CVPR. 4119--4128.

[59]

Theodoros Rekatsinas, Amol Deshpande, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2016. SourceSight: Enabling Effective Source Selection. In SIGMOD. 2157--2160.

Digital Library

[60]

Adam Rose. 2010. Are Face-Detection Cameras Racist? Time Business.

[61]

Shazia Wasim Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F. Ilyas, Sebastian Link, Renée J. Miller, Felix Naumann, Xiaofang Zhou, and Divesh Srivastava. 2017. Data Quality: The Role of Empiricism. SIGMOD Rec. 46, 4 (2017), 35--43.

Digital Library

[62]

Babak Salimi, Bill Howe, and Dan Suciu. 2020. Database Repair Meets Algorithmic Fairness. ACM SIGMOD Record 49, 1 (2020), 34--41.

Digital Library

[63]

Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional fairness: Causal database repair for algorithmic fairness. In SIGMOD. 793--810.

Digital Library

[64]

Nihar B Shah and Zachary Lipton. 2020. SIGMOD 2020 Tutorial on Fairness and Bias in Peer Review and Other Sociotechnical Intelligent Systems. In SIGMOD. 2637--2640.

Digital Library

[65]

Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, and Lev Novik. 2014. Discovering queries based on example tuples. In SIGMOD. 493--504.

Digital Library

[66]

Cheng Sheng, Nan Zhang, Yufei Tao, and Xin Jin. 2012. Optimal algorithms for crawling a hidden database in the web. arXiv preprint arXiv:1208.0075 (2012).

Digital Library

[67]

N. Singer. 2013. A data broker offers a peek behind the curtain. The New York Times.

[68]

Aleksandrs Slivkins. 2019. Introduction to Multi-Armed Bandits. Found. Trends Mach. Learn. 12, 1--2 (2019), 1--286.

Digital Library

[69]

Julia Stoyanovich, Bill Howe, and HV Jagadish. 2020. Responsible data management. PVLDB 13, 12 (2020), 3474--3488.

Digital Library

[70]

Chenkai Sun, Abolfazl Asudeh, HV Jagadish, Bill Howe, and Julia Stoyanovich. 2019. Mithralabel: Flexible dataset nutritional labels for responsible data science. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2893--2896.

Digital Library

[71]

Tess Townsend. 2017. Most engineers are white and so are the faces they use to train software. Recode.

[72]

Suresh Venkatasubramanian. 2019. Algorithmic fairness: Measures, methods and representations. In PODS. 481--481.

Digital Library

[73]

Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. 2017. Learning non-discriminatory predictors. In Conference on Learning Theory. PMLR, 1920--1953.

[74]

Ke Yang, Julia Stoyanovich, Abolfazl Asudeh, Bill Howe, HV Jagadish, and Gerome Miklau. 2018. A nutritional label for rankings. In SIGMOD. 1773--1776.

Digital Library

[75]

M. B. Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. 2015. Fairness constraints: Mechanisms for fair classification. CoRR, abs/1507.05259 (2015).

[76]

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. 2017. Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics. PMLR, 962--970.

[77]

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In ICML.

Digital Library

[78]

Hantian Zhang, Xu Chu, Abolfazl Asudeh, and Shamkant Navathe. 2021. OmniFair: A Declarative System for Model-Agnostic Group Fairness in Machine Learning. SIGMOD (2021).

Digital Library

[79]

Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random Sampling over Joins Revisited. In SIGMOD. 1525--1539.

Digital Library

[80]

Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet-Scale Domain Search. PVLDB 9, 12 (2016), 1185--1196.

Digital Library

Cited By

Erfanian MJagadish HAsudeh A(2024)Chameleon: Foundation Models for Fairness-Aware Multi-Modal Data Augmentation to Enhance Coverage of MinoritiesProceedings of the VLDB Endowment10.14778/3681954.368201417:11(3470-3483)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682014
Behme LGalhotra SBeedkar KMarkl V(2024)Fainder: A Fast and Accurate Index for Distribution-Aware Dataset SearchProceedings of the VLDB Endowment10.14778/3681954.368199917:11(3269-3282)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681999
Shahbazi NSintos SAsudeh A(2024)FairHash: A Fair and Memory/Time-efficient HashmapProceedings of the ACM on Management of Data10.1145/36549392:3(1-29)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654939
Show More Cited By

Index Terms

Tailoring data source distributions for fairness-aware data integration
1. Information systems
  1. Information systems applications
2. Theory of computation
  1. Theory and algorithms for application domains

Index terms have been assigned to the content through auto-classification.

Recommendations

Source integration for data warehousing
Multidimensional databases

While the main goal of a data warehouse is to provide support for data analysis and management's decisions, a fundamental aspect in design of a data warehouse system is the process of acquiring the raw data from a set of relevant information sources. We ...
Data source management and selection for dynamic data integration
RED'09: Proceedings of the 2nd international conference on Resource discovery

Selection-dynamic data integration employs a set of known data sources attached to an integration system. For answering a given query, suitable sources are selected from this set and dynamically integrated. This procedure requires a method to determine ...
On-demand big data integration

Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager extract, transform, and load (ETL) process constructs an integrated data repository ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 11

July 2021

732 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2021

Published in PVLDB Volume 14, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
334
Total Downloads

Downloads (Last 12 months)108
Downloads (Last 6 weeks)11

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Erfanian MJagadish HAsudeh A(2024)Chameleon: Foundation Models for Fairness-Aware Multi-Modal Data Augmentation to Enhance Coverage of MinoritiesProceedings of the VLDB Endowment10.14778/3681954.368201417:11(3470-3483)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682014
Behme LGalhotra SBeedkar KMarkl V(2024)Fainder: A Fast and Accurate Index for Distribution-Aware Dataset SearchProceedings of the VLDB Endowment10.14778/3681954.368199917:11(3269-3282)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681999
Shahbazi NSintos SAsudeh A(2024)FairHash: A Fair and Memory/Time-efficient HashmapProceedings of the ACM on Management of Data10.1145/36549392:3(1-29)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654939
Chang JCui BNargesian FAsudeh AJagadish H(2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00849-w
Shahbazi NDanevski NNargesian FAsudeh ASrivastava D(2023)Through the Fairness Lens: Experimental Analysis and Evaluation of Entity MatchingProceedings of the VLDB Endowment10.14778/3611479.361152516:11(3279-3292)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611525
Huang ZDamalapati PWu E(2023)Aggregation Consistency Errors in Semantic Layers and How to Avoid ThemProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3597465.3605224(1-7)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3597465.3605224
Shahbazi NLin YAsudeh AJagadish H(2023)Representation Bias in Data: A Survey on Identification and Resolution TechniquesACM Computing Surveys10.1145/358843355:13s(1-39)Online publication date: 13-Jul-2023
https://dl.acm.org/doi/10.1145/3588433
Nargesian FAsudeh AJagadish HChua TLauw HSi LTerzi ETsaparas P(2023)Next-generation Challenges of Responsible Data IntegrationProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3572727(1256-1259)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3572727
Jagadish HStoyanovich JHowe B(2023)The Many Facets of Data EquityJournal of Data and Information Quality10.1145/353342514:4(1-21)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1145/3533425
Asudeh ANargesian F(2022)Towards distribution-aware query answering in data marketsProceedings of the VLDB Endowment10.14778/3551793.355185815:11(3137-3144)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551858
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents