research-article

Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching

Authors:

Nikola Danevski,

Fatemeh Nargesian,

Abolfazl Asudeh,

Divesh SrivastavaAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 16, Issue 11

Pages 3279 - 3292

https://doi.org/10.14778/3611479.3611525

Published: 01 July 2023 Publication History

Abstract

Entity matching (EM) is a challenging problem studied by different communities for over half a century. Algorithmic fairness has also become a timely topic to address machine bias and its societal impacts. Despite extensive research on these two topics, little attention has been paid to the fairness of entity matching.

Towards addressing this gap, we perform an extensive experimental evaluation of a variety of EM techniques in this paper. We generated two social datasets from publicly available datasets for the purpose of auditing EM through the lens of fairness. Our findings underscore potential unfairness under two common conditions in real-world societies: (i) when some demographic groups are over-represented, and (ii) when names are more similar in some groups compared to others. Among our many findings, it is noteworthy to mention that while various fairness definitions are valuable for different settings, due to EM's class imbalance nature, measures such as positive predictive value parity and true positive rate parity are, in general, more capable of revealing EM unfairness.

References

[1]

[n.d.]. u.s. census bureau quickfacts: united states. https://www.census.gov/quickfacts/fact/table/US/PST045221

[2]

2015. COMPAS Recidivism Risk Score Data and Analysis. www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis.

[3]

[visited: 2023]. CSRankings GitHub Repository. https://github.com/emeryberger/CSRankings.

[4]

IBM Watson Advertising. 2022. Bias in Advertising: Confronting & Addressing the Challenge. https://www.ibm.com/watson-advertising/thought-leadership/bias-in-advertising.

[5]

Abolfazl Asudeh, HV Jagadish, Julia Stoyanovich, and Gautam Das. 2019. Designing fair ranking schemes. In Proceedings of the 2019 international conference on management of data. 1259--1276.

Digital Library

[6]

Abolfazl Asudeh and H. V. Jagadish. 2020. Fairly evaluating and scoring items in a data set. PVLDB 13, 12 (2020), 3445--3448.

Digital Library

[7]

Abolfazl Asudeh, Zhongjun Jin, and HV Jagadish. 2019. Assessing and remedying coverage for a given dataset. In ICDE. IEEE, 554--565.

[8]

Abolfazl Asudeh, Nima Shahbazi, Zhongjun Jin, and HV Jagadish. 2021. Identifying insufficient data coverage for ordinal continuous-valued attributes. In Proceedings of the 2021 international conference on management of data. 129--141.

Digital Library

[9]

Tho Bach and Kenny Bernat. 2022. The Business Impact of Biased Advertising (and How to Fix It). https://www.wpromote.com/blog/digital-marketing/biased-advertising.

[10]

Nils Barlaug. 2022. LEMON: explainable entity matching. IEEE Transactions on Knowledge and Data Engineering (2022).

[11]

Nils Barlaug and Jon Atle Gulla. 2021. Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 3 (2021), 1--37.

[12]

Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2019. Fairness and machine learning: Limitations and opportunities. fairmlbook.org.

[13]

Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, et al. 2018. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. arXiv preprint arXiv:1810.01943 (2018).

[14]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606 (2016).

[15]

Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. 2017. Optimized pre-processing for discrimination prevention. Advances in neural information processing systems 30 (2017).

Digital Library

[16]

L Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K Vishnoi. 2019. Classification with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of the conference on fairness, accountability, and transparency. 319--328.

Digital Library

[17]

Runjin Chen, Yanyan Shen, and Dongxiang Zhang. 2021. GNEM: a generic one-to-set neural entity matching framework. In Proceedings of the Web Conference 2021. 1686--1694.

Digital Library

[18]

Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153--163.

[19]

Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2020. An overview of end-to-end entity resolution for big data. ACM Computing Surveys (CSUR) 53, 6 (2020), 1--42.

Digital Library

[20]

Equal Employment Opportunity Commission. 1979. The U.S. Uniform guidelines on employee selection procedures.

[21]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[22]

Vincenzo Di Cicco, Donatella Firmani, Nick Koudas, Paolo Merialdo, and Divesh Srivastava. 2019. Interpreting deep learning models for entity resolution: an experience report using LIME. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1--4.

Digital Library

[23]

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. 214--226.

Digital Library

[24]

Vasilis Efthymiou, Kostas Stefanidis, Evaggelia Pitoura, and Vassilis Christophides. 2021. FairER: Entity Resolution With Fairness Constraints. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3004--3008.

Digital Library

[25]

Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 259--268.

Digital Library

[26]

Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. 2021. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 3665--3671.

[27]

Forest Gregg and Derek Eder. 2022. Dedupe. https://github.com/dedupeio/dedupe.

[28]

Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016).

[29]

InDeXLab. 2023. Fair Entity Matching. github.com/UIC-InDeXLab/fair_entity_matching/tree/main/synthetic%20dataset%20generator.

[30]

Anna Jurek, Jun Hong, Yuan Chi, and Weiru Liu. 2017. A novel ensemble learning approach to unsupervised record linkage. Information Systems 71 (2017), 40--54.

Digital Library

[31]

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).

[32]

Pradap Venkatramanan Konda. 2018. Magellan: Toward building entity matching management systems. The University of Wisconsin-Madison.

[33]

Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A comparison. Data & Knowledge Engineering 69, 2 (2010), 197--210.

Digital Library

[34]

Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. Advances in neural information processing systems 30 (2017).

[35]

Bo-Han Li, Yi Liu, An-Man Zhang, Wen-Huan Wang, and Shuo Wan. 2020. A survey on blocking technology of entity resolution. Journal of Computer Science and Technology 35 (2020), 769--793.

Digital Library

[36]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020).

[37]

Ling Liu. 2022. Ensemble Learning Methods for Dirty Data. In CIKM, Keynote.

[38]

Christina Makri, Alexandros Karakasidis, and Evaggelia Pitoura. 2022. Towards a more Accurate and Fair SVM-based Record Linkage. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, 4691--4699.

[39]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[40]

Alex P Miller and Kartik Hosanagar. 2019. How targeted ads and dynamic pricing can perpetuate bias. Harvard Business Review (2019).

[41]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34.

Digital Library

[42]

Fatemeh Nargesian, Abolfazl Asudeh, and HV Jagadish. 2021. Tailoring data source distributions for fairness-aware data integration. PVLDB 14, 11 (2021), 2519--2532.

Digital Library

[43]

Soudeh Nilforoushan, Qianfan Wu, and Mostafa Milani. 2022. Entity Matching with AUC-Based Fairness. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, 5068--5075.

[44]

Matteo Paganelli, Paolo Sottovia, Francesco Guerra, and Yannis Velegrakis. 2019. Tuner: Fine tuning of rule-based entity matchers. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2945--2948.

Digital Library

[45]

Fatemah Panahi, Wentao Wu, AnHai Doan, and Jeffrey F Naughton. 2017. Towards Interactive Debugging of Rule-based Entity Matching. In EDBT. 354--365.

[46]

George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR) 53, 2 (2020), 1--42.

Digital Library

[47]

George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment 9, 9 (2016), 684--695.

Digital Library

[48]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[49]

Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC training dataset and gold standard for large-scale product matching. In Companion Proceedings of The 2019 World Wide Web Conference. 381--386.

Digital Library

[50]

Nima Shahbazi, Nikola Danevski, Fatemeh Nargesian, Abolfazl Asudeh, and Divesh Srivastava. 2023. Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching. https://github.com/UIC-InDeXLab/fair_entity_matching/blob/main/techrep.pdf.

[51]

Nima Shahbazi, Yin Lin, Abolfazl Asudeh, and HV Jagadish. 2023. Representation Bias in Data: A Survey on Identification and Resolution Techniques. ACM Computing Surveys (2023).

[52]

Suraj Shetiya, Ian P. Swift, Abolfazl Asudeh, and Gautam Das. 2022. Fairness-Aware Range Queries for Selecting Unbiased Data. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE.

[53]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment 11, 2 (2017), 189--202.

Digital Library

[54]

Ian P Swift, Sana Ebrahimi, Azade Nova, and Abolfazl Asudeh. 2022. Maximizing Fair Content Spread via Edge Suggestion in Social Networks. Proceedings of the VLDB Endowment 15, 11 (2022).

Digital Library

[55]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[56]

Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity matching: How similar is similar. Proceedings of the VLDB Endowment 4, 10 (2011), 622--633.

Digital Library

[57]

Jin Wang and Yuliang Li. 2022. Minun: evaluating counterfactual explanations for entity matching. In Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning. 1--11.

Digital Library

[58]

Jin Wang, Yuliang Li, and Wataru Hirota. 2021. Machamp: A Generalized Entity Matching Benchmark. In CIKM. ACM, 4633--4642.

[59]

Liu Yi, Diao Xing-Chun, Cao Jian-Jun, Zhou Xing, and Shang Yu-Ling. 2017. A method for entity resolution in high dimensional data using ensemble classifiers. Mathematical Problems in Engineering 2017 (2017).

[60]

Minghe Yu, Guoliang Li, Dong Deng, and Jianhua Feng. 2016. String similarity search and join: a survey. Frontiers of Computer Science 10, 3 (2016), 399--417.

Digital Library

[61]

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. 2017. Fairness constraints: Mechanisms for fair classification. In Artificial intelligence and statistics. PMLR, 962--970.

[62]

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In International conference on machine learning. PMLR, 325--333.

[63]

Dongxiang Zhang, Yuyang Nie, Sai Wu, Yanyan Shen, and Kian-Lee Tan. 2020. Multi-context attention for entity matching. In Proceedings of The Web Conference 2020. 2634--2640.

Digital Library

[64]

Hantian Zhang, Nima Shahbazi, Xu Chu, and Abolfazl Asudeh. 2021. FairRover: explorative model building for fair and responsible machine learning. In Proceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning. 1--10.

Digital Library

Cited By

Shahbazi NErfanian MAsudeh ANargesian FSrivastava D(2024)FairEM360: A Suite for Responsible Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368588917:12(4417-4420)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685889
Erfanian MJagadish HAsudeh A(2024)Chameleon: Foundation Models for Fairness-Aware Multi-Modal Data Augmentation to Enhance Coverage of MinoritiesProceedings of the VLDB Endowment10.14778/3681954.368201417:11(3470-3483)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682014
Shahbazi NWang JMiao ZBhutani N(2024)Fairness-Aware Data Preparation for Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00268(3476-3489)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00268
Show More Cited By

Recommendations

Experimental evaluation of MAC protocols for fairness and QoS support in wireless networks
ICNP '08: Proceedings of the 2008 IEEE International Conference on Network Protocols

Existing wireless MACs are known to have fairness and QoS problems 1. In response, the wireless community have come up with numerous ldquopoint solutionsrdquo, with one-to-one comparison of their solution with existing MACs for the specific fairness ...
Airtime Fairness for IEEE 802.11 Multirate Networks

Under a multi rate network scenario, the IEEE 802.11 DCF MAC fails to provide air-time fairness for all competing stations since the protocol is designed for ensuring max-min throughput fairness and the maximum achievable throughput by any station gets ...
Implementation and experimental evaluation of multi-channel MAC protocols for 802.11 networks

Multi-channel MAC protocols have recently obtained considerable attention in wireless networking research because they promise to increase capacity of wireless networks significantly by exploiting multiple frequency bands. However, most of these ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 16, Issue 11

July 2023

789 pages

ISSN:2150-8097

Editors:
Georgia Koutrika
Athena Research Center
,
Jun Yang
Duke University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2023

Published in PVLDB Volume 16, Issue 11

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
93
Total Downloads

Downloads (Last 12 months)75
Downloads (Last 6 weeks)12

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shahbazi NErfanian MAsudeh ANargesian FSrivastava D(2024)FairEM360: A Suite for Responsible Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368588917:12(4417-4420)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685889
Erfanian MJagadish HAsudeh A(2024)Chameleon: Foundation Models for Fairness-Aware Multi-Modal Data Augmentation to Enhance Coverage of MinoritiesProceedings of the VLDB Endowment10.14778/3681954.368201417:11(3470-3483)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682014
Shahbazi NWang JMiao ZBhutani N(2024)Fairness-Aware Data Preparation for Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00268(3476-3489)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00268
Ghassabi SBehkamal BMilani M(2023)Leveraging Knowledge Graphs for Matching Heterogeneous Entities and Explanation2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386157(2910-2919)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386157

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents