Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

PoWareMatch: A Quality-aware Deep Learning Approach to Improve Human Schema Matching

Published: 23 May 2022 Publication History

Abstract

Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web, and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering to the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high-quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.
Appendix

A Monotonic Evaluation and Theorem Proofs

The appendix is devoted to the proofs of Theorems 1 and 2.
Theorem 1.
Recall (R) is a MIEM over \(\Sigma ^{\subseteq }\), Precision (P) is a MIEM over \(\Sigma ^P\), and f-measure (F) is a MIEM over \(\Sigma ^F\).
For proving Theorem 1, we use two lemmas stating that recall is a MIEM over all match pairs in \(\Sigma ^{\subseteq }\) and, since precision and f-measure are not monotonic over the full set of pairs in \(\Sigma ^{\subseteq }\), the conditions under which monotonicity can be guaranteed for both measures. Due to space considerations we refer the interested reader to a technical report [54] for proofs of both lemmas.
Lemma 2.
Recall (R) is a MIEM over \(\Sigma ^{\subseteq }\).
Lemma 3.
For \((\sigma , \sigma ^{\prime })\in \Sigma ^{\subseteq }\):
\(P(\sigma) \le P(\sigma ^{\prime })\) iff \(P(\sigma)\le P(\Delta);\)
\(F(\sigma) \le F(\sigma ^{\prime })\) iff \(0.5\cdot F(\sigma)\le P(\Delta).\)
Proof of Theorem 1
The first part of the theorem follows directly from Lemma 2. For the remainder of the proof we rely on Lemma 3. Let \((\sigma , \sigma ^{\prime })\in \Sigma ^{P}\) be a match pair in \(\Sigma ^{P}\). By definition, since \((\sigma , \sigma ^{\prime })\in \Sigma ^{P}\), then \(P(\sigma)\le P(\Delta)\) and by Lemma 3, we can conclude that \(P(\sigma) \le P(\sigma ^{\prime })\). Similarly, let \((\sigma , \sigma ^{\prime })\in \Sigma ^{F}\) be a match pair in \(\Sigma ^{F}\). By definition, since \((\sigma , \sigma ^{\prime })\in \Sigma ^{F}\), then \(0.5\cdot F(\sigma)\le P(\Delta)\) and by Lemma 3, we can infer that \(F(\sigma) \le F(\sigma ^{\prime })\), which concludes the proof.□
Theorem 2.
Let \(R/P/F\) be a random variable, whose values are taken from the domain of \([0,1]\), and \(\Delta\) be a singleton correspondence set (\(|\Delta |=1\)). \(\Delta\) is a probabilistic local annealer with respect to \(R/P/F\) over \(\Sigma ^{\subseteq _1}/\Sigma ^{E(P)}/\Sigma ^{E(F)}\).
Proof of Theorem 2
Let \(\Delta\) be a singleton correspondence set (\(|\Delta |=1\)). Let R be a random variable and let \(\Delta =\Delta _{(\sigma ,\sigma ^{\prime })}\) such that \((\sigma ,\sigma ^{\prime })\in \Sigma ^{\subseteq _1}\).
According to Corollary 1, \(\Delta\) is a local annealer with respect to R over \(\Sigma ^{\subseteq _1}\) and therefore \(R(\sigma)\le R(\sigma ^{\prime })\), regardless of \(Pr\lbrace \Delta \in \sigma ^*\rbrace\). Therefore, for any \(p=Pr\lbrace \Delta \in \sigma ^*\rbrace , p\cdot R(\sigma)\le p\cdot R(\sigma ^{\prime })\) and by definition of expectation, \(E(R(\sigma))\le E(R(\sigma ^{\prime }))\).
Let P be a random variable and let \(\Delta =\Delta _{(\sigma ,\sigma ^{\prime })}\) such that \((\sigma ,\sigma ^{\prime })\in \Sigma ^{E(P)}\). By definition of \(\Sigma ^{E(P)}\), \(E(P(\sigma))\le Pr\lbrace \Delta \in \sigma ^*\rbrace\) and using Lemma 1 we obtain \(E(P(\sigma)) \le E(P(\sigma ^{\prime }))\).
Let F be a random variable and let \(\Delta =\Delta _{(\sigma ,\sigma ^{\prime })}\) such that \((\sigma ,\sigma ^{\prime })\in \Sigma ^{E(F)}\). By definition of \(\Sigma ^{E(F)}\), \(E(F(\sigma))\le 0.5\cdot Pr\lbrace \Delta \in \sigma ^*\rbrace\) and using Lemma 1 we obtain \(E(F(\sigma)) \le E(F(\sigma ^{\prime }))\).
We can therefore conclude, by Definition 5, that \(\Delta\) is a probabilistic local annealer with respect to \(R/P/F\) over \(\Sigma ^{\subseteq _1}/\Sigma ^{E(P)}/\Sigma ^{E(F)}\).□

References

[1]
2021. Data. Retrieved April 19, 2022 from https://github.com/shraga89/PoWareMatch/tree/master/DataFiles. (2021).
[2]
2021. Graphs. Retrieved on April 19, 2022 from https://github.com/shraga89/PoWareMatch/tree/master/Eval_graphs. (2021).
[3]
2021. OAEI benchmark. Retrieved on April 19, 2022 from http://oaei.ontologymatching.org/2011/benchmarks. (2021).
[4]
2021. Ontobuilder research environment. Retrieved on April 19, 2022 from https://github.com/shraga89/Ontobuilder-Research-Environment. (2021).
[5]
2021. PoWareMatch Configuration. Retrieved on April 19, 2022 from https://github.com/shraga89/PoWareMatch/blob/master/RunFiles/config.py. (2021).
[6]
2021. PoWareMatch repository. Retrieved on April 19, 2022 from https://github.com/shraga89/PoWareMatch. (2021).
[7]
2021. PyTorch. Retrieved on April 19, 2022 from https://pytorch.org/. (2021).
[8]
Rakefet Ackerman, Avigdor Gal, Tomer Sagi, and Roee Shraga. 2019. A cognitive model of human bias in matching. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence. Springer, 632–646.
[9]
Rakefet Ackerman and Valerie Thompson. 2017. Meta-reasoning: Monitoring and control of thinking and reasoning. Trends in Cognitive Sciences 21, 8 (2017), 607–617.
[10]
Lawrence W. Barsalou. 2014. Cognitive Psychology: An Overview for Cognitive Scientists. Psychology Press.
[11]
Zohra Bellahsene, Angela Bonifati, Fabien Duchateau, and Yannis Velegrakis. 2011. On evaluating schema matching and mapping. In Proceedings of the Schema Matching and Mapping. Springer Berlin Heidelberg, 253–291.
[12]
Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.). 2011. Schema Matching and Mapping. Springer.
[13]
Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Proceedings of the Noise Reduction in Speech Processing. Springer, 1–4.
[14]
Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic schema matching, ten years later. PVLDB 4, 11 (2011), 695–701.
[15]
Robert A. Bjork, John Dunlosky, and Nate Kornell. 2013. Self-regulated learning: Beliefs, techniques, and illusions. Annual Review of Psychology 64, 1 (2013), 417–444.
[16]
Nikolaos Bozovic and Vasilis Vassalos. 2015. Two phase user driven schema matching. In Proceedings of the Advances in Databases and Information Systems. 49–62.
[17]
Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating embeddings of heterogeneous relational datasets for data integration tasks. In Proceedings of the SIGMOD. 1335–1349.
[18]
Chen Chen, Behzad Golshan, Alon Y. Halevy, Wang-Chiew Tan, and AnHai Doan. 2018. BigGorilla: An open-source ecosystem for data preparation and integration. IEEE Data Eng. Bull. 41, 2 (2018), 10–22.
[19]
Hong-Hai Do and Erhard Rahm. 2002. COMA-a system for flexible combination of schema matching approaches. In Proceedings of the 28th International Conference on Very Large Databases. 610–621.
[20]
Xin Dong, Alon Halevy, and Cong Yu. 2009. Data integration with uncertainty. The VLDB Journal 18, 2 (2009), 469–500.
[21]
Zlatan Dragisic, Valentina Ivanova, Patrick Lambrix, Daniel Faria, Ernesto Jiménez-Ruiz, and Catia Pesquita. 2016. User validation in ontology alignment. In Proceedings of the International Semantic Web Conference. Springer, 200–217.
[22]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. PVLDB 11, 11 (2018), 1454–1467.
[23]
Jérôme Euzenat and Pavel Shvaiko. 2007. Ontology Matching, Vol. 18. Springer.
[24]
Sean M. Falconer and Margaret-Anne D. Storey. 2007. A cognitive support framework for ontology mapping. In Proceedings of the International Semantic Web Conference. Lecture Notes in Computer Science, Vol. 4825. Springer, Berlin, 114–127.
[25]
Ju Fan, Meiyu Lu, Beng Chin Ooi, Wang-Chiew Tan, and Meihui Zhang. 2014. A hybrid machine-crowdsourcing system for matching web tables. In Proceedings of the 2014 IEEE 30th International Conference on Data Engineering. IEEE, 976–987.
[26]
Raul Castro Fernandez, Essam Mansour, Abdulhakim A. Qahtan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping semantics: Linking datasets using word embeddings for data discovery. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering. IEEE, 989–1000.
[27]
Avigdor Gal. 2011. Uncertain Schema Matching. Morgan & Claypool Publishers.
[28]
Avigdor Gal, Haggai Roitman, and Roee Shraga. 2019. Learning to rerank schema matches. IEEE Transactions on Knowledge and Data Engineering 33, 8 (2019), 3104–3116. https://ieeexplore.ieee.org/document/8944172/citations#citations.
[29]
Serena Sorrentino, Sonia Bergamaschi, Maciej Gawinecki, and Laura Po. 2009. Schema normalization for improving schema matching. In Proceedings of the International Conference on Conceptual Modeling. Springer, 280–293.
[30]
Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. 2000. Learning to forget: Continual prediction with LSTM. Neural Computation 12, 10 (2000), 2451–2471.
[31]
Alon Y. Halevy and Jayant Madhavan. 2003. Corpus-based knowledge representation. In Proceedings of the IJCAI, Vol. 3. 1567–1572.
[32]
Joachim Hammer, Michael Stonebraker, and Oguzhan Topsakal. 2005. THALIA: Test harness for the assessment of legacy information integration approaches. In Proceedings of the ICDE. 485–486.
[33]
Bin He and Kevin Chen-Chuan Chang. 2005. Making holistic schema matching robust: An ensemble approach. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 429–438.
[34]
Maurice G. Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81–93.
[35]
Prodromos Kolyvakis, Alexandros Kalousis, and Dimitris Kiritsis. 2018. Deepalignment: Unsupervised ontology matching with refined word vectors. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 787–798.
[36]
Patrick Lambrix and Anna Edberg. 2003. Evaluation of ontology merging tools in bioinformatics. In Proceedings of the 8th Pacific Symposium on Biocomputing.589–600.
[37]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436.
[38]
Guoliang Li. 2017. Human-in-the-loop data integration. Proceedings of the VLDB Endowment 10, 12 (2017), 2006–2017.
[39]
Huanyu Li, Zlatan Dragisic, Daniel Faria, Valentina Ivanova, Ernesto Jiménez-Ruiz, Patrick Lambrix, and Catia Pesquita. 2019. User validation in ontology alignment: Functional assessment and impact. The Knowledge Engineering Review 34 (2019), e15.
[40]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, Jin Wang, Wataru Hirota, and Wang-Chiew Tan. 2021. Deep entity matching: Challenges and opportunities. Journal of Data and Information Quality 13, 1 (2021), 1–17.
[41]
Robert McCann, Warren Shen, and AnHai Doan. 2008. Matching schemas in online communities: A web 2.0 approach. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. IEEE, 110–119.
[42]
Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings of the ICDE. IEEE, 117–128.
[43]
Janet Metcalfe and Bridgid Finn. 2008. Evidence that judgments of learning are causally related to study choice. Psychonomic Bulletin & Review 15, 1 (2008), 174–179.
[44]
Giovanni Modica, Avigdor Gal, and Hasan M. Jamil. 2001. The use of machine-generated ontologies in dynamic information seeking. In Proceedings of the International Conference on Cooperative Information Systems. Springer, 433–447.
[45]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. ACM, 19–34.
[46]
Quoc Viet Hung Nguyen, Thanh Tam Nguyen, Zoltán Miklós, Karl Aberer, Avigdor Gal, and Matthias Weidlich. 2014. Pay-as-you-go reconciliation in schema matching networks. In Proceedings of the ICDE. IEEE, 220–231.
[47]
Natalya Fridman Noy, Jonathan Mortensen, Mark A. Musen, and Paul R. Alexander. 2013. Mechanical turk as an ontology engineer?: Using microtasks as a component of an ontology-engineering workflow. In Proceedings of the Web Science 2013.262–271.
[48]
Natalya F. Noy and Mark A. Musen. 2002. Evaluating ontology-mapping tools: Requirements and experience. In Proceedings of the Workshop on Evaluation of Ontology Tools at EKAW. p1–14.
[49]
Eric Peukert, Julian Eberius, and Erhard Rahm. 2011. AMC-A framework for modelling and comparing matching systems as matching processes. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. IEEE, 1304–1307.
[50]
Christoph Pinkel, Carsten Binnig, Evgeny Kharlamov, and Peter Haase. 2013. IncMap: Pay as you go matching of relational schemata to OWL ontologies. In Proceedings of the OM. Citeseer, 37–48.
[51]
Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching. the VLDB Journal 10, 4 (2001), 334–350.
[52]
Joel Ross, Lilly Irani, M. Silberman, Andrew Zaldivar, and Bill Tomlinson. 2010. Who are the crowdworkers?: Shifting demographics in mechanical turk. In Proceedings of the CHI’10 Extended Abstracts on Human Factors in Computing Systems. ACM, 2863–2872.
[53]
C. Sarasua, E. Simperl, and N. F. Noy. 2012. Crowdmap: Crowdsourcing ontology alignment with microtasks. In Proceedings of the ISWC.
[54]
Roee Shraga and Avigdor Gal. 2021. PoWareMatch: A quality-aware deep learning approach to improve human schema matching - technical report. arXiv:2109.07321. Retrieved on April 19, 2022 from https://arxiv.org/abs/2109.07321.
[55]
Roee Shraga, Avigdor Gal, and Haggai Roitman. 2018. What type of a matcher are you?: Coordination of human and algorithmic matchers. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA@SIGMOD. 12:1–12:7.
[56]
Roee Shraga, Avigdor Gal, and Haggai Roitman. 2020. ADnEV: Cross-domain schema matching using deep similarity matrix adjustment and evaluation. Proceedings of the VLDB Endowment 13, 9 (2020), 1401–1415.
[57]
Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment 11, 2 (2017), 189–202.
[58]
Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, and AnHai Doan. 2020. Data curation with deep learning. In Proceedings of the EDBT. 277–286.
[59]
Pei Wang, Ryan Shea, Jiannan Wang, and Eugene Wu. 2019. Progressive deep web crawling through keyword queries for data enrichment. In Proceedings of the 2019 International Conference on Management of Data. 229–246.
[60]
Cort J. Willmott and Kenji Matsuura. 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research 30, 1 (2005), 79–82.
[61]
Chen Zhang, Lei Chen, H. V. Jagadish, Mengchen Zhang, and Yongxin Tong. 2018. Reducing uncertainty of schema matching via crowdsourcing with accuracy rates. IEEE Transactions on Knowledge and Data Engineering 32, 1 (2018), 135–151. https://ieeexplore.ieee.org/abstract/document/8533346?casa_token=5DaP6zqJFZsAAAAA:SjsJhRjdEjggqjXxEy16Z6gTyCsF6IFQlaTjJLvnBIV6YHisFgY9HyAbcWEhfjAYU1-JwISP.
[62]
Chen Jason Zhang, Lei Chen, H. V. Jagadish, and Caleb Chen Cao. 2013. Reducing uncertainty of schema matching via crowdsourcing. PVLDB 6, 9 (2013), 757–768.
[63]
Yi Zhang and Zachary G. Ives. 2020. Finding related tables in data lakes for interactive data science. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1951–1966.

Cited By

View all
  • (2024)Automatic conceptual database design based on heterogeneous source artifactsComputer Science and Information Systems10.2298/CSIS240229065B21:4(1913-1961)Online publication date: 2024
  • (2024)Data Pattern Matching Method Based on BERT and Attention Mechanism2024 Second International Conference on Networks, Multimedia and Information Technology (NMITCON)10.1109/NMITCON62075.2024.10699209(1-8)Online publication date: 9-Aug-2024
  • (2023)One Algorithm to Rule Them All: On the Changing Roles of Humans in Data IntegrationComputer10.1109/MC.2023.324044956:4(102-109)Online publication date: 1-Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 14, Issue 3
September 2022
155 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3533272
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2022
Online AM: 28 March 2022
Accepted: 01 August 2021
Revised: 01 July 2021
Received: 01 March 2021
Published in JDIQ Volume 14, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Human-in-the-loop
  2. data quality
  3. deep learning

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)9
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Automatic conceptual database design based on heterogeneous source artifactsComputer Science and Information Systems10.2298/CSIS240229065B21:4(1913-1961)Online publication date: 2024
  • (2024)Data Pattern Matching Method Based on BERT and Attention Mechanism2024 Second International Conference on Networks, Multimedia and Information Technology (NMITCON)10.1109/NMITCON62075.2024.10699209(1-8)Online publication date: 9-Aug-2024
  • (2023)One Algorithm to Rule Them All: On the Changing Roles of Humans in Data IntegrationComputer10.1109/MC.2023.324044956:4(102-109)Online publication date: 1-Apr-2023
  • (2022)HumanALProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547496(1-8)Online publication date: 12-Jun-2022

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media