research-article

PoWareMatch: A Quality-aware Deep Learning Approach to Improve Human Schema Matching

Authors:

Roee Shraga,

Avigdor GalAuthors Info & Claims

ACM Journal of Data and Information Quality (JDIQ), Volume 14, Issue 3

Article No.: 16, Pages 1 - 27

https://doi.org/10.1145/3483423

Published: 23 May 2022 Publication History

Get Access

Abstract

Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web, and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering to the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high-quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.

Appendix

A Monotonic Evaluation and Theorem Proofs

The appendix is devoted to the proofs of Theorems 1 and 2.

Theorem 1.

Recall (R) is a MIEM over \(\Sigma ^{\subseteq }\), Precision (P) is a MIEM over \(\Sigma ^P\), and f-measure (F) is a MIEM over \(\Sigma ^F\).

For proving Theorem 1, we use two lemmas stating that recall is a MIEM over all match pairs in \(\Sigma ^{\subseteq }\) and, since precision and f-measure are not monotonic over the full set of pairs in \(\Sigma ^{\subseteq }\), the conditions under which monotonicity can be guaranteed for both measures. Due to space considerations we refer the interested reader to a technical report [54] for proofs of both lemmas.

Lemma 2.

Recall (R) is a MIEM over \(\Sigma ^{\subseteq }\).

Lemma 3.

For \((\sigma , \sigma ^{\prime })\in \Sigma ^{\subseteq }\):

–

\(P(\sigma) \le P(\sigma ^{\prime })\) iff \(P(\sigma)\le P(\Delta);\)

–

\(F(\sigma) \le F(\sigma ^{\prime })\) iff \(0.5\cdot F(\sigma)\le P(\Delta).\)

Proof of Theorem 1

The first part of the theorem follows directly from Lemma 2. For the remainder of the proof we rely on Lemma 3. Let \((\sigma , \sigma ^{\prime })\in \Sigma ^{P}\) be a match pair in \(\Sigma ^{P}\). By definition, since \((\sigma , \sigma ^{\prime })\in \Sigma ^{P}\), then \(P(\sigma)\le P(\Delta)\) and by Lemma 3, we can conclude that \(P(\sigma) \le P(\sigma ^{\prime })\). Similarly, let \((\sigma , \sigma ^{\prime })\in \Sigma ^{F}\) be a match pair in \(\Sigma ^{F}\). By definition, since \((\sigma , \sigma ^{\prime })\in \Sigma ^{F}\), then \(0.5\cdot F(\sigma)\le P(\Delta)\) and by Lemma 3, we can infer that \(F(\sigma) \le F(\sigma ^{\prime })\), which concludes the proof.□

Theorem 2.

Let \(R/P/F\) be a random variable, whose values are taken from the domain of \([0,1]\), and \(\Delta\) be a singleton correspondence set (\(|\Delta |=1\)). \(\Delta\) is a probabilistic local annealer with respect to \(R/P/F\) over \(\Sigma ^{\subseteq _1}/\Sigma ^{E(P)}/\Sigma ^{E(F)}\).

Proof of Theorem 2

Let \(\Delta\) be a singleton correspondence set (\(|\Delta |=1\)). Let R be a random variable and let \(\Delta =\Delta _{(\sigma ,\sigma ^{\prime })}\) such that \((\sigma ,\sigma ^{\prime })\in \Sigma ^{\subseteq _1}\).

According to Corollary 1, \(\Delta\) is a local annealer with respect to R over \(\Sigma ^{\subseteq _1}\) and therefore \(R(\sigma)\le R(\sigma ^{\prime })\), regardless of \(Pr\lbrace \Delta \in \sigma ^*\rbrace\). Therefore, for any \(p=Pr\lbrace \Delta \in \sigma ^*\rbrace , p\cdot R(\sigma)\le p\cdot R(\sigma ^{\prime })\) and by definition of expectation, \(E(R(\sigma))\le E(R(\sigma ^{\prime }))\).

Let P be a random variable and let \(\Delta =\Delta _{(\sigma ,\sigma ^{\prime })}\) such that \((\sigma ,\sigma ^{\prime })\in \Sigma ^{E(P)}\). By definition of \(\Sigma ^{E(P)}\), \(E(P(\sigma))\le Pr\lbrace \Delta \in \sigma ^*\rbrace\) and using Lemma 1 we obtain \(E(P(\sigma)) \le E(P(\sigma ^{\prime }))\).

Let F be a random variable and let \(\Delta =\Delta _{(\sigma ,\sigma ^{\prime })}\) such that \((\sigma ,\sigma ^{\prime })\in \Sigma ^{E(F)}\). By definition of \(\Sigma ^{E(F)}\), \(E(F(\sigma))\le 0.5\cdot Pr\lbrace \Delta \in \sigma ^*\rbrace\) and using Lemma 1 we obtain \(E(F(\sigma)) \le E(F(\sigma ^{\prime }))\).

We can therefore conclude, by Definition 5, that \(\Delta\) is a probabilistic local annealer with respect to \(R/P/F\) over \(\Sigma ^{\subseteq _1}/\Sigma ^{E(P)}/\Sigma ^{E(F)}\).□

References

[1]

2021. Data. Retrieved April 19, 2022 from https://github.com/shraga89/PoWareMatch/tree/master/DataFiles. (2021).

Abstract

A Monotonic Evaluation and Theorem Proofs

References

Cited By

Index Terms

Recommendations

Quality Attributes and Classification of Schema Matchers

A Flexible Approach for Planning Schema Matching Algorithms

A New Complex Schema Matching System

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Full Text

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations