Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Discovering Conditional Matching Rules

Published: 29 June 2017 Publication History

Abstract

Matching dependencies (MDs) have recently been proposed to make data dependencies tolerant to various information representations, and found useful in data quality applications such as record matching. Instead of the strict equality function used in traditional dependency syntax (e.g., functional dependencies), MDs specify constraints based on similarity and identification. However, in practice, MDs may still be too strict and applicable only in a subset of tuples in a relation. Thereby, we study the conditional matching dependencies (CMDs), which bind matching dependencies only in a certain part of a table, i.e., MDs conditionally applicable in a subset of tuples. Compared to MDs, CMDs have more expressive power that enables them to satisfy wider application needs. In this article, we study several important theoretical and practical issues of CMDs, including irreducible CMDs with respect to the implication, discovery of CMDs from data, reliable CMDs agreed most by a relation, approximate CMDs almost satisfied in a relation, and finally applications of CMDs in record matching and missing value repairing. Through an extensive experimental evaluation in real data sets, we demonstrate the efficiency of proposed CMDs discovery algorithms and effectiveness of CMDs in real applications.

References

[1]
Ziawasch Abedjan, Cuneyt Gurcan Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal rules discovery for web data cleaning. PVLDB 9, 4 (2015), 336--347.
[2]
Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: A survey. VLDB J. 24, 4 (2015), 557--581.
[3]
Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.
[4]
Renaud Bassée and Jef Wijsen. 2001. Neighborhood dependencies for prediction. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’01). 562--567.
[5]
Jana Bauckmann, Ziawasch Abedjan, Ulf Leser, Heiko Müller, and Felix Naumann. 2012. Discovering conditional inclusion dependencies. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12), Maui, HI, October 29--November 02. 2094--2098.
[6]
Radim Belohlávek and Vilém Vychodil. 2006. Data tables with similarity relations: Functional dependencies, complete rules and non-redundant bases. In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA’06). 644--658.
[7]
Mikhail Bilenko, Raymond J. Mooney, William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. 2003. Adaptive name matching in information integration. IEEE Intell. Syst. 18, 5 (2003), 16--23.
[8]
Dina Bitton, Jeffrey Millman, and Solveig Torgersen. 1989. A feasibility and performance study of dependency inference. In Proceedings of the Annual IEEE International Conference on Data Engineering (ICDE’89). 635--641.
[9]
Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2007. Conditional functional dependencies for data cleaning. In Proceedings of the Annual IEEE International Conference on Data Engineering (ICDE’07). 746--755.
[10]
Loreto Bravo, Wenfei Fan, Floris Geerts, and Shuai Ma. 2008. Increasing the expressivity of conditional functional dependencies without extra complexity. In Proceedings of the Annual IEEE International Conference on Data Engineering (ICDE’08). 516--525.
[11]
Loreto Bravo, Wenfei Fan, and Shuai Ma. 2007. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Data Bases (VLDB’07). 243--254.
[12]
Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed functional dependencies -- A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2016), 147--165.
[13]
Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti, and Raghav Kaushik. 2007. Leveraging aggregate constraints for deduplication. In Proceedings of the SIGMOD Conference. 437--448.
[14]
Fei Chiang and Renée J. Miller. 2008. Discovering data quality rules. PVLDB 1, 1 (2008), 1166--1177.
[15]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. PVLDB 6, 13 (2013), 1498--1509.
[16]
Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In Proceedings of the International Conference on Very Large Data Bases (VLDB’07). 315--326.
[17]
Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: A commodity data cleaning system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’13), New York, NY, June 22--27. 541--552.
[18]
AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data Integration. Morgan Kaufmann.
[19]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16.
[20]
Wenfei Fan. 2008. Dependencies revisited for improving data quality. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’08). 159--170.
[21]
Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma. 2011. Dynamic constraints for record matching. VLDB J. 20, 4 (2011), 495--520.
[22]
Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33, 2 (2008), 1--48.
[23]
Wenfei Fan, Floris Geerts, Laks V. S. Lakshmanan, and Ming Xiong. 2009. Discovering conditional functional dependencies. In Proceedings of the Annual IEEE International Conference on Data Engineering (ICDE’09). 1231--1234.
[24]
Wenfei Fan, Jianzhong Li, Xibei Jia, and Shuai Ma. 2009. Reasoning about record matching rules. PVLDB 2, 1 (2009), 407--418.
[25]
Wenfei Fan, Shuai Ma, Yanli Hu, Jie Liu, and Yinghui Wu. 2008. Propagating functional dependencies with conditions. PVLDB 1, 1 (2008), 391--407.
[26]
Peter A. Flach and Iztok Savnik. 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139--160.
[27]
M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.
[28]
Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2013. The LLUNATIC data-cleaning framework. PVLDB 6, 9 (2013), 625--636.
[29]
Chris Giannella and Edward L. Robertson. 2004. On approximation measures for functional dependencies. Inf. Syst. 29, 6 (2004), 483--507.
[30]
Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, and Divesh Srivastava. 2009. Sequential dependencies. PVLDB 2, 1 (2009), 574--585.
[31]
Lukasz Golab, Howard J. Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1, 1 (2008), 376--390.
[32]
Daniel Haas, Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, and Eugene Wu. 2015. Wisteria: Nurturing scalable data cleaning infrastructure. PVLDB 8, 12 (2015), 2004--2007.
[33]
Arvid Heise, Jorge-Arnulfo Quiané-Ruiz, Ziawasch Abedjan, Anja Jentzsch, and Felix Naumann. 2013. Scalable discovery of unique column combinations. PVLDB 7, 4 (2013), 301--312.
[34]
Mauricio A. Hernández and Salvatore J. Stolfo. 1995. The merge/purge problem for large databases. In Proceedings of the SIGMOD Conference. 127--138.
[35]
Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1998. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of the Annual IEEE International Conference on Data Engineering (ICDE’98). 392--401.
[36]
Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42, 2 (1999), 100--111.
[37]
Ihab F. Ilyas, Volker Markl, Peter J. Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: Automatic discovery of correlations and soft functional dependencies. In Proceedings of the SIGMOD Conference. 647--658.
[38]
R. M. Karp. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations, R. E. Miller and J. W. Thatcher (Eds.), Plenum Press, 85--103.
[39]
Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2015. BigDansing: A system for big data cleansing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1215--1230.
[40]
Ronald S. King and James J. Legendre. 2003. Discovery of functional and approximate functional dependencies in relational databases. JAMDS 7, 1 (2003), 49--59.
[41]
Jyrki Kivinen and Heikki Mannila. 1995. Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149, 1 (1995), 129--149.
[42]
Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. In Proceedings of the 12th International Conference on Database Theory (ICDT’09). 53--62.
[43]
Flip Korn, S. Muthukrishnan, and Yunyue Zhu. 2003. Checks and balances: Monitoring data quality problems in network traffic databases. In Proceedings of the International Conference on Very Large Data Bases (VLDB’03). 536--547.
[44]
Nick Koudas, Avishek Saha, Divesh Srivastava, and Suresh Venkatasubramanian. 2009. Metric functional dependencies. In Proceedings of the Annual IEEE International Conference on Data Engineering (ICDE’09). 1275--1278.
[45]
Stefan Kramer and Bernhard Pfahringer. 1996. Efficient search for strong partial determinations. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD’96). 371--374.
[46]
Heikki Mannila and Kari-Jouko Räihä. 1987. Dependency inference. In Proceedings of the International Conference on Very Large Data Bases (VLDB’87). 155--158.
[47]
Heikki Mannila and Kari-Jouko Räihä. 1992. Design of Relational Databases. Addison-Wesley.
[48]
Heikki Mannila and Kari-Jouko Räihä. 1994. Algorithms for inferring functional dependencies from relations. Data Knowl. Eng. 12, 1 (1994), 83--99.
[49]
Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to functional dependency discovery. In Proceedings of the 2016 SIGMOD International Conference on Management of Data. 821--833.
[50]
Bernhard Pfahringer and Stefan Kramer. 1995. Compression-based evaluation of partial determinations. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD’95). 234--239.
[51]
Jeffrey C. Schlimmer. 1993. Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning. In Proceedings of the International Conference on Machine Learning (ICML’93). 284--290.
[52]
Warren Shen, Xin Li, and AnHai Doan. 2005. Constraint-based entity matching. In Proceedings of the National Conference on Artificial Intelligence (AAAI’05). 862--867.
[53]
Shaoxu Song and Lei Chen. 2009. Discovering matching dependencies. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM’09). 1421--1424.
[54]
Shaoxu Song and Lei Chen. 2011. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst. 36, 3 (2011), 1--41.
[55]
Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87 (2013), 146--166.
[56]
Shaoxu Song, Lei Chen, and Hong Cheng. 2014. On concise set of relative candidate keys. PVLDB 7, 12 (2014), 1179--1190. http://www.vldb.org/pvldb/vol7/p1179-song.pdf.
[57]
Shaoxu Song, Lei Chen, and Jeffrey Xu Yu. 2010. Extending matching rules with conditions. In Proceedings of the 8th International Workshop on Quality in Databases.
[58]
Shaoxu Song, Lei Chen, and Philip S. Yu. 2011. On data dependencies in dataspaces. In Proceedings of the Annual IEEE International Conference on Data Engineering (ICDE’01). 470--481.
[59]
C. J. van Rijsbergen. 1979. Information Retrieval. Butterworth.
[60]
Daisy Zhe Wang, Xin Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Y. Halevy. 2009. Functional dependency generation and applications in pay-as-you-go data integration systems. In Proceedings of the WebDB Workshop.
[61]
Catharine M. Wyss, Chris Giannella, and Edward L. Robertson. 2001. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances -- Extended abstract. In Proceedings of the DaWaK. 101--110.

Cited By

View all
  • (2024)Measuring Approximate Functional Dependencies: A Comparative Study2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00270(3505-3518)Online publication date: 13-May-2024
  • (2024)Efficient Set-Based Order Dependency Discovery with a Level-Wise Hybrid Strategy2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00059(692-704)Online publication date: 13-May-2024
  • (2023)Smart Work Injury Management (SWIM) System: A Machine Learning Approach for the Prediction of Sick Leave and Rehabilitation PlanBioengineering10.3390/bioengineering1002017210:2(172)Online publication date: 28-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 11, Issue 4
Special Issue on KDD 2016 and Regular Papers
November 2017
419 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3119906
  • Editor:
  • Jie Tang
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2017
Accepted: 01 March 2017
Revised: 01 December 2016
Received: 01 August 2015
Published in TKDD Volume 11, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Conditional matching dependency
  2. data repair
  3. record matching

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NSFC
  • Tsinghua University Initiative Scientific Research Program
  • National Key Research Program of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Measuring Approximate Functional Dependencies: A Comparative Study2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00270(3505-3518)Online publication date: 13-May-2024
  • (2024)Efficient Set-Based Order Dependency Discovery with a Level-Wise Hybrid Strategy2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00059(692-704)Online publication date: 13-May-2024
  • (2023)Smart Work Injury Management (SWIM) System: A Machine Learning Approach for the Prediction of Sick Leave and Rehabilitation PlanBioengineering10.3390/bioengineering1002017210:2(172)Online publication date: 28-Jan-2023
  • (2022)Discovering Key Sub-Trajectories to Explain Traffic PredictionSensors10.3390/s2301013023:1(130)Online publication date: 23-Dec-2022
  • (2022)Data Dependencies Extended for Variety and Veracity: A Family TreeIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.304644334:10(4717-4736)Online publication date: 1-Oct-2022
  • (2022)Efficient Processing of Group Planning Queries Over Spatial-Social NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300415334:5(2135-2147)Online publication date: 1-May-2022
  • (2022)Assessing the Existence of a Function in a Dataset with the g3 Indicator2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00050(607-620)Online publication date: May-2022
  • (2021)ADESITProceedings of the VLDB Endowment10.14778/3476311.347631814:12(2679-2682)Online publication date: 28-Oct-2021
  • (2021)Fast incremental discovery of pointwise order dependenciesProceedings of the VLDB Endowment10.14778/3401960.340196513:10(1669-1681)Online publication date: 10-Mar-2021
  • (2021)Discovering Relaxed Functional Dependencies Based on Multi-Attribute DominanceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.296772233:9(3212-3228)Online publication date: 1-Sep-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media