Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Detecting Inclusion Dependencies on Very Many Tables

Published: 31 July 2017 Publication History

Abstract

Detecting inclusion dependencies, the prerequisite of foreign keys, in relational data is a challenging task. Detecting them among the hundreds of thousands or even millions of tables on the web is daunting. Still, such inclusion dependencies can help connect disparate pieces of information on the Web and reveal unknown relationships among tables.
With the algorithm Many, we present a novel inclusion dependency detection algorithm, specialized for the very many—but typically small—tables found on the Web. We make use of Bloom filters and indexed bit-vectors to show the feasibility of our approach. Our evaluation on two corpora of Web tables shows a superior runtime over known approaches and its usefulness to reveal hidden structures on the Web.

References

[1]
Ziawasch Abedjan, John Morcos, Michael Gubanov, Ihab F. Ilyas, Michael Stonebraker, Paolo Papotti, and Mourad Ouzzani. 2015. DataXFormer: Leveraging the web for semantic transformations. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). www.cidrdb.org.
[2]
Jana Bauckmann, Ulf Leser, and Felix Naumann. 2010. Efficient and Exact Computation of Inclusion Dependencies for Data Integration. Technical Report. Universitátsverlag Potsdam.
[3]
Jana Bauckmann, Ulf Leser, Felix Naumann, and Veronique Tietz. 2007. Efficiently detecting inclusion dependencies. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE, 1448--1450.
[4]
Siegfried Bell and Peter Brockhausen. 1995. Discovery of Data Dependencies in Relational Databases. Technical Report. Universitát Dortmund.
[5]
Frank Benford. 1938. The law of anomalous numbers. Proceedings of the American Philosophical Society 78 (1938), 551--572.
[6]
Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2013. Methods for exploring and mining tables on wikipedia. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics. ACM, New York, NY, 18--26.
[7]
Thomas Blásius, Tobias Friedrich, and Martin Schirneck. 2016. The parameterized complexity of dependency detection in relational databases. In International Symposium on Parametrized and Exact Computation (IPEC). Schloss Dagstuhl‐Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany.
[8]
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (1970), 422--426.
[9]
Michael J. Cafarella, Alon Y. Halevy, Daisy Z. Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538--549.
[10]
Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Symposium on Theory of Computing (STOC). ACM, New York, NY, 380--388.
[11]
Benjamin Kille, Frank Hopfgartner, Torben Brodt, and Tobias Heintz. 2013. The plista dataset. In Proceedings of the International Workshop and Challenge on News Recommender Systems. ACM, New York, NY, USA, 16--23.
[12]
Stéphane Lopes, Jean-Marc Petit, and Farouk Toumani. 2002. Discovering interesting inclusion dependencies: Application to logical database tuning. Information Systems 27, 1 (2002), 1--19.
[13]
Fabien De Marchi, Stphane Lopes, and Jean-Marc Petit. 2002. Efficient algorithms for mining inclusion dependencies. In Proceedings of the International Conference on Extending Database Technology (EDBT). Springer, Berlin, 464--476.
[14]
Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015a. Data profiling with metanome. Proceedings of the VLDB Endowment 8, 12 (2015), 1860--1871.
[15]
Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Naumann. 2015b. Divide 8 conquer-based inclusion dependency discovery. Proceedings of the VLDB Endowment 8, 7 (2015), 774--785.
[16]
Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, and Ulf Leser. 2009. A machine learning approach to foreign key discovery. In Proceedings of the ACM Workshop on the Web and Databases (WebDB). ACM, Providence, RI.
[17]
Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. 2012. InfoGather: Entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the International Conference on Management of Data (SIGMOD). ACM, New York, NY, 97--108.
[18]
Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2010. On multi-column foreign key discovery. Proceedings of the VLDB Endowment 3, 1--2 (2010), 805--814.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 42, Issue 3
Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical Correspondence
September 2017
220 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3129336
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 July 2017
Accepted: 01 May 2017
Revised: 01 April 2017
Received: 01 February 2016
Published in TODS Volume 42, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Inclusion dependency discovery
  2. data profiling
  3. foreign key discovery
  4. web data management

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)6
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Minimal coverage of generalized typed inclusion dependencies in databasesModeling and Analysis of Information Systems10.18255/1818-1015-2024-1-78-8931:1(78-89)Online publication date: 28-Mar-2024
  • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
  • (2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
  • (2024)Entity/Relationship Profiling2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00411(5393-5396)Online publication date: 13-May-2024
  • (2024)Efficient Set-Based Order Dependency Discovery with a Level-Wise Hybrid Strategy2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00059(692-704)Online publication date: 13-May-2024
  • (2023)Fast Discovery of Inclusion Dependencies with Desbordante2023 33rd Conference of Open Innovations Association (FRUCT)10.23919/FRUCT58615.2023.10143047(264-275)Online publication date: 24-May-2023
  • (2023)Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema GraphProceedings of the VLDB Endowment10.14778/3603581.360359616:10(2578-2590)Online publication date: 1-Jun-2023
  • (2023)Discovering Similarity Inclusion DependenciesProceedings of the ACM on Management of Data10.1145/35889291:1(1-24)Online publication date: 30-May-2023
  • (2023)Discovery of Cross JoinsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.319284235:7(6839-6851)Online publication date: 1-Jul-2023
  • (2023)Towards the efficient discovery of meaningful functional dependenciesInformation Systems10.1016/j.is.2023.102224116:COnline publication date: 1-Jun-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media