research-article

Public Access

Extracting Databases from Dark Data with DeepDive

Authors:

Christopher Ré,

Michael Cafarella,

Feng NiuAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 847 - 859

https://doi.org/10.1145/2882903.2904442

Published: 14 June 2016 Publication History

Abstract

DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data --- scientific papers, Web classified ads, customer service notes, and so on --- were instead in a relational database, it would give analysts access to a massive and highly-valuable new set of "big data" to exploit.

DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers.

DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.

References

[1]

MUC6 '95: Proceedings of the 6th Conference on Message Understanding, Stroudsburg, PA, USA, 1995. Association for Computational Linguistics.

[2]

M. R. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. Re, and C. Zhang. Brainwash: A data system for feature engineering. In CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings, 2013.

[3]

M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI 2007, Proceedings of the 20th International Joint Conference on Arti cial Intelligence, Hyderabad, India, January 6-12, 2007, pages 2670--2676, 2007.

Digital Library

[4]

D. Barbosa, H. Wang, and C. Yu. Shallow information extraction for the knowledge web. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages 1264--1267, 2013.

Digital Library

[5]

J. Betteridge, A. Carlson, S. A. Hong, E. R. H. Jr., E. L. M. Law, T. M. Mitchell, and S. H. Wang. Toward never ending language learning. In Learning by Reading and Learning to Read, Papers from the 2009 AAAI Spring Symposium, Technical Report SS-09-07, Stanford, California, USA, March 23-25, 2009, pages 1--2, 2009.

[6]

S. Brin. Extracting patterns and relations from the world wide web. In The World Wide Web and Databases, International Workshop WebDB'98, Valencia, Spain, March 27-28, 1998, Selected Papers, pages 172--183, 1998.

Digital Library

[7]

R. C. Bunescu and R. J. Mooney. Learning to extract relations from the web using minimal supervision. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, 2007.

[8]

A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In Proceedings of the Twenty-Fourth AAAI Conference on Arti cial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, 2010.

Digital Library

[9]

F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7-12, 2008, Cancun, Mexico, pages 943--952, 2008.

Digital Library

[10]

F. Chen, X. Feng, C. Re, and M. Wang. Optimizing statistical information extraction programs over evolving text. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, pages 870--881, 2012.

Digital Library

[11]

Y. Chen and D. Z. Wang. Knowledge expansion over probabilistic knowledge bases. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 649--660, 2014.

Digital Library

[12]

M. Craven and J. Kumlien. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, August 6-10, 1999, Heidelberg, Germany, pages 77--86, 1999.

Digital Library

[13]

X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 7(10):881--892, 2014.

Digital Library

[14]

O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In Proceedings of the 13th international conference on World Wide Web, WWW 2004, New York, NY, USA, May 17-20, 2004, pages 100--110, 2004.

Digital Library

[15]

M. Fan, D. Zhao, Q. Zhou, Z. Liu, T. F. Zheng, and E. Y. Chang. Distant supervision for relation extraction with matrix completion. In EMNLP, 2014.

[16]

G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca. The lixto data extraction project - back and forth between theory and practice. In Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 14-16, 2004, Paris, France, pages 1--12, 2004.

Digital Library

[17]

A. Gupta, I. S. Mumick, and V. S. Subrahmanian. Maintaining views incrementally. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 26-28, 1993., pages 157--166, 1993.

Digital Library

[18]

A. Y. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, 2009.

Digital Library

[19]

M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In 14th International Conference on Computational Linguistics, COLING 1992, Nantes, France, August 23-28, 1992, pages 539--545, 1992.

Digital Library

[20]

R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pages 541--550, 2011.

Digital Library

[21]

R. Hoffmann, C. Zhang, and D. S. Weld. Learning 5000 relational extractors. In ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden, pages 286--295, 2010.

Digital Library

[22]

R. Jampani, F. Xu, M. Wu, L. L. Perez, C. M. Jermaine, and P. J. Haas. MCDB: a monte carlo approach to managing uncertain data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 687--700, 2008.

Digital Library

[23]

N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, pages 655--665, 2014.

[24]

G. Kasneci, M. Ramanath, F. M. Suchanek, and G. Weikum. The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37(4):41--47, 2008.

Digital Library

[25]

M. L. Koc and C. Ré. Incrementally maintaining classification using an RDBMS. PVLDB, 4(5):302--313, 2011.

Digital Library

[26]

S. Krause, H. Li, H. Uszkoreit, and F. Xu. Large-scale learning of relation-extraction rules with distant supervision from the web. In ISWC, 2012.

Digital Library

[27]

J. Li, A. Ritter, and E. H. Hovy. Weakly supervised user profile extraction from twitter. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, pages 165--174, 2014.

[28]

Y. Li, F. R. Reiss, and L. Chiticariu. SystemT: A declarative information extraction system. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pages 109--114, 2011.

Digital Library

[29]

J. Liu, S. J. Wright, C. Re, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 469--477, 2014.

Digital Library

[30]

E. K. Mallory et al. Large-scale extraction of gene interactions from full text literature using deepdive. Bioinformatics, 2015.

[31]

M. Marchetti-Bowick and N. Chambers. Learning for microblogs with distant supervision: Political forecasting with twitter. In EACL, 2012.

Digital Library

[32]

B. Min, R. Grishman, L. Wan, C. Wang, and D. Gondek. Distant supervision for relation extraction with an incomplete knowledge base. In NAACL-HLT, 2013.

[33]

M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore, pages 1003--1011, 2009.

Digital Library

[34]

N. Nakashole, M. Theobald, and G. Weikum. Scalable knowledge harvesting with high precision and high recall. In Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong, China, February 9-12, 2011, pages 227--236, 2011.

Digital Library

[35]

T.-V. T. Nguyen and A. Moschitti. End-to-end relation extraction using distant supervision from external semantic repositories. In ACL, 2011.

Digital Library

[36]

F. Niu, C. Re, A. Doan, and J. W. Shavlik. Tu y: Scaling up statistical inference in markov logic networks using an RDBMS. PVLDB, 4(6):373--384, 2011.

Digital Library

[37]

S. E. Peters et al. A machine reading system for assembling synthetic Paleontological databases. PloS ONE, 2014.

[38]

H. Poon and P. M. Domingos. Joint inference in information extraction. In Proceedings of the Twenty-Second AAAI Conference on Arti cial Intelligence, July 22-26, 2007, Vancouver, British Columbia, Canada, pages 913--918, 2007.

Digital Library

[39]

M. Purver and S. Battersby. Experimenting with distant supervision for emotion classi cation. In EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23-27, 2012, pages 482--491, 2012.

Digital Library

[40]

C. Re, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature engineering for knowledge base construction. IEEE Data Eng. Bull., 37(3):26--40, 2014.

[41]

B. Recht, C. Re, S. J. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages 693--701, 2011.

Digital Library

[42]

S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part III, pages 148--163, 2010.

Digital Library

[43]

C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.

[44]

W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, pages 1033--1044, 2007.

Digital Library

[45]

J. Shin, C. Ré, and M. J. Cafarella. Mindtagger: A demonstration of data labeling in knowledge base construction. PVLDB, 8(12):1920--1931, 2015.

Digital Library

[46]

F. M. Suchanek, M. Sozio, and G. Weikum. SOFIE: a self-organizing framework for information extraction. In Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, pages 631--640, 2009.

Digital Library

[47]

M. Surdeanu, S. Gupta, J. Bauer, D. McClosky, A. X. Chang, V. I. Spitkovsky, and C. D. Manning. Stanford's distantly-supervised slot- lling system. In Proceedings of the Fourth Text Analysis Conference, TAC 2011, Gaithersburg, Maryland, USA, November 14-15, 2011, 2011.

[48]

M. Surdeanu, D. McClosky, J. Tibshirani, J. Bauer, A. X. Chang, V. I. Spitkovsky, and C. D. Manning. A simple distant supervision approach for the TAC-KBP slot lling task. In Proceedings of the Third Text Analysis Conference, TAC 2010, Gaithersburg, Maryland, USA, November 15-16, 2010, 2010.

[49]

M. J. Wainwright and M. I. Jordan. Log-determinant relaxation for approximate inference in discrete markov random elds. IEEE Transactions on Signal Processing, 54(6-1):2099--2109, 2006.

[50]

M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1--305, 2008.

Digital Library

[51]

L. Yao, S. Riedel, and A. McCallum. Collective cross-document relation extraction without labelled data. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, 9-11 October 2010, MIT Stata Center, Massachusetts, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1013--1023, 2010.

Digital Library

[52]

A. Yates and O. Etzioni. Unsupervised resolution of objects and relations on the web. In Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, April 22-27, 2007, Rochester, New York, USA, pages 121--130, 2007.

[53]

C. Zhang, F. Niu, C. Re, and J. W. Shavlik. Big data versus the crowd: Looking for relationships in all the right places. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers, pages 825--834, 2012.

Digital Library

[54]

C. Zhang and C. Re. Towards high-throughput gibbs sampling at scale: a study across storage managers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, pages 397--408, 2013.

Digital Library

[55]

C. Zhang and C. Re. Dimmwitted: A study of main-memory statistical analytics. PVLDB, 7(12):1283--1294, 2014.

Digital Library

[56]

J. Zhu, Z. Nie, X. Liu, B. Zhang, and J. Wen. StatSnowball: a statistical approach to extracting entity relationships. In Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, pages 101--110, 2009.

Digital Library

[57]

M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems. Vancouver, Canada, December 6-9, 2010., pages 2595--2603, 2010.

Cited By

Kayali MLykov AFountalis IVasiloglou NOlteanu DSuciu D(2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659461
Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
M SG MV PS PK S(2024)Tor-Quest (The Onion Router Crawler)2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM)10.1109/ICONSTEM60960.2024.10568905(1-5)Online publication date: 4-Apr-2024
https://doi.org/10.1109/ICONSTEM60960.2024.10568905
Show More Cited By

Index Terms

Extracting Databases from Dark Data with DeepDive
1. Information systems
  1. Data management systems
    1. Information integration
      1. Extraction, transformation and loading
      2. Wrappers (data mining)
  2. Information systems applications
    1. Data mining
      1. Data cleaning
    2. Decision support systems
      1. Data analytics
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic reasoning algorithms
      1. Markov-chain Monte Carlo methods
        Gibbs sampling
    2. Probabilistic representations
      1. Factor graphs

Recommendations

DeepDive: declarative knowledge base construction

The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that ...
Incremental knowledge base construction using DeepDive

Populating a database with information from unstructured sources--also known as knowledge base construction (KBC)--is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. In this work, we ...
DeepDive: Declarative Knowledge Base Construction

The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ONR
DARPA
NSF

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
1,304
Total Downloads

Downloads (Last 12 months)162
Downloads (Last 6 weeks)28

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kayali MLykov AFountalis IVasiloglou NOlteanu DSuciu D(2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659461
Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
M SG MV PS PK S(2024)Tor-Quest (The Onion Router Crawler)2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM)10.1109/ICONSTEM60960.2024.10568905(1-5)Online publication date: 4-Apr-2024
https://doi.org/10.1109/ICONSTEM60960.2024.10568905
Steegh ESileno GAndrade FGrabmair M(2023)No Labels?Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law10.1145/3594536.3595171(277-286)Online publication date: 19-Jun-2023
https://dl.acm.org/doi/10.1145/3594536.3595171
Corallo ACrespino AVecchio VLazoi MMarra M(2023)Understanding and Defining Dark Data for the Manufacturing IndustryIEEE Transactions on Engineering Management10.1109/TEM.2021.305198170:2(700-712)Online publication date: Feb-2023
https://doi.org/10.1109/TEM.2021.3051981
Sal Bde la Vega ALópez-Martínez PGarcía-Saiz DGrande ALópez DSánchez P(2023)Towards the Specification and Generation of Time Series Datasets from Data Lakes2023 IEEE 31st International Requirements Engineering Conference Workshops (REW)10.1109/REW57809.2023.00057(302-306)Online publication date: Sep-2023
https://doi.org/10.1109/REW57809.2023.00057
Wang CLi YChen JMa X(2023)Named entity annotation schema for geological literature mining in the domain of porphyry copper depositsOre Geology Reviews10.1016/j.oregeorev.2022.105243152(105243)Online publication date: Jan-2023
https://doi.org/10.1016/j.oregeorev.2022.105243
Deeb-Swihart JEndert ABruckman A(2022)Ethical Tensions in Applications of AI for Addressing Human Trafficking: A Human Rights PerspectiveProceedings of the ACM on Human-Computer Interaction10.1145/35551866:CSCW2(1-29)Online publication date: 11-Nov-2022
https://dl.acm.org/doi/10.1145/3555186
Qi PSun YLuo HGuizani M(2022)Scratch-DKG: A Framework for Constructing Scratch Domain Knowledge GraphIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2020.299671010:1(170-185)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TETC.2020.2996710
Liu YWang YGao LGuo CXie YXiao Z(2021)Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark DataACM/IMS Transactions on Data Science10.1145/34200382:2(1-26)Online publication date: 8-Apr-2021
https://dl.acm.org/doi/10.1145/3420038
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents