Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2904442acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Public Access

Extracting Databases from Dark Data with DeepDive

Published: 14 June 2016 Publication History


DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data --- scientific papers, Web classified ads, customer service notes, and so on --- were instead in a relational database, it would give analysts access to a massive and highly-valuable new set of "big data" to exploit.
DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers.
DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.


MUC6 '95: Proceedings of the 6th Conference on Message Understanding, Stroudsburg, PA, USA, 1995. Association for Computational Linguistics.
M. R. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. Re, and C. Zhang. Brainwash: A data system for feature engineering. In CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings, 2013.
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI 2007, Proceedings of the 20th International Joint Conference on Arti cial Intelligence, Hyderabad, India, January 6-12, 2007, pages 2670--2676, 2007.
D. Barbosa, H. Wang, and C. Yu. Shallow information extraction for the knowledge web. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages 1264--1267, 2013.
J. Betteridge, A. Carlson, S. A. Hong, E. R. H. Jr., E. L. M. Law, T. M. Mitchell, and S. H. Wang. Toward never ending language learning. In Learning by Reading and Learning to Read, Papers from the 2009 AAAI Spring Symposium, Technical Report SS-09-07, Stanford, California, USA, March 23-25, 2009, pages 1--2, 2009.
S. Brin. Extracting patterns and relations from the world wide web. In The World Wide Web and Databases, International Workshop WebDB'98, Valencia, Spain, March 27-28, 1998, Selected Papers, pages 172--183, 1998.
R. C. Bunescu and R. J. Mooney. Learning to extract relations from the web using minimal supervision. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, 2007.
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In Proceedings of the Twenty-Fourth AAAI Conference on Arti cial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, 2010.
F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7-12, 2008, Cancun, Mexico, pages 943--952, 2008.
F. Chen, X. Feng, C. Re, and M. Wang. Optimizing statistical information extraction programs over evolving text. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, pages 870--881, 2012.
Y. Chen and D. Z. Wang. Knowledge expansion over probabilistic knowledge bases. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 649--660, 2014.
M. Craven and J. Kumlien. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, August 6-10, 1999, Heidelberg, Germany, pages 77--86, 1999.
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 7(10):881--892, 2014.
O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In Proceedings of the 13th international conference on World Wide Web, WWW 2004, New York, NY, USA, May 17-20, 2004, pages 100--110, 2004.
M. Fan, D. Zhao, Q. Zhou, Z. Liu, T. F. Zheng, and E. Y. Chang. Distant supervision for relation extraction with matrix completion. In EMNLP, 2014.
G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca. The lixto data extraction project - back and forth between theory and practice. In Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 14-16, 2004, Paris, France, pages 1--12, 2004.
A. Gupta, I. S. Mumick, and V. S. Subrahmanian. Maintaining views incrementally. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 26-28, 1993., pages 157--166, 1993.
A. Y. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, 2009.
M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In 14th International Conference on Computational Linguistics, COLING 1992, Nantes, France, August 23-28, 1992, pages 539--545, 1992.
R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pages 541--550, 2011.
R. Hoffmann, C. Zhang, and D. S. Weld. Learning 5000 relational extractors. In ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden, pages 286--295, 2010.
R. Jampani, F. Xu, M. Wu, L. L. Perez, C. M. Jermaine, and P. J. Haas. MCDB: a monte carlo approach to managing uncertain data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 687--700, 2008.
N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, pages 655--665, 2014.
G. Kasneci, M. Ramanath, F. M. Suchanek, and G. Weikum. The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37(4):41--47, 2008.
M. L. Koc and C. Ré. Incrementally maintaining classification using an RDBMS. PVLDB, 4(5):302--313, 2011.
S. Krause, H. Li, H. Uszkoreit, and F. Xu. Large-scale learning of relation-extraction rules with distant supervision from the web. In ISWC, 2012.
J. Li, A. Ritter, and E. H. Hovy. Weakly supervised user profile extraction from twitter. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, pages 165--174, 2014.
Y. Li, F. R. Reiss, and L. Chiticariu. SystemT: A declarative information extraction system. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pages 109--114, 2011.
J. Liu, S. J. Wright, C. Re, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 469--477, 2014.
E. K. Mallory et al. Large-scale extraction of gene interactions from full text literature using deepdive. Bioinformatics, 2015.
M. Marchetti-Bowick and N. Chambers. Learning for microblogs with distant supervision: Political forecasting with twitter. In EACL, 2012.
B. Min, R. Grishman, L. Wan, C. Wang, and D. Gondek. Distant supervision for relation extraction with an incomplete knowledge base. In NAACL-HLT, 2013.
M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore, pages 1003--1011, 2009.
N. Nakashole, M. Theobald, and G. Weikum. Scalable knowledge harvesting with high precision and high recall. In Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong, China, February 9-12, 2011, pages 227--236, 2011.
T.-V. T. Nguyen and A. Moschitti. End-to-end relation extraction using distant supervision from external semantic repositories. In ACL, 2011.
F. Niu, C. Re, A. Doan, and J. W. Shavlik. Tu y: Scaling up statistical inference in markov logic networks using an RDBMS. PVLDB, 4(6):373--384, 2011.
S. E. Peters et al. A machine reading system for assembling synthetic Paleontological databases. PloS ONE, 2014.
H. Poon and P. M. Domingos. Joint inference in information extraction. In Proceedings of the Twenty-Second AAAI Conference on Arti cial Intelligence, July 22-26, 2007, Vancouver, British Columbia, Canada, pages 913--918, 2007.
M. Purver and S. Battersby. Experimenting with distant supervision for emotion classi cation. In EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23-27, 2012, pages 482--491, 2012.
C. Re, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature engineering for knowledge base construction. IEEE Data Eng. Bull., 37(3):26--40, 2014.
B. Recht, C. Re, S. J. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages 693--701, 2011.
S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part III, pages 148--163, 2010.
C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, pages 1033--1044, 2007.
J. Shin, C. Ré, and M. J. Cafarella. Mindtagger: A demonstration of data labeling in knowledge base construction. PVLDB, 8(12):1920--1931, 2015.
F. M. Suchanek, M. Sozio, and G. Weikum. SOFIE: a self-organizing framework for information extraction. In Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, pages 631--640, 2009.
M. Surdeanu, S. Gupta, J. Bauer, D. McClosky, A. X. Chang, V. I. Spitkovsky, and C. D. Manning. Stanford's distantly-supervised slot- lling system. In Proceedings of the Fourth Text Analysis Conference, TAC 2011, Gaithersburg, Maryland, USA, November 14-15, 2011, 2011.
M. Surdeanu, D. McClosky, J. Tibshirani, J. Bauer, A. X. Chang, V. I. Spitkovsky, and C. D. Manning. A simple distant supervision approach for the TAC-KBP slot lling task. In Proceedings of the Third Text Analysis Conference, TAC 2010, Gaithersburg, Maryland, USA, November 15-16, 2010, 2010.
M. J. Wainwright and M. I. Jordan. Log-determinant relaxation for approximate inference in discrete markov random elds. IEEE Transactions on Signal Processing, 54(6-1):2099--2109, 2006.
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1--305, 2008.
L. Yao, S. Riedel, and A. McCallum. Collective cross-document relation extraction without labelled data. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, 9-11 October 2010, MIT Stata Center, Massachusetts, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1013--1023, 2010.
A. Yates and O. Etzioni. Unsupervised resolution of objects and relations on the web. In Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, April 22-27, 2007, Rochester, New York, USA, pages 121--130, 2007.
C. Zhang, F. Niu, C. Re, and J. W. Shavlik. Big data versus the crowd: Looking for relationships in all the right places. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers, pages 825--834, 2012.
C. Zhang and C. Re. Towards high-throughput gibbs sampling at scale: a study across storage managers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, pages 397--408, 2013.
C. Zhang and C. Re. Dimmwitted: A study of main-memory statistical analytics. PVLDB, 7(12):1283--1294, 2014.
J. Zhu, Z. Nie, X. Liu, B. Zhang, and J. Wen. StatSnowball: a statistical approach to extracting entity relationships. In Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, pages 101--110, 2009.
M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems. Vancouver, Canada, December 6-9, 2010., pages 2595--2603, 2010.

Cited By

View all
  • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
  • (2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
  • (2024)Tor-Quest (The Onion Router Crawler)2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM)10.1109/ICONSTEM60960.2024.10568905(1-5)Online publication date: 4-Apr-2024
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].



Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016


Request permissions for this article.

Check for updates

Author Tags

  1. dark data
  2. data integration
  3. information extraction
  4. knowledge base construction


  • Research-article

Funding Sources

  • ONR
  • NSF


SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)162
  • Downloads (Last 6 weeks)28
Reflects downloads up to 30 Aug 2024

Other Metrics


Cited By

View all
  • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
  • (2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
  • (2024)Tor-Quest (The Onion Router Crawler)2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM)10.1109/ICONSTEM60960.2024.10568905(1-5)Online publication date: 4-Apr-2024
  • (2023)No Labels?Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law10.1145/3594536.3595171(277-286)Online publication date: 19-Jun-2023
  • (2023)Understanding and Defining Dark Data for the Manufacturing IndustryIEEE Transactions on Engineering Management10.1109/TEM.2021.305198170:2(700-712)Online publication date: Feb-2023
  • (2023)Towards the Specification and Generation of Time Series Datasets from Data Lakes2023 IEEE 31st International Requirements Engineering Conference Workshops (REW)10.1109/REW57809.2023.00057(302-306)Online publication date: Sep-2023
  • (2023)Named entity annotation schema for geological literature mining in the domain of porphyry copper depositsOre Geology Reviews10.1016/j.oregeorev.2022.105243152(105243)Online publication date: Jan-2023
  • (2022)Ethical Tensions in Applications of AI for Addressing Human Trafficking: A Human Rights PerspectiveProceedings of the ACM on Human-Computer Interaction10.1145/35551866:CSCW2(1-29)Online publication date: 11-Nov-2022
  • (2022)Scratch-DKG: A Framework for Constructing Scratch Domain Knowledge GraphIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2020.299671010:1(170-185)Online publication date: 1-Jan-2022
  • (2021)Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark DataACM/IMS Transactions on Data Science10.1145/34200382:2(1-26)Online publication date: 8-Apr-2021
  • Show More Cited By

View Options

View options


View or Download as a PDF file.



View online with eReader.


Get Access

Login options







Share this Publication link

Share on social media