research-article

Public Access

Building Structured Databases of Factual Knowledge from Massive Text Corpora

Authors:

Jiawei HanAuthors Info & Claims

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 1741 - 1745

https://doi.org/10.1145/3035918.3054781

Published: 09 May 2017 Publication History

Abstract

In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate reports, advertisements, legal acts, medical reports). To turn such massive unstructured text data into structured, actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations) in the text.

In this tutorial, we introduce data-driven methods on mining structured facts (i.e., entities and their relations/attributes for types of interest) from massive text corpora, to construct structured databases of factual knowledge (called StructDBs). State-of-the-art information extraction systems have strong reliance on large amounts of task/corpus-specific labeled data (usually created by domain experts). In practice, the scale and efficiency of such a manual annotation process are rather limited, especially when dealing with text corpora of various kinds (domains, languages, genres). We focus on methods that are minimally-supervised, domain-independent, and language-independent for timely StructDB construction across various application domains (news, social media, biomedical, business), and demonstrate on real datasets how these StructDBs aid in data exploration and knowledge discovery.

References

[1]

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In ACM conference on Digital libraries, pages 85--94, 2000.

Digital Library

[2]

B. Ahmadi, M. Hadjieleftheriou, T. Seidl, D. Srivastava, and S. Venkatasubramanian. Type-based categorization of relational attributes. In EDBT, pages 84--95, 2009.

Digital Library

[3]

N. Bach and S. Badaskar. A review of relation extraction. Literature review for Language and Statistics II.

[4]

M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007.

Digital Library

[5]

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008.

Digital Library

[6]

S. Brin. Extracting patterns and relations from the world wide web. In International Workshop on The World Wide Web and Databases, 1998.

Digital Library

[7]

R. C. Bunescu and R. Mooney. Learning to extract relations from the web using minimal supervision. In ACL, 2007.

[8]

M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008.

Digital Library

[9]

A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.

Digital Library

[10]

P. Deane. A nonparametric method for extraction of candidate phrasal terms. In ACL, 2005.

Digital Library

[11]

A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 2015.

Digital Library

[12]

O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall:(preliminary results). In Proceedings of the 13th international conference on World Wide Web, pages 100--110. ACM, 2004.

Digital Library

[13]

V. Ganti, A. C. König, and R. Vernica. Entity categorization over large document collections. In SIGKDD, 2008.

Digital Library

[14]

R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano. Text mining for product attribute extraction. ACM SIGKDD Explorations Newsletter, 8(1):41--48, 2006.

Digital Library

[15]

R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7):505--516, 2014.

Digital Library

[16]

S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CONLL, 2014.

[17]

A. Halevy, N. Noy, S. Sarawagi, S. E. Whang, and X. Yu. Discovering structure in the universe of attribute names. In WWW, pages 939--949, 2016.

Digital Library

[18]

Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, 2011.

Digital Library

[19]

R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, 2011.

Digital Library

[20]

R. Huang and E. Riloff. Inducing domain-specific semantic class taggers from (almost) nothing. In ACL, 2010.

Digital Library

[21]

B. Kimelfeld. Database principles in information extraction. In PODS, 2014.

Digital Library

[22]

T. Koo, X. Carreras, and M. Collins. Simple semi-supervised dependency parsing. ACL-HLT, 2008.

[23]

T. Lee, Z. Wang, H. Wang, and S.-w. Hwang. Attribute extraction and scoring: A probabilistic approach. In ICDE, 2013.

Digital Library

[24]

Q. Li and H. Ji. Incremental joint extraction of entity mentions and relations. In ACL, 2014.

[25]

Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining evidences for named entity disambiguation. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1070--1078. ACM, 2013.

Digital Library

[26]

G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. VLDB, 3(1--2):1338--1347, 2010.

Digital Library

[27]

X. Ling and D. S. Weld. Fine-grained entity recognition. In AAAI, 2012.

Digital Library

[28]

J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015.

Digital Library

[29]

C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky. The stanford corenlp natural language processing toolkit. In ACL, 2014.

[30]

R. McDonald, F. Pereira, K. Ribarov, and J. Hajič. Non-projective dependency parsing using spanning tree algorithms. In EMNLP, 2005.

Digital Library

[31]

P. McNamee and J. Mayfield. Entity extraction without language-specific resources. In proceedings of the 6th conference on Natural language learning-Volume 20, pages 1--4. Association for Computational Linguistics, 2002.

Digital Library

[32]

M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, 2009.

Digital Library

[33]

N. Nguyen and R. Caruana. Classification with partial labels. In KDD, 2008.

Digital Library

[34]

A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the web of concepts: Extracting concepts from large datasets. VLDB, 3((1--2)), September 2010.

Digital Library

[35]

V. Punyakanok and D. Roth. The use of classifiers in sequential inference. In NIPS, 2001.

Digital Library

[36]

D. Qiu, L. Barbosa, X. L. Dong, Y. Shen, and D. Srivastava. Dexter: large-scale discovery and extraction of product specifications on the web. Proceedings of the VLDB Endowment, 8(13):2194--2205, 2015.

Digital Library

[37]

L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In ACL, 2009.

Digital Library

[38]

X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. ClusType: Effective entity recognition and typing by relation phrase-based clustering. In KDD, 2015.

Digital Library

[39]

X. Ren, W. He, M. Qu, L. Huang, H. Ji, and J. Han. AFET: Automatic fine-grained entity typing by hierarchical partial-label embedding. In EMNLP, 2016.

[40]

X. Ren, W. He, M. Qu, C. R. Voss, H. Ji, and J. Han. Label noise reduction in entity typing by heterogeneous partial-label embedding. In KDD, 2016.

Digital Library

[41]

X. Ren, Z. Wu, W. He, M. Qu, C. R. Voss, H. Ji, T. F. Abdelzaher, and J. Han. CoType: Joint extraction of typed entities and relations with knowledge bases. In arXiv:1610.08763, 2017.

Digital Library

[42]

W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, (99):1--20, 2014.

[43]

J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using deepdive. VLDB, 8(11):1310--1321, 2015.

Digital Library

[44]

A. Silva, W. Meira Jr, and M. J. Zaki. Mining attribute-structure correlated patterns in large attributed graphs. PVLDB, 5(5):466--477, 2012.

Digital Library

[45]

Y. Sun and J. Han. Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explorations, 14(2):20--28, 2013.

Digital Library

[46]

J. Tang, M. Qu, and Q. Mei. Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1165--1174. ACM, 2015.

Digital Library

[47]

J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, 2010.

Digital Library

[48]

W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 481--492. ACM, 2012.

Digital Library

[49]

E. Xun, C. Huang, and M. Zhou. A unified statistical model for the identification of english basenp. In ACL, 2000.

Digital Library

[50]

M. Yahya, S. Whang, R. Gupta, and A. Y. Halevy. Renoun: Fact extraction for nominal attributes. In EMNLP, 2014.

[51]

M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, 2012.

Digital Library

[52]

D. Yogatama, D. Gillick, and N. Lazic. Embedding methods for fine grained entity type classification. In ACL, 2015.

[53]

D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. R. Voss, and M. Magdon-Ismail. The wisdom of minority: Unsupervised slot filling validation based on multi-dimensional truth-finding. In COLING, 2014.

[54]

D. Yu and H. Ji. Unsupervised person slot filling based on graph mining. In ACL, 2016.

[55]

C. Zhang, J. Shin, C. Ré, M. Cafarella, and F. Niu. Extracting databases from dark data with deepdive. In SIGMOD, 2016.

Digital Library

[56]

M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, and D. Srivastava. Automatic discovery of attributes in relational databases. In SIGMOD, pages 109--120, 2011.

Digital Library

[57]

G. Zhou, J. Su, J. Zhang, and M. Zhang. Exploring various knowledge in relation extraction. In ACL, 2005.

[58]

L. Zou, R. Huang, H. Wang, J. X. Yu, W. He, and D. Zhao. Natural language question answering over rdf: A graph data driven approach. In SIGMOD, pages 313--324, 2014.

Digital Library

Cited By

Kang EKo D(2021)Comparison of Active Learning Performance for Automatic Classification of Learning ContentsThe Journal of Korean Institute of Information Technology10.14801/jkiit.2021.19.7.119:7(1-7)Online publication date: 31-Jul-2021
https://doi.org/10.14801/jkiit.2021.19.7.1
Yang CZhang CChen XYe JHan J(2018)Did You Enjoy the Ride? Understanding Passenger Experience via Heterogeneous Network Embedding2018 IEEE 34th International Conference on Data Engineering (ICDE)10.1109/ICDE.2018.00158(1392-1403)Online publication date: Apr-2018
https://doi.org/10.1109/ICDE.2018.00158

Index Terms

Building Structured Databases of Factual Knowledge from Massive Text Corpora
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction
  2. Information systems applications
    1. Data mining

Recommendations

Constructing Structured Information Networks from Massive Text Corpora
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

In today's computerized and information-based society, text data is rich but messy. People are soaked with vast amounts of natural-language text data, ranging from news articles, social media post, advertisements, to a wide range of textual information ...
Automatic Entity Recognition and Typing in Massive Text Corpora
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

In today's computerized and information-based society, we are soaked with vast amounts of natural language text data, ranging from news articles, product reviews, advertisements, to a wide range of user-generated content from social media. To turn such ...
CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases
WWW '17: Proceedings of the 26th International Conference on World Wide Web

Extracting entities and relations for types of interest from text is important for understanding massive text corpora. Traditionally, systems of entity relation extraction have relied on human-annotated corpora for training and adopted an incremental ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

May 2017

1810 pages

ISBN:9781450341974

DOI:10.1145/3035918

General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Army Research Lab
NIGMS
National Science Foundation

Conference

SIGMOD/PODS'17

Sponsor:

SIGMOD

SIGMOD/PODS'17: International Conference on Management of Data

May 14 - 19, 2017

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
697
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)11

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kang EKo D(2021)Comparison of Active Learning Performance for Automatic Classification of Learning ContentsThe Journal of Korean Institute of Information Technology10.14801/jkiit.2021.19.7.119:7(1-7)Online publication date: 31-Jul-2021
https://doi.org/10.14801/jkiit.2021.19.7.1
Yang CZhang CChen XYe JHan J(2018)Did You Enjoy the Ride? Understanding Passenger Experience via Heterogeneous Network Embedding2018 IEEE 34th International Conference on Data Engineering (ICDE)10.1109/ICDE.2018.00158(1392-1403)Online publication date: Apr-2018
https://doi.org/10.1109/ICDE.2018.00158

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten