Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3035918.3054781acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Building Structured Databases of Factual Knowledge from Massive Text Corpora

Published: 09 May 2017 Publication History

Abstract

In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate reports, advertisements, legal acts, medical reports). To turn such massive unstructured text data into structured, actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations) in the text.
In this tutorial, we introduce data-driven methods on mining structured facts (i.e., entities and their relations/attributes for types of interest) from massive text corpora, to construct structured databases of factual knowledge (called StructDBs). State-of-the-art information extraction systems have strong reliance on large amounts of task/corpus-specific labeled data (usually created by domain experts). In practice, the scale and efficiency of such a manual annotation process are rather limited, especially when dealing with text corpora of various kinds (domains, languages, genres). We focus on methods that are minimally-supervised, domain-independent, and language-independent for timely StructDB construction across various application domains (news, social media, biomedical, business), and demonstrate on real datasets how these StructDBs aid in data exploration and knowledge discovery.

References

[1]
E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In ACM conference on Digital libraries, pages 85--94, 2000.
[2]
B. Ahmadi, M. Hadjieleftheriou, T. Seidl, D. Srivastava, and S. Venkatasubramanian. Type-based categorization of relational attributes. In EDBT, pages 84--95, 2009.
[3]
N. Bach and S. Badaskar. A review of relation extraction. Literature review for Language and Statistics II.
[4]
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007.
[5]
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008.
[6]
S. Brin. Extracting patterns and relations from the world wide web. In International Workshop on The World Wide Web and Databases, 1998.
[7]
R. C. Bunescu and R. Mooney. Learning to extract relations from the web using minimal supervision. In ACL, 2007.
[8]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008.
[9]
A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.
[10]
P. Deane. A nonparametric method for extraction of candidate phrasal terms. In ACL, 2005.
[11]
A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 2015.
[12]
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall:(preliminary results). In Proceedings of the 13th international conference on World Wide Web, pages 100--110. ACM, 2004.
[13]
V. Ganti, A. C. König, and R. Vernica. Entity categorization over large document collections. In SIGKDD, 2008.
[14]
R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano. Text mining for product attribute extraction. ACM SIGKDD Explorations Newsletter, 8(1):41--48, 2006.
[15]
R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7):505--516, 2014.
[16]
S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CONLL, 2014.
[17]
A. Halevy, N. Noy, S. Sarawagi, S. E. Whang, and X. Yu. Discovering structure in the universe of attribute names. In WWW, pages 939--949, 2016.
[18]
Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, 2011.
[19]
R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, 2011.
[20]
R. Huang and E. Riloff. Inducing domain-specific semantic class taggers from (almost) nothing. In ACL, 2010.
[21]
B. Kimelfeld. Database principles in information extraction. In PODS, 2014.
[22]
T. Koo, X. Carreras, and M. Collins. Simple semi-supervised dependency parsing. ACL-HLT, 2008.
[23]
T. Lee, Z. Wang, H. Wang, and S.-w. Hwang. Attribute extraction and scoring: A probabilistic approach. In ICDE, 2013.
[24]
Q. Li and H. Ji. Incremental joint extraction of entity mentions and relations. In ACL, 2014.
[25]
Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining evidences for named entity disambiguation. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1070--1078. ACM, 2013.
[26]
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. VLDB, 3(1--2):1338--1347, 2010.
[27]
X. Ling and D. S. Weld. Fine-grained entity recognition. In AAAI, 2012.
[28]
J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015.
[29]
C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky. The stanford corenlp natural language processing toolkit. In ACL, 2014.
[30]
R. McDonald, F. Pereira, K. Ribarov, and J. Hajič. Non-projective dependency parsing using spanning tree algorithms. In EMNLP, 2005.
[31]
P. McNamee and J. Mayfield. Entity extraction without language-specific resources. In proceedings of the 6th conference on Natural language learning-Volume 20, pages 1--4. Association for Computational Linguistics, 2002.
[32]
M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, 2009.
[33]
N. Nguyen and R. Caruana. Classification with partial labels. In KDD, 2008.
[34]
A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the web of concepts: Extracting concepts from large datasets. VLDB, 3((1--2)), September 2010.
[35]
V. Punyakanok and D. Roth. The use of classifiers in sequential inference. In NIPS, 2001.
[36]
D. Qiu, L. Barbosa, X. L. Dong, Y. Shen, and D. Srivastava. Dexter: large-scale discovery and extraction of product specifications on the web. Proceedings of the VLDB Endowment, 8(13):2194--2205, 2015.
[37]
L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In ACL, 2009.
[38]
X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. ClusType: Effective entity recognition and typing by relation phrase-based clustering. In KDD, 2015.
[39]
X. Ren, W. He, M. Qu, L. Huang, H. Ji, and J. Han. AFET: Automatic fine-grained entity typing by hierarchical partial-label embedding. In EMNLP, 2016.
[40]
X. Ren, W. He, M. Qu, C. R. Voss, H. Ji, and J. Han. Label noise reduction in entity typing by heterogeneous partial-label embedding. In KDD, 2016.
[41]
X. Ren, Z. Wu, W. He, M. Qu, C. R. Voss, H. Ji, T. F. Abdelzaher, and J. Han. CoType: Joint extraction of typed entities and relations with knowledge bases. In arXiv:1610.08763, 2017.
[42]
W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, (99):1--20, 2014.
[43]
J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using deepdive. VLDB, 8(11):1310--1321, 2015.
[44]
A. Silva, W. Meira Jr, and M. J. Zaki. Mining attribute-structure correlated patterns in large attributed graphs. PVLDB, 5(5):466--477, 2012.
[45]
Y. Sun and J. Han. Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explorations, 14(2):20--28, 2013.
[46]
J. Tang, M. Qu, and Q. Mei. Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1165--1174. ACM, 2015.
[47]
J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, 2010.
[48]
W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 481--492. ACM, 2012.
[49]
E. Xun, C. Huang, and M. Zhou. A unified statistical model for the identification of english basenp. In ACL, 2000.
[50]
M. Yahya, S. Whang, R. Gupta, and A. Y. Halevy. Renoun: Fact extraction for nominal attributes. In EMNLP, 2014.
[51]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, 2012.
[52]
D. Yogatama, D. Gillick, and N. Lazic. Embedding methods for fine grained entity type classification. In ACL, 2015.
[53]
D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. R. Voss, and M. Magdon-Ismail. The wisdom of minority: Unsupervised slot filling validation based on multi-dimensional truth-finding. In COLING, 2014.
[54]
D. Yu and H. Ji. Unsupervised person slot filling based on graph mining. In ACL, 2016.
[55]
C. Zhang, J. Shin, C. Ré, M. Cafarella, and F. Niu. Extracting databases from dark data with deepdive. In SIGMOD, 2016.
[56]
M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, and D. Srivastava. Automatic discovery of attributes in relational databases. In SIGMOD, pages 109--120, 2011.
[57]
G. Zhou, J. Su, J. Zhang, and M. Zhang. Exploring various knowledge in relation extraction. In ACL, 2005.
[58]
L. Zou, R. Huang, H. Wang, J. X. Yu, W. He, and D. Zhao. Natural language question answering over rdf: A graph data driven approach. In SIGMOD, pages 313--324, 2014.

Cited By

View all
  • (2021)Comparison of Active Learning Performance for Automatic Classification of Learning ContentsThe Journal of Korean Institute of Information Technology10.14801/jkiit.2021.19.7.119:7(1-7)Online publication date: 31-Jul-2021
  • (2018)Did You Enjoy the Ride? Understanding Passenger Experience via Heterogeneous Network Embedding2018 IEEE 34th International Conference on Data Engineering (ICDE)10.1109/ICDE.2018.00158(1392-1403)Online publication date: Apr-2018

Index Terms

  1. Building Structured Databases of Factual Knowledge from Massive Text Corpora

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
      May 2017
      1810 pages
      ISBN:9781450341974
      DOI:10.1145/3035918
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 May 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. attribute discovery
      2. entity recognition and typing
      3. massive text corpora
      4. quality phrase mining
      5. relation extraction

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SIGMOD/PODS'17
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)89
      • Downloads (Last 6 weeks)11
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Comparison of Active Learning Performance for Automatic Classification of Learning ContentsThe Journal of Korean Institute of Information Technology10.14801/jkiit.2021.19.7.119:7(1-7)Online publication date: 31-Jul-2021
      • (2018)Did You Enjoy the Ride? Understanding Passenger Experience via Heterogeneous Network Embedding2018 IEEE 34th International Conference on Data Engineering (ICDE)10.1109/ICDE.2018.00158(1392-1403)Online publication date: Apr-2018

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media