article

Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

Authors:

Matthew Michelson,

Craig A. KnoblockAuthors Info & Claims

International Journal on Document Analysis and Recognition, Volume 10, Issue 3-4

Pages 211 - 226

Published: 01 December 2007 Publication History

Abstract

Information extraction from unstructured, ungrammatical data such as classified listings is difficult because traditional structural and grammatical extraction methods do not apply. Previous work has exploited reference sets to aid such extraction, but it did so using supervised machine learning. In this paper, we present an unsupervised approach that both selects the relevant reference set(s) automatically and then uses it for unsupervised extraction. We validate our approach with experimental results that show our unsupervised extraction is competitive with supervised machine learning approaches, including the previous supervised approach that exploits reference sets.

References

[1]

Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 20-29. ACM, Baltimore (2004)

Digital Library

[2]

Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39---48. ACM, Baltimore (2003)

Digital Library

[3]

Cafarella, M.J., Downey, D., Soderland, S., Etzioni, O.: KnowItNow: Fast, scalable information extraction from the web. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 563-570. Association for Computational Linguistics, East Stroudsburg (2005)

Digital Library

[4]

Carman, M.J., Knoblock, C.A.: Learning semantic descriptions of web information sources. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2695---2700 (2007)

[5]

Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1251---1256 (2001)

[6]

Cohen, W., Ravikumar, P., Feinberg, S.: A comparison of string metrics for matching names and records. In: Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp. 13---18 (2003)

[7]

Cohen, W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89---98. ACM, Baltimore (2004)

Digital Library

[8]

Craswell, N., Bailey, P., Hawking, D.: Server selection on the world wide web. In: Proceedings of the Conference on Digital Libraries, pp. 37---46. ACM, Baltimore (2000)

[9]

Dill, S., Gibson, N., Gruhl, D., Guha, R., Jhingran, A., Kanungo,~T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien,~J.Y.: Semtag and seeker: Bootstrapping the semantic web via automated semantic annotation. In: Proceedings of the International World Wide Web Conference, pp. 178---186. ACM, Baltimore (2003)

Digital Library

[10]

Hassan, H., Hassan, A., Emam, O.: Unsupervised information extraction approach using graph mutual reinforcement. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 501---508. Association for Computational Linguistics, East Stroudsburg (2006)

Digital Library

[11]

Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 729---737 (1997)

[12]

Lerman, K., Plangrasopchok, A., Knoblock, C.A.: Automatically labeling the inputs and outputs of web services. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1363---1368. AAAI, Charlotte (2006)

Digital Library

[13]

Levy, A.: Logic-based techniques in data integration. In: J.~Minker (ed.) Logic Based Artificial Intelligence, pp. 575---595. Kluwer, Dordrecht (2000)

[14]

Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions. In: Proceedings of the International Conference on Very Large Data Bases, pp. 251---262. Morgan Kaufmann, San Fransisco (1996)

[15]

Lin J. (1991). Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory 37(1): 145---151

Digital Library

[16]

McCallum, A.: Mallet: A machine learning for language toolkit http://mallet.cs.umass.edu (2002)

[17]

Michelson, M., Knoblock, C.A.: Semantic annotation of unstructured and ungrammatical text. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1091---1098 (2005)

[18]

Michelson, M., Knoblock, C.A.: An automatic approach to semantic annotation of unstructured, ungrammatical sources: A first look. In: Proceedings of the IJCAI Workshop on Analytics for Noisy Unstructured Text Data, pp. 123---130 (2007)

[19]

Michelson, M., Knoblock, C.A.: Mining heterogeneous transformations for record linkage. In: Proceedings of the International Workshop on Information Integration on the Web, pp. 68---73. AAAI, Charlotte (2007)

[20]

Minton, S.N., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A heterogeneous field matching method for record linkage. In: Proceedings of the IEEE International Conference on Data Mining, pp. 314---321. IEEE Computer Society, Washington DC (2005)

Digital Library

[21]

Paşca, M., Lin, D., Bigham, J., Lifchits, A., Jain, A.: Organizing and searching the world wide web of facts - step one: the one- million fact extraction challenge. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1400---1405. AAAI, Charlotte (2006)

[22]

Reeve, L., Han, H.: Survey of semantic annotation platforms. In: Proceedings of ACM Symposium on Applied Computing, pp. 1634---1638. ACM, Baltimore (2005)

Digital Library

[23]

Smith T.F., Waterman M.S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147: 195---197

[24]

Thakkar S., Ambite J.L., Knoblock C.A. (2005). Composing, optimizing, and executing plans for bioinformatics web services. Int. J. Very Large Databases, Spec. Issue Data Manage. Anal. Mining Life Sci 14(3): 330---353

Digital Library

[25]

Winkler, W.E.: The state of record linkage and current research problems. Technical Report U.S. Census Bureau (1999)

Cited By

Omari AShoham SYahav EDillon LVisser WWilliams L(2016)Cross-supervised synthesis of web-crawlersProceedings of the 38th International Conference on Software Engineering10.1145/2884781.2884842(368-379)Online publication date: 14-May-2016
https://dl.acm.org/doi/10.1145/2884781.2884842
Dhillon PSellamanickam SSelvaraj S(2011)Semi-supervised multi-task learning of structured prediction models for web information extractionProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063713(957-966)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.1145/2063576.2063713
Tuchinda RKnoblock CSzekely P(2011)Building Mashups by DemonstrationACM Transactions on the Web10.1145/1993053.19930585:3(1-45)Online publication date: 1-Jul-2011
https://dl.acm.org/doi/10.1145/1993053.1993058
Show More Cited By

Index Terms

Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web
1. Applied computing
  1. Document management and text processing
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning

Recommendations

Unsupervised named-entity extraction from the Web: An experimental study

The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of ...
Coupled semi-supervised learning for information extraction
WSDM '10: Proceedings of the third ACM international conference on Web search and data mining

We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or ...
Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources

Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific template is time-consuming but straightforward, it is desirable to automatically generate template extraction ...

Comments

Information & Contributors

Information

Published In

cover image International Journal on Document Analysis and Recognition

International Journal on Document Analysis and Recognition Volume 10, Issue 3-4

December 2007

109 pages

ISSN:1433-2833

EISSN:1433-2825

Issue’s Table of Contents

Copyright © Copyright © 2007 Springer-Verlag.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 December 2007

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Omari AShoham SYahav EDillon LVisser WWilliams L(2016)Cross-supervised synthesis of web-crawlersProceedings of the 38th International Conference on Software Engineering10.1145/2884781.2884842(368-379)Online publication date: 14-May-2016
https://dl.acm.org/doi/10.1145/2884781.2884842
Dhillon PSellamanickam SSelvaraj S(2011)Semi-supervised multi-task learning of structured prediction models for web information extractionProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063713(957-966)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.1145/2063576.2063713
Tuchinda RKnoblock CSzekely P(2011)Building Mashups by DemonstrationACM Transactions on the Web10.1145/1993053.19930585:3(1-45)Online publication date: 1-Jul-2011
https://dl.acm.org/doi/10.1145/1993053.1993058
Tang JYao LZhang DZhang J(2010)A Combination Approach to Web User ProfilingACM Transactions on Knowledge Discovery from Data10.1145/1870096.18700985:1(1-44)Online publication date: 1-Dec-2010
https://dl.acm.org/doi/10.1145/1870096.1870098
Brauer FHuber MHackenbroich GLeser UNaumann FBarczynski WRappa MJones PFreire JChakrabarti S(2010)Graph-based concept identification and disambiguation for enterprise searchProceedings of the 19th international conference on World wide web10.1145/1772690.1772709(171-180)Online publication date: 26-Apr-2010
https://dl.acm.org/doi/10.1145/1772690.1772709
Michelson MKnoblock C(2009)Exploiting background knowledge to build reference sets for information extractionProceedings of the 21st International Joint Conference on Artificial Intelligence10.5555/1661445.1661777(2076-2082)Online publication date: 11-Jul-2009
https://dl.acm.org/doi/10.5555/1661445.1661777
Elmeleegy HMadhavan JHalevy A(2009)Harvesting relational tables from lists on the webProceedings of the VLDB Endowment10.14778/1687627.16877492:1(1078-1089)Online publication date: 1-Aug-2009
https://dl.acm.org/doi/10.14778/1687627.1687749
Jimenez SBecerra CGelbukh AGonzalez F(2009)Generalized Mongue-Elkan Method for Approximate Text String ComparisonProceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing10.1007/978-3-642-00382-0_45(559-570)Online publication date: 17-Feb-2009
https://dl.acm.org/doi/10.1007/978-3-642-00382-0_45

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents