Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

Published: 01 December 2007 Publication History

Abstract

Information extraction from unstructured, ungrammatical data such as classified listings is difficult because traditional structural and grammatical extraction methods do not apply. Previous work has exploited reference sets to aid such extraction, but it did so using supervised machine learning. In this paper, we present an unsupervised approach that both selects the relevant reference set(s) automatically and then uses it for unsupervised extraction. We validate our approach with experimental results that show our unsupervised extraction is competitive with supervised machine learning approaches, including the previous supervised approach that exploits reference sets.

References

[1]
Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 20-29. ACM, Baltimore (2004)
[2]
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39---48. ACM, Baltimore (2003)
[3]
Cafarella, M.J., Downey, D., Soderland, S., Etzioni, O.: KnowItNow: Fast, scalable information extraction from the web. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 563-570. Association for Computational Linguistics, East Stroudsburg (2005)
[4]
Carman, M.J., Knoblock, C.A.: Learning semantic descriptions of web information sources. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2695---2700 (2007)
[5]
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1251---1256 (2001)
[6]
Cohen, W., Ravikumar, P., Feinberg, S.: A comparison of string metrics for matching names and records. In: Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp. 13---18 (2003)
[7]
Cohen, W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89---98. ACM, Baltimore (2004)
[8]
Craswell, N., Bailey, P., Hawking, D.: Server selection on the world wide web. In: Proceedings of the Conference on Digital Libraries, pp. 37---46. ACM, Baltimore (2000)
[9]
Dill, S., Gibson, N., Gruhl, D., Guha, R., Jhingran, A., Kanungo,~T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien,~J.Y.: Semtag and seeker: Bootstrapping the semantic web via automated semantic annotation. In: Proceedings of the International World Wide Web Conference, pp. 178---186. ACM, Baltimore (2003)
[10]
Hassan, H., Hassan, A., Emam, O.: Unsupervised information extraction approach using graph mutual reinforcement. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 501---508. Association for Computational Linguistics, East Stroudsburg (2006)
[11]
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 729---737 (1997)
[12]
Lerman, K., Plangrasopchok, A., Knoblock, C.A.: Automatically labeling the inputs and outputs of web services. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1363---1368. AAAI, Charlotte (2006)
[13]
Levy, A.: Logic-based techniques in data integration. In: J.~Minker (ed.) Logic Based Artificial Intelligence, pp. 575---595. Kluwer, Dordrecht (2000)
[14]
Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions. In: Proceedings of the International Conference on Very Large Data Bases, pp. 251---262. Morgan Kaufmann, San Fransisco (1996)
[15]
Lin J. (1991). Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory 37(1): 145---151
[16]
McCallum, A.: Mallet: A machine learning for language toolkit http://mallet.cs.umass.edu (2002)
[17]
Michelson, M., Knoblock, C.A.: Semantic annotation of unstructured and ungrammatical text. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1091---1098 (2005)
[18]
Michelson, M., Knoblock, C.A.: An automatic approach to semantic annotation of unstructured, ungrammatical sources: A first look. In: Proceedings of the IJCAI Workshop on Analytics for Noisy Unstructured Text Data, pp. 123---130 (2007)
[19]
Michelson, M., Knoblock, C.A.: Mining heterogeneous transformations for record linkage. In: Proceedings of the International Workshop on Information Integration on the Web, pp. 68---73. AAAI, Charlotte (2007)
[20]
Minton, S.N., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A heterogeneous field matching method for record linkage. In: Proceedings of the IEEE International Conference on Data Mining, pp. 314---321. IEEE Computer Society, Washington DC (2005)
[21]
Paşca, M., Lin, D., Bigham, J., Lifchits, A., Jain, A.: Organizing and searching the world wide web of facts - step one: the one- million fact extraction challenge. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1400---1405. AAAI, Charlotte (2006)
[22]
Reeve, L., Han, H.: Survey of semantic annotation platforms. In: Proceedings of ACM Symposium on Applied Computing, pp. 1634---1638. ACM, Baltimore (2005)
[23]
Smith T.F., Waterman M.S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147: 195---197
[24]
Thakkar S., Ambite J.L., Knoblock C.A. (2005). Composing, optimizing, and executing plans for bioinformatics web services. Int. J. Very Large Databases, Spec. Issue Data Manage. Anal. Mining Life Sci 14(3): 330---353
[25]
Winkler, W.E.: The state of record linkage and current research problems. Technical Report U.S. Census Bureau (1999)

Cited By

View all
  • (2016)Cross-supervised synthesis of web-crawlersProceedings of the 38th International Conference on Software Engineering10.1145/2884781.2884842(368-379)Online publication date: 14-May-2016
  • (2011)Semi-supervised multi-task learning of structured prediction models for web information extractionProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063713(957-966)Online publication date: 24-Oct-2011
  • (2011)Building Mashups by DemonstrationACM Transactions on the Web10.1145/1993053.19930585:3(1-45)Online publication date: 1-Jul-2011
  • Show More Cited By

Index Terms

  1. Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image International Journal on Document Analysis and Recognition
        International Journal on Document Analysis and Recognition  Volume 10, Issue 3-4
        December 2007
        109 pages
        ISSN:1433-2833
        EISSN:1433-2825
        Issue’s Table of Contents

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 01 December 2007

        Author Tags

        1. Information extraction
        2. Information integration
        3. Semantic annotation
        4. Unstructured data sources
        5. Unsupervised

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 13 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2016)Cross-supervised synthesis of web-crawlersProceedings of the 38th International Conference on Software Engineering10.1145/2884781.2884842(368-379)Online publication date: 14-May-2016
        • (2011)Semi-supervised multi-task learning of structured prediction models for web information extractionProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063713(957-966)Online publication date: 24-Oct-2011
        • (2011)Building Mashups by DemonstrationACM Transactions on the Web10.1145/1993053.19930585:3(1-45)Online publication date: 1-Jul-2011
        • (2010)A Combination Approach to Web User ProfilingACM Transactions on Knowledge Discovery from Data10.1145/1870096.18700985:1(1-44)Online publication date: 1-Dec-2010
        • (2010)Graph-based concept identification and disambiguation for enterprise searchProceedings of the 19th international conference on World wide web10.1145/1772690.1772709(171-180)Online publication date: 26-Apr-2010
        • (2009)Exploiting background knowledge to build reference sets for information extractionProceedings of the 21st International Joint Conference on Artificial Intelligence10.5555/1661445.1661777(2076-2082)Online publication date: 11-Jul-2009
        • (2009)Harvesting relational tables from lists on the webProceedings of the VLDB Endowment10.14778/1687627.16877492:1(1078-1089)Online publication date: 1-Aug-2009
        • (2009)Generalized Mongue-Elkan Method for Approximate Text String ComparisonProceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing10.1007/978-3-642-00382-0_45(559-570)Online publication date: 17-Feb-2009

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media