Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/336597.336644acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article
Free access

Snowball: extracting relations from large plain-text collections

Published: 01 June 2000 Publication History
  • Get Citation Alerts
  • Abstract

    Text documents often contain valuable structured data that is hidden Yin regular English sentences. This data is best exploited infavailable as arelational table that we could use for answering precise queries or running data mining tasks.We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection.We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents.At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention,and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.

    References

    [1]
    Proceedings of the Sixth Message Understanding Conference. Morgan Kaufman, 1995.
    [2]
    Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998.
    [3]
    Sergey Brin. Extracting patterns and relations from the World- Wide Web. In Proceedings of the 1998 International Workshop on the Web and Databases (WebDB' 98), March 1998.
    [4]
    William Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the 1998 ACM International Conference on Management of Data (SIGMOD' 98), 1998.
    [5]
    Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIG- DAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
    [6]
    M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 1999.
    [7]
    David Day, John Aberdeen, Lynette Hirschman, Robyn Kozierok, Patricia Robinson, and Marc Vilain. Mixedinitiative development of language processing systems. In Proceedings of the Fifth ACL Conference on Applied Natural Language Processing, April 1997.
    [8]
    D. Fisher, S. Soderland, J. McCarthy, F. Feng, and W. Lehnert. Description of the UMass systems as used for MUC-6. In Proceedings of the 6th Message Understanding Conference. Columbia, MD, 1995.
    [9]
    William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice- Hall, 1992.
    [10]
    Ralph Grishman. Information extraction: Techniques and challenges. In Information Extraction (International Summer School SCIE-97). Springer-Verlag, 1997.
    [11]
    Ellen Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044-1049, 1996.
    [12]
    Ellen Riloff and Rosie Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999.
    [13]
    Gerard Salton. Automatic Text Processing: The transformarion, analysis, and retrieval of information by computer. Addison-Wesley, 1989.
    [14]
    Roman Yangarber and Ralph Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.
    [15]
    D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189-196. Cambridge, MA, 1995.
    [16]
    Jeonghee Yi and Neel Sundaresan. Mining the web for acronyms using the duality of patterns and relations. In Proceedings of the 1999 Workshop on Web Information and Data Management, 1999.

    Cited By

    View all
    • (2024)Information extraction from automotive reports for ontology populationApplied Ontology10.3233/AO-23000219:2(113-142)Online publication date: 13-Jun-2024
    • (2024)Retrieval Augmented ModelingAdvances in Multimodal Information Retrieval and Generation10.1007/978-3-031-57816-8_5(135-157)Online publication date: 26-Jun-2024
    • (2023)ToolformerProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669119(68539-68551)Online publication date: 10-Dec-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DL '00: Proceedings of the fifth ACM conference on Digital libraries
    June 2000
    294 pages
    ISBN:158113231X
    DOI:10.1145/336597
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2000

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    DL00: Fifth ACM Conference on Digital Libraries
    June 2 - 7, 2000
    Texas, San Antonio, USA

    Acceptance Rates

    DL '00 Paper Acceptance Rate 44 of 132 submissions, 33%;
    Overall Acceptance Rate 95 of 346 submissions, 27%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)297
    • Downloads (Last 6 weeks)26

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Information extraction from automotive reports for ontology populationApplied Ontology10.3233/AO-23000219:2(113-142)Online publication date: 13-Jun-2024
    • (2024)Retrieval Augmented ModelingAdvances in Multimodal Information Retrieval and Generation10.1007/978-3-031-57816-8_5(135-157)Online publication date: 26-Jun-2024
    • (2023)ToolformerProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669119(68539-68551)Online publication date: 10-Dec-2023
    • (2023)Competition or cooperation? exploring unlabeled data via challenging minimax game for semi-supervised relation extractionProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i11.26513(12872-12880)Online publication date: 7-Feb-2023
    • (2023)On the Cusp: Computing Thrills and Perils and Professional AwakeningProceedings of the VLDB Endowment10.14778/3611540.361164016:12(4152-4159)Online publication date: 1-Aug-2023
    • (2023)Modernization of Databases in the Cloud Era: Building Databases that Run Like LegosProceedings of the VLDB Endowment10.14778/3611540.361163916:12(4140-4151)Online publication date: 1-Aug-2023
    • (2023)Generations of Knowledge Graphs: The Crazy Ideas and the Business ImpactProceedings of the VLDB Endowment10.14778/3611540.361163616:12(4130-4137)Online publication date: 1-Aug-2023
    • (2023)The Story of AWS GlueProceedings of the VLDB Endowment10.14778/3611540.361154716:12(3557-3569)Online publication date: 1-Aug-2023
    • (2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 5-Sep-2023
    • (2023)Towards Visual Taxonomy ExpansionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613845(6481-6490)Online publication date: 26-Oct-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media