Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2911451.2914719acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Enhancing First Story Detection using Word Embeddings

Published: 07 July 2016 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper we show how word embeddings can be used to increase the effectiveness of a state-of-the art Locality Sensitive Hashing (LSH) based first story detection (FSD) system over a standard tweet corpus. Vocabulary mismatch, in which related tweets use different words, is a serious hindrance to the effectiveness of a modern FSD system. In this case, a tweet could be flagged as a first story even if a related tweet, which uses different but synonymous words, was already returned as a first story. In this work, we propose a novel approach to mitigate this problem of lexical variation, based on tweet expansion. In particular, we propose to expand tweets with semantically related paraphrases identified via automatically mined word embeddings over a background tweet corpus. Through experimentation on a large data stream comprised of 50 million tweets, we show that FSD effectiveness can be improved by 9.5% over a state-of-the-art FSD system.

    References

    [1]
    J. Fiscus Overview of of the TDT 2001 evaluation and results. In Proc. TDT, 2001.
    [2]
    J. Allan. Introduction to topic detection and tracking. In Proc. TDT, 2002.
    [3]
    C. Callison-Burch Syntactic constraints onparaphrases extracted from parallel corpora. In Proc. EMNLP, 2008.
    [4]
    P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. STOC, 1998.
    [5]
    D. Wurzer, V. Lavrenko, M. Osborne Twitter-scale New Event Detection via K-term Hashing. In Proc. EMNLP, 2015.
    [6]
    G. Miller. WordNet: a lexical database for English. Communications of the ACM, 38 (11), 1995.
    [7]
    R. Lebret, and R. Collobert. N-gram-Based Low-Dimensional Representation for Document Classification. In Proc. ICLP, 2015.
    [8]
    M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. STOC, 2002.
    [9]
    T. Mikolov, K. Chen, G. Corrado and J. Dean. Efficient estimation of word representations in vector space. In Proc. ICLR, 2013.
    [10]
    S. Petrović, M. Osborne, and V. Lavrenko. Streaming first story detection with application to Twitter. In Proc. NAACL, 2010.
    [11]
    S. Petrović, M. Osborne and V. Lavrenko. Using paraphrases for improving first story detection in news and Twitter. In Proc. NAACL, 2012.
    [12]
    C. Quirk, C. Brockett, and W. Dola. Monolingual machine translation for paraphrase generation. In Proc. EMNLP, 2004.

    Cited By

    View all
    • (2022)Real-Time Detection of First Stories in Twitter Using a FastText ModelArtificial Intelligence for Data Science in Theory and Practice10.1007/978-3-030-92245-0_9(179-218)Online publication date: 2022
    • (2021)News Monitor: A Framework for Exploring News in Real-TimeData10.3390/data70100037:1(3)Online publication date: 27-Dec-2021
    • (2021)Event identification by deep transfer learning with dual transformations2021 30th Wireless and Optical Communications Conference (WOCC)10.1109/WOCC53213.2021.9602921(178-182)Online publication date: 7-Oct-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
    July 2016
    1296 pages
    ISBN:9781450340694
    DOI:10.1145/2911451
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 July 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. document expansion
    2. locality sensitive hashing
    3. nearest neighbour search
    4. paraphrase
    5. streaming data
    6. twitter

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    SIGIR '16
    Sponsor:

    Acceptance Rates

    SIGIR '16 Paper Acceptance Rate 62 of 341 submissions, 18%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Real-Time Detection of First Stories in Twitter Using a FastText ModelArtificial Intelligence for Data Science in Theory and Practice10.1007/978-3-030-92245-0_9(179-218)Online publication date: 2022
    • (2021)News Monitor: A Framework for Exploring News in Real-TimeData10.3390/data70100037:1(3)Online publication date: 27-Dec-2021
    • (2021)Event identification by deep transfer learning with dual transformations2021 30th Wireless and Optical Communications Conference (WOCC)10.1109/WOCC53213.2021.9602921(178-182)Online publication date: 7-Oct-2021
    • (2021)A General Framework for First Story Detection Utilizing Entities and Their RelationsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.297005133:11(3482-3493)Online publication date: 1-Nov-2021
    • (2021)News Monitor: A Framework for Querying News in Real TimeAdvances in Information Retrieval10.1007/978-3-030-72240-1_62(543-548)Online publication date: 28-Mar-2021
    • (2020)Update Frequency and Background Corpus Selection in Dynamic TF-IDF Models for First Story DetectionComputational Linguistics10.1007/978-981-15-6168-9_18(206-217)Online publication date: 2-Jul-2020
    • (2018)A Distance-Dependent Chinese Restaurant Process Based Method for Event Detection on Social MediaInventions10.3390/inventions30400803:4(80)Online publication date: 7-Dec-2018
    • (2018)Parameterizing Kterm HashingThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210101(945-948)Online publication date: 27-Jun-2018
    • (2018)Exploring Entity-centric Networks in Entangled News StreamsCompanion Proceedings of the The Web Conference 201810.1145/3184558.3188726(555-563)Online publication date: 23-Apr-2018
    • (2018)On the Reproducibility and Generalisation of the Linear Transformation of Word EmbeddingsAdvances in Information Retrieval10.1007/978-3-319-76941-7_20(263-275)Online publication date: 1-Mar-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media