Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2009916.2010043acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

An event-centric model for multilingual document similarity

Published: 24 July 2011 Publication History
  • Get Citation Alerts
  • Abstract

    Document similarity measures play an important role in many document retrieval and exploration tasks. Over the past decades, several models and techniques have been developed to determine a ranked list of documents similar to a given query document. Interestingly, the proposed approaches typically rely on extensions to the vector space model and are rarely suited for multilingual corpora.
    In this paper, we present a novel document similarity measure that is based on events extracted from documents. An event is solely described by nearby occurrences of temporal and geographic expressions in a document's text. Thus, a document is modeled as a set of events that can be compared and ranked using temporal and geographic hierarchies. A key feature of our model is that it is term- and language-independent as temporal and geographic expressions mentioned in texts are normalized to a standard format. This also allows to determine similar documents across languages, an important feature in the context of document exploration. Our approach proves to be quite effective, including the discovery of new similarities, as our experiments using different (multilingual) corpora demonstrate.

    References

    [1]
    J. Allan (Ed.). Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Publishers, USA, 2002.
    [2]
    R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, USA, 1999.
    [3]
    H. Becker, M. Naaman, and L. Gravano. Learning Similarity Metrics for Event Identification in Social Media. In WSDM'10, 291--300, 2010.
    [4]
    T. Brants and R. Stolle. Finding Similar Documents in Document Collections. In Proc. LREC-2002 Workshop on Using Semantics for Information Retrieval and Filtering, 2002.
    [5]
    H. Chim and X. Deng. Efficient Phrase-based Document Similarity for Clustering. IEEE Trans. on Knowledge and Data Eng., 20(9):1217--1229, 2008.
    [6]
    S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by Latent Semantic Analysis. J. American Society for Information Science, 41:391--407, 1990.
    [7]
    L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. SIGIR Forum, 40(1):64--69, 2006.
    [8]
    T. Elsayed, J. Lin, and D. W. Oard. Pairwise Document Similarity in Large Collections with MapReduce. In ACL-HLT'08, 265--268, 2008.
    [9]
    E. Gabrilovich and S. Markovitch. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In Proc. Int. Joint Conf. on Artificial Intelligence, 1606--1611, 2007.
    [10]
    I. Gurevych, M. Mühlhäuser, C. Müller, J. Steimle, M. Weimer, and T. Zesch. Darmstadt Knowledge Processing Repository Based on UIMA. In Proc. 1st Workshop on UIMA, Biannual Conf. Society for Comp. Ling. and Lang. Techn., 2007.
    [11]
    P. Lakkaraju, S. Gauch, and M. Speretta. Document Similarity Based on Concept Tree Distance. In Proc. ACM Conference on Hypertext and Hypermedia, 127--132, 2008.
    [12]
    M. D. Lee, B. Pincombe, and M. Welsh. An Empirical Evaluation of Models of Text Document Similarity, In Proc. Annual Conf. Cognitive Science Society,1254--1259, 2005.
    [13]
    J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi. Topic Detection and Tracking with Spatio-Temporal Evidence. In ECIR'03, 251--265, 2003.
    [14]
    C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, USA, 2008.
    [15]
    B. Martins, H. Manguinhas, and J. Borbinha. Extracting and Exploring the Geo-Temporal Semantics of Textual Resources. In Proc. International Conference on Semantic Computing, 2008.
    [16]
    MetaCarta Inc. http://www.metacarta.com/.
    [17]
    H. Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proc. International Conference on New Methods in Language Processing, 1994.
    [18]
    R. Steinberger, B. Pouliquen, and J. Hagman. Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In 3rd Int. Conf. on Computational Linguistics and Intelligent Text Processing, 415--424, 2002.
    [19]
    J. Strötgen and M. Gertz. HeidelTime: High Quality Rule-based Extraction and Normalization of Temporal Expressions. In SemEval'10, 321--324, 2010.
    [20]
    J. Strötgen, M. Gertz, and P. Popov. Extraction and Exploration of Spatio-Temporal Information in Documents. In Proc. 6th Workshop on Geographic Information Retrieval, 2010.
    [21]
    TimeML. http://www.timeml.org/.
    [22]
    UIMA. http://uima.apache.org.
    [23]
    M. Verhagen and J. Pustejovsky. Temporal Processing with the TARSQI Toolkit. In Coling 2008: Companion volume: Demonstrations,189--192, 2008.
    [24]
    M. Verhagen, R. Sauri, T. Caselli, and J. Pustejovsky. SemEval-2010 Task 13: TempEval-2. In SemEval'10, 57--62, 2010.
    [25]
    Wikipedia Featured Articles. http://en.wikipedia. org/wiki/Wikipedia:Featured_articles.
    [26]
    Yahoo Placemaker. http://developer.yahoo.com/geo/placemaker/.
    [27]
    K. Zhang, J. Zi, and L. G. Wu. New Event Detection Based on Indexing-tree and Named Entity. In SIGIR'07, 215--222, 2007.

    Cited By

    View all
    • (2018)Focused crawler for eventsInternational Journal on Digital Libraries10.1007/s00799-016-0207-119:1(3-19)Online publication date: 1-Mar-2018
    • (2017)Towards Exploiting Social Networks for Detecting Epidemic OutbreaksGlobal Journal of Flexible Systems Management10.1007/s40171-016-0148-y18:1(61-71)Online publication date: 11-Jan-2017
    • (2017)Automatic Parallel Data Mining After Bilingual Document AlignmentRecent Advances in Information Systems and Technologies10.1007/978-3-319-56535-4_32(317-327)Online publication date: 28-Mar-2017
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
    July 2011
    1374 pages
    ISBN:9781450307574
    DOI:10.1145/2009916
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 July 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. document similarity
    2. event extraction
    3. geographic information
    4. temporal information

    Qualifiers

    • Research-article

    Conference

    SIGIR '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Focused crawler for eventsInternational Journal on Digital Libraries10.1007/s00799-016-0207-119:1(3-19)Online publication date: 1-Mar-2018
    • (2017)Towards Exploiting Social Networks for Detecting Epidemic OutbreaksGlobal Journal of Flexible Systems Management10.1007/s40171-016-0148-y18:1(61-71)Online publication date: 11-Jan-2017
    • (2017)Automatic Parallel Data Mining After Bilingual Document AlignmentRecent Advances in Information Systems and Technologies10.1007/978-3-319-56535-4_32(317-327)Online publication date: 28-Mar-2017
    • (2016)Challenges in Detecting Epidemic Outbreaks from Social Networks2016 30th International Conference on Advanced Information Networking and Applications Workshops (WAINA)10.1109/WAINA.2016.111(69-74)Online publication date: Mar-2016
    • (2016)Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual DataMultimedia and Network Information Systems10.1007/978-3-319-43982-2_27(307-320)Online publication date: 17-Aug-2016
    • (2015)Efficient similarity search in scientific databases with feature signaturesProceedings of the 27th International Conference on Scientific and Statistical Database Management10.1145/2791347.2791384(1-12)Online publication date: 29-Jun-2015
    • (2014)What triggers human remembering of events?Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries10.5555/2740769.2740828(341-350)Online publication date: 8-Sep-2014
    • (2014)What triggers human remembering of events? A large-scale analysis of catalysts for collective memory in WikipediaIEEE/ACM Joint Conference on Digital Libraries10.1109/JCDL.2014.6970189(341-350)Online publication date: Sep-2014
    • (2013)Cross-lingual geo-parsing for non-structured dataProceedings of the 7th Workshop on Geographic Information Retrieval10.1145/2533888.2533943(64-71)Online publication date: 5-Nov-2013
    • (2013)Landmark History VisualizationAdvances in Multimedia Modeling10.1007/978-3-642-35728-2_12(121-132)Online publication date: 2013
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media