Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2348283.2348495acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
poster

On building a reusable Twitter corpus

Published: 12 August 2012 Publication History

Abstract

The Twitter real-time information network is the subject of research for information retrieval tasks such as real-time search. However, so far, reproducible experimentation on Twitter data has been impeded by restrictions imposed by the Twitter terms of service. In this paper, we detail a new methodology for legally building and distributing Twitter corpora, developed through collaboration between the Text REtrieval Conference (TREC) and Twitter. In particular, we detail how the first publicly available Twitter corpus - referred to as Tweets2011 - was distributed via lists of tweet identifiers and specialist tweet crawling software. Furthermore, we analyse whether this distribution approach remains robust over time, as tweets in the corpus are removed either by users or Twitter itself. Tweets2011 was successfully used by 58 participating groups for the TREC 2011 Microblog track, while our results attest to the robustness of the crawling methodology over time.

References

[1]
M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill and J. Lin. Earlybird: Real-Time Search at Twitter. In Proc. of ICDE'12.
[2]
I. Ounis, C. Macdonald, J. Lin and I. Soboroff. Overview of the TREC-2011 Microblog Track In Proc. of TREC'11.

Cited By

View all
  • (2024)Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practicesChallenges in Corpus Linguistics10.1075/scl.118.06har(89-105)Online publication date: 15-Sep-2024
  • (2023)AutoInfer: Self-Driving Management for Resource-Efficient, SLO-Aware Machine=Learning Inference in GPU ClustersIEEE Internet of Things Journal10.1109/JIOT.2022.322338110:7(6271-6285)Online publication date: 1-Apr-2023
  • (2023)Fake User Account Detection in Online Social Media Networks Using Machine Learning and Neural Network TechniquesData Analytics for Smart Grids Applications—A Key to Smart City Development10.1007/978-3-031-46092-0_12(199-215)Online publication date: 30-Nov-2023
  • Show More Cited By

Index Terms

  1. On building a reusable Twitter corpus

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
    August 2012
    1236 pages
    ISBN:9781450314725
    DOI:10.1145/2348283

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 August 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Twitter
    2. corpus creation
    3. reproducibility

    Qualifiers

    • Poster

    Conference

    SIGIR '12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practicesChallenges in Corpus Linguistics10.1075/scl.118.06har(89-105)Online publication date: 15-Sep-2024
    • (2023)AutoInfer: Self-Driving Management for Resource-Efficient, SLO-Aware Machine=Learning Inference in GPU ClustersIEEE Internet of Things Journal10.1109/JIOT.2022.322338110:7(6271-6285)Online publication date: 1-Apr-2023
    • (2023)Fake User Account Detection in Online Social Media Networks Using Machine Learning and Neural Network TechniquesData Analytics for Smart Grids Applications—A Key to Smart City Development10.1007/978-3-031-46092-0_12(199-215)Online publication date: 30-Nov-2023
    • (2022)The Ultraviolet Bleach corpusLinguistics Vanguard10.1515/lingvan-2020-0145Online publication date: 12-Apr-2022
    • (2022)Less Provisioning: A Hybrid Resource Scaling Engine for Long-Running Services With Tail Latency GuaranteesIEEE Transactions on Cloud Computing10.1109/TCC.2020.301634510:3(1941-1957)Online publication date: 1-Jul-2022
    • (2022)Harnessing Indigenous Tweets: The Reo Māori Twitter corpusLanguage Resources and Evaluation10.1007/s10579-022-09580-w56:4(1229-1268)Online publication date: 14-Feb-2022
    • (2022)Reproducing Personalised Session Search Over the AOL Query LogAdvances in Information Retrieval10.1007/978-3-030-99736-6_42(627-640)Online publication date: 5-Apr-2022
    • (2022)Immediate Text Search on Streams Using Apoptosic IndexesAdvances in Information Retrieval10.1007/978-3-030-99736-6_11(157-169)Online publication date: 5-Apr-2022
    • (2021)This Account Doesn’t Exist: Tweet Decay and the Politics of Deletion in the Brexit DebateAmerican Behavioral Scientist10.1177/000276422198977265:5(757-773)Online publication date: 25-Jan-2021
    • (2021)TwiScraper: A Collaborative Project to Enhance Twitter Data CollectionProceedings of the 14th ACM International Conference on Web Search and Data Mining10.1145/3437963.3441716(886-889)Online publication date: 8-Mar-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media