Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2660517.2660527acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Quick-and-clean extraction of linked data entities from microblogs

Published: 04 September 2014 Publication History

Abstract

In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities. Our approach is able to significantly speed-up the semantic analysis process by discarding retweets, tweets without pre-identifiable entities, as well similar and redundant tweets, while retaining information content.
We apply the approach on a corpus of 1:4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700,000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE tools.
We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.

References

[1]
Eugene Agichtein. Scaling information extraction to large document collections. IEEE Data Eng. Bull, 28:3--10, 2005.
[2]
Amitava Das, Utsab Burman, Balamurali Ar, and Sivaji Bandyopadhyay. NER from Tweets: SRI-JU System. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', page 62, 2013.
[3]
Diego Marinho de Oliveira, Alberto H. F. Laender, Adriano Veloso, and Altigran S. da Silva. FS-NER: A Lightweight Filter-stream Approach to Named Entity Recognition on Twitter Data. In Proceedings of the 22nd International Conference on World Wide Web Companion, WWW '13 Companion, pages 597--604, 2013.
[4]
Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rajagopalan, Andrew Tomkins, John A. Tomlin, and Jason Y. Zien. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In Proceedings of the 12th International Conference on World Wide Web, pages 178--186. ACM, 2003.
[5]
Yegin Genc, Winter A. Mason, and Jeffrey V. Nickerson. Classifying Short Messages using Collaborative Knowledge Bases: Reading Wikipedia to Understand Twitter. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 50--53, 2013.
[6]
Bo Han and Timothy Baldwin. Lexical Normalisation of Short Text Messages: Makn Sens a #Twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, HLT '11, pages 368--378. ACL, 2011.
[7]
Silviu Homoceanu, Felix Geilert, Christian Pek, and Wolf-Tilo Balke. Any Suggestions? Active Schema Support for Structuring Web Information. In Database Systems for Advanced Applications, pages 251--265. Springer, 2014.
[8]
Amir Hossein Jadidinejad. Unsupervised Information Extraction using BabelNet and DBpedia. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 54--56, 2013.
[9]
David Laniado and Peter Mika. Making Sense of Twitter. In Proceedings of the 9th International Semantic Web Conference, pages 470--485. Springer, 2010.
[10]
Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. TwiNER: Named Entity Recognition in Targeted Twitter Stream. In Proceedings of the 35th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 721--730. ACM, 2012.
[11]
Songyu Ma, Quan Shi, and Lu Xu. The Research of Web Parallel Information Extraction Based on Hadoop. In Proceedings of International Conference on Computer Science and Information Technology, pages 341--348. Springer, 2014.
[12]
Pablo N. Mendes, Dirk Weissenborn, and Chris Hokamp. DBpedia Spotlight at the MSM2013 Challenge. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 57--61, 2013.
[13]
Óscar Muñoz-García, Andrés García-Silva, and Óscar Corcho. Towards Concept Identification using a Knowledge-Intensive Approach. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 45--49, 2013.
[14]
Deepak Ravichandran. Terascale Knowledge Acquisition. PhD thesis, Los Angeles, CA, USA, 2005. AAI3196880.
[15]
Giuseppe Rizzo and Raphaël Troncy. NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL '12, pages 73--76. ACL, 2012.
[16]
Sandhya Sachidanandan, Prathyush Sambaturu, and Kamalakar Karlapalem. NERTUW: Named Entity Recognition on Tweets using Wikipedia. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 67--70, 2013.
[17]
Hassan Saif, Yulan He, and Harith Alani. Semantic Sentiment Analysis of Twitter. In Proceedings of the 11th International Conference on The Semantic Web, pages 508--524. Springer, 2012.
[18]
Seth van Hooland, Max De Wilde, Ruben Verborgh, Thomas Steiner, and Rik Van de Walle. Exploring entity recognition and disambiguation for cultural heritage collections. Literary and Linguistic Computing, 2013.
[19]
Henning Wachsmuth, Benno Stein, and Gregor Engels. Constructing Efficient Information Extraction Pipelines. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 2237--2240. ACM, 2011.
[20]
Casey Whitelaw, Alex Kehlenbeck, Nemanja Petrovic, and Lyle Ungar. Web-scale Named Entity Recognition. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08, pages 123--132. ACM, 2008.

Cited By

View all
  • (2019)Beyond Monetary IncentivesACM Transactions on Social Computing10.1145/33217002:2(1-31)Online publication date: 13-Jun-2019
  • (2018)An extended study of content and crowdsourcing-related performance factors in named entity annotationSemantic Web10.3233/SW-1702829:3(355-379)Online publication date: 1-Jan-2018
  • (2015)Towards Hybrid NERProceedings of the 12th European Semantic Web Conference on The Semantic Web. Latest Advances and New Domains - Volume 908810.1007/978-3-319-18818-8_32(525-540)Online publication date: 31-May-2015

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SEM '14: Proceedings of the 10th International Conference on Semantic Systems
September 2014
161 pages
ISBN:9781450329279
DOI:10.1145/2660517
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • St. Pölten University: St. Pölten University of Applied Sciences, Austria
  • University of Potsdam: University of Potsdam
  • PoolParty: PoolParty (Semantic Web Company GmbH)
  • University of Vienna: University of Vienna
  • Wolters Kluwer: Wolters Kluwer, Germany
  • Semantic Web Company: Semantic Web Company
  • STII: STI International
  • DBpedia Association: DBpedia Association

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2014

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SEM '14
Sponsor:
  • St. Pölten University
  • University of Potsdam
  • PoolParty
  • University of Vienna
  • Wolters Kluwer
  • Semantic Web Company
  • STII
  • DBpedia Association

Acceptance Rates

SEM '14 Paper Acceptance Rate 22 of 59 submissions, 37%;
Overall Acceptance Rate 22 of 59 submissions, 37%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Beyond Monetary IncentivesACM Transactions on Social Computing10.1145/33217002:2(1-31)Online publication date: 13-Jun-2019
  • (2018)An extended study of content and crowdsourcing-related performance factors in named entity annotationSemantic Web10.3233/SW-1702829:3(355-379)Online publication date: 1-Jan-2018
  • (2015)Towards Hybrid NERProceedings of the 12th European Semantic Web Conference on The Semantic Web. Latest Advances and New Domains - Volume 908810.1007/978-3-319-18818-8_32(525-540)Online publication date: 31-May-2015

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media