research-article

Quick-and-clean extraction of linked data entities from microblogs

Authors:

Oluwaseyi Feyisetan,

Markus Luczak-Roesch,

Nigel ShadboltAuthors Info & Claims

SEM '14: Proceedings of the 10th International Conference on Semantic Systems

Pages 5 - 12

https://doi.org/10.1145/2660517.2660527

Published: 04 September 2014 Publication History

Abstract

In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities. Our approach is able to significantly speed-up the semantic analysis process by discarding retweets, tweets without pre-identifiable entities, as well similar and redundant tweets, while retaining information content.

We apply the approach on a corpus of 1:4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700,000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE tools.

We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.

References

[1]

Eugene Agichtein. Scaling information extraction to large document collections. IEEE Data Eng. Bull, 28:3--10, 2005.

[2]

Amitava Das, Utsab Burman, Balamurali Ar, and Sivaji Bandyopadhyay. NER from Tweets: SRI-JU System. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', page 62, 2013.

[3]

Diego Marinho de Oliveira, Alberto H. F. Laender, Adriano Veloso, and Altigran S. da Silva. FS-NER: A Lightweight Filter-stream Approach to Named Entity Recognition on Twitter Data. In Proceedings of the 22nd International Conference on World Wide Web Companion, WWW '13 Companion, pages 597--604, 2013.

Digital Library

[4]

Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rajagopalan, Andrew Tomkins, John A. Tomlin, and Jason Y. Zien. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In Proceedings of the 12th International Conference on World Wide Web, pages 178--186. ACM, 2003.

Digital Library

[5]

Yegin Genc, Winter A. Mason, and Jeffrey V. Nickerson. Classifying Short Messages using Collaborative Knowledge Bases: Reading Wikipedia to Understand Twitter. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 50--53, 2013.

[6]

Bo Han and Timothy Baldwin. Lexical Normalisation of Short Text Messages: Makn Sens a #Twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, HLT '11, pages 368--378. ACL, 2011.

Digital Library

[7]

Silviu Homoceanu, Felix Geilert, Christian Pek, and Wolf-Tilo Balke. Any Suggestions? Active Schema Support for Structuring Web Information. In Database Systems for Advanced Applications, pages 251--265. Springer, 2014.

[8]

Amir Hossein Jadidinejad. Unsupervised Information Extraction using BabelNet and DBpedia. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 54--56, 2013.

[9]

David Laniado and Peter Mika. Making Sense of Twitter. In Proceedings of the 9th International Semantic Web Conference, pages 470--485. Springer, 2010.

Digital Library

[10]

Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. TwiNER: Named Entity Recognition in Targeted Twitter Stream. In Proceedings of the 35th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 721--730. ACM, 2012.

Digital Library

[11]

Songyu Ma, Quan Shi, and Lu Xu. The Research of Web Parallel Information Extraction Based on Hadoop. In Proceedings of International Conference on Computer Science and Information Technology, pages 341--348. Springer, 2014.

[12]

Pablo N. Mendes, Dirk Weissenborn, and Chris Hokamp. DBpedia Spotlight at the MSM2013 Challenge. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 57--61, 2013.

[13]

Óscar Muñoz-García, Andrés García-Silva, and Óscar Corcho. Towards Concept Identification using a Knowledge-Intensive Approach. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 45--49, 2013.

[14]

Deepak Ravichandran. Terascale Knowledge Acquisition. PhD thesis, Los Angeles, CA, USA, 2005. AAI3196880.

Digital Library

[15]

Giuseppe Rizzo and Raphaël Troncy. NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL '12, pages 73--76. ACL, 2012.

Digital Library

[16]

Sandhya Sachidanandan, Prathyush Sambaturu, and Kamalakar Karlapalem. NERTUW: Named Entity Recognition on Tweets using Wikipedia. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 67--70, 2013.

[17]

Hassan Saif, Yulan He, and Harith Alani. Semantic Sentiment Analysis of Twitter. In Proceedings of the 11th International Conference on The Semantic Web, pages 508--524. Springer, 2012.

Digital Library

[18]

Seth van Hooland, Max De Wilde, Ruben Verborgh, Thomas Steiner, and Rik Van de Walle. Exploring entity recognition and disambiguation for cultural heritage collections. Literary and Linguistic Computing, 2013.

[19]

Henning Wachsmuth, Benno Stein, and Gregor Engels. Constructing Efficient Information Extraction Pipelines. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 2237--2240. ACM, 2011.

Digital Library

[20]

Casey Whitelaw, Alex Kehlenbeck, Nemanja Petrovic, and Lyle Ungar. Web-scale Named Entity Recognition. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08, pages 123--132. ACM, 2008.

Digital Library

Cited By

Feyisetan OSimperl E(2019)Beyond Monetary IncentivesACM Transactions on Social Computing10.1145/33217002:2(1-31)Online publication date: 13-Jun-2019
https://dl.acm.org/doi/10.1145/3321700
Feyisetan OSimperl ELuczak-Roesch MTinati RShadbolt N(2018)An extended study of content and crowdsourcing-related performance factors in named entity annotationSemantic Web10.3233/SW-1702829:3(355-379)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.3233/SW-170282
Feyisetan OLuczak-Roesch MSimperl ETinati RShadbolt N(2015)Towards Hybrid NERProceedings of the 12th European Semantic Web Conference on The Semantic Web. Latest Advances and New Domains - Volume 908810.1007/978-3-319-18818-8_32(525-540)Online publication date: 31-May-2015
https://dl.acm.org/doi/10.1007/978-3-319-18818-8_32

Index Terms

Quick-and-clean extraction of linked data entities from microblogs

Recommendations

Analysis and robust extraction of changing named entities
NEWS '09: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration

This paper focuses on the change of named entities over time and its influence on the performance of the named entity tagger. First, we analyze Japanese named entities which appear in Mainichi Newspaper articles published in 1995, 1996, 1997, 1998 and ...
AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data
The Semantic Web – ISWC 2014
Abstract
Over the last decades, several billion Web pages have been made available on the Web. The ongoing transition from the current Web of unstructured data to the Web of Data yet requires scalable and accurate approaches for the extraction of ...
Configuring Named Entity Extraction through Real-Time Exploitation of Linked Data
WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

Named Entity Extraction is the process of identifying entities (like persons, locations, organizations, etc.) in texts and linking them to related semantic resources. This task is useful in several applications, e.g. for question answering, annotating ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SEM '14: Proceedings of the 10th International Conference on Semantic Systems

September 2014

161 pages

ISBN:9781450329279

DOI:10.1145/2660517

Editors:
Harald Sack
Hasso-Plattner-Institute for IT Systems Engineering, Germany
,
Agata Filipowska
Poznan University of Economics, Poland
,
Jens Lehmann
University of Leipzig, Germany
,
Sebastian Hellmann
Institute for Applied Informatics (InfAI), Leipzig, Germany

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

St. Pölten University: St. Pölten University of Applied Sciences, Austria
University of Potsdam: University of Potsdam
PoolParty: PoolParty (Semantic Web Company GmbH)
University of Vienna: University of Vienna
Wolters Kluwer: Wolters Kluwer, Germany
Semantic Web Company: Semantic Web Company
STII: STI International
DBpedia Association: DBpedia Association

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SEM '14

Sponsor:

St. Pölten University
University of Potsdam
PoolParty
University of Vienna
Wolters Kluwer
Semantic Web Company
STII
DBpedia Association

SEM '14: SEMANTiCS 2014 - 10th International Conference on Semantic Systems

September 4 - 5, 2014

Leipzig, Germany

Acceptance Rates

SEM '14 Paper Acceptance Rate 22 of 59 submissions, 37%;

Overall Acceptance Rate 22 of 59 submissions, 37%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
171
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Feyisetan OSimperl E(2019)Beyond Monetary IncentivesACM Transactions on Social Computing10.1145/33217002:2(1-31)Online publication date: 13-Jun-2019
https://dl.acm.org/doi/10.1145/3321700
Feyisetan OSimperl ELuczak-Roesch MTinati RShadbolt N(2018)An extended study of content and crowdsourcing-related performance factors in named entity annotationSemantic Web10.3233/SW-1702829:3(355-379)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.3233/SW-170282
Feyisetan OLuczak-Roesch MSimperl ETinati RShadbolt N(2015)Towards Hybrid NERProceedings of the 12th European Semantic Web Conference on The Semantic Web. Latest Advances and New Domains - Volume 908810.1007/978-3-319-18818-8_32(525-540)Online publication date: 31-May-2015
https://dl.acm.org/doi/10.1007/978-3-319-18818-8_32

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents