short-paper

Lexical Normalization of Japanese Tweets Using Related Images

Authors:

Atsushi Matsumura,

Tetsuji SatohAuthors Info & Claims

iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence

Pages 157 - 161

https://doi.org/10.1145/3487664.3487686

Published: 30 December 2021 Publication History

Abstract

Twitter is noisy and contains many nonstandard words. Furthermore, in Japanese tweets, many words have multiple variant notations. Therefore, the use of such noisy data may interfere with tasks such as identifying potential communities. In this paper, based on the assumption that words with the same meaning will have similar related images, we propose a method of normalization for nonstandard words and variant notations in Japanese tweets using related images. First, we collect images related to a word from Bing and use OpponentSIFT features to properly represent the content of those images. Next, we use clustering to narrow down the set of images to extract related images. Finally, we determine the similarity between words based on the similarity of the sets of related images.

References

[1]

Bilal Ahmed. 2015. Lexical normalisation of twitter data. In 2015 Science and Information Conference (SAI). IEEE, 326–328.

[2]

Oscar Araque, Ganggao Zhu, and Carlos A Iglesias. 2019. A semantic similarity-based perspective of affect lexicons for sentiment analysis. Knowledge-Based Systems 165 (2019), 346–359.

[3]

Yizong Cheng. 1995. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence 17, 8(1995), 790–799.

Digital Library

[4]

Danish Contractor, Tanveer A Faruquie, and L Venkata Subramaniam. 2010. Unsupervised cleansing of noisy text. In Coling 2010: Posters. 189–196.

Digital Library

[5]

Kelly Dekker and Rob van der Goot. 2020. Synthetic Data for English Lexical Normalization: How Close Can We Get to Manually Annotated Data?. In Proceedings of the 12th Language Resources and Evaluation Conference. 6300–6309.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).

[7]

Siti Rofiqoh Fitriyani and Hendri Murfi. 2016. The K-means with mini batch algorithm for topics detection on online news. In 2016 4th International Conference on Information and Communication Technology (ICoICT). IEEE, 1–5.

[8]

Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the First workshop on Unsupervised Learning in NLP. 82–90.

Digital Library

[9]

Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a# twitter. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 368–378.

Digital Library

[10]

Bo Han, Paul Cook, and Timothy Baldwin. 2012. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. 421–432.

Digital Library

[11]

Cem Keskin, Furkan Kıraç, Yunus Emre Kara, and Lale Akarun. 2012. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In European Conference on Computer Vision. Springer, 852–863.

Digital Library

[12]

Pan Li, Maofei Que, Zhichao Jiang, Yao Hu, and Alexander Tuzhilin. 2020. PURS: Personalized Unexpected Recommender System for Improving User Satisfaction. In Fourteenth ACM Conference on Recommender Systems. 279–288.

Digital Library

[13]

Yang Li and Tao Yang. 2018. Word embedding for understanding natural language: a survey. In Guide to big data applications. Springer, 83–104.

[14]

David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91–110.

Digital Library

[15]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781(2013).

[16]

Masaki Murata, Masahiro Kojima, Takuya Minamiguchi, and Yasuhiro Watanabe. 2013. Automatic selection and analysis of Japanese notational variants on the basis of machine learning. International Journal of Innovative Computing, Information and Control 9, 10 (2013), 4231–4246.

[17]

David Nister and Henrik Stewenius. 2006. Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2. Ieee, 2161–2168.

Digital Library

[18]

Abass A Olaode, Golshah Naghdy, and Catherine A Todd. 2015. Bag-of-visual words codebook development for the semantic content based annotation of images. In 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS). IEEE, 7–14.

Digital Library

[19]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.

[20]

Dong ping Tian 2013. A review on image feature extraction and representation techniques. International Journal of Multimedia and Ubiquitous Engineering 8, 4(2013), 385–396.

[21]

Keiko Komiya Samimy. 1994. Teaching Japanese: Consideration of learners’ affective variables. Theory into Practice 33, 1 (1994), 29–33.

[22]

Ranjan Satapathy, Claudia Guerreiro, Iti Chaturvedi, and Erik Cambria. 2017. Phonetic-based microtext normalization for twitter sentiment analysis. In 2017 IEEE international conference on data mining workshops (ICDMW). IEEE, 407–413.

[23]

David Sculley. 2010. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web. 1177–1178.

Digital Library

[24]

Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, and Lawrence Carin. 2018. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. arXiv preprint arXiv:1805.09843.

[25]

Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Computer Vision, IEEE International Conference on, Vol. 3. IEEE Computer Society, 1470–1470.

Digital Library

[26]

Koen Van De Sande, Theo Gevers, and Cees Snoek. 2009. Evaluating color descriptors for object and scene recognition. IEEE transactions on pattern analysis and machine intelligence 32, 9(2009), 1582–1596.

Digital Library

[27]

Yong Wang, Jiangzhou Deng, Jerry Gao, and Pu Zhang. 2017. A hybrid user similarity model for collaborative filtering. Information Sciences 418(2017), 102–118.

Digital Library

[28]

Lei Wu, Rong Jin, and Anil K Jain. 2012. Tag completion for image retrieval. IEEE transactions on pattern analysis and machine intelligence 35, 3(2012), 716–727.

Digital Library

[29]

Ke Xu, Yunqing Xia, and Chin-Hui Lee. 2015. Tweet normalization with syllables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 920–928.

[30]

Shaoting Zhang, Ming Yang, Timothee Cour, Kai Yu, and Dimitris N Metaxas. 2012. Query specific fusion for image retrieval. In European conference on computer vision. Springer, 660–673.

[31]

Wengang Zhou, Houqiang Li, and Qi Tian. 2017. Recent advance in content-based image retrieval: A literature survey. arXiv preprint arXiv:1706.06064(2017).

[32]

Ganggao Zhu and Carlos A Iglesias. 2016. Computing semantic similarity of concepts in knowledge graphs. IEEE Transactions on Knowledge and Data Engineering 29, 1(2016), 72–85.

Digital Library

Index Terms

Lexical Normalization of Japanese Tweets Using Related Images
1. Computing methodologies
  1. Artificial intelligence
2. Information systems
  1. Information retrieval
  2. Information systems applications

Index terms have been assigned to the content through auto-classification.

Recommendations

Lexical Normalization of Spanish Tweets
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

Twitter data have brought new opportunities to know what happens in the world in real-time, and conduct studies on the human subjectivity on a diversity of issues and topics at large scale, which would not be feasible using traditional methods. However, ...
A New Method to Measure Similarity of Words in Japanese Twitter Based on Related Images
Information Integration and Web Intelligence
Abstract
Twitter, as a popular form of social media in Japan, has emerged as a valuable data resource for various important social network analysis tasks. However, Japanese tweets often contain nonstandard words and variant notations, owing to which ...
Lexical normalization for social media text
Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this article, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalizing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence

November 2021

658 pages

ISBN:9781450395564

DOI:10.1145/3487664

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 December 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

iiWAS2021

iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence

November 29 - December 1, 2021

Linz, Austria

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
33
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents