Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3487664.3487686acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
short-paper

Lexical Normalization of Japanese Tweets Using Related Images

Published: 30 December 2021 Publication History

Abstract

Twitter is noisy and contains many nonstandard words. Furthermore, in Japanese tweets, many words have multiple variant notations. Therefore, the use of such noisy data may interfere with tasks such as identifying potential communities. In this paper, based on the assumption that words with the same meaning will have similar related images, we propose a method of normalization for nonstandard words and variant notations in Japanese tweets using related images. First, we collect images related to a word from Bing and use OpponentSIFT features to properly represent the content of those images. Next, we use clustering to narrow down the set of images to extract related images. Finally, we determine the similarity between words based on the similarity of the sets of related images.

References

[1]
Bilal Ahmed. 2015. Lexical normalisation of twitter data. In 2015 Science and Information Conference (SAI). IEEE, 326–328.
[2]
Oscar Araque, Ganggao Zhu, and Carlos A Iglesias. 2019. A semantic similarity-based perspective of affect lexicons for sentiment analysis. Knowledge-Based Systems 165 (2019), 346–359.
[3]
Yizong Cheng. 1995. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence 17, 8(1995), 790–799.
[4]
Danish Contractor, Tanveer A Faruquie, and L Venkata Subramaniam. 2010. Unsupervised cleansing of noisy text. In Coling 2010: Posters. 189–196.
[5]
Kelly Dekker and Rob van der Goot. 2020. Synthetic Data for English Lexical Normalization: How Close Can We Get to Manually Annotated Data?. In Proceedings of the 12th Language Resources and Evaluation Conference. 6300–6309.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
[7]
Siti Rofiqoh Fitriyani and Hendri Murfi. 2016. The K-means with mini batch algorithm for topics detection on online news. In 2016 4th International Conference on Information and Communication Technology (ICoICT). IEEE, 1–5.
[8]
Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the First workshop on Unsupervised Learning in NLP. 82–90.
[9]
Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a# twitter. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 368–378.
[10]
Bo Han, Paul Cook, and Timothy Baldwin. 2012. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. 421–432.
[11]
Cem Keskin, Furkan Kıraç, Yunus Emre Kara, and Lale Akarun. 2012. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In European Conference on Computer Vision. Springer, 852–863.
[12]
Pan Li, Maofei Que, Zhichao Jiang, Yao Hu, and Alexander Tuzhilin. 2020. PURS: Personalized Unexpected Recommender System for Improving User Satisfaction. In Fourteenth ACM Conference on Recommender Systems. 279–288.
[13]
Yang Li and Tao Yang. 2018. Word embedding for understanding natural language: a survey. In Guide to big data applications. Springer, 83–104.
[14]
David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91–110.
[15]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781(2013).
[16]
Masaki Murata, Masahiro Kojima, Takuya Minamiguchi, and Yasuhiro Watanabe. 2013. Automatic selection and analysis of Japanese notational variants on the basis of machine learning. International Journal of Innovative Computing, Information and Control 9, 10 (2013), 4231–4246.
[17]
David Nister and Henrik Stewenius. 2006. Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2. Ieee, 2161–2168.
[18]
Abass A Olaode, Golshah Naghdy, and Catherine A Todd. 2015. Bag-of-visual words codebook development for the semantic content based annotation of images. In 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS). IEEE, 7–14.
[19]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
[20]
Dong ping Tian 2013. A review on image feature extraction and representation techniques. International Journal of Multimedia and Ubiquitous Engineering 8, 4(2013), 385–396.
[21]
Keiko Komiya Samimy. 1994. Teaching Japanese: Consideration of learners’ affective variables. Theory into Practice 33, 1 (1994), 29–33.
[22]
Ranjan Satapathy, Claudia Guerreiro, Iti Chaturvedi, and Erik Cambria. 2017. Phonetic-based microtext normalization for twitter sentiment analysis. In 2017 IEEE international conference on data mining workshops (ICDMW). IEEE, 407–413.
[23]
David Sculley. 2010. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web. 1177–1178.
[24]
Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, and Lawrence Carin. 2018. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. arXiv preprint arXiv:1805.09843.
[25]
Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Computer Vision, IEEE International Conference on, Vol. 3. IEEE Computer Society, 1470–1470.
[26]
Koen Van De Sande, Theo Gevers, and Cees Snoek. 2009. Evaluating color descriptors for object and scene recognition. IEEE transactions on pattern analysis and machine intelligence 32, 9(2009), 1582–1596.
[27]
Yong Wang, Jiangzhou Deng, Jerry Gao, and Pu Zhang. 2017. A hybrid user similarity model for collaborative filtering. Information Sciences 418(2017), 102–118.
[28]
Lei Wu, Rong Jin, and Anil K Jain. 2012. Tag completion for image retrieval. IEEE transactions on pattern analysis and machine intelligence 35, 3(2012), 716–727.
[29]
Ke Xu, Yunqing Xia, and Chin-Hui Lee. 2015. Tweet normalization with syllables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 920–928.
[30]
Shaoting Zhang, Ming Yang, Timothee Cour, Kai Yu, and Dimitris N Metaxas. 2012. Query specific fusion for image retrieval. In European conference on computer vision. Springer, 660–673.
[31]
Wengang Zhou, Houqiang Li, and Qi Tian. 2017. Recent advance in content-based image retrieval: A literature survey. arXiv preprint arXiv:1706.06064(2017).
[32]
Ganggao Zhu and Carlos A Iglesias. 2016. Computing semantic similarity of concepts in knowledge graphs. IEEE Transactions on Knowledge and Data Engineering 29, 1(2016), 72–85.

Index Terms

  1. Lexical Normalization of Japanese Tweets Using Related Images
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence
        November 2021
        658 pages
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 30 December 2021

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Japanese tweets
        2. content-based image retrieval
        3. lexical normalization
        4. related image

        Qualifiers

        • Short-paper
        • Research
        • Refereed limited

        Conference

        iiWAS2021

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 33
          Total Downloads
        • Downloads (Last 12 months)10
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 09 Nov 2024

        Other Metrics

        Citations

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media