Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

Published: 27 April 2021 Publication History

Abstract

A large fraction of textual data available today contains various types of “noise,” such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data/human intervention. The proposed algorithm is applicable to text over different languages and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods.

References

[1]
Moumita Basu, Anurag Roy, Kripabandhu Ghosh, Somprakash Bandyopadhyay, and Saptarshi Ghosh. 2017. A novel word embedding based stemming approach for microblog retrieval during disasters. In Proc. European Conference on Information Retrieval (ECIR’17). 589–597.
[2]
Moumita Basu, Anurag Shandilya, Prannay Khosla, Kripabandhu Ghosh, and Saptarshi Ghosh. 2019. Extracting resource needs and availabilities from microblogs for aiding post-disaster relief operations. IEEE Transactions on Computational Social Systems 6, 3 (2019), 604–618.
[3]
Thales Felipe Costa Bertaglia, and Maria das Graças Volpe Nunes. 2016. Exploring word embeddings for unsupervised textual user-generated content normalization. In Proc. Workshop on Noisy User-generated Text (WNUT’16). 112–120.
[4]
Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008 (2008), P10008.
[5]
Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proc. Annual Meeting of the Association for Computational Linguistics (ACL’00). 286–293.
[6]
Jiachen Du, Ruifeng Xu, Yulan He, and Lin Gui. 2017. Stance classification with target-specific neural attention networks. In Proc. International Joint Conference on Artificial Intelligence (IJCAI’17). 3988--3994.
[7]
Santo Fortunato. 2010. Community detection in graphs. Physics Reports 486, 3 (2010), 75–174.
[8]
Phani Gadde, Rahul Goutam, Rakshit Shah, Hemanth Sagar Bayyarapu, and L. V. Subramaniam. 2011. Experiments with artificially generated noise for cleansing noisy text. In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data. 1--8.
[9]
Kripabandhu Ghosh, Anirban Chakraborty, Swapan Kumar Parui, and Prasenjit Majumder. 2016. Improving information retrieval performance on OCRed text in the absence of clean text ground truth. Information Processing & Management 52, 5 (2016), 873–884.
[10]
Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proc. Workshop on Unsupervised Learning in NLP (with EMNLP’11). 82–90.
[11]
Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. 368–378.
[12]
Nathan Hartmann, Lucas Avanço, Pedro Balage, Magali Duran, Maria das Graças Volpe Nunes, Thiago Pardo, and Sandra Aluísio. 2014. A large corpus of product reviews in portuguese: Tackling Out-of-vocabulary words. In Proc. International Conference on Language Resources and Evaluation (LREC’14). 3865–3871.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[14]
Paul B. Kantor and Ellen M. Voorhees. 2000. The TREC-5 confusion track: Comparing retrieval methods for scanned text. Information Retrieval 2, 2–3 (2000), 165–176.
[15]
Okan Kolak and Philip Resnik. 2002. OCR error correction using a noisy channel model. In Proceedings of the Second International Conference on Human Language Technology Research (HLT'02). 257–262.
[16]
Dilek Küçük and Fazli Can. 2020. Stance detection: A survey. ACM Computing Surveys (CSUR) 53 (2020), 1–37.
[17]
Chen Li and Yang Liu. 2014. Improving text normalization via unsupervised model and discriminative reranking. In Proc. ACL 2014 Student Research Workshop. 86–93.
[18]
Wang Ling, Chris Dyer, Alan W. Black, and Isabel Trancoso. 2013. Paraphrasing 4 microblog normalization. In Proc. EMNLP. 73–84.
[19]
Massimo Lusetti, Tatyana Ruzsics, Anne Göhring, Tanja Samardžić, and Elisabeth Stark. 2018. Encoder-decoder methods for text normalization. In Proc. Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’18). 18–28.
[20]
I. Dan Melamed. 1995. Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In 3rd Workshop on Very Large Corpora.
[21]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26. 3111–3119.
[22]
T. Mikolov, W. T. Yih, and G. Zweig. 2013. Linguistic regularities in continuous space word representations. In Proc. Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’13). 746–751.
[23]
Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. 2016. Semeval-2016 task 6: Detecting stance in tweets. In Proc. International Workshop on Semantic Evaluation (SemEval’16). 31–41.
[24]
Mohammed Elsaid Moussa, Ensaf Hussein Mohamed, and Mohamed Hassan Haggag. 2018. A survey on opinion summarization techniques for social media. Future Computing and Informatics Journal 3, 1 (2018), 82–109.
[25]
M. E. J. Newman. 2006. Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103, 23 (2006), 8577–8582.
[26]
Lawrence Philips. 2000. The double metaphone search algorithm. C/C++ Users Journal 18, 6 (June 2000), 38–43.
[27]
Anurag Roy, Trishnendu Ghorai, Kripabandhu Ghosh, and Saptarshi Ghosh. 2017. Combining local and global word embeddings for microblog stemming. In Proc. ACM Conference on Information and Knowledge Management (CIKM’17). 2267–2270.
[28]
Koustav Rudra, Subham Ghosh, Pawan Goyal, Niloy Ganguly, and Saptarshi Ghosh. 2015. Extracting situational information from microblogs during disaster events: A classification-summarization approach. In Proc. ACM Conference on Information and Knowledge Management (CIKM’15). 583–592.
[29]
R. Satapathy, C. Guerreiro, I. Chaturvedi, and E. Cambria. 2017. Phonetic-based microtext normalization for twitter sentiment analysis. In Proc. IEEE International Conference on Data Mining Workshops (ICDMW’17). 407–413.
[30]
Rangarajan Sridhar and Vivek Kumar. 2015. Unsupervised text normalization using distributed representations of words and phrases. In Proc. 1st Workshop on Vector Space Modeling for Natural Language Processing. 8–16.
[31]
Trevor Strohman, Donald Metzler, Howard Turtle, and W. Croft. 2005. Indri: A language-model based search engine for complex queries. Information Retrieval - IR 2 (Jan. 2005), 2--6.
[32]
L. Venkata Subramaniam, Shourya Roy, Tanveer A. Faruquie, and Sumit Negi. 2009. A survey of types of text noise and techniques to handle noisy text. In Proc. Workshop on Analytics for Noisy Unstructured Text Data (AND’09). 115–122.
[33]
Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. 2011. #TwitterSearch: A comparison of microblog search and web search. In Proc. ACM Conference on Web Search and Data Mining (WSDM’11). 35–44.
[34]
Kristina Toutanova and Robert Moore. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 144–151.
[35]
Alessandro Vinciarelli. 2005. Noisy text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 12 (2005), 1882–1895.
[36]
Maryam Zare and Shaurya Rohatgi. 2017. DeepNorm-a deep learning approach to text normalization. CoRR abs/1712.06994 (2017). http://arxiv.org/abs/1712.06994.

Cited By

View all

Index Terms

  1. An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Journal of Data and Information Quality
      Journal of Data and Information Quality  Volume 13, Issue 3
      September 2021
      117 pages
      ISSN:1936-1955
      EISSN:1936-1963
      DOI:10.1145/3460503
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 April 2021
      Accepted: 01 July 2020
      Revised: 01 July 2020
      Received: 01 February 2020
      Published in JDIQ Volume 13, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Data cleaning
      2. unsupervised text normalization
      3. morphological variants
      4. retrieval
      5. stance detection
      6. microblogs
      7. OCR noise

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • Building Healthcare Informatics Systems Utilising Web Data
      • Department of Science & Technology, Government of India
      • NVIDIA Corporation
      • Titan Xp GPU

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 156
        Total Downloads
      • Downloads (Last 12 months)19
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media