research-article

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

Authors:

Shalmoli Ghosh,

Kripabandhu Ghosh,

Saptarshi GhoshAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 13, Issue 3

Article No.: 17, Pages 1 - 25

https://doi.org/10.1145/3418036

Published: 27 April 2021 Publication History

Abstract

A large fraction of textual data available today contains various types of “noise,” such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data/human intervention. The proposed algorithm is applicable to text over different languages and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods.

References

[1]

Moumita Basu, Anurag Roy, Kripabandhu Ghosh, Somprakash Bandyopadhyay, and Saptarshi Ghosh. 2017. A novel word embedding based stemming approach for microblog retrieval during disasters. In Proc. European Conference on Information Retrieval (ECIR’17). 589–597.

[2]

Moumita Basu, Anurag Shandilya, Prannay Khosla, Kripabandhu Ghosh, and Saptarshi Ghosh. 2019. Extracting resource needs and availabilities from microblogs for aiding post-disaster relief operations. IEEE Transactions on Computational Social Systems 6, 3 (2019), 604–618.

[3]

Thales Felipe Costa Bertaglia, and Maria das Graças Volpe Nunes. 2016. Exploring word embeddings for unsupervised textual user-generated content normalization. In Proc. Workshop on Noisy User-generated Text (WNUT’16). 112–120.

[4]

Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008 (2008), P10008.

[5]

Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proc. Annual Meeting of the Association for Computational Linguistics (ACL’00). 286–293.

Digital Library

[6]

Jiachen Du, Ruifeng Xu, Yulan He, and Lin Gui. 2017. Stance classification with target-specific neural attention networks. In Proc. International Joint Conference on Artificial Intelligence (IJCAI’17). 3988--3994.

Digital Library

[7]

Santo Fortunato. 2010. Community detection in graphs. Physics Reports 486, 3 (2010), 75–174.

[8]

Phani Gadde, Rahul Goutam, Rakshit Shah, Hemanth Sagar Bayyarapu, and L. V. Subramaniam. 2011. Experiments with artificially generated noise for cleansing noisy text. In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data. 1--8.

Digital Library

[9]

Kripabandhu Ghosh, Anirban Chakraborty, Swapan Kumar Parui, and Prasenjit Majumder. 2016. Improving information retrieval performance on OCRed text in the absence of clean text ground truth. Information Processing & Management 52, 5 (2016), 873–884.

Digital Library

[10]

Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proc. Workshop on Unsupervised Learning in NLP (with EMNLP’11). 82–90.

Digital Library

[11]

Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. 368–378.

Digital Library

[12]

Nathan Hartmann, Lucas Avanço, Pedro Balage, Magali Duran, Maria das Graças Volpe Nunes, Thiago Pardo, and Sandra Aluísio. 2014. A large corpus of product reviews in portuguese: Tackling Out-of-vocabulary words. In Proc. International Conference on Language Resources and Evaluation (LREC’14). 3865–3871.

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.

Digital Library

[14]

Paul B. Kantor and Ellen M. Voorhees. 2000. The TREC-5 confusion track: Comparing retrieval methods for scanned text. Information Retrieval 2, 2–3 (2000), 165–176.

Digital Library

[15]

Okan Kolak and Philip Resnik. 2002. OCR error correction using a noisy channel model. In Proceedings of the Second International Conference on Human Language Technology Research (HLT'02). 257–262.

Digital Library

[16]

Dilek Küçük and Fazli Can. 2020. Stance detection: A survey. ACM Computing Surveys (CSUR) 53 (2020), 1–37.

Digital Library

[17]

Chen Li and Yang Liu. 2014. Improving text normalization via unsupervised model and discriminative reranking. In Proc. ACL 2014 Student Research Workshop. 86–93.

[18]

Wang Ling, Chris Dyer, Alan W. Black, and Isabel Trancoso. 2013. Paraphrasing 4 microblog normalization. In Proc. EMNLP. 73–84.

[19]

Massimo Lusetti, Tatyana Ruzsics, Anne Göhring, Tanja Samardžić, and Elisabeth Stark. 2018. Encoder-decoder methods for text normalization. In Proc. Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’18). 18–28.

[20]

I. Dan Melamed. 1995. Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In 3rd Workshop on Very Large Corpora.

[21]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26. 3111–3119.

Digital Library

[22]

T. Mikolov, W. T. Yih, and G. Zweig. 2013. Linguistic regularities in continuous space word representations. In Proc. Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’13). 746–751.

[23]

Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. 2016. Semeval-2016 task 6: Detecting stance in tweets. In Proc. International Workshop on Semantic Evaluation (SemEval’16). 31–41.

[24]

Mohammed Elsaid Moussa, Ensaf Hussein Mohamed, and Mohamed Hassan Haggag. 2018. A survey on opinion summarization techniques for social media. Future Computing and Informatics Journal 3, 1 (2018), 82–109.

[25]

M. E. J. Newman. 2006. Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103, 23 (2006), 8577–8582.

[26]

Lawrence Philips. 2000. The double metaphone search algorithm. C/C++ Users Journal 18, 6 (June 2000), 38–43.

Digital Library

[27]

Anurag Roy, Trishnendu Ghorai, Kripabandhu Ghosh, and Saptarshi Ghosh. 2017. Combining local and global word embeddings for microblog stemming. In Proc. ACM Conference on Information and Knowledge Management (CIKM’17). 2267–2270.

Digital Library

[28]

Koustav Rudra, Subham Ghosh, Pawan Goyal, Niloy Ganguly, and Saptarshi Ghosh. 2015. Extracting situational information from microblogs during disaster events: A classification-summarization approach. In Proc. ACM Conference on Information and Knowledge Management (CIKM’15). 583–592.

Digital Library

[29]

R. Satapathy, C. Guerreiro, I. Chaturvedi, and E. Cambria. 2017. Phonetic-based microtext normalization for twitter sentiment analysis. In Proc. IEEE International Conference on Data Mining Workshops (ICDMW’17). 407–413.

[30]

Rangarajan Sridhar and Vivek Kumar. 2015. Unsupervised text normalization using distributed representations of words and phrases. In Proc. 1st Workshop on Vector Space Modeling for Natural Language Processing. 8–16.

[31]

Trevor Strohman, Donald Metzler, Howard Turtle, and W. Croft. 2005. Indri: A language-model based search engine for complex queries. Information Retrieval - IR 2 (Jan. 2005), 2--6.

[32]

L. Venkata Subramaniam, Shourya Roy, Tanveer A. Faruquie, and Sumit Negi. 2009. A survey of types of text noise and techniques to handle noisy text. In Proc. Workshop on Analytics for Noisy Unstructured Text Data (AND’09). 115–122.

Digital Library

[33]

Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. 2011. #TwitterSearch: A comparison of microblog search and web search. In Proc. ACM Conference on Web Search and Data Mining (WSDM’11). 35–44.

Digital Library

[34]

Kristina Toutanova and Robert Moore. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 144–151.

Digital Library

[35]

Alessandro Vinciarelli. 2005. Noisy text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 12 (2005), 1882–1895.

Digital Library

[36]

Maryam Zare and Shaurya Rohatgi. 2017. DeepNorm-a deep learning approach to text normalization. CoRR abs/1712.06994 (2017). http://arxiv.org/abs/1712.06994.

Cited By

Index Terms

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining
      1. Data cleaning

Recommendations

Semi-supervised Stance Detection of Tweets Via Distant Network Supervision
WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

Detecting and labeling stance in social media text is strongly motivated by hate speech detection, poll prediction, engagement forecasting, and concerted propaganda detection. Today's best neural stance detectors need large volumes of training data, ...
Retrieval-Based Unsupervised Noisy Label Detection on Text Data
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

The success of deep neural networks hinges on both high-quality annotations and copious amounts of data; however, in practice, a compromise between dataset size and quality frequently arises. Data collection and cleansing are often resource-intensive and ...
Image retrieval using noisy query
ICME'09: Proceedings of the 2009 IEEE international conference on Multimedia and Expo

In conventional content based image retrieval (CBIR) employing relevance feedback, one implicit assumption is that both pure positive and negative examples are available. However it is not always true in the practical applications of CBIR. In this paper,...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 13, Issue 3

September 2021

117 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3460503

Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy

Issue’s Table of Contents

Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2021

Accepted: 01 July 2020

Revised: 01 July 2020

Received: 01 February 2020

Published in JDIQ Volume 13, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Building Healthcare Informatics Systems Utilising Web Data
Department of Science & Technology, Government of India
NVIDIA Corporation
Titan Xp GPU

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
156
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents