research-article

Massive Text Normalization via an Efficient Randomized Algorithm

Authors:

Yexiang XueAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 2946 - 2956

https://doi.org/10.1145/3485447.3512015

Published: 25 April 2022 Publication History

Get Access

Abstract

Current popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. Nevertheless, real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate. Moreover, existing text normalization methods are prohibitively expensive to execute over web-scale datasets, can hardly process noisy texts from social networks, or require annotations to learn the corrections in a supervised manner. In this paper, we present Flan (Fast LSH Algorithm for Text Normalization), a scalable randomized algorithm to clean and canonicalize massive text data. Our approach suggests corrections based on the morphology of the words, where lexically similar words are considered the same with high probability. We efficiently handle the pairwise word-to-word comparisons via locality sensitive hashing (LSH). We also propose a novel stabilization process to address the issue of hash collisions between dissimilar words, which is a consequence of the randomized nature of LSH and is exacerbated by the massive scale of real-world datasets. Compared with existing approaches, our method is more efficient, both asymptotically and in empirical evaluations, does not rely on feature engineering, and does not require any annotation. Our experimental results on real-world datasets demonstrate the efficiency and efficacy of Flan. Based on recent advances in densified Minhash, our approach requires much less computational time compared to baseline text normalization techniques on large-scale Twitter and Reddit datasets. In a human evaluation of the quality of the normalization, Flan achieves 5% and 14% improvement against the baselines over the Reddit and Twitter datasets, respectively. Our method also improves performance on Twitter sentiment classification applications and the perturbed GLUE benchmark datasets, where we introduce random errors into the text.

References

[1]

Nir Ailon and Bernard Chazelle. 2009. The Fast Johnson–Lindenstrauss Transform and Approximate Nearest Neighbors. SIAM J. Comput. 39, 1 (2009), 302–322.

Abstract

References

Cited By

Index Terms

Recommendations

Lexical normalization for social media text

Lexical Normalization of Spanish Tweets

Lexical Normalization of Japanese Tweets Using Related Images

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations