research-article

Unsupervised statistical text simplification using pre-trained language modeling for initialization

Authors:

Xindong WuAuthors Info & Claims

Frontiers of Computer Science, Volume 17, Issue 1

https://doi.org/10.1007/s11704-022-1244-0

Published: 08 August 2022 Publication History

Abstract

Unsupervised text simplification has attracted much attention due to the scarcity of high-quality parallel text simplification corpora. Recent an unsupervised statistical text simplification based on phrase-based machine translation system (UnsupPBMT) achieved good performance, which initializes the phrase tables using the similar words obtained by word embedding modeling. Since word embedding modeling only considers the relevance between words, the phrase table in UnsupPBMT contains a lot of dissimilar words. In this paper, we propose an unsupervised statistical text simplification using pre-trained language modeling BERT for initialization. Specifically, we use BERT as a general linguistic knowledge base for predicting similar words. Experimental results show that our method outperforms the state-of-the-art unsupervised text simplification methods on three benchmarks, even outperforms some supervised baselines.

References

[1]

Martin L, de la Clergerie É, Sagot B, Bordes A. Controllable sentence simplification. In: Proceedings of the 12th Conference on Language Resources and Evaluation. 2020, 4689–4698

[2]

Nisioi S, Štajner S, Ponzetto S P, Dinu L P. Exploring neural text simplification models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017, 85–91

[3]

Wubben S, van den Bosch A, Krahmer E. Sentence simplification by monolingual machine translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers. 2012, 1015–1024

[4]

Xu W, Napoles C, Pavlick E, Chen Q, and Callison-Burch C Optimizing statistical machine translation for text simplification Transactions of the Association for Computational Linguistics 2016 4 401-415

[5]

Zhang X, Lapata M. Sentence simplification with deep reinforcement learning. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 584–594

[6]

Zhu Z, Bernhard D, Gurevych I. A monolingual tree-based translation model for sentence simplification. In: Proceedings of the 23rd International Conference on Computational Linguistics. 2010, 1353–1361

[7]

Xu W, Callison-Burch C, and Napoles C Problems in current text simplification research: new data can help Transactions of the Association for Computational Linguistics 2015 3 283-297

[8]

Surya S, Mishra A, Laha A, Jain P, Sankaranarayanan K. Unsupervised neural text simplification. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 2058–2068

[9]

Kumar D, Mou L, Golab L, Vechtomova O. Iterative edit-based unsupervised sentence simplification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7918–7928

[10]

Qiang J and Wu X Unsupervised statistical text simplification IEEE Transactions on Knowledge and Data Engineering 2021 33 4 1802-1806

[11]

Meng Y, Zhang Y, Huang J, Xiong C, Ji H, Zhang C, Han J. Text classification using label names only: a language model self-training approach. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, 9006–9017

[12]

Petroni F, Rocktäschel T, Lewis P, Bakhtin A, Wu Y, Miller A H, Riedel S. Language models as knowledge bases?. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 2463–2473

[13]

Roberts A, Raffel C, Shazeer N. How much knowledge can you pack into the parameters of a language model?. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, 5418–5426

[14]

Zhang H, Khashabi D, Song Y, Roth D. TransOMCS: from linguistic graphs to commonsense knowledge. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2020, 4004–4010

[15]

Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E. Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. 2007, 177–180

[16]

Artetxe M, Labaka G, Agirre E, Cho K. Unsupervised neural machine translation. In: Proceedings of the 6th International Conference on Learning Representations. 2018

[17]

Pennington J, Socher R, Manning C. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532–1543

[18]

Farr J N, Jenkins J J, and Paterson D G Simplification of flesch reading ease formula Journal of Applied Psychology 1951 35 5 333-337

[19]

Heafield K. KenLM: faster and smaller language model queries. In: Proceedings of the 6th Workshop on Statistical Machine Translation. 2011, 187–197

[20]

Lample G, Ott M, Conneau A, Denoyer L, Ranzato M. Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 5039–5049

[21]

Li D, Zhang Y, Peng H, Chen L, Brockett C, Sun M T, Dolan B. Contextualized perturbation for textual adversarial attack. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 5053–5069

[22]

Glavaš G, Štajner S. Simplifying lexical simplification: do we need simplified corpora?. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015, 63–68

[23]

Brysbaert M and New B Moving beyond Kučera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English Behavior Research Methods 2009 41 4 977-990

[24]

Qiang J, Li Y, Zhu Y, Yuan Y, and Wu X Lexical simplification with pretrained encoders Proceedings of the AAAI Conference on Artificial Intelligence 2020 34 5 8649-8656

[25]

Qiang J, Lv X, Li Y, Yuan Y, and Wu X Chinese lexical simplification IEEE/ACV Transactions on Audio, Speech, and Language Processing 2021 29 1819-1828

Digital Library

[26]

Zhao S, Meng R, He D, Andi S, Bambang P. Integrating transformer and paraphrase rules for sentence simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3164–3173

[27]

Narayan S, Gardent C. Hybrid simplification using deep semantics and machine translation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014, 435–445

[28]

Guo H, Pasunuru R, Bansal M. Dynamic multi-level multi-task learning for sentence simplification. In: Proceedings of the 27th International Conference on Computational Linguistics. 2018, 462–476

[29]

Dong Y, Li Z, Rezagholizadeh M, Cheung J C K. EditNTS: an neural programmer-interpreter model for sentence simplification through explicit editing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 3393–3402

[30]

Radford A, Wu J, Child R, Luan D, Amodei D, and Sutskever I Language models are unsupervised multitask learners OpenAI Blog 2019 1 8 9

[31]

Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le Q V. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). 2019, 5754–5764

[32]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. 2019, 4171–4186

[33]

Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: a lite BERT for self-supervised learning of language representations. In: Proceedings of the 8th International Conference on Learning Representations. 2020

[34]

Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2019, 7871–7880

[35]

Scarton C, Specia L. Learning simplifications for specific target audiences. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 712–718

[36]

Narayan S, Gardent C. Unsupervised sentence simplification using deep semantics. In: Proceedings of the 9th International Natural Language Generation Conference. 2015, 111–120

[37]

Martin L, Fan A, de la Clergerie É, Bordes A, Sagot B. MUSS: multilingual unsupervised sentence simplification by mining paraphrases. 2021, arXiv preprint arXiv: 2005.00352

[38]

Artetxe M, Labaka G, Agirre E. Unsupervised statistical machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3632–3642

[39]

Wenzek G, Lachaux M A, Conneau A, Chaudhary V, Guzmán F, Joulin A, Grave E. CCNET: extracting high quality monolingual datasets from web crawl data. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020, 4003–4012

[40]

Pavlick E, Callison-Burch C. Simple PPDB: a paraphrase database for simplification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 143–148

Cited By

Zhang XChen JLuo ZBai YHu CZhang R(2025)A multi-projection recurrent model for hypernym detection and discoveryFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-3638-719:4Online publication date: 1-Apr-2025
https://dl.acm.org/doi/10.1007/s11704-024-3638-7
Wang RMou XWo TZhang MLiu YWang TLiu PYan JLiu X(2025)ACbot: an IIoT platform for industrial robotsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-3449-x19:4Online publication date: 1-Apr-2025
https://dl.acm.org/doi/10.1007/s11704-024-3449-x
Li YXiong HKong LSun ZChen HWang SYin DLarson K(2024)MPGrafProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/937(8439-8443)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/937
Show More Cited By

Recommendations

SimpLex: a lexical text simplification architecture
Abstract
Text simplification (TS) is the process of generating easy-to-understand sentences from a given sentence or piece of text. The aim of TS is to reduce both the lexical (which refers to vocabulary complexity and meaning) and syntactic (which refers ...
Comparing resources for spanish lexical simplification
SLSP'13: Proceedings of the First international conference on Statistical Language and Speech Processing

In this paper we study the effect of different lexical resources and strategies for selecting synonyms in a lexical simplification system for the Spanish language. The resources used for the experiments are the Spanish EuroWordNet, the Spanish Open ...
Text simplification resources for Spanish

In this paper we present the development of a text simplification system for Spanish. Text simplification is the adaptation of a text for the special needs of certain groups of readers, such as language learners, people with cognitive difficulties, and ...

Comments

Information & Contributors

Information

Published In

cover image Frontiers of Computer Science: Selected Publications from Chinese Universities

Frontiers of Computer Science: Selected Publications from Chinese Universities Volume 17, Issue 1

Feb 2023

231 pages

ISSN:2095-2228

EISSN:2095-2236

Issue’s Table of Contents

© Higher Education Press 2023.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 August 2022

Accepted: 06 September 2021

Received: 10 May 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang XChen JLuo ZBai YHu CZhang R(2025)A multi-projection recurrent model for hypernym detection and discoveryFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-3638-719:4Online publication date: 1-Apr-2025
https://dl.acm.org/doi/10.1007/s11704-024-3638-7
Wang RMou XWo TZhang MLiu YWang TLiu PYan JLiu X(2025)ACbot: an IIoT platform for industrial robotsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-3449-x19:4Online publication date: 1-Apr-2025
https://dl.acm.org/doi/10.1007/s11704-024-3449-x
Li YXiong HKong LSun ZChen HWang SYin DLarson K(2024)MPGrafProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/937(8439-8443)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/937
Liu CYu CWang XJiang JYang TTang BShi YLiang CShi Y(2024)CalibRead: Unobtrusive Eye Tracking Calibration from Natural Reading BehaviorProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997378:4(1-30)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3699737
Yang YHuang PCao JLi JLin YMa F(2024)A prompt-based approach to adversarial example generation and robustness enhancementFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-2639-218:4Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s11704-023-2639-2
Dong J(2023)Transfer Learning-Based Neural Machine Translation for Low-Resource LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3618111Online publication date: 13-Sep-2023
https://dl.acm.org/doi/10.1145/3618111

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents