research-article

Text-to-text generative approach for enhanced complex word identification

Authors:

Patrycja Śliwiak,

Syed Afaq Ali ShahAuthors Info & Claims

Volume 610, Issue C

https://doi.org/10.1016/j.neucom.2024.128501

Published: 07 January 2025 Publication History

Abstract

This paper presents a novel approach for solving the Complex Word Identification (CWI) task using the text-to-text generative model. The CWI task involves identifying complex words in text, which is a challenging Natural Language Processing task. To our knowledge, it is a first attempt to address CWI problem into text-to-text context. In this work, we propose a new methodology that leverages the power of the Transformer model to evaluate complexity of words in binary and probabilistic settings. We also propose a novel CWI dataset, which consists of 62,200 phrases, both complex and simple. We train and fine-tune our proposed model on our CWI dataset. We also evaluate its performance on separate test sets across three different domains. Our experimental results demonstrate the effectiveness of our proposed approach compared to state-of-the-art methods.

Highlights

•

The paper proposes a transformer based generative approach for complex word identification (CWI)

•

Our technique uses text generation and addresses the CWI task in the text-to-text context.

•

The paper also proposes and develops a new CWI dataset for solving the CWI task using the proposed method.

•

We fine-tune our model for CWI task in binary settings and it performs on par with state-of-the-art methods.

•

In addition, we also fine-tune the model using probabilistic settings and it achieves state-of-the-art results.

•

Our dataset and code is publicly available for the research community.

References

[1]

Nagy W.E., Anderson R.C., Herman P.A., Learning word meanings from context during normal reading, Am. Educ. Res. J. 24 (2) (1987) 237–270,.

[2]

Cervetti G.N., Hiebert E.H., Pearson P.D., McClung N.A., Factors that influence the difficulty of science words, J. Lit. Res. 47 (2) (2015) 153–185,.

[3]

complex, adj., in: OED Online, Oxford University Press, 1989.

[4]

simple, adj., in: OED Online, Oxford University Press, 1989.

[5]

Wagner R.K., Muse A.E., Tannenbaum K.R., Vocabulary Acquisition: Implications for Reading Comprehension, Guilford Press, 2007.

[6]

Paetzold G.H., Specia L., A survey on lexical simplification, J. Artif. Int. Res. 60 (1) (2017) 549–593.

[7]

G. Paetzold, L. Specia, Semeval 2016 task 11: Complex word identification, in: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval-2016, 2016, pp. 560–569.

[8]

Semeval 2016, 2016, URL: https://alt.qcri.org/semeval2016/.

[9]

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I., Attention is all you need, 2017, CoRR abs/1706.03762, arXiv:1706.03762, URL: http://arxiv.org/abs/1706.03762.

[10]

Le Cun Y., Bengio Y., Word-level training of a handwritten word recognizer based on convolutional neural networks, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5), vol. 2, 1994, pp. 88–92,.

[11]

Rumelhart D.E., Hinton G.E., Williams R.J., Learning Internal Representations by Error Propagation, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.

[12]

Hochreiter S., Schmidhuber J., Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780.

Digital Library

[13]

Sutskever I., Vinyals O., Le Q.V., Sequence to sequence learning with neural networks, Ghahramani Z., Welling M., Cortes C., Lawrence N., Weinberger K. (Eds.), Advances in Neural Information Processing Systems, vol. 27, Curran Associates, Inc., 2014.

[14]

Bahdanau D., Cho K., Bengio Y., Neural machine translation by jointly learning to align and translate, 2014, arXiv preprint arXiv:1409.0473.

[15]

Zhou C., Sun C., Liu Z., Lau F.C.M., A C-LSTM neural network for text classification, 2015, CoRR abs/1511.08630, arXiv:1511.08630, URL: http://arxiv.org/abs/1511.08630.

[16]

Jacovi A., Shalom O.S., Goldberg Y., Understanding convolutional neural networks for text classification, 2018, CoRR abs/1809.08037, arXiv:1809.08037, URL: http://arxiv.org/abs/1809.08037.

[17]

Graves A., Sequence transduction with recurrent neural networks, 2012, CoRR abs/1211.3711, arXiv:1211.3711, URL: http://arxiv.org/abs/1211.3711.

[18]

R. Chandrasekar, C. Doran, S. Bangalore, Motivations and methods for text simplification, in: COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics, 1996.

[19]

Qiang J., Zhang F., Li Y., Yuan Y., Zhu Y., Wu X., Unsupervised statistical text simplification using pre-trained language modeling for initialization, Front. Comput. Sci. 17 (2022) URL: https://api.semanticscholar.org/CorpusID:251409015.

[20]

Monteiro J., Aguiar M.M.A., Araújo S., Using a pre-trained SimpleT5 model for text simplification in a limited corpus, 2022.

[21]

Ortiz-Zambrano J., Espin-Riofrio C., Montejo-Ráez A., SINAI participation in SimpleText task 2 at CLEF 2023: GPT-3 in lexical complexity prediction for general audience, 2023.

[22]

Yimam S.M., Biemann C., Malmasi S., Paetzold G.H., Specia L., Stajner S., Tack A., Zampieri M., A report on the complex word identification shared task 2018, 2018, CoRR abs/1804.09132, arXiv:1804.09132, URL: http://arxiv.org/abs/1804.09132.

[23]

Siddharthan A., An architecture for a text simplification system, in: Language Engineering Conference, 2002. Proceedings, 2002, pp. 64–71,.

[24]

J.A. Carroll, G. Minnen, D. Pearce, Y. Canning, S. Devlin, J. Tait, Simplifying text for language-impaired readers, in: Ninth Conference of the European Chapter of the Association for Computational Linguistics, 1999, pp. 269–270.

[25]

Gooding S., Kochmar E., Complex word identification as a sequence labelling task, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 1148–1153,. URL: https://aclanthology.org/P19-1109.

[26]

M. Shardlow, Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline, in: LREC, 2014, pp. 1583–1590.

[27]

M. Shardlow, A Comparison of Techniques to Automatically Identify Complex Words, in: 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop, 2013, pp. 103–109.

[28]

M. Shardlow, The cw corpus: A new resource for evaluating the identification of complex words, in: Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations, 2013, pp. 69–77.

[29]

S.R. Thomas, S. Anderson, WordNet-based lexical simplification of a document, in: KONVENS, 2012, pp. 80–88.

[30]

M. Zampieri, L. Tan, J. van Genabith, Macsaar at semeval-2016 task 11: Zipfian and character features for complexword identification, in: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval-2016, 2016, pp. 1001–1005.

[31]

F. Ronzano, L.E. Anke, H. Saggion, et al., Taln at semeval-2016 task 11: Modelling complex words by contextual, lexical and semantic features, in: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval-2016, 2016, pp. 1011–1016.

[32]

G. Paetzold, L. Specia, Sv000gg at semeval-2016 task 11: Heavy gauge complex word identification with system voting, in: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval-2016, 2016, pp. 969–974.

[33]

Gooding S., Kochmar E., CAMB at CWI shared task 2018: complex word identification with ensemble-based voting, in: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 184–194,. URL: https://aclanthology.org/W18-0520.

[34]

De Hertog D., Tack A., Deep learning architecture for complexword identification, in: Thirteenth Workshop of Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (ACL); New Orleans, Louisiana, 2018, pp. 328–334.

[35]

Muhie S., Biemann C., Malmasi S., Paetzold G., Specia L., Stajner S., Tack A.s., Zampieri M., A report on the complex word identification shared task 2018, 2018, pp. 66–78,.

[36]

Rei M., Semi-supervised multitask learning for sequence labeling, 2017, arXiv preprint arXiv:1704.07156.

[37]

Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: pre-training of deep bidirectional transformers for language understanding, 2018,. URL: https://arxiv.org/abs/1810.04805.

[38]

Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V., RoBERTa: a robustly optimized BERT pretraining approach, 2019,. URL: https://arxiv.org/abs/1907.11692.

[39]

Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901.

[40]

Raffel C., Shazeer N., Roberts A., Lee K., Narang S., Matena M., Zhou Y., Li W., Liu P.J., et al., Exploring the limits of transfer learning with a unified text-to-text transformer., J. Mach. Learn. Res. 21 (140) (2020) 1–67.

[41]

Srivastava A., Complex word identification for language learners, 2022.

[42]

Zaharia G.-E., Cercel D.-C., Dascalu M., Cross-lingual transfer learning for complex word identification, in: 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence, ICTAI, 2020,.

[43]

Pires T., Schlinger E., Garrette D., How multilingual is multilingual BERT?, 2019, arXiv preprint arXiv:1906.01502.

[44]

K.C. Sheang, H. Saggion, Controllable Sentence Simplification with a Unified Text-to-Text Transfer Transformer, in: Proceedings of the 14th International Conference on Natural Language Generation, 2021, pp. 341–352.

[45]

D.D. Lewis, M. Ringuette, A study of Bayesian learning methods for text categorization, in: Proceedings of the AAAI-94 Workshop on Learning in Text Categorization, 1994, pp. 81–88.

[46]

Cortes C., Vapnik V., Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297,.

[47]

Lafferty J.D., McCallum A., Pereira F.C.N., Conditional random fields: probabilistic models for segmenting and labeling sequence data, in: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001, pp. 282–289.

[48]

Cho K., van Merriënboer B., Bahdanau D., Bengio Y., Learning phrase representations using RNN encoder-decoder for statistical machine translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, Association for Computational Linguistics, 2014, pp. 1724–1734,.

[49]

Schuster M., Paliwal K.K., Bidirectional recurrent neural networks, IEEE Trans. Signal Process. 45 (11) (1997) 2673–2681,.

Digital Library

[50]

Lample G., Ballesteros M., Subramanian S., Kawakami K., Dyer C., Neural architectures for named entity recognition, 2016, arXiv:1603.01360, URL: https://arxiv.org/abs/1603.01360.

[51]

Xue L., Barua A., Zhu Y., Zoph B., Bapna D., Firat O., Cherry C., mT5: a massively multilingual pre-trained text-to-text transformer, Trans. Assoc. Comput. Linguist. 9 (2021) 720–735,.

[52]

Liu Y., Gu J., Goyal N., Li X., Edunov S., Ghazvininejad M., Lewis M., Zettlemoyer L., Multilingual denoising pre-training for neural machine translation, 2020, arXiv preprint arXiv:2001.08210.

[53]

M. Shardlow, Out in the open: Finding and categorising errors in the lexical simplification pipeline, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC’14, 2014, pp. 1583–1590.

[54]

G. Paetzold, L. Specia, Text simplification as tree transduction, in: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology, 2013.

[55]

SemEval-2016 : semantic evaluation exercises international workshop on semantic evaluation, 2016, URL: https://alt.qcri.org/semeval2016/.

[56]

C. Horn, C. Manduca, D. Kauchak, Learning a lexical simplifier using wikipedia, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014, pp. 458–463.

[57]

D. Kauchak, Improving text simplification language modeling using unsimplified text data, in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013, pp. 1537–1546.

[58]

Hayes A.F., Krippendorff K., Answering the call for a standard reliability measure for coding data, Commun. Methods Meas. 1 (1) (2007) 77–89.

[59]

Shardlow M., Cooper M., Zampieri M., Complex: A new corpus for lexical complexity prediction from likert scale data, 2020, arXiv preprint arXiv:2003.07008.

[60]

SemEval-2018: semantic evaluation exercises international workshop on semantic evaluation, 2018, URL: https://alt.qcri.org/semeval2018/.

[61]

Complex word identification (CWI) shared task 2018, 2018, URL: https://sites.google.com/view/cwisharedtask2018/datasets.

[62]

Shardlow M., Evans R., Paetzold G.H., Zampieri M., SemEval-2021 task 1: lexical complexity prediction, in: Proceedings of the 15th International Workshop on Semantic Evaluation, SemEval-2021, Association for Computational Linguistics, 2021, pp. 1–16,. Online, URL: https://aclanthology.org/2021.semeval-1.1.

[63]

Clark K., Luong M.-T., Le Q.V., Manning C.D., ELECTRA: pre-training text encoders as discriminators rather than generators, 2020,. URL: https://arxiv.org/abs/2003.10555.

[64]

Lan Z., Chen M., Goodman S., Gimpel K., Sharma P., Soricut R., ALBERT: a lite BERT for self-supervised learning of language representations, 2019,. URL: https://arxiv.org/abs/1909.11942.

[65]

He P., Liu X., Gao J., Chen W., DeBERTa: decoding-enhanced BERT with disentangled attention, 2020, CoRR abs/2006.03654, arXiv:2006.03654, URL: https://arxiv.org/abs/2006.03654.

[66]

Caruana R., Multitask Learning, Springer, 1998.

[67]

Mikolov T., Chen K., Corrado G., Dean J., Efficient estimation of word representations in vector space, 2013, arXiv preprint arXiv:1301.3781.

[68]

Lewis M., Liu Y., Goyal N., Ghazvininejad M., Mohamed A., Levy O., Stoyanov V., Zettlemoyer L., BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019, arXiv:1910.13461.

[69]

Chung H.W., Hou L., Longpre S., Zoph B., Tay Y., Fedus W., Li Y., Wang X., Dehghani M., Brahma S., Webson A., Gu S.S., Dai Z., Suzgun M., Chen X., Chowdhery A., Castro-Ros A., Pellat M., Robinson K., Valter D., Narang S., Mishra G., Yu A., Zhao V., Huang Y., Dai A., Yu H., Petrov S., Chi E.H., Dean J., Devlin J., Roberts A., Zhou D., Le Q.V., Wei J., Scaling instruction-finetuned language models, 2022, arXiv:2210.11416.

[70]

Stiennon N., Ouyang L., Wu J., Ziegler D.M., Lowe R., Voss C., Radford A., Amodei D., Christiano P., Learning to summarize from human feedback, 2022, arXiv:2009.01325, URL: https://arxiv.org/abs/2009.01325.

[71]

Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C., Mishkin P., Zhang C., Agarwal S., Slama K., Ray A., Schulman J., Hilton J., Kelton F., Miller L., Simens M., Askell A., Welinder P., Christiano P.F., Leike J., Lowe R., Training language models to follow instructions with human feedback, Koyejo S., Mohamed S., Agarwal A., Belgrave D., Cho K., Oh A. (Eds.), Advances in Neural Information Processing Systems, vol. 35, Curran Associates, Inc., 2022, pp. 27730–27744. URL: https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.

[72]

Bai Y., Jones A., Ndousse K., Askell A., Chen A., DasSarma N., Drain D., Fort S., Ganguli D., Henighan T., Joseph N., Kadavath S., Kernion J., Conerly T., El-Showk S., Elhage N., Hatfield-Dodds Z., Hernandez D., Hume T., Johnston S., Kravec S., Lovitt L., Nanda N., Olsson C., Amodei D., Brown T., Clark J., McCandlish S., Olah C., Mann B., Kaplan J., Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022, arXiv:2204.05862, URL: https://arxiv.org/abs/2204.05862.

[73]

Gooding S., On the ethical considerations of text simplification, 2022,. URL: https://arxiv.org/abs/2204.09565.

Index Terms

Text-to-text generative approach for enhanced complex word identification

Index terms have been assigned to the content through auto-classification.

Recommendations

CWITR: A Corpus for Automatic Complex Word Identification in Turkish Texts
NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval

The Complex Word Identification (CWI) task aims to provide support to resolve accessibility barriers for people who experience difficulties with cognitive, language, and learning disabilities. The task is concerned with the detection and identification ...
Text analysis and language identification for polyglot text-to-speech synthesis

In multilingual countries, text-to-speech synthesis systems often have to deal with texts containing inclusions of multiple other languages in form of phrases, words, or even parts of words. In such multilingual cultural settings, listeners expect a ...
Word-Level Language Identification and Back Transliteration of Romanized Text
FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval Evaluation

This paper presents the BMSCE team's participation in `FIRE Shared Task on Transliterated Search subtask-1'. Our Language Identification system is based on the n-grams approach and uses a tri-gram language identifier trained over a shared and collected ...

Comments

Information & Contributors

Information

Published In

cover image Neurocomputing

Neurocomputing Volume 610, Issue C

Dec 2024

1073 pages

Issue’s Table of Contents

The Author(s).

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 07 January 2025

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents