research-article

A Machine Learning–Based Readability Model for Gujarati Texts

Author:

Chandrakant K. BhogayataAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 2

Article No.: 30, Pages 1 - 32

https://doi.org/10.1145/3637826

Published: 08 February 2024 Publication History

Abstract

This study aims to develop a machine learning–based model to predict the readability of Gujarati texts. The dataset was 50 prose passages from Gujarati literature. Fourteen lexical and syntactic readability text features were extracted from the dataset using a machine learning algorithm of the unigram parts of speech tagger and three Python programming scripts. Two samples of native Gujarati speaking secondary and higher education students rated the Gujarati texts for readability judgment on a 10-point scale of “easy” to “difficult” with the interrater agreement. After dimensionality reduction, seven text features as the independent variables and the mean readability rating as the dependent variable were used to train the readability model. As the students' level of education and gender were related to their readability rating, four readability models for school students, university students, male students, and female students were trained with a backward stepwise multiple linear regression algorithm of supervised machine learning. The trained model is comparable across the raters’ groups. The best model is the university students’ readability rating model. The model is cross-validated. It explains 91% and 88% of the variance in readability ratings at training and cross-validation, respectively, and its effect size and power are large and high.

References

[1]

Ethem Alpaydin. 2020. Introduction to Machine Learning (4th. ed). The MIT Press, Cambridge, MA.

[2]

Stevan J. Amendum, Kristin Conradi, and Elfrieda Hiebert. 2018. Does text complexity matter in the elementary grades? A research synthesis of text difficulty and elementary students’ reading fluency and comprehension. Educational Psychology Review 30, 1 (Mar. 2018), 121–151. DOI:

[3]

Lennart Ante. 2022. The relationship between readability and scientific impact: Evidence from emerging technology discourses. Journal of Informetrics 16, 1 (Feb. 2022), Article 101252. DOI:

[4]

Alan Bailin and Ann Grafstein. 2016. Readability: Text and Context. Palgrave Macmillan, New York, NY.

[5]

Sofie Beier, Sam Berlow, Esat Boucaud, Zoya Bylinskii, Tianyuan Cai, Jenae Cohn, Kathy Crowley, Stephanie L. Day, Tilman Dingler, Jonathan Dobres, Jennifer Healey, Rajiv Jain, Marjorie Jordan, Bernard Kerr, Qisheng Li, Dave B. Miller, Susanne Nobles, Alexandra Papoutsaki, Jing Qian, Tina Rezvanian, Shelley Rodrigo, Ben D. Sawyer, Shannon M. Sheppard, Bram Stein, Rick Treitman, Jen Vanek, Shaun Wallace, and Benjamin Wolfe. 2022. Reading research: An interdisciplinary approach. Foundations and Trends in Human-Computer Interaction 16, 4 (Dec. 2022), 214–324. DOI:

[6]

Susmoy Chakraborty, Mir Tafseer Nayeem, and Wasi Uddin Ahmad. 2021. Simple or complex? Learning to predict readability of Bengali texts. In Proceedings of the AAAI Conference on Artificial Intelligence 35, 14 (May 2021), Association for the Advancement of Artificial Intelligence, 12621–12629. DOI:

[7]

Samprit Chatterjee and Jeffrey S. Simonoff. 2013. Handbook of Regression Analysis. John Wiley & Sons, Hoboken, NJ. https://eli.johogo.com/Class/CCU/SEM/_Handbook%20of%20Regression%20Analysis_Chatterjee.pdf

[8]

Alebachew Chiche and Betselot Yitagesu. 2022. Part of speech tagging: A systematic review of deep learning and machine learning approaches. Journal of Big Data 9 (Jan. 2022), Article 10. DOI:

[9]

Joon Suh Choi and Scott A. Crossley. 2022. Advances in readability research: A new readability web app for English. In Proceedings of the International Conference on Advanced Learning Technologies (ICALT ’22). IEEE. 1–5. DOI:

[10]

Kevyn Collins-Thompson. 2014. Computational assessment of text readability: A survey of current and future research. International Journal of Applied Linguistics 165, 2 (Dec. 2014), 97–135. DOI:

[11]

Scott Crossley, Aron Heintz, Joon Suh Choi, Mehrnoush Karimi, and Agnes Malatinszky. 2023. A large-scaled corpus for assessing text readability. Behavior Research Methods 55, 2 (Feb. 2023), 491–507. DOI:

[12]

Scott A. Crossley, Stephen Skalicky, and Mihai Dascalu. 2019. Moving beyond classic readability formulas: New methods and new models. Journal of Research in Reading 42, 3-4 (Nov. 2019), 541–561. DOI:

[13]

Yue Cui, Junhui Zhu, Liner Yang, Xuezhi Fang, Xiaobin Chen, Yujie Wang, and Erhong Yang. 2022. CTAP for Chinese: A linguistic complexity feature automatic calculation platform. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, 5525–5538. https://aclanthology.org/2022.lrec-1.592.pdf

[14]

Mark de Rooij and Wouter Weeda. 2020. Cross-validation: A method every psychologist should know. Advances in Methods and Practices in Psychological Science 3, 2 (Jun. 2020), 248–263. DOI:

[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL_HLT 2019), Volume 1 (Long and Short Papers), Association for Computational Linguistics, 4171–4186. DOI:

[16]

Holger Diessel. 2019. The Grammar Network: How Linguistic Structure is Shaped by Language Use. Cambridge University Press, Cambridge, UK.

[17]

Han Feng, Sizai Hou, Le-Yin Wei, and Ding-Xuan Zhou. 2022. CNN models for readability of Chinese texts. Mathematical Foundations of Computing 5, 4 (Nov. 2022), 351–362. DOI:

[18]

Thomas François. 2015. When readability meets computational linguistics: A new paradigm in readability. Revue Française de Linguistique Appliquée xx, 2 (2015), 79–97. DOI:

[19]

Lorenz Graf-Vlachy. 2022. Is the readability of abstracts decreasing in management research? Review of Managerial Science 16, 4 (May 2022), 1063–1084. DOI:

[20]

Philip E. Hulme and Hazel McLaren-Swift. 2022. Declining readability of research on biological invasions over two decades. Biological Invasions 24, 6 (Jun. 2022), 1651–1660. DOI:

[21]

Joseph Marvin Imperial. 2021. BERT embeddings for automatic readability assessment. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. 611–618.

[22]

Indian Language Technology Proliferation and Deployment Centre. 2017. Gujarati monolingual text corpus. New Delhi.

[23]

Nina Jamar. 2023. The readability of abstracts in library and information science journals. Journal of Documentation 79, 7 (2023), 1–11. DOI:

[24]

Dhawal Khem, Shailesh Panchal, and Chetan Bhatt. 2023. Text simplification improves text translation from Gujarati regional language to English: An experimental study. International Journal of Intelligent Systems and Applications in Engineering 11, 2s (Jan. 2023), 316–327. https://ijisae.org/index.php/IJISAE/article/view/2699/1279

[25]

Cini Kurian. 2014. A review on the progress of natural language processing in India. International Journal of Advances in Engineering & Technology 7, 5 (Nov. 2014), 1420–1425. https://www.academia.edu/10350837/A_REVIEW_ON_THE_PROGRESS_OF_NATURAL_LANGUAGE_PROCESSING_IN_INDIA

[26]

Michael H. Kutner, Christopher J. Nachtsheim, John Neter, and William Li. 2013. Applied Linear Statistical Models (5th ed., Indian ed.). McGraw Hill Education (India), Chennai, Tamil Nadu.

[27]

Brett Lantz. 2019. Machine Learning with R (3rd. ed.). Packt Publishing Ltd., Birmingham, UK.

[28]

Bruce W. Lee and Jason Hyung-Jong Lee. 2023. Prompt-based learning for text readability assessment. arXiv:2302.13139v1. Retrieved from

[29]

A. Madhushree and D. Nanjappa. 2017. Development of readability formula for Kannada language. Mysore Journal of Agricultural Sciences 51, 2 (Apr.-Jun. 2017), 326–330. https://e-krishiuasb.karnataka.gov.in/MJAS/getInfoForIssue.aspx

[30]

Matej Martinc, Senja Pollak, and Marko Robnik-Šikonja. 2021. Supervised and unsupervised neural approaches to text readability. Computational Linguistics 41, 1 (Apr. 2021), 141–179. DOI:

[31]

Jorge Martinez-Gil. 2023. Optimizing readability using genetic algorithms. arXiv:2301.00374v1. Retrieved from

[32]

Bryan C. McCannon. 2019. Readability and research impact. Economics Letters 180 (Jul. 2019), 76–79. DOI:

[33]

Changping Meng, Muhao Chen, Jie Mao, and Jennifer Neville. 2020. ReadNet: A hierarchical transformer framework for web article readability analysis. Advances in Information Retrieval (ECIR ’20), Lecture Notes in Computer Science, Vol. 12035. Springer, Cham. DOI:

Digital Library

[34]

Shailaja Menon, Ramchandar Krishnamurthy, S. Sajitha, Neela Apte, Abha Basargekar, Sneha Subramaniam, Mounesh Nalkamani, and Madhuri Modugala. 2017. Literacy Research in Indian Languages (LiRiL): Report of a Three-Year Longitudinal Study on Early Reading and Writing in Marathi and Kannada. Azim Premji University, Bangalore and Tata Trusts, New Delhi. Retrieved July 1, 2020 from https://publications.azimpremjiuniversity.edu.in/400/1/Liril_Final.pdf

[35]

Mostafa Mesgari, Chitu Okoli, Mohamad Mehdi, Finn Årup Nielsen, and Arto Lanamäki. 2015. “The sum of all human knowledge”: A systematic review of scholarly research on the content of Wikipedia. Journal of the Association for Information Science and Technology 66, 2 (Feb. 2015), 219–245. https://backend.orbit.dtu.dk/ws/portalfiles/portal/103083646/WikiLit_Content_open_access_version.pdf

Digital Library

[36]

Jatin C. Modh, Jatinderkumar R. Saini, and Ketan Kotecha. 2022. A novel readability complexity score for Gujarati idiomatic text. International Journal of Advanced Computer Science and Applications 13, 5 (May, 2022), 453–459. https://thesai.org/Publications/IJACSA

[37]

Guy Moors, Natalia Kieruj, and Jeroen K. Vermunt. 2014. The effect of labeling and numbering of response scales on the likelihood of response bias. Sociological Methodology 44, 1 (Aug. 2014), 369–399. DOI:

[38]

Naoual Nassiri, Violetta Cavalli-Sforza, and Abdelhak Lakhouaja. 2023. Approaches, methods, and resources for assessing the readability of Arabic texts. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 4 (Apr. 2023), Article 95, 30 pages. DOI:

Digital Library

[39]

Halil Ibrahim Öksüz and Hasan Kağan Keskin. 2022. A study on the impact of readability on comprehensibility. International Journal of Progressive Education 18, 1 (Feb. 2022), 322–335. DOI:

[40]

Lydia O'Sullivan, Prasanth Sukumar, Rachel Crowley, Eilish McAuliffe, and Peter Doran. 2020. Readability and understandability of clinical research patient information leaflets and consent forms in Ireland and the UK: A retrospective quantitative analysis. BMJ Open 10, 9 (Sep. 2020), e037994. DOI:

[41]

Muralidhar Pantula and K. S. Kuppusamy. 2022. A machine learning-based model to evaluate readability and assess grade level for the web pages. The Computer Journal 65, 4 (Apr. 2022), 831–842. DOI:

[42]

Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. 2014. Inter-rater agreement study on readability assessment in Bengali. International Journal on Natural Language Computing (IJNLC) 3, 3 (Jun. 2014), 25–31. DOI:

[43]

Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. 2019. Readability analysis of Bengali literary texts. Journal of Quantitative Linguistics 26, 4 (2019), 287–305. DOI:

[44]

Florian Pickelmann, Michael Färber, and Adam Jatowt. 2023. Ablesbarkeitsmesser: A system for assessing the readability of German text. In Advances in Information Retrieval: Proceedings of the 45th European Conference on Information Retrieval (ECIR ’23), Part III, 288–293.

Digital Library

[45]

Pontus Plavén-Sigray, Granville James Matheson, Björn Christian Schiffler, and William Hedley Thompson. 2017. The readability of scientific texts is decreasing over time. eLife 6 (Sep. 2017), e27725. DOI:

[46]

Samira Muhamad Salh. 2015. Estimating R² shrinkage in regression. International Journal of Technical Research and Applications 3, 2 (Mar.-Apr. 2015), 1–6. https://www.ijtra.com/ijtra-issue152.php

[47]

Carlos Roberto Sanquetta, Ana Paula Dalla Corte, Alexandre Behling, Luani Rosa de Oliveira Piva, Sylvio Péllico Netto, Aurélio Lourenço Rodrigues, and Mateus Niroh Inoue Sanquetta. 2018. Selection criteria for linear regression models to estimate individual tree biomasses in the Atlantic Rain Forest, Brazil. Carbon Balance and Management 13, 25 (Dec. 2018). DOI:

[48]

Dirk Schmidt. 2020. Grading Tibetan children's literature: A test case using the NLP readability tool “Dakje.” ACM Transactions on Asian and Low-Resource Language Information Processing 19, 6 (Oct. 2020), Article 75, 19 pages. DOI:

Digital Library

[49]

Kathleen M. Sheehan, Irene Kostin, Diane Napolitano, and Michael Flor. 2014. The TextEvaluator tool: Helping teachers and test developers select texts for use in instruction and assessment. The Elementary School Journal 115, 2 (Dec. 2014), 184–209.

[50]

Manjira Sinha and Anupam Basu. 2016. A study of readability of texts in Bangla through machine learning approaches. Educational and Information Technologies 21, 5 (Sep. 2016), 1071–1094. DOI:

[51]

Manjira Sinha, Tirthankar Dasgupta, and Anupam Basu. 2014. Text readability in Hindi: A comparative study of feature performances using support vectors. In Proceedings of the 11th International Conference on Natural Language Processing (ICON ’14). NLP Association of India, 223–231. https://www.aclweb.org/anthology/W14-5134.pdf

[52]

Kay C. Soh. 2020. Readability formula for Chinese as a second language: An exploratory study. Frontiers of Education in China 14, 4 (Feb. 2020), 551–574. DOI:

[53]

Yao-Ting Sung, Ju-Ling Chen, Ji-Her Cha, Hou-Chiang Tseng, Tao-Hsing Chang, and Kuo-En Chang. 2015. Constructing and validating readability models: The method of integrating multilevel linguistic features with machine learning. Behavior Research Methods 47, 2 (Jun. 2015), 340–354. DOI:

[54]

Hou-Chiang Tseng, Hsueh-Chih Chen, Kuo-En Chang, Yao-Ting Sung, and Berlin Chen. 2019. An Innovative BERT-Based Readability Model. Innovative Technologies and Learning (ICITL 2019), Lecture Notes in Computer Science, Vol. 11937. Springer, Cham. DOI:

Digital Library

[55]

Swati Tyagi and Gauri Shankar Mishra. 2016. Statistical analysis of part of speech (POS) tagging algorithms for English corpus. International Journal of Advance Research, Ideas and Innovations in Technology 2, 3 (June 2016), V213-1157. https://www.Ijariit.com

[56]

Sowmya Vajjala. 2022. Trends, limitations, and open challenges in automatic readability assessment research. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC ’22). European Language Resources Association, 5366–5377. https://lrec2022.lrec-conf.org/en/

[57]

Laura Vásquez-Rodríguez, Pedro-Manuel Cuenca-Jiménez, Sergio Esteban Morales-Esquivel, and Fernando Alva-Manchego. 2022. A benchmark for neural readability assessment of texts in Spanish. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR ’22), 188–198. https://aclanthology.org/2022.tsar-1.18

[58]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS ’17). Curran Associates Inc., Red Hook, NY, 6000–6010.

Digital Library

[59]

Zhijuan Wang, Xiaobin Zhao, Wei Song, and Antai Wang. 2019. Readability assessment of textbooks in low resource languages. Computers, Materials & Continua 61, 1 (2019), 213–225. DOI:

[60]

Ju Wen, Sike He, and Lan Yi. 2023. Easily readable? Examining the readability of lay summaries published in Autism Research. Autism Research. Short Report. Early View (Mar. 2023). DOI:

[61]

Wikipedia. 2020. Wikipedia (2020). Retrieved February 8, 2020 from https://en.wikipedia.org/wiki/Gujarati_language

[62]

Wikipedia. 2021. Wikipedia (2021). Retrieved December 20, 2021 from https://en.wikipedia.org/wiki/Gujarati_script

[63]

Amy P. Worrall, Mary J. Connolly, Aine O'Neill, Murray O'Doherty, Kenneth P. Thornton, Cora McNally, Samuel J. McConkey, and Eoghan de Barra. 2020. Readability of online COVID-19 health information: A comparison between four English speaking countries. BMC Public Health 20 (Nov. 2020), Article 1635. DOI:

[64]

Andy W. K. Yeung, Tazuko K. Goto, and W. Keung Leung. 2018. Readability of the 100 most-cited neuroimaging papers assessed by common readability formulae. Frontiers in Human Neuroscience 12, (2018), Article 308. DOI:

[65]

Hanwook Yoo, Mikyung Kim Wolf, and Laura D. Ballard. 2023. Evaluating the equality of regression coefficients for multiple group comparisons: A case of English learner subgroups by home languages. Practical Assessment, Research, and Evaluation 28, 1 (Mar. 2023), Article 5. https://scholarworks.umass.edu/pare/vol28/iss1/5

Cited By

Index Terms

A Machine Learning–Based Readability Model for Gujarati Texts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by regression

Recommendations

Gujarati Script Recognition
Abstract
Character recognition is the extraction of printed or handwritten text from images into machine-readable format. The extracted text can be easily edited, modified and efficiently stored. While there are several Optical Character Recognition (OCR) ...
An Innovative BERT-Based Readability Model
Innovative Technologies and Learning
Abstract
Readability is referred to as the degree of difficulty to which an given text (article) can be understood by readers. When readers are reading a text with high readability, they will achieve better comprehension and learning retention. However, it ...
The Effect of Font Type on Screen Readability by People with Dyslexia

Around 10% of the people have dyslexia, a neurological disability that impairs a person’s ability to read and write. There is evidence that the presentation of the text has a significant effect on a text’s accessibility for people with dyslexia. However,...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 2

February 2024

340 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3613556

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 February 2024

Online AM: 21 December 2023

Accepted: 08 November 2023

Revised: 06 June 2023

Received: 01 May 2021

Published in TALLIP Volume 23, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
175
Total Downloads

Downloads (Last 12 months)175
Downloads (Last 6 weeks)30

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents