Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Machine Learning–Based Readability Model for Gujarati Texts

Published: 08 February 2024 Publication History
  • Get Citation Alerts
  • Abstract

    This study aims to develop a machine learning–based model to predict the readability of Gujarati texts. The dataset was 50 prose passages from Gujarati literature. Fourteen lexical and syntactic readability text features were extracted from the dataset using a machine learning algorithm of the unigram parts of speech tagger and three Python programming scripts. Two samples of native Gujarati speaking secondary and higher education students rated the Gujarati texts for readability judgment on a 10-point scale of “easy” to “difficult” with the interrater agreement. After dimensionality reduction, seven text features as the independent variables and the mean readability rating as the dependent variable were used to train the readability model. As the students' level of education and gender were related to their readability rating, four readability models for school students, university students, male students, and female students were trained with a backward stepwise multiple linear regression algorithm of supervised machine learning. The trained model is comparable across the raters’ groups. The best model is the university students’ readability rating model. The model is cross-validated. It explains 91% and 88% of the variance in readability ratings at training and cross-validation, respectively, and its effect size and power are large and high.

    References

    [1]
    Ethem Alpaydin. 2020. Introduction to Machine Learning (4th. ed). The MIT Press, Cambridge, MA.
    [2]
    Stevan J. Amendum, Kristin Conradi, and Elfrieda Hiebert. 2018. Does text complexity matter in the elementary grades? A research synthesis of text difficulty and elementary students’ reading fluency and comprehension. Educational Psychology Review 30, 1 (Mar. 2018), 121–151. DOI:
    [3]
    Lennart Ante. 2022. The relationship between readability and scientific impact: Evidence from emerging technology discourses. Journal of Informetrics 16, 1 (Feb. 2022), Article 101252. DOI:
    [4]
    Alan Bailin and Ann Grafstein. 2016. Readability: Text and Context. Palgrave Macmillan, New York, NY.
    [5]
    Sofie Beier, Sam Berlow, Esat Boucaud, Zoya Bylinskii, Tianyuan Cai, Jenae Cohn, Kathy Crowley, Stephanie L. Day, Tilman Dingler, Jonathan Dobres, Jennifer Healey, Rajiv Jain, Marjorie Jordan, Bernard Kerr, Qisheng Li, Dave B. Miller, Susanne Nobles, Alexandra Papoutsaki, Jing Qian, Tina Rezvanian, Shelley Rodrigo, Ben D. Sawyer, Shannon M. Sheppard, Bram Stein, Rick Treitman, Jen Vanek, Shaun Wallace, and Benjamin Wolfe. 2022. Reading research: An interdisciplinary approach. Foundations and Trends in Human-Computer Interaction 16, 4 (Dec. 2022), 214–324. DOI:
    [6]
    Susmoy Chakraborty, Mir Tafseer Nayeem, and Wasi Uddin Ahmad. 2021. Simple or complex? Learning to predict readability of Bengali texts. In Proceedings of the AAAI Conference on Artificial Intelligence 35, 14 (May 2021), Association for the Advancement of Artificial Intelligence, 12621–12629. DOI:
    [7]
    Samprit Chatterjee and Jeffrey S. Simonoff. 2013. Handbook of Regression Analysis. John Wiley & Sons, Hoboken, NJ. https://eli.johogo.com/Class/CCU/SEM/_Handbook%20of%20Regression%20Analysis_Chatterjee.pdf
    [8]
    Alebachew Chiche and Betselot Yitagesu. 2022. Part of speech tagging: A systematic review of deep learning and machine learning approaches. Journal of Big Data 9 (Jan. 2022), Article 10. DOI:
    [9]
    Joon Suh Choi and Scott A. Crossley. 2022. Advances in readability research: A new readability web app for English. In Proceedings of the International Conference on Advanced Learning Technologies (ICALT ’22). IEEE. 1–5. DOI:
    [10]
    Kevyn Collins-Thompson. 2014. Computational assessment of text readability: A survey of current and future research. International Journal of Applied Linguistics 165, 2 (Dec. 2014), 97–135. DOI:
    [11]
    Scott Crossley, Aron Heintz, Joon Suh Choi, Mehrnoush Karimi, and Agnes Malatinszky. 2023. A large-scaled corpus for assessing text readability. Behavior Research Methods 55, 2 (Feb. 2023), 491–507. DOI:
    [12]
    Scott A. Crossley, Stephen Skalicky, and Mihai Dascalu. 2019. Moving beyond classic readability formulas: New methods and new models. Journal of Research in Reading 42, 3-4 (Nov. 2019), 541–561. DOI:
    [13]
    Yue Cui, Junhui Zhu, Liner Yang, Xuezhi Fang, Xiaobin Chen, Yujie Wang, and Erhong Yang. 2022. CTAP for Chinese: A linguistic complexity feature automatic calculation platform. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, 5525–5538. https://aclanthology.org/2022.lrec-1.592.pdf
    [14]
    Mark de Rooij and Wouter Weeda. 2020. Cross-validation: A method every psychologist should know. Advances in Methods and Practices in Psychological Science 3, 2 (Jun. 2020), 248–263. DOI:
    [15]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL_HLT 2019), Volume 1 (Long and Short Papers), Association for Computational Linguistics, 4171–4186. DOI:
    [16]
    Holger Diessel. 2019. The Grammar Network: How Linguistic Structure is Shaped by Language Use. Cambridge University Press, Cambridge, UK.
    [17]
    Han Feng, Sizai Hou, Le-Yin Wei, and Ding-Xuan Zhou. 2022. CNN models for readability of Chinese texts. Mathematical Foundations of Computing 5, 4 (Nov. 2022), 351–362. DOI:
    [18]
    Thomas François. 2015. When readability meets computational linguistics: A new paradigm in readability. Revue Française de Linguistique Appliquée xx, 2 (2015), 79–97. DOI:
    [19]
    Lorenz Graf-Vlachy. 2022. Is the readability of abstracts decreasing in management research? Review of Managerial Science 16, 4 (May 2022), 1063–1084. DOI:
    [20]
    Philip E. Hulme and Hazel McLaren-Swift. 2022. Declining readability of research on biological invasions over two decades. Biological Invasions 24, 6 (Jun. 2022), 1651–1660. DOI:
    [21]
    Joseph Marvin Imperial. 2021. BERT embeddings for automatic readability assessment. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. 611–618.
    [22]
    Indian Language Technology Proliferation and Deployment Centre. 2017. Gujarati monolingual text corpus. New Delhi.
    [23]
    Nina Jamar. 2023. The readability of abstracts in library and information science journals. Journal of Documentation 79, 7 (2023), 1–11. DOI:
    [24]
    Dhawal Khem, Shailesh Panchal, and Chetan Bhatt. 2023. Text simplification improves text translation from Gujarati regional language to English: An experimental study. International Journal of Intelligent Systems and Applications in Engineering 11, 2s (Jan. 2023), 316–327. https://ijisae.org/index.php/IJISAE/article/view/2699/1279
    [25]
    Cini Kurian. 2014. A review on the progress of natural language processing in India. International Journal of Advances in Engineering & Technology 7, 5 (Nov. 2014), 1420–1425. https://www.academia.edu/10350837/A_REVIEW_ON_THE_PROGRESS_OF_NATURAL_LANGUAGE_PROCESSING_IN_INDIA
    [26]
    Michael H. Kutner, Christopher J. Nachtsheim, John Neter, and William Li. 2013. Applied Linear Statistical Models (5th ed., Indian ed.). McGraw Hill Education (India), Chennai, Tamil Nadu.
    [27]
    Brett Lantz. 2019. Machine Learning with R (3rd. ed.). Packt Publishing Ltd., Birmingham, UK.
    [28]
    Bruce W. Lee and Jason Hyung-Jong Lee. 2023. Prompt-based learning for text readability assessment. arXiv:2302.13139v1. Retrieved from
    [29]
    A. Madhushree and D. Nanjappa. 2017. Development of readability formula for Kannada language. Mysore Journal of Agricultural Sciences 51, 2 (Apr.-Jun. 2017), 326–330. https://e-krishiuasb.karnataka.gov.in/MJAS/getInfoForIssue.aspx
    [30]
    Matej Martinc, Senja Pollak, and Marko Robnik-Šikonja. 2021. Supervised and unsupervised neural approaches to text readability. Computational Linguistics 41, 1 (Apr. 2021), 141–179. DOI:
    [31]
    Jorge Martinez-Gil. 2023. Optimizing readability using genetic algorithms. arXiv:2301.00374v1. Retrieved from
    [32]
    Bryan C. McCannon. 2019. Readability and research impact. Economics Letters 180 (Jul. 2019), 76–79. DOI:
    [33]
    Changping Meng, Muhao Chen, Jie Mao, and Jennifer Neville. 2020. ReadNet: A hierarchical transformer framework for web article readability analysis. Advances in Information Retrieval (ECIR ’20), Lecture Notes in Computer Science, Vol. 12035. Springer, Cham. DOI:
    [34]
    Shailaja Menon, Ramchandar Krishnamurthy, S. Sajitha, Neela Apte, Abha Basargekar, Sneha Subramaniam, Mounesh Nalkamani, and Madhuri Modugala. 2017. Literacy Research in Indian Languages (LiRiL): Report of a Three-Year Longitudinal Study on Early Reading and Writing in Marathi and Kannada. Azim Premji University, Bangalore and Tata Trusts, New Delhi. Retrieved July 1, 2020 from https://publications.azimpremjiuniversity.edu.in/400/1/Liril_Final.pdf
    [35]
    Mostafa Mesgari, Chitu Okoli, Mohamad Mehdi, Finn Årup Nielsen, and Arto Lanamäki. 2015. “The sum of all human knowledge”: A systematic review of scholarly research on the content of Wikipedia. Journal of the Association for Information Science and Technology 66, 2 (Feb. 2015), 219–245. https://backend.orbit.dtu.dk/ws/portalfiles/portal/103083646/WikiLit_Content_open_access_version.pdf
    [36]
    Jatin C. Modh, Jatinderkumar R. Saini, and Ketan Kotecha. 2022. A novel readability complexity score for Gujarati idiomatic text. International Journal of Advanced Computer Science and Applications 13, 5 (May, 2022), 453–459. https://thesai.org/Publications/IJACSA
    [37]
    Guy Moors, Natalia Kieruj, and Jeroen K. Vermunt. 2014. The effect of labeling and numbering of response scales on the likelihood of response bias. Sociological Methodology 44, 1 (Aug. 2014), 369–399. DOI:
    [38]
    Naoual Nassiri, Violetta Cavalli-Sforza, and Abdelhak Lakhouaja. 2023. Approaches, methods, and resources for assessing the readability of Arabic texts. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 4 (Apr. 2023), Article 95, 30 pages. DOI:
    [39]
    Halil Ibrahim Öksüz and Hasan Kağan Keskin. 2022. A study on the impact of readability on comprehensibility. International Journal of Progressive Education 18, 1 (Feb. 2022), 322–335. DOI:
    [40]
    Lydia O'Sullivan, Prasanth Sukumar, Rachel Crowley, Eilish McAuliffe, and Peter Doran. 2020. Readability and understandability of clinical research patient information leaflets and consent forms in Ireland and the UK: A retrospective quantitative analysis. BMJ Open 10, 9 (Sep. 2020), e037994. DOI:
    [41]
    Muralidhar Pantula and K. S. Kuppusamy. 2022. A machine learning-based model to evaluate readability and assess grade level for the web pages. The Computer Journal 65, 4 (Apr. 2022), 831–842. DOI:
    [42]
    Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. 2014. Inter-rater agreement study on readability assessment in Bengali. International Journal on Natural Language Computing (IJNLC) 3, 3 (Jun. 2014), 25–31. DOI:
    [43]
    Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. 2019. Readability analysis of Bengali literary texts. Journal of Quantitative Linguistics 26, 4 (2019), 287–305. DOI:
    [44]
    Florian Pickelmann, Michael Färber, and Adam Jatowt. 2023. Ablesbarkeitsmesser: A system for assessing the readability of German text. In Advances in Information Retrieval: Proceedings of the 45th European Conference on Information Retrieval (ECIR ’23), Part III, 288–293.
    [45]
    Pontus Plavén-Sigray, Granville James Matheson, Björn Christian Schiffler, and William Hedley Thompson. 2017. The readability of scientific texts is decreasing over time. eLife 6 (Sep. 2017), e27725. DOI:
    [46]
    Samira Muhamad Salh. 2015. Estimating R2 shrinkage in regression. International Journal of Technical Research and Applications 3, 2 (Mar.-Apr. 2015), 1–6. https://www.ijtra.com/ijtra-issue152.php
    [47]
    Carlos Roberto Sanquetta, Ana Paula Dalla Corte, Alexandre Behling, Luani Rosa de Oliveira Piva, Sylvio Péllico Netto, Aurélio Lourenço Rodrigues, and Mateus Niroh Inoue Sanquetta. 2018. Selection criteria for linear regression models to estimate individual tree biomasses in the Atlantic Rain Forest, Brazil. Carbon Balance and Management 13, 25 (Dec. 2018). DOI:
    [48]
    Dirk Schmidt. 2020. Grading Tibetan children's literature: A test case using the NLP readability tool “Dakje.” ACM Transactions on Asian and Low-Resource Language Information Processing 19, 6 (Oct. 2020), Article 75, 19 pages. DOI:
    [49]
    Kathleen M. Sheehan, Irene Kostin, Diane Napolitano, and Michael Flor. 2014. The TextEvaluator tool: Helping teachers and test developers select texts for use in instruction and assessment. The Elementary School Journal 115, 2 (Dec. 2014), 184–209.
    [50]
    Manjira Sinha and Anupam Basu. 2016. A study of readability of texts in Bangla through machine learning approaches. Educational and Information Technologies 21, 5 (Sep. 2016), 1071–1094. DOI:
    [51]
    Manjira Sinha, Tirthankar Dasgupta, and Anupam Basu. 2014. Text readability in Hindi: A comparative study of feature performances using support vectors. In Proceedings of the 11th International Conference on Natural Language Processing (ICON ’14). NLP Association of India, 223–231. https://www.aclweb.org/anthology/W14-5134.pdf
    [52]
    Kay C. Soh. 2020. Readability formula for Chinese as a second language: An exploratory study. Frontiers of Education in China 14, 4 (Feb. 2020), 551–574. DOI:
    [53]
    Yao-Ting Sung, Ju-Ling Chen, Ji-Her Cha, Hou-Chiang Tseng, Tao-Hsing Chang, and Kuo-En Chang. 2015. Constructing and validating readability models: The method of integrating multilevel linguistic features with machine learning. Behavior Research Methods 47, 2 (Jun. 2015), 340–354. DOI:
    [54]
    Hou-Chiang Tseng, Hsueh-Chih Chen, Kuo-En Chang, Yao-Ting Sung, and Berlin Chen. 2019. An Innovative BERT-Based Readability Model. Innovative Technologies and Learning (ICITL 2019), Lecture Notes in Computer Science, Vol. 11937. Springer, Cham. DOI:
    [55]
    Swati Tyagi and Gauri Shankar Mishra. 2016. Statistical analysis of part of speech (POS) tagging algorithms for English corpus. International Journal of Advance Research, Ideas and Innovations in Technology 2, 3 (June 2016), V213-1157. https://www.Ijariit.com
    [56]
    Sowmya Vajjala. 2022. Trends, limitations, and open challenges in automatic readability assessment research. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC ’22). European Language Resources Association, 5366–5377. https://lrec2022.lrec-conf.org/en/
    [57]
    Laura Vásquez-Rodríguez, Pedro-Manuel Cuenca-Jiménez, Sergio Esteban Morales-Esquivel, and Fernando Alva-Manchego. 2022. A benchmark for neural readability assessment of texts in Spanish. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR ’22), 188–198. https://aclanthology.org/2022.tsar-1.18
    [58]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS ’17). Curran Associates Inc., Red Hook, NY, 6000–6010.
    [59]
    Zhijuan Wang, Xiaobin Zhao, Wei Song, and Antai Wang. 2019. Readability assessment of textbooks in low resource languages. Computers, Materials & Continua 61, 1 (2019), 213–225. DOI:
    [60]
    Ju Wen, Sike He, and Lan Yi. 2023. Easily readable? Examining the readability of lay summaries published in Autism Research. Autism Research. Short Report. Early View (Mar. 2023). DOI:
    [61]
    Wikipedia. 2020. Wikipedia (2020). Retrieved February 8, 2020 from https://en.wikipedia.org/wiki/Gujarati_language
    [62]
    Wikipedia. 2021. Wikipedia (2021). Retrieved December 20, 2021 from https://en.wikipedia.org/wiki/Gujarati_script
    [63]
    Amy P. Worrall, Mary J. Connolly, Aine O'Neill, Murray O'Doherty, Kenneth P. Thornton, Cora McNally, Samuel J. McConkey, and Eoghan de Barra. 2020. Readability of online COVID-19 health information: A comparison between four English speaking countries. BMC Public Health 20 (Nov. 2020), Article 1635. DOI:
    [64]
    Andy W. K. Yeung, Tazuko K. Goto, and W. Keung Leung. 2018. Readability of the 100 most-cited neuroimaging papers assessed by common readability formulae. Frontiers in Human Neuroscience 12, (2018), Article 308. DOI:
    [65]
    Hanwook Yoo, Mikyung Kim Wolf, and Laura D. Ballard. 2023. Evaluating the equality of regression coefficients for multiple group comparisons: A case of English learner subgroups by home languages. Practical Assessment, Research, and Evaluation 28, 1 (Mar. 2023), Article 5. https://scholarworks.umass.edu/pare/vol28/iss1/5

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 2
    February 2024
    340 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3613556
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 February 2024
    Online AM: 21 December 2023
    Accepted: 08 November 2023
    Revised: 06 June 2023
    Received: 01 May 2021
    Published in TALLIP Volume 23, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Readability model
    2. readability rating and level of education
    3. interrater agreement
    4. model comparison
    5. Gujarati texts

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 175
      Total Downloads
    • Downloads (Last 12 months)175
    • Downloads (Last 6 weeks)30
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media