Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Building Arabic Paraphrasing Benchmark based on Transformation Rules

Published: 09 June 2021 Publication History

Abstract

Measuring semantic similarity between short texts is an important task in many applications of natural language processing, such as paraphrasing identification. This process requires a benchmark of sentence pairs that are labeled by Arab linguists and considered a standard that can be used by researchers when evaluating their results. This research describes an Arabic paraphrasing benchmark to be a good standard for evaluation algorithms that are developed to measure semantic similarity for Arabic sentences to detect paraphrasing in the same language. The transformed sentences are in accordance with a set of rules for Arabic paraphrasing. These sentences are constructed from the words in the Arabic word semantic similarity dataset and from different Arabic books, educational texts, and lexicons. The proposed benchmark consists of 1,010 sentence pairs wherein each pair is tagged with scores determining semantic similarity and paraphrasing. The quality of the data is assessed using statistical analysis for the distribution of the sentences over the Arabic transformation rules and exploration through hierarchical clustering (HCL). Our exploration using HCL shows that the sentences in the proposed benchmark are grouped into 27 clusters representing different subjects. The inter-annotator agreement measures show a moderate agreement for the annotations of the graduate students and a poor reliability for the annotations of the undergraduate students.

References

[1]
V. Vaishnavi, Madhesh Saritha, and S. Milton Rajendram. 2013. Paraphrase identification in short texts using grammar patterns. In 2013 International Conference on Recent Trends in Information Technology (ICRTIT). 472–477.
[2]
Samuel Fernando and Mark Stevenson. 2008. A semantic similarity approach to paraphrase detection. In 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics.
[3]
Peter W. Culicover. 1968. Paraphrase generation and information retrieval from stored text. Mechanical Translation and Computational Linguistics 11, 1 and 2 (1968), 78–88.
[4]
Ngoc Phuoc An Vo, Simone Magnolini, and Octavian Popescu. 2015. Paraphrase identification and semantic similarity in Twitter with simple features. In International Workshop on Natural Language Processing for Social Media (SocialNLP’15), 10–19.
[5]
Salha Alzahrani. 2016. Cross-language semantic similarity of Arabic-English short phrases and sentences. Journal of Computer Sciences 12, 1 (2016), 1–18.
[6]
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. SemEval-2015 Task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. In the 9th International Workshop on Semantic Evaluation (SemEval’15) (Denver, CO 2015). 252–263.
[7]
Marwah Alian and Arafat Awajan. 2018. Semantic similarity approaches- review. In 2018 International Arab Conference on Information Technology (ACIT’18) (Werdanye, Lebanon, 2018), 1–6.
[8]
James O'Shea, Zuhair Bandar, Keeley Crockett, and David McLean. 2008. Benchmarking short text semantic similarity. International Journal of Intelligent Information and Database Systems 4, 2 (2008), 103–120.
[9]
Bill Dolan, Chris Brockett, and Chris Quirk. 2005. Microsoft Research Paraphrase Corpus. (March 2005). Microsoft Research.
[10]
Wafa Wali, Bilel Gargouri, and Abdelmajid Ben Hamadou. 2017. Enhancing the sentence similarity measure by semantic and syntactico-semantic knowledge. Vietnam Journal of Computer Science 4 (2017). 51–60.
[11]
Daniel Cera, Mona Diabb, Eneko Agirrec, Iñigo Lopez-Gazpio, and Lucia Speciad. 2017. SemEval-2017 Task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation. (Canada 2017). 11th International Workshop on Semantic Evaluation (SemEval-2017).
[12]
Ali AlJarem and Mustafa Ameen. 2004. Clear grammer of Arabic language—AlnHw AlwADH fy qwAEd AllgAh AlErbyAh. Al-Dar Almysria Alsuadia for Publishing.
[13]
Ahmad Mukhtar Omar. 1998. Semantics. Elm AldlAlAh. Book World. Qairo.
[14]
Mohammad AlKholi. 2001. Semantics. Elm AldlAlAh (Elm AlmEnY). Dar Al-falah. Amman.
[15]
Ahmad M. Omar and others. 1999. Language and grammar exercises. AltdrybAt AllgwyAh wAlqwAEd. Kuwait University—Art Collage.
[16]
Faaza A. Almarsoomi, James D. O'shea, Zuhair Bandar, and Keeley Crockett. 2013. AWSS: An algorithm for measuring Arabic word semantic similarity. In 2013 IEEE International Conference on Systems, Man, and Cybernetics. 504–509.
[17]
Mohammad AlKholi. 1999. Transformation rules for Arabic language. qwAEd tHwylyAh llgAh AlErbyAh. Dar Al-Falah. Amman.
[18]
Noam Chomsky. 1957. Syntactic Structure. Mouton Publishers, The Hague, Paris.
[19]
Abdel Haleem Benaissa. 2011. Transfer Grammar in Arabic Phrase. Dar Al-Kotob Al-Ilmiyah, Lebanon.
[20]
Abu Bakr Soliman Mohammad, Kareem Eissa, and Samhaa R. El-Beltagy. 2017. AraVec: A set of Arabic word embedding models for use in Arabic NLP. Procedia Computer Science 117, (2017) 256–265.
[21]
Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 5 (1971), 378.
[22]
Andrew F. Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1 (2007), 77–89.
[23]
Jinyuan Liu, Wan Tang, Guanqin Chen, Yin Lu, Changyong Feng, and Xin M Tu. 2016. Correlation and agreement: Overview and clarification of competing concepts and measures. Shanghai Arch Psychiatry 28, 2 (2016), 115–120.
[24]
Adrian Sanborn and Jacek Skryzalin. 2015. Deep learning for semantic similarity. CS224d: Deep Learning for Natural Language Processing. Stanford, CA: Stanford University.
[25]
Yuhua Li, David McLean, Zuhair Bandar, James Dominic O'Shea, and Keeley Crockett. 2006. Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering 18, 8 (2006), 1138–1150.
[26]
Marwah Alian, Arafat Awajan, Ahmad Al-Hasan, and Raeda Akuzhia. 2019. Towards building Arabic paraphrasing benchmark. In Proceedings of the 2nd International Conference on Data Science, E-Learning and Information Systems. (2019). Article No. 17. 1–5.
[27]
Joel R. Brandt, Jiayi Chong, and Sean Rosenbaum. 2006. Interactive Clustering for Data Exploration. Stanford University, Stanford, CA.
[28]
Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. 1999. Clustering gene expression patterns. Journal of Computational Biology. 6 (3/4). 281–297.
[29]
Marwah Alian and Arafat Awajan. 2020. Factors affecting sentence similarity and paraphrasing identification. International Journal of Speech Technology 23, 851–859. https://doi.org/10.1007/s10772-020-09753-4

Cited By

View all

Index Terms

  1. Building Arabic Paraphrasing Benchmark based on Transformation Rules

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 4
    July 2021
    419 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3465463
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2021
    Accepted: 01 January 2021
    Revised: 01 October 2020
    Received: 01 July 2020
    Published in TALLIP Volume 20, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Paraphrasing
    2. Arabic benchmark
    3. transformation rules
    4. Arabic paraphrasing benchmark
    5. semantic similarity
    6. inter-annotator agreement
    7. K-means
    8. HCL Clustering

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • Scientific Research and Innovation Support Fund
    • Ministry of Higher Education, Jordan (research project ICT/2/5/2016)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)RViT: Robust Fusion Vision Transformer with Variational Hierarchical Denoising Process for Image ClassificationGuidance, Navigation and Control10.1142/S273748072441007304:03Online publication date: 26-Oct-2024
    • (2024)Arabic Paraphrase Generation Using Transformer-Based ApproachesIEEE Access10.1109/ACCESS.2024.345093112(121896-121914)Online publication date: 2024
    • (2024)Paraphrasing identification Using ACV-tree kernelProcedia Computer Science10.1016/j.procs.2024.10.188244(151-157)Online publication date: 2024
    • (2024)Arabic paraphrased parallel synthetic datasetData in Brief10.1016/j.dib.2024.11100457(111004)Online publication date: Dec-2024
    • (2024)A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic TextsSN Computer Science10.1007/s42979-024-02691-x5:4Online publication date: 27-Mar-2024
    • (2024)Evaluating the adversarial robustness of Arabic spam classifiersNeural Computing and Applications10.1007/s00521-024-10778-yOnline publication date: 20-Dec-2024
    • (2023)Arabic Paraphrasing Detection Using Multiple Extracted Features2023 14th International Conference on Information and Communication Systems (ICICS)10.1109/ICICS60529.2023.10330486(01-06)Online publication date: 21-Nov-2023
    • (2023)Research on the Energy-Saving Transformation Strategy of Building Ecology Based on Artificial Intelligence and Interactive Virtual Simulation2023 International Conference on Ambient Intelligence, Knowledge Informatics and Industrial Electronics (AIKIIE)10.1109/AIKIIE60097.2023.10390167(1-4)Online publication date: 2-Nov-2023
    • (2023)Syntactic-Semantic Similarity Based on Dependency Tree KernelArabian Journal for Science and Engineering10.1007/s13369-023-07694-z48:8(10937-10948)Online publication date: 11-Apr-2023
    • (2023)From extended chunking to dependency parsing using traditional Arabic grammarLanguage Resources and Evaluation10.1007/s10579-022-09629-w57:3(1011-1043)Online publication date: 1-Feb-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media