Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A New English/Arabic Parallel Corpus for Phishing Emails

Published: 25 July 2023 Publication History

Abstract

Phishing involves malicious activity whereby phishers, in the disguise of legitimate entities, obtain illegitimate access to the victims’ personal and private information, usually through emails. Currently, phishing attacks and threats are being handled effectively through the use of the latest phishing email detection solutions. Most current phishing detection systems assume phishing attacks to be in English, though attacks in other languages are growing. In particular, Arabic is a widely used language and therefore represents a vulnerable target. However, there is a significant shortage of corpora that can be used to develop Arabic phishing detection systems. This article presents the development of a new English-Arabic parallel phishing email corpus that has been developed from the anti-phishing share task text (IWSPA-AP 2018). The email content was to be translated, and the task had been allotted to 10 volunteers who had a university background and were English and Arabic language experts. To evaluate the effectiveness of the new corpus, we develop phishing email detection models using Term Frequency–Inverse Document Frequency and Multilayer Perceptron using 1,258 emails in Arabic and English that have equal ratios of legitimate and phishing emails. The experimental findings show that the accuracy reaches 96.82% for the Arabic dataset and 94.63% for the emails in English, providing some assurance of the potential value of the parallel corpus developed.

References

[1]
S. Salloum, T. Gaber, S. Vadera, and K. Shaalan. 2021. Phishing website detection from URLs using classical machine learning ANN model. In Proceedings of the International Conference on Security and Privacy in Communication Systems. 509–523.
[2]
United Nations. 2013. UN Official Languages. Retrieved from https://www.un.org/%0Aen/aboutun/languages.shtml.
[3]
S. A. Salloum, M. Al-Emran, and K. Shaalan. 2016. A survey of lexical functional grammar in the Arabic context. Int. J. Com. Net. Tech 4, 3 (2016) .
[4]
A. Farghaly and K. Shaalan. 2009. Arabic natural language processing: Challenges and solutions. ACM Trans. Asian Lang. Inf. Process. 8, 4 (2009), 14.
[5]
A. A. Rafea and K. F. Shaalan. 1993. Lexical analysis of inflected Arabic words using exhaustive search of an augmented transition network. Softw. Pract. Exp. 23, 6 (1993), 567–588.
[6]
K. Shaalan, M. Attia, P. Pecina, Y. Samih, and J. van Genabith. 2012. Arabic word generation and modelling for spell checking. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 719–725.
[7]
H. Al-Ajmi. 2004. A new English–Arabic parallel text corpus for lexicographic applications. Lexikos 14 (2004) .
[8]
H. Salhi. 2013. Investigating the complementary polysemy and the Arabic translations of the noun destruction in EAPCOUNT. Meta J. des traducteurs/Meta Transl. J. 58, 1 (2013), 227–246.
[9]
A. Eisele and Y. Chen. 2010. MultiUN: A multilingual corpus from united nation documents. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’10).
[10]
J. Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 2214–2218.
[11]
Linguistic Data Consortium. 2013. LDC Catalog.
[12]
S. Izwaini. 2003. A corpus-based study of metaphor in information technology. In Corpus Linguistics. Lancaster, UK
[13]
A. Abdelali, F. Guzman, H. Sajjad, and S. Vogel. 2014. The AMARA Corpus: Building parallel language resources for the educational domain. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’14). 1044–1054.
[14]
F. Guzmán, H. Sajjad, S. Vogel, and A. Abdelali. 2013. The AMARA corpus: Building resources for translating the web's educational content. In Proceedings of the 10th International Workshop on Spoken Language Technology (IWSLT’13).
[15]
AMARA. Retrieved from www.amara.org.
[16]
S. Alkahtani and W. J. Teahan. 2016. A new parallel corpus of Arabic/English. In Proceedings of the 8th Saudi Students Conference in the UK. 279–284.
[17]
S. M. O. Hassan and E. S. Atwell. 2016. Design and implementing of multilingual Hadith corpus. Int. J. Recent Res. Soc. Sci. Humanit. 3, 2 (2016), 100–104.
[18]
J. Nazario. 2020. Nazario's phishing corpora. Retrieved from https://monkey.org/∼jose/phishing/.
[19]
C. Project. 2020. Enron email dataset. Retrieved from http://www.cs.cmu.edu/∼enron/.
[20]
The Apache Spamassassin Public Corpus. Retrieved from https://spamassassin.apache.org/old/publiccorpus.
[21]
K. Taghipour, S. Khadivi, and J. Xu. 2011. Parallel corpus refinement as an outlier detection algorithm. Proc. 13th Mach. Transl. Summit (MT Summit XIII). 414–421.
[22]
R. M. Verma, V. Zeng, and H. Faridi. 2019. Data quality for security challenges: Case studies of phishing, malware and intrusion detection datasets. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 2605–2607.
[23]
A. E. Aassal, L. Moraes, S. Baki, A. Das, and R. Verma. 2018. Anti-phishing pilot at ACM IWSPA 2018: Evaluating performance with new metrics for unbalanced datasets. In Proceedings of the Anti-Phishing Pilot at ACM International Workshop on Security and Privacy Analytics (IWSPA AP’18). 2–10.
[24]
N. B. Harikrishnan, R. Vinayakumar, and K. P. Soman. 2018. A machine learning approach towards phishing email detection. In Proceedings of the Anti-Phishing Pilot at ACM International Workshop on Security and Privacy Analytics (IWSPA AP’18). 455–468.
[25]
V. Ra, B. G. HBa, A. K. Ma, S. KPa, P. Poornachandran, and A. Verma. 2018. DeepAnti-PhishNet: Applying deep neural networks for phishing email detection. In Proceedings of the 1st Anti-Phishing Shared Pilot 4th ACM International Workshop on Security and Privacy Analytics (IWSPA’18). Tempe, AZ, 1–11.
[26]
B. G. HBa, V. Ra, A. K. Ma, and S. KPa. Distributed Representation using Target Classes: Bag of Tricks for Security and Privacy Analytics.
[27]
A. Vazhayil, N. B. Harikrishnan, R. Vinayakumar, K. P. Soman, and A. D. R. Verma. 2018. PED-ML: Phishing email detection using classical machine learning techniques. In Proceedings of the 1st Anti-Phishing Shared Pilot 4th ACM International Workshop on Security and Privacy Analytics (IWSPA’18). 1–8.
[28]
N. A. Unnithan, N. B. Harikrishnan, S. Akarsh, R. Vinayakumar, and K. P. Soman. 2018. Machine learning-based phishing e-mail detection. Secur. Amrita. 65–69.
[29]
C. Coyotes, V. S. Mohan, J. Naveen, R. Vinayakumar, K. P. Soman, and A. D. R. Verma. 2018. ARES: Automatic rogue email spotter. In Proceedings of the 1st Anti-Phishing Shared Pilot 4th ACM International Workshop on Security and Privacy Analytics (IWSPA’18).
[30]
M. Nguyen, T. Nguyen, and T. H. Nguyen. 2018. A deep learning model with hierarchical lstms and supervised attention for anti-phishing. Retrieved from https://arXivPrepr.arXiv1805.01554.
[31]
M. Hiransha, N. A. Unnithan, R. Vinayakumar, K. Soman, and A. D. R. Verma. 2018. Deep learning-based phishing e-mail detection. In Proceedings of the 1st Anti-Phishing Shared Pilot 4th ACM International Workshop on Security and Privacy Analytics (IWSPA’18).
[32]
N. A. Unnithan, N. B. Harikrishnan, R. Vinayakumar, K. P. Soman, and S. Sundarakrishna. 2018. Detecting phishing E-mail using machine learning techniques. In Proceedings of the 1st Anti-Phishing Shared Task Pilot 4th ACM IWSPA Co-Located 8th ACM Conference Data and Application Security and Privacy (CODASPY’18). 51–54.
[33]
D. D. Palmer. 2000. Tokenisation and sentence segmentation. Handb. Nat. Lang. Process. (2000), 11–35.
[34]
S. G. Bird and E. Loper. 2004. NLTK: The natural language toolkit.
[35]
S. Seifollahi, A. Bagirov, R. Layton, and I. Gondal. 2017. Optimization-based clustering algorithms for authorship analysis of phishing emails. Neural Process. Lett. 46, 2 (2017), 411–425.
[36]
E. Castillo, S. Dhaduvai, P. Liu, K.-S. Thakur, A. Dalton, and T. Strzalkowski. 2020. Email threat detection using distinct neural network approaches. In Proceedings for the 1st International Workshop on Social Threats in Online Conversations: Understanding and Management. 48–55.
[37]
R. Vinayakumar, K. P. Soman, P. Poornachandran, V. S. Mohan, and A. D. Kumar. 2018. ScaleNet: Scalable and hybrid framework for cyber threat situational awareness based on DNS, URL, and email data analysis. J. Cyber Secur. Mobil. (2018), 189–240.
[38]
R. Vinayakumar, K. P. Soman, P. Poornachandran, S. Akarsh, and M. Elhoseny. 2019. Deep learning framework for cyber threat situational awareness based on email and url data analysis. In Cybersecurity and Secure Information Systems, Springer, 87–124.
[39]
E. S. Gualberto, R. T. De Sousa, P. D. B. Thiago, J. P. C. L. Da Costa, and C. G. Duque. 2020. The answer is in the text: Multi-Stage methods for phishing detection based on feature engineering. IEEE Access 8 (2020), 223529-223547.
[40]
R. Amin, M. M. Rahman, and N. Hossain. 2019. A bangla spam email detection and datasets creation approach based on machine learning algorithms. In Proceedings of the 3rd International Conference on Electrical, Computer, and Telecommunication Engineering (ICECTE’19). 169–172.
[41]
S. Kaddoura, O. Alfandi, and N. Dahmani. 2020. A spam email detection mechanism for english language text emails using deep learning approach. In Proceedings of the IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE’20). 193–198.
[42]
J. Rastenis, S. Ramanauskaitė, I. Suzdalev, K. Tunaitytė, J. Janulevičius, and A. Čenys. 2021. Multi-Language spam/ phishing classification by email body text: Toward automated security incident investigation. Electronics 10, 6 (2021), 668.
[43]
V. Ramanathan and H. Wechsler. 2012. phishGILLNET—phishing detection methodology using probabilistic latent semantic analysis, AdaBoost, and co-training. EURASIP J. Inf. Secur. 2012, 1 (2012), 1.
[44]
F. Janjua, A. Masood, H. Abbas, and I. Rashid. 2020. Handling insider threat through supervised machine learning techniques. Procedia Comput. Sci. 177 (2020), 64–71.
[45]
G. Sonowal. 2020. Phishing email detection based on binary search feature selection. SN Comput. Sci. 1, 4 (2020) .
[46]
M. MANASWINI and D. R. N. SRINIVASU. 2021. Phishing email detection model using improved recurrent convolutional neural networks and multilevel vectors. Ann. Rom. Soc. Cell Biol. 25, 6 (2021), 16674–16681.
[47]
A. Baccouche, S. Ahmed, D. Sierra-Sosa, and A. Elmaghraby. 2020. Malicious text identification: Deep learning from public comments and emails. Information 11, 6 (2020), 312.
[48]
Y. Fang, C. Zhang, C. Huang, L. Liu, and Y. Yang. 2019. Phishing email detection using improved RCNN model with multilevel vectors and attention mechanism. IEEE Access 7 (2019), 56329–56340.
[49]
C. Thapa et al. Evaluation of federated learning in phishing email detection. 23, 9.
[50]
T. Joachims. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text. In Proceedings of the International Conference on Machine Learning (ICML’97).

Cited By

View all
  • (2024)A Systematic Review of Deep Learning Techniques for Phishing Email DetectionElectronics10.3390/electronics1319382313:19(3823)Online publication date: 27-Sep-2024
  • (2024)Building an Annotated L1 Arabic/L2 English Bilingual Writer Corpus: The Qatari Corpus of Argumentative Writing (QCAW)Corpus-based Studies across Humanities10.1515/csh-2023-00121:1(183-215)Online publication date: 12-Jan-2024
  • (2024)A Practical Investigation of Spear Phishing Spam Emails: Comparative Analysis and Evaluation2024 9th International Conference on Computer Science and Engineering (UBMK)10.1109/UBMK63289.2024.10773554(1-6)Online publication date: 26-Oct-2024
  • Show More Cited By

Index Terms

  1. A New English/Arabic Parallel Corpus for Phishing Emails

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 7
    July 2023
    422 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3610376
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 July 2023
    Online AM: 28 June 2023
    Accepted: 12 June 2023
    Revised: 04 March 2023
    Received: 07 January 2023
    Published in TALLIP Volume 22, Issue 7

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. English-Arabic Parallel Corpus
    2. phishing emails
    3. Multilayer Perceptron
    4. frequency-inverse document frequency

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)164
    • Downloads (Last 6 weeks)19
    Reflects downloads up to 28 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Systematic Review of Deep Learning Techniques for Phishing Email DetectionElectronics10.3390/electronics1319382313:19(3823)Online publication date: 27-Sep-2024
    • (2024)Building an Annotated L1 Arabic/L2 English Bilingual Writer Corpus: The Qatari Corpus of Argumentative Writing (QCAW)Corpus-based Studies across Humanities10.1515/csh-2023-00121:1(183-215)Online publication date: 12-Jan-2024
    • (2024)A Practical Investigation of Spear Phishing Spam Emails: Comparative Analysis and Evaluation2024 9th International Conference on Computer Science and Engineering (UBMK)10.1109/UBMK63289.2024.10773554(1-6)Online publication date: 26-Oct-2024
    • (2024)Why Phishing Emails Escape Detection: A Closer Look at the Failure Points2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527344(1-6)Online publication date: 29-Apr-2024
    • (2024)LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization ApproachIEEE Access10.1109/ACCESS.2024.351892312(193807-193821)Online publication date: 2024
    • (2024)Optimizing News Categorization with Machine Learning: A Comprehensive Study Using Naive Bayes (MultinomialNB) ClassifierAchieving Sustainable Business through AI, Technology Education and Computer Science10.1007/978-3-031-70855-8_15(169-178)Online publication date: 9-Nov-2024
    • (2024)Leveraging Soft Power: A Study of Emirati Online Journalism Through Arabic Topic ModelingTechnology and Business Model Innovation: Challenges and Opportunities10.1007/978-3-031-55911-2_2(13-20)Online publication date: 17-Mar-2024
    • (2024)Teaching the Skills of Expression According to Theory of Gerjanis’s Systems and Generation Chomsky: From the Perspective of Arabic Language Engineering for Non-Arabic SpeakersArtificial Intelligence in Education: The Power and Dangers of ChatGPT in the Classroom10.1007/978-3-031-52280-2_7(91-110)Online publication date: 30-Mar-2024
    • (2024)Detecting Malicious Accounts in Cyberspace: Enhancing Security in ChatGPT and BeyondArtificial Intelligence in Education: The Power and Dangers of ChatGPT in the Classroom10.1007/978-3-031-52280-2_42(653-666)Online publication date: 30-Mar-2024
    • (2024)Artificial Intelligence in Pharmacy: Revolutionizing Medical Education DeliveryArtificial Intelligence in Education: The Power and Dangers of ChatGPT in the Classroom10.1007/978-3-031-52280-2_39(615-622)Online publication date: 30-Mar-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media