research-article

A New English/Arabic Parallel Corpus for Phishing Emails

Authors:

Khaled ShaalanAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 7

Article No.: 201, Pages 1 - 17

https://doi.org/10.1145/3606031

Published: 25 July 2023 Publication History

Abstract

Phishing involves malicious activity whereby phishers, in the disguise of legitimate entities, obtain illegitimate access to the victims’ personal and private information, usually through emails. Currently, phishing attacks and threats are being handled effectively through the use of the latest phishing email detection solutions. Most current phishing detection systems assume phishing attacks to be in English, though attacks in other languages are growing. In particular, Arabic is a widely used language and therefore represents a vulnerable target. However, there is a significant shortage of corpora that can be used to develop Arabic phishing detection systems. This article presents the development of a new English-Arabic parallel phishing email corpus that has been developed from the anti-phishing share task text (IWSPA-AP 2018). The email content was to be translated, and the task had been allotted to 10 volunteers who had a university background and were English and Arabic language experts. To evaluate the effectiveness of the new corpus, we develop phishing email detection models using Term Frequency–Inverse Document Frequency and Multilayer Perceptron using 1,258 emails in Arabic and English that have equal ratios of legitimate and phishing emails. The experimental findings show that the accuracy reaches 96.82% for the Arabic dataset and 94.63% for the emails in English, providing some assurance of the potential value of the parallel corpus developed.

References

[1]

S. Salloum, T. Gaber, S. Vadera, and K. Shaalan. 2021. Phishing website detection from URLs using classical machine learning ANN model. In Proceedings of the International Conference on Security and Privacy in Communication Systems. 509–523.

[2]

United Nations. 2013. UN Official Languages. Retrieved from https://www.un.org/%0Aen/aboutun/languages.shtml.

[3]

S. A. Salloum, M. Al-Emran, and K. Shaalan. 2016. A survey of lexical functional grammar in the Arabic context. Int. J. Com. Net. Tech 4, 3 (2016) .

[4]

A. Farghaly and K. Shaalan. 2009. Arabic natural language processing: Challenges and solutions. ACM Trans. Asian Lang. Inf. Process. 8, 4 (2009), 14.

Digital Library

[5]

A. A. Rafea and K. F. Shaalan. 1993. Lexical analysis of inflected Arabic words using exhaustive search of an augmented transition network. Softw. Pract. Exp. 23, 6 (1993), 567–588.

Digital Library

[6]

K. Shaalan, M. Attia, P. Pecina, Y. Samih, and J. van Genabith. 2012. Arabic word generation and modelling for spell checking. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 719–725.

[7]

H. Al-Ajmi. 2004. A new English–Arabic parallel text corpus for lexicographic applications. Lexikos 14 (2004) .

[8]

H. Salhi. 2013. Investigating the complementary polysemy and the Arabic translations of the noun destruction in EAPCOUNT. Meta J. des traducteurs/Meta Transl. J. 58, 1 (2013), 227–246.

[9]

A. Eisele and Y. Chen. 2010. MultiUN: A multilingual corpus from united nation documents. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’10).

[10]

J. Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 2214–2218.

[11]

Linguistic Data Consortium. 2013. LDC Catalog.

[12]

S. Izwaini. 2003. A corpus-based study of metaphor in information technology. In Corpus Linguistics. Lancaster, UK

[13]

A. Abdelali, F. Guzman, H. Sajjad, and S. Vogel. 2014. The AMARA Corpus: Building parallel language resources for the educational domain. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’14). 1044–1054.

[14]

F. Guzmán, H. Sajjad, S. Vogel, and A. Abdelali. 2013. The AMARA corpus: Building resources for translating the web's educational content. In Proceedings of the 10th International Workshop on Spoken Language Technology (IWSLT’13).

[15]

AMARA. Retrieved from www.amara.org.

[16]

S. Alkahtani and W. J. Teahan. 2016. A new parallel corpus of Arabic/English. In Proceedings of the 8th Saudi Students Conference in the UK. 279–284.

[17]

S. M. O. Hassan and E. S. Atwell. 2016. Design and implementing of multilingual Hadith corpus. Int. J. Recent Res. Soc. Sci. Humanit. 3, 2 (2016), 100–104.

[18]

J. Nazario. 2020. Nazario's phishing corpora. Retrieved from https://monkey.org/∼jose/phishing/.

[19]

C. Project. 2020. Enron email dataset. Retrieved from http://www.cs.cmu.edu/∼enron/.

[20]

The Apache Spamassassin Public Corpus. Retrieved from https://spamassassin.apache.org/old/publiccorpus.

[21]

K. Taghipour, S. Khadivi, and J. Xu. 2011. Parallel corpus refinement as an outlier detection algorithm. Proc. 13th Mach. Transl. Summit (MT Summit XIII). 414–421.

[22]

R. M. Verma, V. Zeng, and H. Faridi. 2019. Data quality for security challenges: Case studies of phishing, malware and intrusion detection datasets. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 2605–2607.

Digital Library

[23]

A. E. Aassal, L. Moraes, S. Baki, A. Das, and R. Verma. 2018. Anti-phishing pilot at ACM IWSPA 2018: Evaluating performance with new metrics for unbalanced datasets. In Proceedings of the Anti-Phishing Pilot at ACM International Workshop on Security and Privacy Analytics (IWSPA AP’18). 2–10.

[24]

N. B. Harikrishnan, R. Vinayakumar, and K. P. Soman. 2018. A machine learning approach towards phishing email detection. In Proceedings of the Anti-Phishing Pilot at ACM International Workshop on Security and Privacy Analytics (IWSPA AP’18). 455–468.

[25]

V. Ra, B. G. HBa, A. K. Ma, S. KPa, P. Poornachandran, and A. Verma. 2018. DeepAnti-PhishNet: Applying deep neural networks for phishing email detection. In Proceedings of the 1st Anti-Phishing Shared Pilot 4th ACM International Workshop on Security and Privacy Analytics (IWSPA’18). Tempe, AZ, 1–11.

[26]

B. G. HBa, V. Ra, A. K. Ma, and S. KPa. Distributed Representation using Target Classes: Bag of Tricks for Security and Privacy Analytics.

[27]

A. Vazhayil, N. B. Harikrishnan, R. Vinayakumar, K. P. Soman, and A. D. R. Verma. 2018. PED-ML: Phishing email detection using classical machine learning techniques. In Proceedings of the 1st Anti-Phishing Shared Pilot 4th ACM International Workshop on Security and Privacy Analytics (IWSPA’18). 1–8.

[28]

N. A. Unnithan, N. B. Harikrishnan, S. Akarsh, R. Vinayakumar, and K. P. Soman. 2018. Machine learning-based phishing e-mail detection. Secur. Amrita. 65–69.

[29]

C. Coyotes, V. S. Mohan, J. Naveen, R. Vinayakumar, K. P. Soman, and A. D. R. Verma. 2018. ARES: Automatic rogue email spotter. In Proceedings of the 1st Anti-Phishing Shared Pilot 4th ACM International Workshop on Security and Privacy Analytics (IWSPA’18).

[30]

M. Nguyen, T. Nguyen, and T. H. Nguyen. 2018. A deep learning model with hierarchical lstms and supervised attention for anti-phishing. Retrieved from https://arXivPrepr.arXiv1805.01554.

[31]

M. Hiransha, N. A. Unnithan, R. Vinayakumar, K. Soman, and A. D. R. Verma. 2018. Deep learning-based phishing e-mail detection. In Proceedings of the 1st Anti-Phishing Shared Pilot 4th ACM International Workshop on Security and Privacy Analytics (IWSPA’18).

[32]

N. A. Unnithan, N. B. Harikrishnan, R. Vinayakumar, K. P. Soman, and S. Sundarakrishna. 2018. Detecting phishing E-mail using machine learning techniques. In Proceedings of the 1st Anti-Phishing Shared Task Pilot 4th ACM IWSPA Co-Located 8th ACM Conference Data and Application Security and Privacy (CODASPY’18). 51–54.

[33]

D. D. Palmer. 2000. Tokenisation and sentence segmentation. Handb. Nat. Lang. Process. (2000), 11–35.

[34]

S. G. Bird and E. Loper. 2004. NLTK: The natural language toolkit.

[35]

S. Seifollahi, A. Bagirov, R. Layton, and I. Gondal. 2017. Optimization-based clustering algorithms for authorship analysis of phishing emails. Neural Process. Lett. 46, 2 (2017), 411–425.

Digital Library

[36]

E. Castillo, S. Dhaduvai, P. Liu, K.-S. Thakur, A. Dalton, and T. Strzalkowski. 2020. Email threat detection using distinct neural network approaches. In Proceedings for the 1st International Workshop on Social Threats in Online Conversations: Understanding and Management. 48–55.

[37]

R. Vinayakumar, K. P. Soman, P. Poornachandran, V. S. Mohan, and A. D. Kumar. 2018. ScaleNet: Scalable and hybrid framework for cyber threat situational awareness based on DNS, URL, and email data analysis. J. Cyber Secur. Mobil. (2018), 189–240.

[38]

R. Vinayakumar, K. P. Soman, P. Poornachandran, S. Akarsh, and M. Elhoseny. 2019. Deep learning framework for cyber threat situational awareness based on email and url data analysis. In Cybersecurity and Secure Information Systems, Springer, 87–124.

[39]

E. S. Gualberto, R. T. De Sousa, P. D. B. Thiago, J. P. C. L. Da Costa, and C. G. Duque. 2020. The answer is in the text: Multi-Stage methods for phishing detection based on feature engineering. IEEE Access 8 (2020), 223529-223547.

[40]

R. Amin, M. M. Rahman, and N. Hossain. 2019. A bangla spam email detection and datasets creation approach based on machine learning algorithms. In Proceedings of the 3rd International Conference on Electrical, Computer, and Telecommunication Engineering (ICECTE’19). 169–172.

[41]

S. Kaddoura, O. Alfandi, and N. Dahmani. 2020. A spam email detection mechanism for english language text emails using deep learning approach. In Proceedings of the IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE’20). 193–198.

[42]

J. Rastenis, S. Ramanauskaitė, I. Suzdalev, K. Tunaitytė, J. Janulevičius, and A. Čenys. 2021. Multi-Language spam/ phishing classification by email body text: Toward automated security incident investigation. Electronics 10, 6 (2021), 668.

[43]

V. Ramanathan and H. Wechsler. 2012. phishGILLNET—phishing detection methodology using probabilistic latent semantic analysis, AdaBoost, and co-training. EURASIP J. Inf. Secur. 2012, 1 (2012), 1.

[44]

F. Janjua, A. Masood, H. Abbas, and I. Rashid. 2020. Handling insider threat through supervised machine learning techniques. Procedia Comput. Sci. 177 (2020), 64–71.

[45]

G. Sonowal. 2020. Phishing email detection based on binary search feature selection. SN Comput. Sci. 1, 4 (2020) .

Digital Library

[46]

M. MANASWINI and D. R. N. SRINIVASU. 2021. Phishing email detection model using improved recurrent convolutional neural networks and multilevel vectors. Ann. Rom. Soc. Cell Biol. 25, 6 (2021), 16674–16681.

[47]

A. Baccouche, S. Ahmed, D. Sierra-Sosa, and A. Elmaghraby. 2020. Malicious text identification: Deep learning from public comments and emails. Information 11, 6 (2020), 312.

[48]

Y. Fang, C. Zhang, C. Huang, L. Liu, and Y. Yang. 2019. Phishing email detection using improved RCNN model with multilevel vectors and attention mechanism. IEEE Access 7 (2019), 56329–56340.

[49]

C. Thapa et al. Evaluation of federated learning in phishing email detection. 23, 9.

[50]

T. Joachims. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text. In Proceedings of the International Conference on Machine Learning (ICML’97).

Cited By

Kyaw PGutierrez JGhobakhlou A(2024)A Systematic Review of Deep Learning Techniques for Phishing Email DetectionElectronics10.3390/electronics1319382313:19(3823)Online publication date: 27-Sep-2024
https://doi.org/10.3390/electronics13193823
Ahmed AZhang XRezk LZaghouani W(2024)Building an Annotated L1 Arabic/L2 English Bilingual Writer Corpus: The Qatari Corpus of Argumentative Writing (QCAW)Corpus-based Studies across Humanities10.1515/csh-2023-00121:1(183-215)Online publication date: 12-Jan-2024
https://doi.org/10.1515/csh-2023-0012
Günter Bollens K(2024)A Practical Investigation of Spear Phishing Spam Emails: Comparative Analysis and Evaluation2024 9th International Conference on Computer Science and Engineering (UBMK)10.1109/UBMK63289.2024.10773554(1-6)Online publication date: 26-Oct-2024
https://doi.org/10.1109/UBMK63289.2024.10773554
Show More Cited By

Index Terms

A New English/Arabic Parallel Corpus for Phishing Emails
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

A Sender-Centric Approach to Detecting Phishing Emails
CYBERSECURITY '12: Proceedings of the 2012 International Conference on Cyber Security

Email-based online phishing is a critical security threat on the Internet. Although phishers have great flexibility in manipulating both the content and structure of phishing emails, phishers have much less flexibility in completely concealing the ...
How Experts Detect Phishing Scam Emails
CSCW

Phishing scam emails are emails that pretend to be something they are not in order to get the recipient of the email to undertake some action they normally would not. While technical protections against phishing reduce the number of phishing emails ...
A Design of an Anti-Phishing Training System Collaborated with Multiple Organizations
iiWAS2019: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services

Phishing is a dangerous threat to organizations. A sender of a phishing email pretends to be a trusted person to steal valuable information, including personal identity data and credentials. If a targeted organization is sent a large number of attack ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 7

July 2023

422 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3610376

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2023

Online AM: 28 June 2023

Accepted: 12 June 2023

Revised: 04 March 2023

Received: 07 January 2023

Published in TALLIP Volume 22, Issue 7

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
340
Total Downloads

Downloads (Last 12 months)164
Downloads (Last 6 weeks)19

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kyaw PGutierrez JGhobakhlou A(2024)A Systematic Review of Deep Learning Techniques for Phishing Email DetectionElectronics10.3390/electronics1319382313:19(3823)Online publication date: 27-Sep-2024
https://doi.org/10.3390/electronics13193823
Ahmed AZhang XRezk LZaghouani W(2024)Building an Annotated L1 Arabic/L2 English Bilingual Writer Corpus: The Qatari Corpus of Argumentative Writing (QCAW)Corpus-based Studies across Humanities10.1515/csh-2023-00121:1(183-215)Online publication date: 12-Jan-2024
https://doi.org/10.1515/csh-2023-0012
Günter Bollens K(2024)A Practical Investigation of Spear Phishing Spam Emails: Comparative Analysis and Evaluation2024 9th International Conference on Computer Science and Engineering (UBMK)10.1109/UBMK63289.2024.10773554(1-6)Online publication date: 26-Oct-2024
https://doi.org/10.1109/UBMK63289.2024.10773554
Champa ARabbi FZibran M(2024)Why Phishing Emails Escape Detection: A Closer Look at the Failure Points2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527344(1-6)Online publication date: 29-Apr-2024
https://doi.org/10.1109/ISDFS60797.2024.10527344
Khalid AHanif MHameed AAshraf ZAlnfiai MAlnefaie S(2024)LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization ApproachIEEE Access10.1109/ACCESS.2024.351892312(193807-193821)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3518923
Mansoori ATahat KTahat DHabes MSalloum S(2024)Optimizing News Categorization with Machine Learning: A Comprehensive Study Using Naive Bayes (MultinomialNB) ClassifierAchieving Sustainable Business through AI, Technology Education and Computer Science10.1007/978-3-031-70855-8_15(169-178)Online publication date: 9-Nov-2024
https://doi.org/10.1007/978-3-031-70855-8_15
Tahat KMansoori ATahat DHabes MSalloum S(2024)Leveraging Soft Power: A Study of Emirati Online Journalism Through Arabic Topic ModelingTechnology and Business Model Innovation: Challenges and Opportunities10.1007/978-3-031-55911-2_2(13-20)Online publication date: 17-Mar-2024
https://doi.org/10.1007/978-3-031-55911-2_2
Al Muallem N(2024)Teaching the Skills of Expression According to Theory of Gerjanis’s Systems and Generation Chomsky: From the Perspective of Arabic Language Engineering for Non-Arabic SpeakersArtificial Intelligence in Education: The Power and Dangers of ChatGPT in the Classroom10.1007/978-3-031-52280-2_7(91-110)Online publication date: 30-Mar-2024
https://doi.org/10.1007/978-3-031-52280-2_7
Salloum S(2024)Detecting Malicious Accounts in Cyberspace: Enhancing Security in ChatGPT and BeyondArtificial Intelligence in Education: The Power and Dangers of ChatGPT in the Classroom10.1007/978-3-031-52280-2_42(653-666)Online publication date: 30-Mar-2024
https://doi.org/10.1007/978-3-031-52280-2_42
Khadragy SElshaeer M(2024)Artificial Intelligence in Pharmacy: Revolutionizing Medical Education DeliveryArtificial Intelligence in Education: The Power and Dangers of ChatGPT in the Classroom10.1007/978-3-031-52280-2_39(615-622)Online publication date: 30-Mar-2024
https://doi.org/10.1007/978-3-031-52280-2_39

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents