Research Report
Research Report
Research Report
on
Submitted by
2023-2024
I
Certificate of Approval
………………………………………… ……………………………………………..
Head Supervisor
Department of Information Technology Department of Information Technology
Kalyani Government Engineering College Kalyani Government Engineering College
………………………………………… ………………………………………………
Project Coordinator Examiner
Department of Information Technology
Kalyani Government Engineering College
II
ACKNOWLEDGEMENTS
I would like to express my sincere gratitude to Dr. Malabika Sengupta for her invaluable
guidance and unwavering support throughout the course of this project. Her insightful
feedback, constructive criticism, and dedication to fostering a spirit of inquiry have been
pivotal in shaping and refining our research on "Detecting Phishing Websites using
Machine Learning." Her mentorship has not only enriched my understanding of the subject
matter but has also inspired a deeper passion for research and innovation.
I would also like to extend my appreciation to the authors of the research papers that served
as significant references for our project. The contributions of these researchers have laid a
solid foundation for our work, and their insights have been instrumental in shaping our
methodology. The following research papers have been particularly influential:
- Görkem Giray, Bedir Tekinerdogan, Sandeep Kumar & Suyash Shukla, “Applications of
deep learning for phishing detection: a systematic literature review” in Knowledge and
Information Systems [23rd May 2022]
- Ashit Kumar Dutta, “Detecting phishing websites using machine learning technique” in
Zhihan Lv, Qingdao University, China October 11, 2021.
Other valuable papers and articles have been enlisted in the References section of the report.
I am thankful for the support and encouragement received from Dr. Sengupta and the
authors of the referenced research papers, as their work has played a crucial role in the
development and success of our project.
[Date]
III
ABSTRACT
Among many key parameters, few parameters as, English efficiency, source year, DNS
filter, reviews are used in this work for developing a phishing website detector:
V
CONTENTS
CHAPTER 1 INTRODUCTION 1
1.1 Motivation 1
1.2 Objective 2
1.3 Organization of the Project 2
CHAPTER 4 CONCLUSIONS 13
CHAPTER 5 REFERENCES 14
V
Chapter 1
INTRODUCTION
The primary objective is to design a robust machine learning model capable of discerning
patterns in the structure, content, and behavior of phishing sites. Unlike static rule sets,
machine learning provides adaptability to the dynamic nature of modern phishing attacks.
The model is trained on diverse datasets, allowing it to generalize learning and effectively
identify phishing attempts in real-time.
Grounded in the understanding that phishing websites exhibit discernible patterns, our
approach draws insights from contemporary research papers, positioning our project at the
intersection of cutting-edge research and practical implementation in the field of phishing
detection using machine learning. This report comprehensively outlines our research
methodology, dataset selection, feature extraction techniques, and evaluation metrics,
contributing to the cybersecurity knowledge base and offering a practical solution to the
evolving threat of phishing attacks in the digital age.
1.1 MOTIVATION
Inspired by research insights, we aim to bridge the gap between academic knowledge and
practical implementation. Our goal is not just to build a database but to empower users,
1
organizations, and cybersecurity professionals with the tools and intelligence needed to
navigate the digital landscape securely. Through this project, we strive to make meaningful
contributions to the resilience of the online world, ensuring a safer internet for all.
1.2 OBJECTIVE
The upcoming phase of our project focuses on algorithm implementation using Python,
integrating the Random Forest and RNN models into our website through APIs. This step
translates our research findings into a practical, real-time phishing detection system.
Simultaneously, we'll establish a database to systematically store information on identified
phishing websites. Future plans involve linking this database to organizations like the Anti-
Phishing Working Group (APWG) for wider cybersecurity collaboration. User feedback
mechanisms and continuous algorithm refinement will ensure the project remains adaptive
and effective in countering evolving phishing threats. The proposed work aims to create a
comprehensive and impactful tool that not only detects phishing websites but actively
contributes to the global cybersecurity community.
2
Chapter 2
BACKGROUND STUDY
The literature survey conducted for this project provides a comprehensive understanding of
the current state-of-the-art techniques, methodologies, and advancements in the domain of
phishing detection using machine learning.
The survey encompasses a wide range of research papers, articles, and journals that
contribute valuable insights into the intricacies of phishing attacks and the application of
machine learning algorithms for their detection. The references to the papers and articles
are attached in the References section.
The survey delves into feature extraction methodologies relevant to phishing detection. It
investigates how researchers identify discriminative features from web content, structure,
and behavior. Commonly used features include URL structures, HTML content analysis,
and behavioral patterns, which contribute to the effectiveness of machine learning models.
The survey critically examines the challenges and limitations faced by existing phishing
detection systems. It sheds light on areas such as false positives, adversarial attacks, and
the need for continuous adaptation to new phishing tactics.
After thoroughly researching the published journals and articles, we found that the past
works in this field mainly focused on machine learning algorithms such as Random Forest
algorithm, Support Vector Machine, etc. but didn’t include Deep Learning to detect the
Phishing Websites. We will delve into this aspect and include deep learning in our model
to increase its efficiency and accuracy.
3
2.1 PROBLEM STATEMENT
Problem Statement:
Rising Cyber Threats: The surge in phishing attacks poses a severe threat to online security,
demanding a sophisticated solution to accurately detect and prevent malicious URLs.
Objectives:
1. High Accuracy: Develop a machine learning model with precise phishing URL
classification.
2. Real-time Detection: Implement a system for instant analysis, preventing access to
harmful websites.
3. Feature Extraction: Identify key features for robust analysis of URLs and webpage
content.
4. Behavioral Analysis: Integrate dynamic behavioral analysis to adapt to evolving
phishing tactics.
5. User-Friendly Interface: Design an intuitive interface for seamless user
interaction.
6. Scalability: Ensure efficient handling of a large volume of URL requests without
compromising performance.
Outcome:
A proactive defense against phishing attacks, enhancing online security for end-users and
professionals.
After conducting the literary survey we found the following phishing techniques by which
users get scammed-
Email Phishing: Deceptive emails, often impersonating trusted entities, aim to trick
recipients into revealing sensitive information through fraudulent links or attachments.
Characteristics include spoofed sender addresses and urgent requests.
SMS Phishing (Smishing): Text messages, claiming legitimacy from sources like banks,
prompt individuals to click malicious links or disclose sensitive information.
Characteristics include urgent messages and requests for personal information.
Call Phishing (Vishing): Voice phishing involves phone calls where attackers, posing as
trusted entities, manipulate individuals into divulging sensitive information. Characteristics
include manipulative tactics, urgent claims, and requests for information over the phone.
4
Website Phishing: Fraudulent websites imitate legitimate ones to deceive users into
entering sensitive information. Often spread through email or social engineering, these sites
aim to capture login credentials, financial details, or personal data.
Social Media Phishing: Cybercriminals exploit social media platforms to deceive users
into clicking on malicious links or sharing personal information. Impersonation of trusted
contacts and the spread of fake content are common tactics.
Credential Phishing: Attackers use various modes, including emails and fake websites, to
trick individuals into divulging login credentials. This information is then exploited for
unauthorized access to accounts.
In this project we have primarily focused on Website, Social Media and Email phishing.
We also look forward to address the other modes
5
Minimal Resource Requirements:
Leveraging open-source technologies and libraries like scikit-learn, TensorFlow, and
PostgreSQL can contribute to cost savings by minimizing licensing expenses. Additionally,
the collaborative nature of open source often results in efficient problem-solving without
the need for extensive financial resources.
6
Chapter 3
Our project envisions a powerful phishing detection system that combines the versatility of
Random Forest and the sequential analysis capabilities of Recurrent Neural Networks
(RNN). Users will interact with a user-friendly website, submitting URLs for analysis. The
model, implemented in Python, will seamlessly integrate with the website through APIs,
ensuring real-time and accurate phishing detection. The frontend, developed using HTML,
CSS, JavaScript, and JQuery, offers an intuitive user experience.
3.1 IMPLEMENTATION
Below are the steps of how we plan on approaching the problem statement:
2. Data Collection: We collected our training and test data from the UCI phishing
dataset that is publicly available
3. Feature Extraction: Identify relevant features from the URLs that can help
distinguish between phishing and legitimate websites. Features might include URL
length, presence of HTTPS, domain age, and other relevant characteristics.
4. Data Preprocessing: Clean and preprocess the dataset. This involves handling
missing values, encoding categorical variables, and scaling numerical features.
7
5. Split Data: Divide the dataset into training and testing sets. This allows you to train
the model on one subset and evaluate its performance on unseen data.
6. Model Selection: Choose the Random Forest classifier as your machine learning
algorithm. Random Forest is effective for classification tasks and can handle a
diverse set of features.
7. Feature Selection: The difficulty arises when we must determine what are the most
relevant features from a set and what combination of features give us near perfect
classification accuracies. From the 30 features, we identified five subsets. These
were grouped as shown below.
8. Training: Train the Random Forest classifier using the training dataset. The model
will learn to distinguish between phishing and legitimate websites based on the
provided features.
9. Testing: Evaluate the model's performance on the testing dataset. Use metrics such
as accuracy, precision, recall, and F-score to assess its effectiveness.
10. Hyper parameter Tuning: Optimize the performance of the Random Forest by
tuning its hyper parameters. This involves adjusting parameters such as the number
of trees and tree depth.
12. Deployment: Once satisfied with the model's performance, deploy it for real-time
detection. This could involve integrating it into a web application, browser
extension, or network security system.
13. Monitoring and Updates: Regularly monitor the model's performance in real-
world scenarios and update it as needed to adapt to evolving phishing techniques.
8
3.2 FEATURES
Language Correctness
Measure the English proficiency level of the website content, as phishing sites often contain
grammatical errors and awkward phrasing.
Source Year
Evaluate the age of the website, as recently registered domains are more likely to be
associated with phishing attempts.
DNS Filter
A DNS filter is a vital component in our phishing website detection strategy. It analyses
and categorizes domain names, blocking access to known phishing sites by cross-
referencing them with a comprehensive database of malicious entities. This proactive
defence mechanism enhances real-time threat mitigation, preventing users from accessing
fraudulent websites based on historical associations with phishing activities. Integrating
DNS filtering into our system adds a crucial layer of defines, bolstering the effectiveness
of our cybersecurity measures.
Reviews
Consider the reputation and feedback from users and security experts to determine the
legitimacy of the website.
Frontend Development:
9
Backend Development:
NumPy: A fundamental package for scientific computing with Python, essential for
numerical operations.
Pandas: A data manipulation and analysis library, beneficial for handling structured data.
scikit-learn: A machine learning library that includes tools for classification, regression,
clustering, and more.
TensorFlow or PyTorch: Deep learning frameworks for building and training neural
networks.
Database Management:
Git: A distributed version control system for tracking changes in the codebase and
facilitating collaborative development.
GitHub: Platforms for hosting and managing Git repositories, enabling version control and
collaboration.
Text Editor/IDE: Visual Studio Code, Atom, or Sublime Text: Feature-rich text editors
suitable for coding, providing a smooth development experience.
10
API Development:
RESTful API: Building APIs to facilitate communication between the frontend and
backend components.
Swagger/OpenAPI: For documenting and testing APIs, ensuring clarity and consistency.
By integrating these technologies, the project can leverage a powerful and efficient stack
for developing a robust, user-friendly, and effective phishing detection system.
The initial phase of our project has been dedicated to laying a robust foundation,
encompassing both the development of the user interface and a comprehensive exploration
of existing research.
The frontend of our website is now functional, featuring an intuitive input field where users
can submit links of websites they suspect to be fraudulent.
Our research phase has been extensive, drawing insights from reputable journals and
articles in the field of phishing detection. This literature review has been instrumental in
shaping our approach and finalizing the algorithms we plan to implement. We have
carefully selected and defined the machine learning algorithms that will power our system,
ensuring a potent combination of accuracy and adaptability.
Looking ahead, our immediate future work involves the implementation of these algorithms
using Python. This development phase will bring our envisioned machine learning models
to life, incorporating the intricacies identified during our research phase. We plan to
seamlessly integrate these algorithms into our website using APIs, enabling users to
experience real-time phishing detection capabilities.
11
internal use but will also facilitate the submission of reports to organizations dedicated to
combating phishing, such as the Anti-Phishing Working Group (APWG). The integration
of a database ensures data persistence and allows for future analyses and refinements.
API Integration: Seamlessly link the algorithms to the website through APIs, ensuring a
user-friendly and responsive interface.
Through these planned future works, we aim to transform our research and planning into
a functional and impactful tool. By marrying technological innovation with community
engagement, we strive to create a comprehensive solution that not only detects phishing
websites effectively but actively contributes to the global fight against cyber threats.
12
CONCLUSION
Chapter 4
In the culmination of our project's initial phases and the roadmap for its future, a cohesive
and robust framework for detecting phishing websites using machine learning has taken
shape. The completion of the frontend development, coupled with an extensive literature
survey, has laid the groundwork for a user-friendly platform that fosters community-
driven threat intelligence. The integration of machine learning algorithms, specifically the
Random Forest and RNN models, promises to enhance the accuracy and adaptability of
our phishing detection system.
The research journey, guided by insights from reputable journals, has deepened our
understanding of phishing attacks and their evolving tactics. The literature survey has
informed our approach, ensuring that the project aligns with current best practices and
stays at the forefront of advancements in cybersecurity.
Looking forward, the implementation phase beckons, where the chosen algorithms will be
brought to life using Python, seamlessly integrated into our website through APIs. This
critical stage represents the bridge between theory and practical application, translating
our research findings into a functional tool capable of real-time phishing detection.
In conclusion, our project strives not only to create a sophisticated machine learning-
based solution for phishing detection but also to cultivate a community-driven, resilient
defense against cyber threats. As we progress into the implementation and refinement
phases, we remain dedicated to the overarching goal of creating a safer digital
environment, where users can navigate with confidence, shielded from the insidious threat
of phishing attacks. The fusion of technological innovation, community collaboration, and
a commitment to ongoing improvement defines the essence of our project, embodying a
proactive and collective approach to cybersecurity in the digital age.
13
REFERENCES
Chapter 5
The Published journals and articles that have helped us in gathering information regarding
the past works in this topic are enlisted as follows-
https://link.springer.com/article/10.1007/s10115-022-01672-x
https://link.springer.com/chapter/10.1007/978-981-13-9155-2_5
https://www.sciencedirect.com/science/article/abs/pii/S1574013717302010?via%
3Dihub
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0258361
14