B2 Salma Fayaz

A Project Report
On
ADVANCEMENTS IN CYBERSECURITY: A
DATA ANALYTICS APPROACH FOR
PROACTIVE DETECTION AND MITIGATION OF
MALICIOUS URLS
Submitted in partial fulfillment of the
requirement for the award of the degree of
BACHELOR OF TECHNOLOGY
in
ARTFICIAL INTELLIGENCE & DATA SCIENCE
By
S. SALMA FAYAZ 20701A3039
S. NOUSHIN FATHIMA 20701A3033
V. HANISHA 20701A3010
Under the esteemed guidance of
Dr. P. Phanindra Kumar Reddy MTech, Ph.D.

Head of the Department
Department of AI&DS, AITS.
Submitted to
Department of Artificial Intelligence & Data Science

Annamacharya Institute of Technology and Sciences
(Affiliated to J.N.T. University, Anantapur)
New Boyanapalli, Rajampet-516126 Annamayya (Dt), A.P
2023-2024
CERTIFICATE
This is to certify that the project report entitled “Advancements in
Cybersecurity: A Data Analytics Approach for Proactive detection and Mitigation
of Malicious Url’s” done by S . S A L M A F A Y A Z , 2 0 7 0 1 A 3 0 3 9 in partial
fulfillment of the requirements for the award of Degree of Bachelor of Technology in
“Artificial Intelligence and Data Science”, is a record of bonafide work carried out
by her during the academic year 2023-2024.
Signature of Guide: Signature of HOD:

Dr. P. Phanindra Kumar Reddy M. Tech, Ph.D. Dr. P. Phanindra Kumar Reddy M. Tech, Ph.D.
Associate Professor Associate Professor
Department of AI&DS Head of the Department
AITS Artificial Intelligence & Data Science
Rajampet AITS
Rajampet
CERTIFICATE
This is to certify that the project report entitled “Advancements in
Cybersecurity: A Data Analytics Approach for Proactive detection and Mitigation
of Malicious Url’s” done by S . S A L M A F A Y A Z , 2 0 7 0 1 A 3 0 3 9 in partial
fulfillment of the requirements for the award of Degree of Bachelor of Technology in
“Artificial Intelligence and Data Science”, is a record of bonafide work carried out
by her during the academic year 2023-2024.
Project viva-voce held on :
Internal Examiner External Examiner

Declaration by the Candidate
I Miss. S. SALMA FAYAZ, bearing Roll Number 20701A3039, hereby declare that the
project report entitled ADVANCEMENTS IN CYBERSECURITY: A DATA ANALYTICS
APPROACH FOR PROACTIVE DETECTION AND MITIGATION OF MALICIOUS URLS
under the guidance of Dr. P. Phanindra Kumar Reddy, M. Tech, Ph.D., Department of
Artificial Intelligence and Data Science is submitted in partial fulfilment of the requirements
for the award of the degree of Bachelor of Technology in Artificial Intelligence and Data
Science.
This is a record of bonafide work carried out by me and the results embodied in project
report have not been reproduced or copied from any source. The results embodied in this project
report have been submitted to any other University or Institute for the Award of any other
Degree or Diploma.
Name: S. SALMA FAYAZ

Roll Number: 20701A3039
Department of Artificial Intelligence and Data Science
New Boyanapalli-516126
Rajampet, Annamayya, A.P.
.
ACKNOWLEDGMENT
We endeavor of a long period can be successful only with the advice of many
well- wishers. We take this opportunity to express my deep gratitude and appreciation
to all those who encouraged me for the successful completion of the project work.
Our heartfelt thanks to our Guide, Dr. P. Phanindra Kumar Reddy M. Tech,
Ph.D. Head of the Department in Department of Artificial Intelligence and Data
Science, Annamacharya Institute of Technology and Sciences, Rajampet, for his
valuable guidance and suggestions in analyzing and testing throughout the period, till
the end of the project work completion.
We wish to express sincere thanks and gratitude to Dr. P. Phanindra Kumar
Reddy, Head of the Department of Artificial Intelligence and Data Science, for his
encouragement and facilities that were offered to us for carrying out this project.
We take this opportunity to offer gratefulness to our Principal Dr. S.M.V.
Narayana, for providing all sorts of help during the project work.
We are very much thankful to Dr. C. Gangi Reddy, Honorary Secretary of the
Annamacharya Educational Trust, for his help in providing good facilities in our
college.
We would express our sincere thanks to all faculty members of Artificial
Intelligence and Data Science Department, batch-mates, friends, and lab-
technicians, who have helped us to complete the project work successfully.
Finally, we express our sincere thanks to our parents who has provided their
heartfelt support and encouragement in the accomplishment to complete this project
successfully.
PROJECT ASSOCIATES
S. SALMA FAYAZ
S. NOUSHIN FATHIMA
V. HANISHA
TABLE OF CONTENTS
TITLE PAGE NO
ABSTRACT
LIST OF FIGURES
LIST OF TABLE
1. INTRODUCTION
1.1. Emergence Of Malicious Urls 1
1.2. Rationale For Research 1-2
1.3. Challenges In Combatting Malicious Urls 2
2. LITERATURE SURVEY 3-6
3. SYSTEM ANALYSIS
3.1. Existing System 7
3.1.1. Disadvantages 8
3.2. Proposed System 8
3.2.1. Advantages 8
3.3. Modules used in proposed system
3.3.1. User 9
3.3.2. System 9
3.3.3. Algorithms Used 9
3.3.3.1. ANN-LSTM 9
3.3.3.2. CNN-LSTM 9-10
4. SYSTEM REQUIREMENTS SPECIFICATIONS
4.1. Software requirements 11
4.2. Hardware requirements 11
4.3. Feasibility study 11
4.3.1. Economic feasibility 12
4.3.2. Technical feasibility 12
4.3.3. Behavioral feasibility 12
4.3.4. Benefits of doing feasibility study 12
4.4. Functional and non-functional requirements 13
4.4.1. Functional Requirements 13
4.4.2. Non-Functional Requirements 13-14
5. SYSTEM DESIGN
5.1. Architecture Design 15
5.2.Uml diagrams
5.2.1. Use case diagram 16
5.2.2. Class diagram 17
5.2.3. Sequence diagram 17-18
5.2.4. Collaboration diagram 18
5.2.5. Deployment diagram 19
5.2.6. Component diagram 19
5.2.7. State Chart diagram 20
5.3.ER Diagram 20-21
6. SYSTEM CODING AND IMPLEMENTATION
6.1. Programming Language and Libraries Selection 22
6.1.1. Libraries used in Python 22-23
6.2. Development Environment Setup and Configuration 23
6.3. Code 24-25
7. SYSTEM TESTING
7.1. Software testing techniques
7.1.1. Goals 26
7.1.2. Test Case Framework 26
7.1.3. White box testing 26-27
7.1.4. Black box testing 27
7.2. Strategies for software testing 28
7.2.1. Unit testing 28
7.2.2. Integration testing 28
7.2.3. Validation testing 29
7.2.4. System testing 29
7.2.5. Security testing 29
7.2.6. Performance evaluation 30
7.3. Test Cases 30-31
8. RESULTS 32-33
9. CONCLUSION AND FUTURE ENHANCEMENTS 34
REFERENCES 35-36
PLAGIARISM REPORT
LIST OF FIGURES & TABLES
Fig. No. Figures Page No.

5.1. Architecture Diagram 15
5.2. Use Case Diagram 16
5.3. Class Diagram 17
5.4. Sequence Diagram 18
5.5. Collaboration Diagram 18
5.6. Deployment Diagram 19
5.7. Component Diagram 19
5.8 State Chart Diagram 20
5.10. ER Diagram 21
6.1. Tool Stack for Project Implementation 23
8.1. Home Page 32
8.2 Enter Url Page 32
8.3. Prediction Page 33
Table. No. Name Page No.

7.1. Test Cases 30
7.2 Test Cases of Model Building 31
ABSTRACT
The demand for proactive detection and mitigation solutions to protect
against harmful actions on the internet is growing due to the sophistication of
cyber threats. The aim of this study is to identify. This study's objective is to
ascertain malicious URLs using a new data analytics strategy that makes use
of sophisticated machine learning algorithms. With our method, URLs are
analyzed and classified as benign or malicious by combining the capabilities
of classic Long Short-Term Memory (LSTM) networks, Convolutional Neural
Networks (CNN), and Artificial Neural Networks (ANN). The dataset offers an
extensive training and testing environment since it consists of a wide variety
of URLs that have been classified as benign or dangerous. We assess Receiver
Operating Characteristic (ROC) curve, F1 score, accuracy, and precision
performance of the combined CNN-LSTM and ANN-LSTM models by a
comparative study. The outcomes demonstrate how well our method works to
differentiate between legitimate and malicious URLs, allowing for proactive
threat identification. Besides, we choose the best performing model to use for
real-time new URL categorization based on the assessment criteria. Users
may confirm the authenticity of URLs and reduce security concerns by
integrating this paradigm into a web-based application. Our method provides
a scalable and efficient way to counteract cyber threats that are constantly
changing in the digital sphere by utilizing deep learning and data analytics.
Keywords: P h i s h i n g d e t e c t i o n , p r o a c t i v e d e t e c t i o n , w e b s e c u r i t y,
URL classification, malware, neural networks, machine
learning, data analytics, CNN -LSTM, ANN -LSTM,
cybersecurity.
CHAPTER-1
INTRODUCTION
ADVANCEMENTS IN CYBERSECURITY: A DATA ANALYTICS APPROACH FOR
PROACTIVE DETECTION AND MITIGATION OF MALICIOUS URLS
1. INTRODUCTON
1.1. EMERGENCE OF MALICIOUS URLS

In the field of cybersecurity, the appearance of malicious URLs represents a
major turning point. When the internet first came into existence, it was a worldwide
hub for communication and information sharing, which promoted innovation and
connectedness. As cyber dangers emerged, though, the environment quickly changed
and new types of malicious activity that target holes in web infrastructure began to
emerge. The advanced tool in the hackers' toolbox is malicious URLs, commonly
referred to as malicious uniform resource locators. These misleading URLs could
appear to be authentic websites, but they are really meant to spread malware, steal data,
and engage in phishing schemes. Attackers may conduct more complex and effective
attacks with more stealth thanks to the growth of malicious URLs, which is
simultaneous to technological improvements.
A serious threat to people, companies, and governments throughout the world,
malicious URLs have become more widely distributed in recent years than they ever
have before. Malicious links are a favorite weapon of cybercriminals looking to take
advantage of gullible people because they are simple to create and spread, and they can
be accessed anonymously over the internet. Developing successful cybersecurity tactics
to lessen the impact of bad URLs and protect digital assets requires an understanding
of how they form. Researchers are able to develop countermeasures and strengthen
defenses against future assaults by tracking the origins and evolution of threat actors.
This provides researchers with important insights into the strategies, methods, and
processes used by threat actors.
1.2.RATIONALE FOR RESEARCH

The necessity to confront the rising threat posed by cyberattacks in today's
connected world is the driving force for the investigation of harmful URLs. Proactive
cybersecurity measures become increasingly important as technology progresses and
dependence on digital platforms grows. This increases the likelihood of being a victim
of harmful activity. The dedication to safeguarding people, companies, and vital
infrastructure against the destructive effects of cybercrime is at the core of our study.
Researchers want to create cutting-edge solutions that may identify, lessen, and
eliminate these dangers by looking at the characteristics of malicious URLs and the
Department of AI&DS, AITS, Rajampet 1
strategies used by cybercriminals to take advantage of vulnerabilities.
Furthermore, the spread of harmful URLs highlights the necessity of knowledge
exchange and multidisciplinary cooperation between the government, business, and
academic sectors. Stakeholders may more effectively solve complex cybersecurity
concerns by using collective knowledge, sharing insights, and pooling resources
through the development of a collaborative research ecosystem. When it comes down
to it, there are wider societal ramifications to studying bad URLs than just technical
ones. In the digital era, researchers help to preserve freedom, privacy, and trust by
defending the integrity and security of digital ecosystems, establishing the foundation
for a more secure and resilient cyberspace.
1.3. CHALLENGES IN COMBATTING MALICIOUS URLS

The dynamic and adaptable nature of cyber attacks presents a multitude of
obstacles for cybersecurity experts when it comes to combatting harmful URLs. The
difficulty of distinguishing between harmful and normal links on the internet is mostly
due to the large number and variety of bad URLs that circulate. In addition, hackers
utilize advanced methods to avoid detection and get beyond conventional security
protocols. These methods include obfuscating URLs, utilizing polymorphic malware,
and taking advantage of flaws that are not discovered till later. Traditional signature-
based detection systems, which identify threats using predetermined patterns or
signatures, have several obstacles as a result of these evasion techniques.
The quick growth of attack vectors and threat actor methods presents another
difficulty in the fight against malicious URLs. Security experts must constantly
innovate and remain ahead of the curve to identify and efficiently neutralize emerging
threats as hackers modify their tactics to get around current safeguards. For law
enforcement agencies and regulatory authorities responsible with fighting cybercrime,
the worldwide reach of the internet also poses jurisdictional issues. Since
cybercriminals operate internationally, it is more difficult to hold them accountable and
destroy criminal networks because they use anonymity and encryption to avoid
identification and punishment.
Technological innovation, policy development, and international collaboration
must all be combined in a multidimensional manner to address these difficulties.

CHAPTER-2
LITERATURE SURVEY
2. LITERATURE SURVEY
TITLE: Cantina: A Content-Based Approach to Detecting Phishing Websites

AUTHOR: Yue Zhang, Jason Hong, Lorrie Cranor
ABSTRACT: Priority one should be given to addressing the widespread problem of
phishing, which comprises fraudulent emails and websites meant to trick users into
disclosing personal information. In this research, the novel content-based technique
CANTINA—which makes use of the TF-IDF information retrieval algorithm—is
presented as a means to detect phishing websites. We describe CANTINA's architecture,
deployment, and evaluation and discuss the creation and analysis of many heuristics
designed to reduce false positives. With a 95% accuracy rate in accurately identifying
such hostile entities, we rigorously experiment to show the effectiveness of CANTINA
in phishing site detection.

TITLE: Phishing Detection: A Literature Survey

AUTHOR: Mahmoud Khonji, Youssef Iraqi, Andrew Jones
ABSTRACT: This work offers a thorough analysis of the literature on the
identification of phishing attacks, which take advantage of weaknesses in systems caused
by human factors. Because end users are the weakest link in the security chain,
cyberattacks frequently take advantage of their innate vulnerabilities. The scope of the
phishing issue makes it impossible for a single solution to adequately address every
vulnerability. As a result, different strategies are usually used to target different attack
vectors. This paper's main goal is to review a variety of newly suggested phishing attack
mitigation techniques. In order to clarify how phishing detection techniques fit into the
larger mitigation framework, it also provides a high-level overview of many kinds of
phishing mitigation strategies, such as detection, offensive defense, rectification, and
prevention.

TITLE: Malicious URL Detection using Machine Learning: A Survey

AUTHOR: Sahoo D, Liu C, Hoi S C H
ABSTRACT: In the field of cybersecurity, rogue URLs—also referred to as malicious
websites—pose a persistent and important problem. These URLs offer a range of
unwelcome material, such as spam, phishing attempts, and drive-by exploits. As a result,
unknowing consumers become victims of frauds including identity theft, money loss, and
malware infections, which cause significant yearly financial losses. It is essential to
recognize these hazards and act quickly to counter them. While blacklists play a major
role in traditional techniques, their usefulness is naturally limited, and they cannot keep
up with the rate at which harmful URLs are being created. Machine learning approaches
are becoming more and more popular as a means of improving the effectiveness and reach
of malicious URL identification. This article offers a thorough analysis and organized
synopsis of machine learning-based harmful URL detection techniques. In addition to
classifying and reviewing literature contributions addressing different facets of the issue,
such as feature representation and algorithm design, it defines the formal definition of
malicious URL identification as a machine learning challenge. Furthermore, by clarifying
the state of the art and promoting additional study and useful applications, it acts as a
useful resource for a variety of audiences, including machine learning researchers,
cybersecurity experts, and industry practitioners. The paper also addresses essential
directions for further study, emphasizes open research issues, and talks about practical
implications in system design.

TITLE: Max-pooling Loss Training of Long Short-Term Memory Networks for Small-
footprint Keyword Spotting
AUTHOR: Ming Sun, Anirudh Raju, George Tucker, Sankaran Panchapagesan,
Gengshen Fu
ABSTRACT: Using max-pooling based loss functions, this paper presents a unique
method to decrease CPU, memory, and latency requirements for training Long Short-
Term Memory (LSTM) networks for small-footprint keyword spotting (KWS). In the
suggested approach, a cross-entropy loss trained network is used to start the training
process. Max-pooling loss training is then used to further optimize the system. A
technique based on posterior smoothing is used to evaluate the performance of keyword
spotting. Empirical results show that LSTM models trained with max-pooling loss or
cross-entropy loss perform better than a baseline feed-forward Deep Neural Network
(DNN) trained with cross-entropy loss. The results show that LSTM models trained with
max-pooling loss outperform the baseline DNN model, with a substantial relative
decrease of 67.6% in the Area Under the Curve (AUC) metric. This is especially true
whether initiated with a randomly initialized network or a cross-entropy pre-trained
network.

CHAPTER-3
SYSTEM ANALYSIS
3. SYSTEM ANALYSIS
The analysis of computer data, project data, algorithm data, and other inner and
outer data relevant to the proposed research is a comprehensive process that involves a
range of phases, methodologies, functions, and entities in the investigation of project data.
A group of scientific techniques called system analysis are used to determine the
requirements for project task design. System analysis looked at a range of functional and
non-functional needs for the design of the proposed system. The present system analysis
has planned the design using a number of tools, including class diagrams, sequence
diagrams, data flow diagrams, and data dictionaries, in order to develop a logical model
of the system. It has also reviewed various publications relevant to the project's work.
3.1 EXISTING SYSTEM:
For the purpose of detecting and mitigating harmful URLs in cybersecurity

applications, the current system mostly depends on hybrid machine learning techniques,
namely CNN-LSTM models. By using long short-term memory (LSTM) networks to
capture temporal relationships and convolutional neural networks (CNNs) to extract
spatial information from URL data, these models effectively classify URLs as benign
or dangerous. The integration of artificial neural networks (ANN) with LSTM
networks, or ANN-LSTM models, has not gotten much attention in previous research,
despite the fact that CNN-LSTM models have shown significant effectiveness in
detecting malicious URLs by learning spatial and temporal correlations.
Inside the current framework, the majority of research efforts have gone into
improving CNN-LSTM model performance using several methods, such
hyperparameter tuning, data augmentation, and attention mechanisms. Despite the fact
that these efforts have significantly increased the accuracy, precision, and recall of
malicious URL identification, little is known about the possible advantages of utilizing
ANN-LSTM structures. To identify the best strategy for proactive threat detection and
mitigation in cybersecurity applications, a thorough comparison of CNN-LSTM and
ANN-LSTM models is important, as this research gap makes clear.

3.1.1 DISADVANTAGES:
Some drawbacks of the current approach are related to its sole dependence on
CNN-LSTM models. Initially, the CNN-LSTM models could have trouble identifying
intricate temporal correlations seen in URL sequences, which could hinder their
capacity to correctly identify some dangerous URL patterns. Furthermore, the lack of
ANN-LSTM models ignores the benefits of utilizing artificial neural networks to
discern complex patterns and correlations in URL data, which might compromise the
overall efficacy of fraudulent URL detection systems. Research on the advantages and
disadvantages of each technique is further hampered by the dearth of thorough
comparison studies between CNN-LSTM and ANN-LSTM models, which further
impedes advancements in the cybersecurity space.
3.2 PROPOSED SYSTEM:
In order to detect and mitigate harmful URLs, we suggest a unique method in

this research that entails the creation and assessment of CNN-LSTM and ANN-LSTM
models. We want to determine the best strategy for proactive threat identification in
cybersecurity applications by performing in-depth comparison evaluations between
these two designs. To guarantee smooth integration and best-in-class performance of
CNN-LSTM and ANN-LSTM models, the suggested system consists of many phases,
such as data preparation, model training, assessment, and deployment.
3.2.1 ADVANTAGES:
Compared to the current method, the suggested methodology has a number of

benefits. First off, the ANN-LSTM model's capacity to recognize complicated patterns
in URL data and efficiently capture complex temporal linkages is increased when ANN
and LSTM networks are integrated. This improves the model's capacity to identify
harmful URLs. Making educated decisions for cybersecurity applications is also made
possible by the comparative study of the CNN-LSTM and ANN-LSTM models, which
offers insightful information about the advantages and disadvantages of each
methodology. Additionally, the suggested approach makes it possible to classify URLs
in real-time and mitigate hazards, giving businesses and people the ability to proactively
defend against new online dangers.

3.3 MODULES USED IN PROPOSED SYSTEM:

3.3.1 USER:
By enabling users to enter URLs for threat detection and categorization, the user
module makes it easier for users to engage with the system. In order to help users make
educated judgments about the safety of accessible URLs, real-time feedback on the
categorization findings is provided.
3.3.2 SYSTEM:
Encompassing data preparation, model training, assessment, and deployment, the

system module comprises the fundamental features of the suggested system. The CNN-
LSTM and ANN-LSTM models operate smoothly and perform at their best thanks to the
coordination of data flow and processes.
3.3.3 ALGORITHMS USED:

3.3.3.1.ANN-LSTM:
An important component of our suggested solution for malicious URL identification is
the Artificial Neural Network-Long Short-Term Memory (ANN-LSTM) algorithm. The
algorithm in question was carefully selected due to its ability to identify the sequential
patterns and long-term relationships included in URL data. LSTM networks are
particularly suited for modeling data sequences because, in contrast to standard
feedforward neural networks, they are built to store information over time. Our system's
goal is to improve its detection of temporal characteristics that malicious URLs display
by using ANN-LSTM structures. Because of their superior ability to analyze sequential
data, ANN-LSTM models are able to recognize patterns that suggest harmful intent and
accurately capture the dynamic nature of URL data. By employing ANN-LSTM
algorithms, our system can recognize complex objects with greater accuracy and
dependability.
3.3.3.2.CNN-LSTM:
An important component of our suggested solution for malicious URL identification is
the Artificial Neural Network-Long Short-Term Memory (ANN-LSTM) algorithm. The
algorithm in question was carefully selected due to its ability to identify the sequential
patterns and long-term relationships included in URL data. LSTM networks are

particularly suited for modeling data sequences because, in contrast to standard

feedforward neural networks, they are built to store information over time. Our system's
goal is to improve its detection of temporal characteristics that malicious URLs display
by using ANN-LSTM structures. Because of their superior ability to analyze sequential
data, ANN-LSTM models are able to recognize patterns that suggest harmful intent and
accurately capture the dynamic nature of URL data. By employing ANN-LSTM
algorithms, our system can recognize complex objects with greater accuracy and
dependability. To capture structural properties and geographical patterns within URL
data, our proposed system also includes Convolutional Neural Network-Long Short-
Term Memory (CNN-LSTM) architectures in addition to ANN-LSTM. The sequential
analytical powers of LSTM networks are combined with the spatial feature extraction
process of convolutional neural networks (CNNs) in CNN-LSTM models1. The
structural composition of URLs may be analyzed well by CNNs as they are proficient at
recognizing spatial and visual patterns. Our system can detect fraudulent URLs more
effectively by collecting both local and global spatial relationships thanks to the
integration of CNN-LSTM architectures. In this way, typical visual indicators like
irregular domain topologies or aberrant character distributions linked to fraudulent URLs
may be reliably identified by the algorithm. Utilizing CNN-LSTM algorithms, our system
can get thorough coverage in recognizing.
Our proposed approach includes a critical component that allows a thorough comparison
of the efficacy and efficiency of the CNN-LSTM and ANN-LSTM models in malicious
URL identification. Accuracy, recall, precision, F1 score, and receiver operating
characteristic (ROC) curves are just a few of the performance indicators that will be
reviewed in our thorough comparison research. In order to detect malicious URLs in
various contexts and conditions, this investigation seeks to determine the best design. The
effectiveness of each model will also be evaluated in terms of resource consumption,
memory footprint, and computational complexity. We want to determine which design is
most suited for implementation in practical settings by weighing the trade-offs between
accuracy and efficiency. To further guarantee this, we will take into account variables
like training duration, inference speed, and scalability.

CHAPTER-4
SYSTEM REQUIREMENTS SPECIFICATION
4. SYSTEM REQUIREMENTS SPECIFICATION

The responsibilities that a system must do are fully described in software requirements
specifications (SRS), sometimes referred to as software system requirements
specifications. How the program communicates with its users is explained in the use
examples in this part. In addition to the usage case, the SRS includes non-functional
requirements. Design or execution limitations, such as performance engineering
requirements, quality standards, or design restrictions, are examples of non-functional
specifications.
4.1. SOFTWARE REQUIREMENTS

• Operating System : Windows 11
• Server side Script : Python
• IDE : PyCharm
• Framework : Streamlit
• Dataset : Malicious Url’s Dataset
4.2. HARDWARE REQUIREMENTS

• Processor : I3/Intel Processor
• RAM :4GB (min)
• Hard Disk : 160GB
• Keyboard : Standard Windows Keyboard
• Monitor : SVGA
4.3. FEASIBILITY STUDY

The main goal of a feasibility study is to identify the best way to satisfy
performance criteria. Finding and evaluating possible system applicants is a necessary
step in order to choose the best applicant.
• Economic Feasibility
• Technical Feasibility
• Behavioral Feasibility
4.3.1. Economic Feasibility:

By evaluating prospective savings and benefits to see if they outweigh costs,
cost/benefit analysis establishes economic viability. Decisions are taken to develop and
implement the system if the advantages outweigh the disadvantages. On the other hand,
they need to be addressed if more explanation or adjustments are required.
4.3.2. Technical Feasibility:

Determining whether the current computer system can support the proposed
expansion, taking into account the hardware and software requirements, is the main
objective of technical feasibility. Technological developments must take into account
financial factors as well, as inadequate finance might make the project unworkable.
4.3.3. Behavioral Feasibility:

Estimating user staff resistance to the adoption of a computerized system is a
crucial component of behavioral feasibility. It takes work to enlighten, convince, and
teach people to accept new business practices when a new system is introduced.
Behavioral feasibility critically depends on overcoming resistance and encouraging
understanding among users.
4.3.4. Benefits of Doing a Feasibility Study:

There are several advantages to carrying out a feasibility study:
1. Examining system requirements in detail is made easier by the analysis step of the
research, which is the first stage of the software development life cycle.
2. Assessing and evaluating risk concerns related to the design and execution of systems.
3. Giving advice on possible obstacles and how to mitigate them to facilitate risk
planning.
4. Effective resource allocation and organizational functioning are ensured by cost-

benefit evaluations made possible by feasibility studies.
5. Feasibility studies support the planning process for training developers to implement
the system.

4.4. FUNCTIONAL AND NON-FUNCTIONAL REQUIREMENTS:
Determining the success of a software or system project is mostly dependent on

the requirements analysis. There are two primary types of requirements: functional
requirements and non-functional needs
4.4.1. Functional Requirements:

The fundamental features that a system must have in order to satisfy the unique needs
of the end user for fundamental services are known as functional requirements. In the
system development contract, each of these features has to be specified specifically.
Common definitions for them include expected results, actions to be carried out, and
inputs to be supplied. Functional requirements are instantly apparent in the completed
product, in contrast to non-functional requirements, and they directly represent user-
defined demands.
Here are some instances of functional requirements:
1) login as a user after authenticating.
2) Reaction to a cyberattack system shutdown.
3) User registration in a software system triggers the automatic delivery of a

verification email.
4.4.2. Non-functional requirements

Relative to the project contract, non-functional criteria, or quality requirements, specify
the features and standards that the system must meet. According to the demands of the
project, these requirements may rank various components in order of importance. They
cover many different aspects, including performance, scalability, adaptability,
maintainability, security, and reusability.
The following are some instances of non-functional requirements:
1) With respect to such an activity, emails should be sent no more than 12 hours
afterwards.
2) Protection of data from unwanted access and confidentiality are two aspects of
security.
3) Easy system upkeep and future updates are key components of maintainability.

4) Stable system operation and little downtime are characteristics of reliability.
5) Scalability: Able to manage growing workloads or user expectations.
6) Performance is defined as the effectiveness of resource usage and system response

time.
7) Potential for components to be used again in other projects is known as reusability.
8) Adaptability to evolving user needs and surroundings is known as flexibility.

CHAPTER-5
SYSTEM DESIGN
5. SYSTEM DESIGN
5.1. ARCHITECTURE DESIGN
Figure 5.1: Architecture diagram
The architecture diagram in Figure 5.1 delineates the systematic approach to gathering,
refining, and categorizing URLs using advanced deep learning classifiers. Beginning
with data collection from diverse sources, including search engines, the process
encompasses meticulous data preparation, involving cleaning and feature extraction
techniques. Three specialized neural network models are then employed to discern
phishing from legitimate URLs based on intricate patterns. Classification results from
these models facilitate rapid identification of malicious and benign URLs, empowering
stakeholders to bolster cyber defenses. Ultimately, this architecture aims to automate
URL security, enhancing resilience against evolving cyber threats through efficient and
accurate detection mechanisms.

5.2. UML DIAGRAMS

Let's delve into the specifics of various UML diagrams:
5.2.1 USE CASE DIAGRAM

A typical tool used by software engineers to utilize the Unified Modeling
Language (UML) is the use case diagram, which is a behavioral diagram generated
from use-case research. With the use of use cases, it seeks to depict the actors, goals,
and any interdependencies between them inside a system. An illustration of the system
functions carried out for each actor and a clarification of the responsibilities of system
actors are the main goals of a use case diagram. Use scenarios are used to explain how
the system operates when it is not in use, while use cases are used to clarify the
capabilities of the system during the requirements elicitation and analysis phase.
Though use cases depict system operations, actors, who represent people or things, are
located inside the system. A set of use cases inside the is defined by a boundary.
Figure 5.2: Use Case Diagram

5.2.2. CLASS DIAGRAM

The classes, attributes, and connections among the classes that make up a system
are depicted in a class diagram, which is a static structural diagram in the Unified
Modeling Language (UML). It is used in architectural analysis as a method to identify
classes that have too many features and identify possible class partitions. In addition to
helping developers create classes, the diagram creates links between classes. Classes in
a class diagram show linked items having comparable properties, functions,
connections, and limitations on their behavior. These diagrams play a crucial role in
comprehending the architecture of a system, which makes object-oriented modeling
more efficient.
Figure 5.3: Class Diagram
5.2.3. SEQUENCE DIAGRAM

The sequence and links between events or activities are shown in the sequence diagram,
a sort of interaction diagram in the Unified Modeling Language (UML). It incorporates
timing diagrams, event-trace diagrams, and depictions of event contexts and is also
known as a message sequence chart. Software engineers and entrepreneurs can use
sequence diagrams to better understand, and articulate system needs for both new and
old systems. They show how components of a system interact with one another.

Figure 5.4: Sequence Diagram
5.2.4. COLLABORATION DIAGRAM

By numbering the method calls to indicate their order, the collaboration
diagram, also called a communication diagram, illustrates method call sequences.
Though object structure is emphasized, object interactions are depicted in a manner
akin to sequence diagrams. Collaboration diagrams aid in better understanding system
activity by showing the sequence of method calls between objects.
Figure 5.5: Collaboration Diagram

5.2.5. DEPLOYMENT DIAGRAM

Software artifacts are deployed onto hardware nodes, and deployment diagrams
highlight this process by describing the hardware and software elements that make up
a deployment. Deployment diagrams clarify the hardware architecture of a system,
supporting system engineers in hardware organization and runtime processing node
portrayal, in contrast to typical UML diagrams which focus on logical components.
Figure 5.6: Deployment Diagram
5.2.6. COMPONENT DIAGRAM

Files, libraries, and other tangible components of a system are depicted in a
component diagram, a specific type of UML diagram. Depicting the organization and
relationships of the components at a given point in time, it functions as a static
implementation view of the system. Component diagrams help both forward and
reverse engineering processes by providing a visual representation of the system's
components and their interactions, even if they do not contain the full system in one
figure.
Figure 5.7: Component Diagram

5.2.7. STATE CHART DIAGRAM

Including choice, iteration, and concurrency, activity diagrams provide a visual
depiction of changing task and action processes. Offering thorough insights into activity
flows inside the Unified Modeling Language (UML), they describe the operational and
business processes of system components. Activity diagrams use numerous flowchart
symbols, action states, transitions, objects, and other elements to illustrate the sequence of
events that occur in a system and aid in the comprehension of system behavior.
Figure 5.8: State Chart Diagram
5.3. ER DIAGRAM:
Through the use of an Entity Relationship Diagram (ER Diagram), the Entity-
Relationship (ER) paradigm offers a structured representation of a database. This model
defines the entities and their connections inside the system, acting as a template for the
database design. Entity sets and relationship sets, which define the entities and their
relationships in the database, are the fundamental components of the Entity Relationship
(ER) paradigm.
Relationships between entity sets are shown graphically in an entity relationship diagram.
Any group of related entities that share characteristics is referred to as an entity set.
Entities are related to tables or table characteristics in the context of database
management systems (DBMS). An extensive summary of the database's logical structure
is provided by the ER diagram, which shows the relationships between tables and their
characteristics. An efficient way to build and maintain databases is to use this visual
representation, which makes it easier to comprehend the database structure and the
relationships that support it.
Figure 5.9: ER Diagram

CHAPTER-6
SYSTEM CODING AND IMPLEMENTATION

6. SYSTEM CODING AND IMPLEMENTATION
6.1 PROGRAMMING LANGUAGE AND LIBRARIES SELECTION:

To guarantee the effective creation and execution of our solution for identifying
fraudulent URLs in our project, choosing the right programming language and libraries
was essential. The main programming language is Python, which was selected after
careful deliberation because of its many libraries for data analytics and machine
learning, readability, and simplicity of maintenance. The robust libraries like
TensorFlow, Keras, and Scikit-learn that are available thanks to Python's broad
ecosystem are crucial for creating and refining machine learning models for URL
categorization. Furthermore, we can easily iterate and improve our models because to
Python's syntax, which facilitates rapid prototyping and experimentation.
6.1.1. LIBRARIES USED IN PYTHON

Pandas: For data science, data analysis, and machine learning applications, Pandas is a
popular open-source Python package. For activities like preparing and analyzing data,
it is an invaluable tool since it provides simple data structures and methods for working
with structured data. NumPy, another crucial Python module for array manipulation and
numerical computing, provides the foundation upon which Pandas is constructed.
NumPy: It is an essential Python library that offers mathematical functions for Fourier
analysis and linear algebra, as well as making dealing with arrays and matrices easier.
Scientific computing and data analysis applications heavily rely on it as the foundation
for several numerical computing operations.
Matplotlib: A potent Python charting toolkit, Matplotlib easily combines with NumPy
arrays to allow for the production of excellent visuals for data exploration and display.
Plotting and charting jobs benefit greatly from its versatile object-oriented API, which
may be used to create a vast array of illustrations.
Scikit-learn: It is a well-liked Python machine learning toolkit that offers effective

implementations of a wide range of supervised and unsupervised learning algorithms.
It is often referred to as sklearn. It is appropriate for a variety of machine learning
problems because to its extensive collection of classification, regression, and clustering
techniques.

TensorFlow: Python package for deep learning model construction and rapid numerical
computation. Whether using higher-level wrapper libraries developed on top of
TensorFlow or directly, it provides the framework for building neural networks and
other ML models.
6.2 DEVELOPMENT ENVIRONMENT SETUP & CONFIGURATION

Our integrated programming environment (IDE), PyCharm, and Streamlit, which
let us create interactive web apps, were among these products. For effectively
organizing our project files, debugging, and coding, PyCharm offered a stable and
feature-rich environment. PyCharm improved productivity and simplified our
development approach.
Figure 6.1 Tool Stack for Project Implementation
To code and explore, we also used Jupyter Notebooks, especially in the early
phases of data exploration and model building. To execute Python code snippets,
visualize data, and iterate on machine learning methods, Jupyter Notebooks provided
an interactive platform. Our ability to swiftly develop ideas and improve our models in
response to real-time input was made possible by this flexibility. We made sure there
was no disruption in the flow from data exploration to model deployment by including
Jupyter Notebooks into our development process. The amalgamation of PyCharm,
Streamlit, and Jupyter Notebooks enhanced Python's functionalities and furnished us
with an all-inclusive set of instruments to proficiently address the obstacles involved in
our undertaking.

6.3. CODE
app.py
from flask import Flask, request, jsonify
from Malicious_url_functions import preprocess_input_url, predict_url_maliciousness
import logging
app = Flask(__name__)
# Enable logging
logging.basicConfig(level=logging.DEBUG)
@app.route('/predict', methods=['POST'])
def predict():
try:
url = request.json['url']
logging.debug(f"Received URL: {url}")
prediction = predict_url_maliciousness(url)
logging.debug(f"Prediction: {prediction}")
return jsonify({'prediction': prediction})
except Exception as e:
logging.error(f"Error occurred: {str(e)}")
return jsonify({'error': str(e)})
if __name__ == '__main__':
app.run(port=8888, debug=True)

Ui.py
import streamlit as st
import requests
# Define the Streamlit app

def main():
st.title('Malicious URL Detection')
# Prediction function
def predict_url_maliciousness(url):
response = requests.post('http://localhost:8888/predict', json={'url': url})
prediction_result = response.json()['prediction']
return prediction_result
# Prediction
st.subheader('URL Maliciousness Prediction')
user_input = st.text_input('Enter URL to check:')
if st.button('Predict'):
prediction_result = predict_url_maliciousness(user_input)
# Set font color for the output
st.write(f"The URL '{user_input}' is predicted to be: ", unsafe_allow_html=True,
style={'color': 'yellow'}) # Change font color to yellow
st.write(prediction_result, unsafe_allow_html=True, style={'color': 'yellow'}) # Change
font color to yellow
if __name__ == '__main__':
main()

CHAPTER-7
SYSTEM TESTING
7. SYSTEM TESTING
7.1. SOFTWARE TESTING TECHNIQUES

A critical step in assessing software products' quality and finding flaws that need
to be fixed is software testing. Software testing has many limitations even if it strives
to accomplish its objectives. Devotion to predetermined objectives is necessary for
effective testing.
7.1.1. GOALS
1. Compatibility of work products with user stories, designs, specifications, and
code.
2. Every requirement is satisfied.
3. The test object fulfills and meets user and stakeholder requirements in terms of
completion and expectations.
7.1.2. TEST CASE FRAMEWORK

There are the possible tests that any technical product can undergo:
7.1.3. TESTING OF A WHITE BOX

White box testing is a software testing approach that focuses on assessing the
underlying structure and code of software systems. It is sometimes referred to as
structural testing or clear box testing. In order to guarantee intended results, it entails
comparing the internal actions of the program to different inputs. White box testing
focuses on evaluating the internal workings of the product and necessitates
programming skills to create test cases.
White box testing's main goal is to examine the program's inputs and outputs
while maintaining its security. The phrases "white box," "transparent box," and "clear
box" all allude to the ability to see through the software's outside shell. Before
delivering the program to the testing team, developers usually carry out white box
testing, which entails evaluating each and every line of code to find and fix any bugs.

Prior to release, developers conduct white box testing to ensure compliance with
requirements and address any identified issues. Test engineers do not participate in
fixing problems during this phase to prevent potential conflicts with other features.
Instead, they focus on continually identifying new flaws in the program.
White box testing encompasses various tests such as:
• Path testing
• Loop testing
• Condition evaluation
• Memory viewpoint testing
• Test results for the program
7.1.4. BLACK BOX TESTING

Software applications are tested using a technique known as "black box" testing,
which evaluates their functionality without revealing internal code or implementation
details. The software's input and output, as well as its parameters and needs, are the
only things it addresses. For black box testing, any software package—including
operating systems like Windows, databases like Oracle, websites like Google, and
bespoke apps—can be used. This technique involves testers focusing just on inputs and
outputs, with little consideration for the underlying code implementation.
Testing using black boxes looks for possible errors in a number of areas, such as:
1. The identification of inadequate or absent features.
2. Determination of interaction mistakes.
3. Evaluation of the weak information architecture.
4. Assessment of performance or behavioral issues.
5. Mistake finding pertaining to the beginning and end of procedures.

7.2. STRATEGIES FOR SOFTWARE TESTING

• A unit test
• Integrity Checks
• Validation Examination
• System Evaluation
• Security Checks
• Performance Evaluation
7.2.1. Unit Testing

Smallest software architecture module is evaluated via unit testing. With the use
of procedural design specifications, it examines important control channels inside
module restrictions. Within the confines of the module, this testing approach checks
each unit independently to guarantee appropriate operation. To verify the performance
of the code, software developers and sometimes QA personnel do unit tests.
Effective unit testing can find coding flaws that might otherwise go undetected.
Unit testing is included as an essential part of the software development process by
Test-Driven Development (TDD). It is the first stage of testing that comes before further
tests and integration testing. Apart from automated testing, manual testing is still an
alternative for verifying that a unit is independent of other code or operations. Unit
testing does this.
7.2.2. Integration Testing

While checking for interface problems, integration testing creates the framework of a
software. A software structure based on design is created using unit-tested techniques.
In order to find issues with the way software components interact after integration, this
testing process conceptually joins and tests the individual software components.
Top-Down Integration:
Integration from the top down integrates several modules to gradually construct
and test a program's structure, working down the systematic control hierarchy from the
primary control or index program.

Bottom-up Integration:
In bottom-up integration, all processes or modules are integrated bottom-up
without the need for residue, starting with the construction and testing of atomic
modules or the fundamental aspects of the product.
7.2.3. Validation testing

Software development must comply with user requirements and business logic,
which is ensured by validation testing. Strict testing of important application
components is conducted, with an emphasis on verifying supplied logic or business
situations.
Through comprehensive testing of each crucial application component,

validation testing ensures that the software is developed and validated to satisfy user or
customer requirements. For a thorough analysis, it depends on business logic or
scenarios that can be independently verified and given to testers.
7.2.4. System Testing

Computer-based systems are put through rigorous testing during system testing
to make sure that all system components are integrated correctly and meet goals. It
looks at a fully integrated software system to make sure that, from the user's point of
view, everything works properly and flows end to end.
Using integrated software and suitable hardware, system testing confirms the
entire operation of the system. To make sure the finished features and functionality
work as planned, it entails end-to-end testing and a thorough examination of each
module.
7.2.5. Security testing

In order to protect software applications from harmful attacks, security testing
finds flaws and dangers in them. In order to protect data security and maintain usability,
its main goal is to locate any potential ambiguities or weaknesses in the software.
Finding possible security threats, assisting developers in fixing problems, and

guaranteeing data security while preserving program usability are all made possible by

security testing.
7.2.6. Performance Evaluation

Dependability, scalability, and resource efficiency are the main concerns of
performance testing, which assesses a system's responsiveness and stability under
various workloads.
Performance Evaluation Method:

A simple technique for evaluating a system's performance under particular loads
is load testing. Through stress testing, the maximum capacity of the system and how it
behaves when loads surpass the maximum that is anticipated are ascertained. Durability
tests, also referred to as soak tests, assess a system's performance under a continuous
load while keeping an eye on memory utilization to identify problems such as memory
leaks. The main objective is to track the system's performance over time. Spike testing
quickly increases the user base in order to quickly analyze the workload and
performance of the system.
7.3 TEST CASES:

Expected Actual
Test Case Input Output Output Output Pass/Fail
Malicious Classification
1 URL Result Malicious Malicious Pass
Legitimate Classification
2 URL Result Benign Benign Pass
URL with
Suspicious Classification
3 Patterns Result Malicious Malicious Pass

Test Cases of Model building:

Expected Actual
S. No Test Cases Input Output Output Pass/Fail
1 Data Malicious and Dataset Created Dataset Created Pass
Collection Legitimate
URLs
2 Data Feature Cleaned Data Cleaned Data Pass
Preparation Extraction
3 Model CNN-LSTM Trained Model Trained Model Pass
Building Algorithm
4 Model Test Dataset Evaluation Evaluation Pass
Evaluation Metrics Metrics
5 Hyperpara Grid Search Optimized Optimized Pass
meter Model Model
Tuning
6 Prediction New URL Classification Classification Pass
Result Result

CHAPTER-8
RESULTS
8. RESULTS
OUTPUT SCREEN SHOTS WITH DESCRIPTION:
HOME PAGE:
The user is currently seeing the web application's home page.
Figure 8.1: Home Page
ENTER URL PAGE: The user can enter the url.
Figure 8.2: Enter Url Page

PREDICTION PAGE:
The model predicts the output.
Figure 8.3 Prediction Page

CHAPTER 9
CONCLUSION
AND
FUTURE ENHANCEMENTS
9.0. CONCLUSION AND FUTURE ENHANCEMENT

9.1. CONCLUSION
In conclusion, this project offers a solid data analytics approach that uses cutting edge
machine learning methods like ANN, CNN, and LSTM networks to detect and mitigate
harmful URLs early in the cybersecurity process. In terms of identifying benign and
malicious URLs, the suggested CNN-LSTM model performs better than conventional
approaches, displaying improved accuracy, precision, and F1 score. Organizations and
people may proactively protect against new cyberthreats like malware dissemination and
phishing attempts by incorporating this data-driven technique into web page
classification systems. This allows for real-time URL categorization and threat
mitigation. These results highlight how data analytics-driven tactics may strengthen
cybersecurity defenses and shield digital assets from changing threats in an era of
growing digital interconnection.
9.2. FUTURE ENHANCEMENT
The project will be improved in the future by continuously improving performance

metrics by refining feature extraction approaches and fine-tuning hyperparameters of the
CNN-LSTM model. Furthermore, adding more benign and dangerous URLs to the
dataset will improve the model's ability to generalize. For real-time URL categorization
and threat mitigation in large-scale online environments, it is imperative to explore
multimodal analytic methodologies, optimize model scalability and efficiency, and
provide mechanisms for dynamic learning and adaption to developing cyber threats. In
addition, proactive detection and mitigation of malicious URLs will be further advanced,
leading to more robust and effective cybersecurity defenses, by fostering collaborative
defense mechanisms, bolstering adversarial robustness, and improving the
interpretability and explainability of the model's decisions.

REFERENCES
REFERENCES
[1] Yue Zhang, Jason Hong, Lorrie Cranor, “Cantina: A Content-Based Approach to
Detecting Phishing WebSites,” in Proc. of International Conference on World Wide Web,
WWW 2007, Banff, Alberta, Canada, May. DBLP, 639-648, 2007. Article (CrossRef
Link).
[2] Mahmoud Khonji, Youssef Iraqi, Andrew Jones, “Phishing Detection: A Literature
Survey,” IEEE Communications Surveys & Tutorials, 15(4), 2091-2121, 2013. Article
(CrossRef Link).
[3] Lance Spitzner, Honeypots: tracking hackers, Hacker, Boston, MA, USA, 2003.
Article (CrossRef Link).
[4] Jiuxin Cao, Bo Mao, Junzhou Luo, Bo Liu, “A Phishing web Pages Detection
Algorithm Based on Nested Structure of Earth Mover’s Distance,” Chinese Journal of
Computers, 32(5), 922-929, 2009. Article (CrossRef Link).
[5] Shouxu Jiang, Jianzhong Li, “A Reputation-based Trust Mechanism for P2P E-
commerce Systems,” Journal of Software, 2007, 18(10), 2551-2563, 2007.
[6] Hongzhou Sha, Qingyun Liu, Tingwen Liu, Zhou Zhou, Li Guo, Binxing Fang,
“Survey on Malicious Webpage Detection Research,” Chinese Journal of Computers,
39(3), 529-542, 2016. Article (CrossRef Link).
[7] Sahoo D, Liu C, Hoi S C H, “Malicious URL Detection using Machine Learning: A
Survey,” 2017. Article (CrossRef Link).
[8] Pawan Prakash, Manish Kumar, Ramana Kompella, Minaxi Gupta, “Phishnet:
predictive blacklisting to detect phishing attacks,” in Proc. of 2010 Proceedings IEEE
INFOCOM, 1-5, 2010. Article (CrossRef Link).
[9] Dharmaraj R Patil, Jayantrao Patil, “Survey on Malicious Web Pages Detection
Techniques,” International Journal of u- and e- Service, Science and Technology, vol. 8,
no. 5, pp. 195–206, 2015. Article (CrossRef Link).
[10] Sujata Garera, Niels Provos, Monica Chew, Aviel D. Rubin, “A framework for
detection and measurement of phishing attacks,” in Proc. of the 2007 ACM workshop on
Recurring malcode. ACM, pp. 1–8, 2007. Article (CrossRef Link).
[11] Mahmoud Khonji, Youssef Iraqi, Andy Jones, “Phishing Detection: A Literature
Survey,” IEEE Communications Surveys & Tutorials, vol. 15, no. 4, pp. 2091–2121,
2013. Article (CrossRef Link).

[12] Raj Nepali, Yong Wang, “You Look Suspicious!!: Leveraging Visible Attributes to
Classify Malicious Short URLs on Twitter,” in Proc. of 2016 49thHawaii International
Conference on System Sciences (HICSS). IEEE, pp. 2648–2655, 2016. Article (CrossRef
Link).
[13] Masahiro Kuyama, Yoshio Kakizaki, Ryoichi Sasaki, “Method for Detecting a
Malicious Domain by Using WHOIS and DNS Features,” in Proc. of The Third
International Conference on Digital Security and Forensics (Digital Sec2016), pp. 74-80,
[14] Liu G, Qiu B, Liu W, “Automatic Detection of Phishing Target from Phishing
Webpage,” in Proc. of International Conference on Pattern Recognition. IEEE Computer
Society, 4153-4156, 2010. Article (CrossRef Link).
[15] Ma J, Saul LK, Savage S, GM Voelker, “Beyond blacklists: learning to detect

malicious web sites from suspicious URLs,” in Proc. of ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July.
DBLP, 1245-1254, 2009. Article (CrossRef Link).
[16] Ma J, Saul L K, Savage S, GM Voelker, “Identifying suspicious URLs: an

application of large-scale online learning,” in Proc. of International Conference on
Machine Learning. ACM, 681-688, 2009. Article (CrossRef Link).
[17] Ma J, Saul L K, Savage S, GM Voelker, “Learning to detect malicious URLs,” Acm

Transactions on Intelligent Systems & Technology, 2(3), 1-24, 2011. Article (CrossRef
Link).
[18] Xuejian Wang, Lantao Yu, Kan Ren, Guanyu Tao, Weinan Zhang, Yong Yu, Jun
Wang, “Dynamic Attention Deep Model for Article Recommendation by Learning
Human Editors' Demonstration,” in Proc. of ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM, 2051-2059, 2017. Article (CrossRef
Link).
[19] Mnih V, Heess N, GravesA, K Kavukcuoglu, “Recurrent models of visual attention,”

in Proc. of NIPS'14 Proceedings of the 27th International Conference on Neural
Information Processing Systems, 2204-2212, 2014. Article (CrossRef Link).
[20] Ming Sun, Anirudh Raju, George Tucker, Sankaran Panchapagesan, Gengshen Fu,
“Max-pooling loss training of long short-term memory networks for small-footprint
keyword spotting,” in Proc. of Spoken Language Technology Workshop. IEEE, 474-480,

B2 Salma Fayaz

Uploaded by

Copyright:

Available Formats

B2 Salma Fayaz

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

B2 Salma Fayaz

Uploaded by

Copyright:

Available Formats

A Project Report

Dr. P. Phanindra Kumar Reddy MTech, Ph.D.

Department of Artificial Intelligence & Data Science

Signature of Guide: Signature of HOD:

Project viva-voce held on :

Internal Examiner External Examiner

Name: S. SALMA FAYAZ

Fig. No. Figures Page No.

Table. No. Name Page No.

1.1. EMERGENCE OF MALICIOUS URLS

1.2.RATIONALE FOR RESEARCH

1.3. CHALLENGES IN COMBATTING MALICIOUS URLS

Department of AI&DS, AITS, Rajampet 2

TITLE: Cantina: A Content-Based Approach to Detecting Phishing Websites

Department of AI&DS, AITS, Rajampet 3

TITLE: Phishing Detection: A Literature Survey

Department of AI&DS, AITS, Rajampet 4

TITLE: Malicious URL Detection using Machine Learning: A Survey

Department of AI&DS, AITS, Rajampet 5

Department of AI&DS, AITS, Rajampet 6

3.1 EXISTING SYSTEM:

For the purpose of detecting and mitigating harmful URLs in cybersecurity

Department of AI&DS, AITS, Rajampet 7

3.2 PROPOSED SYSTEM:

In order to detect and mitigate harmful URLs, we suggest a unique method in

Compared to the current method, the suggested methodology has a number of

Department of AI&DS, AITS, Rajampet 8

3.3 MODULES USED IN PROPOSED SYSTEM:

Encompassing data preparation, model training, assessment, and deployment, the

3.3.3 ALGORITHMS USED:

Department of AI&DS, AITS, Rajampet 9

particularly suited for modeling data sequences because, in contrast to standard

Department of AI&DS, AITS, Rajampet 10

4. SYSTEM REQUIREMENTS SPECIFICATION

4.1. SOFTWARE REQUIREMENTS

• Server side Script : Python

• Dataset : Malicious Url’s Dataset

4.2. HARDWARE REQUIREMENTS

• RAM :4GB (min)

• Hard Disk : 160GB

• Keyboard : Standard Windows Keyboard

4.3. FEASIBILITY STUDY

4.3.1. Economic Feasibility:

4.3.2. Technical Feasibility:

4.3.3. Behavioral Feasibility:

4.3.4. Benefits of Doing a Feasibility Study:

4. Effective resource allocation and organizational functioning are ensured by cost-

Department of AI&DS, AITS, Rajampet 12

4.4. FUNCTIONAL AND NON-FUNCTIONAL REQUIREMENTS:

Determining the success of a software or system project is mostly dependent on

4.4.1. Functional Requirements:

Here are some instances of functional requirements:

1) login as a user after authenticating.

2) Reaction to a cyberattack system shutdown.

3) User registration in a software system triggers the automatic delivery of a

4.4.2. Non-functional requirements

The following are some instances of non-functional requirements:

Department of AI&DS, AITS, Rajampet 13

4) Stable system operation and little downtime are characteristics of reliability.