B2 Salma Fayaz
B2 Salma Fayaz
B2 Salma Fayaz
On
ADVANCEMENTS IN CYBERSECURITY: A
DATA ANALYTICS APPROACH FOR
PROACTIVE DETECTION AND MITIGATION OF
MALICIOUS URLS
Submitted in partial fulfillment of the
requirement for the award of the degree of
BACHELOR OF TECHNOLOGY
in
ARTFICIAL INTELLIGENCE & DATA SCIENCE
By
S. SALMA FAYAZ 20701A3039
S. NOUSHIN FATHIMA 20701A3033
V. HANISHA 20701A3010
Under the esteemed guidance of
Submitted to
2023-2024
Department of Artificial Intelligence & Data Science
Annamacharya Institute of Technology and Sciences
(Affiliated to J.N.T. University, Anantapur)
New Boyanapalli, Rajampet-516126 Annamayya (Dt), A.P
CERTIFICATE
This is to certify that the project report entitled “Advancements in
Cybersecurity: A Data Analytics Approach for Proactive detection and Mitigation
of Malicious Url’s” done by S . S A L M A F A Y A Z , 2 0 7 0 1 A 3 0 3 9 in partial
fulfillment of the requirements for the award of Degree of Bachelor of Technology in
“Artificial Intelligence and Data Science”, is a record of bonafide work carried out
by her during the academic year 2023-2024.
CERTIFICATE
This is to certify that the project report entitled “Advancements in
Cybersecurity: A Data Analytics Approach for Proactive detection and Mitigation
of Malicious Url’s” done by S . S A L M A F A Y A Z , 2 0 7 0 1 A 3 0 3 9 in partial
fulfillment of the requirements for the award of Degree of Bachelor of Technology in
“Artificial Intelligence and Data Science”, is a record of bonafide work carried out
by her during the academic year 2023-2024.
I Miss. S. SALMA FAYAZ, bearing Roll Number 20701A3039, hereby declare that the
project report entitled ADVANCEMENTS IN CYBERSECURITY: A DATA ANALYTICS
APPROACH FOR PROACTIVE DETECTION AND MITIGATION OF MALICIOUS URLS
under the guidance of Dr. P. Phanindra Kumar Reddy, M. Tech, Ph.D., Department of
Artificial Intelligence and Data Science is submitted in partial fulfilment of the requirements
for the award of the degree of Bachelor of Technology in Artificial Intelligence and Data
Science.
This is a record of bonafide work carried out by me and the results embodied in project
report have not been reproduced or copied from any source. The results embodied in this project
report have been submitted to any other University or Institute for the Award of any other
Degree or Diploma.
We endeavor of a long period can be successful only with the advice of many
well- wishers. We take this opportunity to express my deep gratitude and appreciation
to all those who encouraged me for the successful completion of the project work.
Our heartfelt thanks to our Guide, Dr. P. Phanindra Kumar Reddy M. Tech,
Ph.D. Head of the Department in Department of Artificial Intelligence and Data
Science, Annamacharya Institute of Technology and Sciences, Rajampet, for his
valuable guidance and suggestions in analyzing and testing throughout the period, till
the end of the project work completion.
We wish to express sincere thanks and gratitude to Dr. P. Phanindra Kumar
Reddy, Head of the Department of Artificial Intelligence and Data Science, for his
encouragement and facilities that were offered to us for carrying out this project.
We take this opportunity to offer gratefulness to our Principal Dr. S.M.V.
Narayana, for providing all sorts of help during the project work.
We are very much thankful to Dr. C. Gangi Reddy, Honorary Secretary of the
Annamacharya Educational Trust, for his help in providing good facilities in our
college.
We would express our sincere thanks to all faculty members of Artificial
Intelligence and Data Science Department, batch-mates, friends, and lab-
technicians, who have helped us to complete the project work successfully.
Finally, we express our sincere thanks to our parents who has provided their
heartfelt support and encouragement in the accomplishment to complete this project
successfully.
PROJECT ASSOCIATES
S. SALMA FAYAZ
S. NOUSHIN FATHIMA
V. HANISHA
TABLE OF CONTENTS
TITLE PAGE NO
ABSTRACT
LIST OF FIGURES
LIST OF TABLE
1. INTRODUCTION
1.1. Emergence Of Malicious Urls 1
1.2. Rationale For Research 1-2
1.3. Challenges In Combatting Malicious Urls 2
2. LITERATURE SURVEY 3-6
3. SYSTEM ANALYSIS
3.1. Existing System 7
3.1.1. Disadvantages 8
3.2. Proposed System 8
3.2.1. Advantages 8
3.3. Modules used in proposed system
3.3.1. User 9
3.3.2. System 9
3.3.3. Algorithms Used 9
3.3.3.1. ANN-LSTM 9
3.3.3.2. CNN-LSTM 9-10
4. SYSTEM REQUIREMENTS SPECIFICATIONS
4.1. Software requirements 11
4.2. Hardware requirements 11
4.3. Feasibility study 11
4.3.1. Economic feasibility 12
4.3.2. Technical feasibility 12
4.3.3. Behavioral feasibility 12
4.3.4. Benefits of doing feasibility study 12
4.4. Functional and non-functional requirements 13
4.4.1. Functional Requirements 13
4.4.2. Non-Functional Requirements 13-14
5. SYSTEM DESIGN
5.1. Architecture Design 15
5.2.Uml diagrams
5.2.1. Use case diagram 16
5.2.2. Class diagram 17
5.2.3. Sequence diagram 17-18
5.2.4. Collaboration diagram 18
5.2.5. Deployment diagram 19
5.2.6. Component diagram 19
5.2.7. State Chart diagram 20
5.3.ER Diagram 20-21
6. SYSTEM CODING AND IMPLEMENTATION
6.1. Programming Language and Libraries Selection 22
6.1.1. Libraries used in Python 22-23
6.2. Development Environment Setup and Configuration 23
6.3. Code 24-25
7. SYSTEM TESTING
7.1. Software testing techniques
7.1.1. Goals 26
7.1.2. Test Case Framework 26
7.1.3. White box testing 26-27
7.1.4. Black box testing 27
7.2. Strategies for software testing 28
7.2.1. Unit testing 28
7.2.2. Integration testing 28
7.2.3. Validation testing 29
7.2.4. System testing 29
7.2.5. Security testing 29
7.2.6. Performance evaluation 30
7.3. Test Cases 30-31
8. RESULTS 32-33
9. CONCLUSION AND FUTURE ENHANCEMENTS 34
REFERENCES 35-36
PLAGIARISM REPORT
LIST OF FIGURES & TABLES
Keywords: P h i s h i n g d e t e c t i o n , p r o a c t i v e d e t e c t i o n , w e b s e c u r i t y,
URL classification, malware, neural networks, machine
learning, data analytics, CNN -LSTM, ANN -LSTM,
cybersecurity.
CHAPTER-1
INTRODUCTION
ADVANCEMENTS IN CYBERSECURITY: A DATA ANALYTICS APPROACH FOR
PROACTIVE DETECTION AND MITIGATION OF MALICIOUS URLS
1. INTRODUCTON
2. LITERATURE SURVEY
TITLE: Max-pooling Loss Training of Long Short-Term Memory Networks for Small-
footprint Keyword Spotting
AUTHOR: Ming Sun, Anirudh Raju, George Tucker, Sankaran Panchapagesan,
Gengshen Fu
ABSTRACT: Using max-pooling based loss functions, this paper presents a unique
method to decrease CPU, memory, and latency requirements for training Long Short-
Term Memory (LSTM) networks for small-footprint keyword spotting (KWS). In the
suggested approach, a cross-entropy loss trained network is used to start the training
process. Max-pooling loss training is then used to further optimize the system. A
technique based on posterior smoothing is used to evaluate the performance of keyword
spotting. Empirical results show that LSTM models trained with max-pooling loss or
cross-entropy loss perform better than a baseline feed-forward Deep Neural Network
(DNN) trained with cross-entropy loss. The results show that LSTM models trained with
max-pooling loss outperform the baseline DNN model, with a substantial relative
decrease of 67.6% in the Area Under the Curve (AUC) metric. This is especially true
whether initiated with a randomly initialized network or a cross-entropy pre-trained
network.
3. SYSTEM ANALYSIS
The analysis of computer data, project data, algorithm data, and other inner and
outer data relevant to the proposed research is a comprehensive process that involves a
range of phases, methodologies, functions, and entities in the investigation of project data.
A group of scientific techniques called system analysis are used to determine the
requirements for project task design. System analysis looked at a range of functional and
non-functional needs for the design of the proposed system. The present system analysis
has planned the design using a number of tools, including class diagrams, sequence
diagrams, data flow diagrams, and data dictionaries, in order to develop a logical model
of the system. It has also reviewed various publications relevant to the project's work.
Inside the current framework, the majority of research efforts have gone into
improving CNN-LSTM model performance using several methods, such
hyperparameter tuning, data augmentation, and attention mechanisms. Despite the fact
that these efforts have significantly increased the accuracy, precision, and recall of
malicious URL identification, little is known about the possible advantages of utilizing
ANN-LSTM structures. To identify the best strategy for proactive threat detection and
mitigation in cybersecurity applications, a thorough comparison of CNN-LSTM and
ANN-LSTM models is important, as this research gap makes clear.
3.1.1 DISADVANTAGES:
Some drawbacks of the current approach are related to its sole dependence on
CNN-LSTM models. Initially, the CNN-LSTM models could have trouble identifying
intricate temporal correlations seen in URL sequences, which could hinder their
capacity to correctly identify some dangerous URL patterns. Furthermore, the lack of
ANN-LSTM models ignores the benefits of utilizing artificial neural networks to
discern complex patterns and correlations in URL data, which might compromise the
overall efficacy of fraudulent URL detection systems. Research on the advantages and
disadvantages of each technique is further hampered by the dearth of thorough
comparison studies between CNN-LSTM and ANN-LSTM models, which further
impedes advancements in the cybersecurity space.
3.2.1 ADVANTAGES:
3.3.2 SYSTEM:
3.3.3.2.CNN-LSTM:
An important component of our suggested solution for malicious URL identification is
the Artificial Neural Network-Long Short-Term Memory (ANN-LSTM) algorithm. The
algorithm in question was carefully selected due to its ability to identify the sequential
patterns and long-term relationships included in URL data. LSTM networks are
Our proposed approach includes a critical component that allows a thorough comparison
of the efficacy and efficiency of the CNN-LSTM and ANN-LSTM models in malicious
URL identification. Accuracy, recall, precision, F1 score, and receiver operating
characteristic (ROC) curves are just a few of the performance indicators that will be
reviewed in our thorough comparison research. In order to detect malicious URLs in
various contexts and conditions, this investigation seeks to determine the best design. The
effectiveness of each model will also be evaluated in terms of resource consumption,
memory footprint, and computational complexity. We want to determine which design is
most suited for implementation in practical settings by weighing the trade-offs between
accuracy and efficiency. To further guarantee this, we will take into account variables
like training duration, inference speed, and scalability.
• IDE : PyCharm
• Framework : Streamlit
• Monitor : SVGA
• Economic Feasibility
• Technical Feasibility
• Behavioral Feasibility
Department of AI&DS, AITS, Rajampet 11
ADVANCEMENTS IN CYBERSECURITY: A DATA ANALYTICS APPROACH FOR
PROACTIVE DETECTION AND MITIGATION OF MALICIOUS URLS
1. Examining system requirements in detail is made easier by the analysis step of the
research, which is the first stage of the software development life cycle.
2. Assessing and evaluating risk concerns related to the design and execution of systems.
3. Giving advice on possible obstacles and how to mitigate them to facilitate risk
planning.
5. Feasibility studies support the planning process for training developers to implement
the system.
1) With respect to such an activity, emails should be sent no more than 12 hours
afterwards.
2) Protection of data from unwanted access and confidentiality are two aspects of
security.
3) Easy system upkeep and future updates are key components of maintainability.
5. SYSTEM DESIGN
5.1. ARCHITECTURE DESIGN
The architecture diagram in Figure 5.1 delineates the systematic approach to gathering,
refining, and categorizing URLs using advanced deep learning classifiers. Beginning
with data collection from diverse sources, including search engines, the process
encompasses meticulous data preparation, involving cleaning and feature extraction
techniques. Three specialized neural network models are then employed to discern
phishing from legitimate URLs based on intricate patterns. Classification results from
these models facilitate rapid identification of malicious and benign URLs, empowering
stakeholders to bolster cyber defenses. Ultimately, this architecture aims to automate
URL security, enhancing resilience against evolving cyber threats through efficient and
accurate detection mechanisms.
5.3. ER DIAGRAM:
Through the use of an Entity Relationship Diagram (ER Diagram), the Entity-
Relationship (ER) paradigm offers a structured representation of a database. This model
defines the entities and their connections inside the system, acting as a template for the
database design. Entity sets and relationship sets, which define the entities and their
relationships in the database, are the fundamental components of the Entity Relationship
(ER) paradigm.
Relationships between entity sets are shown graphically in an entity relationship diagram.
Any group of related entities that share characteristics is referred to as an entity set.
Entities are related to tables or table characteristics in the context of database
management systems (DBMS). An extensive summary of the database's logical structure
Department of AI&DS, AITS, Rajampet 20
ADVANCEMENTS IN CYBERSECURITY: A DATA ANALYTICS APPROACH FOR
PROACTIVE DETECTION AND MITIGATION OF MALICIOUS URLS
is provided by the ER diagram, which shows the relationships between tables and their
characteristics. An efficient way to build and maintain databases is to use this visual
representation, which makes it easier to comprehend the database structure and the
relationships that support it.
NumPy: It is an essential Python library that offers mathematical functions for Fourier
analysis and linear algebra, as well as making dealing with arrays and matrices easier.
Scientific computing and data analysis applications heavily rely on it as the foundation
for several numerical computing operations.
Matplotlib: A potent Python charting toolkit, Matplotlib easily combines with NumPy
arrays to allow for the production of excellent visuals for data exploration and display.
Plotting and charting jobs benefit greatly from its versatile object-oriented API, which
may be used to create a vast array of illustrations.
TensorFlow: Python package for deep learning model construction and rapid numerical
computation. Whether using higher-level wrapper libraries developed on top of
TensorFlow or directly, it provides the framework for building neural networks and
other ML models.
To code and explore, we also used Jupyter Notebooks, especially in the early
phases of data exploration and model building. To execute Python code snippets,
visualize data, and iterate on machine learning methods, Jupyter Notebooks provided
an interactive platform. Our ability to swiftly develop ideas and improve our models in
response to real-time input was made possible by this flexibility. We made sure there
was no disruption in the flow from data exploration to model deployment by including
Jupyter Notebooks into our development process. The amalgamation of PyCharm,
Streamlit, and Jupyter Notebooks enhanced Python's functionalities and furnished us
with an all-inclusive set of instruments to proficiently address the obstacles involved in
our undertaking.
6.3. CODE
app.py
from flask import Flask, request, jsonify
from Malicious_url_functions import preprocess_input_url, predict_url_maliciousness
import logging
app = Flask(__name__)
# Enable logging
logging.basicConfig(level=logging.DEBUG)
@app.route('/predict', methods=['POST'])
def predict():
try:
url = request.json['url']
logging.debug(f"Received URL: {url}")
prediction = predict_url_maliciousness(url)
logging.debug(f"Prediction: {prediction}")
return jsonify({'prediction': prediction})
except Exception as e:
logging.error(f"Error occurred: {str(e)}")
return jsonify({'error': str(e)})
if __name__ == '__main__':
app.run(port=8888, debug=True)
Ui.py
import streamlit as st
import requests
# Prediction function
def predict_url_maliciousness(url):
response = requests.post('http://localhost:8888/predict', json={'url': url})
prediction_result = response.json()['prediction']
return prediction_result
# Prediction
st.subheader('URL Maliciousness Prediction')
user_input = st.text_input('Enter URL to check:')
if st.button('Predict'):
prediction_result = predict_url_maliciousness(user_input)
# Set font color for the output
st.write(f"The URL '{user_input}' is predicted to be: ", unsafe_allow_html=True,
style={'color': 'yellow'}) # Change font color to yellow
st.write(prediction_result, unsafe_allow_html=True, style={'color': 'yellow'}) # Change
font color to yellow
if __name__ == '__main__':
main()
7. SYSTEM TESTING
7.1.1. GOALS
1. Compatibility of work products with user stories, designs, specifications, and
code.
3. The test object fulfills and meets user and stakeholder requirements in terms of
completion and expectations.
White box testing's main goal is to examine the program's inputs and outputs
while maintaining its security. The phrases "white box," "transparent box," and "clear
box" all allude to the ability to see through the software's outside shell. Before
delivering the program to the testing team, developers usually carry out white box
testing, which entails evaluating each and every line of code to find and fix any bugs.
Prior to release, developers conduct white box testing to ensure compliance with
requirements and address any identified issues. Test engineers do not participate in
fixing problems during this phase to prevent potential conflicts with other features.
Instead, they focus on continually identifying new flaws in the program.
• Path testing
• Loop testing
• Condition evaluation
Testing using black boxes looks for possible errors in a number of areas, such as:
• Integrity Checks
• Validation Examination
• System Evaluation
• Security Checks
• Performance Evaluation
Effective unit testing can find coding flaws that might otherwise go undetected.
Unit testing is included as an essential part of the software development process by
Test-Driven Development (TDD). It is the first stage of testing that comes before further
tests and integration testing. Apart from automated testing, manual testing is still an
alternative for verifying that a unit is independent of other code or operations. Unit
testing does this.
Top-Down Integration:
Integration from the top down integrates several modules to gradually construct
and test a program's structure, working down the systematic control hierarchy from the
primary control or index program.
Bottom-up Integration:
In bottom-up integration, all processes or modules are integrated bottom-up
without the need for residue, starting with the construction and testing of atomic
modules or the fundamental aspects of the product.
Using integrated software and suitable hardware, system testing confirms the
entire operation of the system. To make sure the finished features and functionality
work as planned, it entails end-to-end testing and a thorough examination of each
module.
security testing.
8. RESULTS
OUTPUT SCREEN SHOTS WITH DESCRIPTION:
HOME PAGE:
PREDICTION PAGE:
In conclusion, this project offers a solid data analytics approach that uses cutting edge
machine learning methods like ANN, CNN, and LSTM networks to detect and mitigate
harmful URLs early in the cybersecurity process. In terms of identifying benign and
malicious URLs, the suggested CNN-LSTM model performs better than conventional
approaches, displaying improved accuracy, precision, and F1 score. Organizations and
people may proactively protect against new cyberthreats like malware dissemination and
phishing attempts by incorporating this data-driven technique into web page
classification systems. This allows for real-time URL categorization and threat
mitigation. These results highlight how data analytics-driven tactics may strengthen
cybersecurity defenses and shield digital assets from changing threats in an era of
growing digital interconnection.
REFERENCES
[1] Yue Zhang, Jason Hong, Lorrie Cranor, “Cantina: A Content-Based Approach to
Detecting Phishing WebSites,” in Proc. of International Conference on World Wide Web,
WWW 2007, Banff, Alberta, Canada, May. DBLP, 639-648, 2007. Article (CrossRef
Link).
[2] Mahmoud Khonji, Youssef Iraqi, Andrew Jones, “Phishing Detection: A Literature
Survey,” IEEE Communications Surveys & Tutorials, 15(4), 2091-2121, 2013. Article
(CrossRef Link).
[3] Lance Spitzner, Honeypots: tracking hackers, Hacker, Boston, MA, USA, 2003.
Article (CrossRef Link).
[4] Jiuxin Cao, Bo Mao, Junzhou Luo, Bo Liu, “A Phishing web Pages Detection
Algorithm Based on Nested Structure of Earth Mover’s Distance,” Chinese Journal of
Computers, 32(5), 922-929, 2009. Article (CrossRef Link).
[5] Shouxu Jiang, Jianzhong Li, “A Reputation-based Trust Mechanism for P2P E-
commerce Systems,” Journal of Software, 2007, 18(10), 2551-2563, 2007.
[6] Hongzhou Sha, Qingyun Liu, Tingwen Liu, Zhou Zhou, Li Guo, Binxing Fang,
“Survey on Malicious Webpage Detection Research,” Chinese Journal of Computers,
39(3), 529-542, 2016. Article (CrossRef Link).
[7] Sahoo D, Liu C, Hoi S C H, “Malicious URL Detection using Machine Learning: A
Survey,” 2017. Article (CrossRef Link).
[8] Pawan Prakash, Manish Kumar, Ramana Kompella, Minaxi Gupta, “Phishnet:
predictive blacklisting to detect phishing attacks,” in Proc. of 2010 Proceedings IEEE
INFOCOM, 1-5, 2010. Article (CrossRef Link).
[9] Dharmaraj R Patil, Jayantrao Patil, “Survey on Malicious Web Pages Detection
Techniques,” International Journal of u- and e- Service, Science and Technology, vol. 8,
no. 5, pp. 195–206, 2015. Article (CrossRef Link).
[10] Sujata Garera, Niels Provos, Monica Chew, Aviel D. Rubin, “A framework for
detection and measurement of phishing attacks,” in Proc. of the 2007 ACM workshop on
Recurring malcode. ACM, pp. 1–8, 2007. Article (CrossRef Link).
[11] Mahmoud Khonji, Youssef Iraqi, Andy Jones, “Phishing Detection: A Literature
Survey,” IEEE Communications Surveys & Tutorials, vol. 15, no. 4, pp. 2091–2121,
2013. Article (CrossRef Link).
[12] Raj Nepali, Yong Wang, “You Look Suspicious!!: Leveraging Visible Attributes to
Classify Malicious Short URLs on Twitter,” in Proc. of 2016 49thHawaii International
Conference on System Sciences (HICSS). IEEE, pp. 2648–2655, 2016. Article (CrossRef
Link).
[13] Masahiro Kuyama, Yoshio Kakizaki, Ryoichi Sasaki, “Method for Detecting a
Malicious Domain by Using WHOIS and DNS Features,” in Proc. of The Third
International Conference on Digital Security and Forensics (Digital Sec2016), pp. 74-80,
2016. Article (CrossRef Link).
[14] Liu G, Qiu B, Liu W, “Automatic Detection of Phishing Target from Phishing
Webpage,” in Proc. of International Conference on Pattern Recognition. IEEE Computer
Society, 4153-4156, 2010. Article (CrossRef Link).
[18] Xuejian Wang, Lantao Yu, Kan Ren, Guanyu Tao, Weinan Zhang, Yong Yu, Jun
Wang, “Dynamic Attention Deep Model for Article Recommendation by Learning
Human Editors' Demonstration,” in Proc. of ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM, 2051-2059, 2017. Article (CrossRef
Link).
[20] Ming Sun, Anirudh Raju, George Tucker, Sankaran Panchapagesan, Gengshen Fu,
“Max-pooling loss training of long short-term memory networks for small-footprint
keyword spotting,” in Proc. of Spoken Language Technology Workshop. IEEE, 474-480,
2017. Article (CrossRef Link).