Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
429 views24 pages

Project Report Hate

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 24

Department of Computer Science

Lok Nayak Jai Prakash Institute of Technology Chapra-84130


(2019-2023)

Project Report
On
“Hate speech Detection”

under the guidance of


Dr. Chanchal Suman

Submitted by :-

------------------------------

Team Members
S.no Name Registration no. Collage Roll no.
01 Shashi Shekhar Sharma 19105117002 2k19-CSE-53
02 Vishal Kumar 19105117006 2k19-CSE-45
03 Rohit Gupta 19105117030 2k19-CSE-20
04 Ankit Raj 19105117055 2k19-CSE-58
ACKNOWLEDGEMENT

We would like to express our sincere gratitude to my supervisor Dr. Chanchal Suman
for providing their invaluable guidance, comments and suggestions throughout the
flow of the project.

We deeply express our sincere thanks to our Department of CSE for encouraging
and allowing us to present the project on the topic “ Hate Speech Detection “ for the
partial fulfillment of the requirements leading to the award of B-Tech degree.

We take this opportunity to thank all our lecturers who have directly or indirectly
helped our project. We pay our respects and love to our parents and all other family
members and friends for their love and encouragement through-out our project. Last
but not the least we express our thanks to our friends for their cooperation and
support.
INDEX

1. Introduction
2. Hardware & Software required
3. Block Diagram
4. Aims & Objective
5. Data Collection
6. Data Cleaning
7. Data Analysis and Exploration
8. Data Modelling
9. Optimization and Deployment
10. Conclusion
11. Future scope
 Introduction

Hate speech is one of the serious issues we see on social media platforms like
Twitter and Facebook daily. There is no legal definition of hate speech because
people’s opinions cannot easily be classified as hateful or offensive. Nevertheless,
the United Nations defines hate speech as any type of verbal, written or behavioural
communication that can attack or use discriminatory language regarding a person or
a group of people based on their identity based on religion, ethnicity, nationality, race,
color, ancestry, gender or any other identity factor.

In Social media platforms, there are uncontrollable number of comments and posts
issued every second which make it impossible to trace or control the content of such
platform. Therefore, social platforms are facing a problem in limiting these posts while
balancing the freedom of speech.

The problem of hate speech in social networks is technically considered as


unstructured text Problem. Therefore, extracting insights and pattern from such text
can be a bit challenging, owing to the context-development interpretation of natural
language. Text mining technologies have the capabilities to handle the ambiguity and
variability of unstructured data.

Social media platforms need to detect hate speech and prevent it from going viral or
ban it at the right time. So in this project, I will walk you through the task of hate
speech detection with machine learning using the Python programming language.
 Hardware & Software requirement

Hardware: -

Processor- I5 or higher
Hard Disk- 10GB or higher
RAM-4GB or higher

Software: -

Anaconda navigator
Jupyter or Goggle collab
Python 3.0 or higher
 Block Diagram
 Project objectives

This aims to classify textual content into non-offensive, less-offensive, and more-
offensive.

The proposed solutions employed the different feature engineering techniques and
ML algorithms to classify content as hate speech.

 The main objective of this work is to develop an automated machine learning


based approach for detecting hate speech and offensive language.
 Automated detection corresponds to automated learning such as machine
learning: supervised and unsupervised learning. We use a supervised learning
method to detect hate and offensive language.
 Classify texts into three categories based on text sentiment and other features
that a text demonstrate.
 Data Collection

In order to build intelligent applications capable of understanding, machine learning


models need to digest large amounts of structured training data. Gathering sufficient
training data is the first step in solving any AI-based machine learning problem.

Data collection means pooling data by scraping, capturing, and loading it from
multiple sources, including offline and online sources. High volumes of data collection
or data creation can be the hardest part of a machine learning project, especially at
scale.

Furthermore, all datasets have flaws. This is why data preparation is so crucial in the
machine learning process. In a word, data preparation is a series of processes for
making your dataset more machine learning-friendly. In a broader sense, data
preparation also entails determining the best data collection mechanism. And these
techniques take up the majority of machine learning time. It can take months for the
first algorithm to be constructed.

In this project dataset is collected in two different ways:-

 Twitter Dataset is downloaded from Kaggle.


 Dataset is prepared by videos annotation.

Twitter Dataset
Annotated Dataset
 Data Cleaning

Data cleaning is one of the important parts of machine learning. It plays a


significant part in building a model. It surely isn’t the fanciest part of machine
learning and at the same time, there aren’t any hidden tricks or secrets to uncover.
However, the success or failure of a project relies on proper data cleaning.
Professional data scientists usually invest a very large portion of their time in this
step because of the belief that “Better data beats fancier algorithms”.
If we have a well-cleaned dataset, there are chances that we can get achieve good
results with simple algorithms also, which can prove very beneficial at times
especially in terms of computation when the dataset size is large. Obviously,
different types of data will require different types of cleaning. However, this
systematic approach can always serve as a good starting point.

The following are the most common steps involved in data cleaning:

1. Data inspection and exploration: This step involves understanding the data by
inspecting its structure, identifying missing values, outliers, and inconsistencies.
2. Handling missing data: Missing data is a common issue in real-world datasets, and
it can occur due to various reasons such as human errors, system failures, or data
collection issues. Various techniques can be used to handle missing data, such as
imputation, deletion, or substitution.
3. Handling outliers: Outliers are extreme values that deviate significantly from the
majority of the data. They can negatively impact the analysis and model
performance. Techniques such as clustering, interpolation, or transformation can be
used to handle outliers.
4. Data transformation: Data transformation involves converting the data from one
form to another to make it more suitable for analysis. Techniques such as
normalization, scaling, or encoding can be used to transform the data.
5. Data integration: Data integration involves combining data from multiple sources
into a single dataset to facilitate analysis. It involves handling inconsistencies,
duplicates, and conflicts between the datasets.
Advantages of Data Cleaning in Machine Learning:

1. Improved model performance: Data cleaning helps improve the performance of the
ML model by removing errors, inconsistencies, and irrelevant data, which can help
the model to better learn from the data.
2. Increased accuracy: Data cleaning helps ensure that the data is accurate,
consistent, and free of errors, which can help improve the accuracy of the ML
model.
3. Better representation of the data: Data cleaning allows the data to be transformed
into a format that better represents the underlying relationships and patterns in the
data, making it easier for the ML model to learn from the data.
4. Improved data quality: Data cleaning helps to improve the quality of the data,
making it more reliable and accurate. This ensures that the machine learning
models are trained on high-quality data, which can lead to better predictions and
outcomes.
5. Improved data security: Data cleaning can help to identify and remove sensitive or
confidential information that could compromise data security. By eliminating this
information, data cleaning can help to ensure that only the necessary and relevant
data is used for machine learning.
 Data Analysis and Exploration

Exploratory Data Analysis (EDA) is an approach that is used to analyze the data
and discover trends, patterns, or check assumptions in data with the help of
statistical summaries and graphical representations.
 Data Modeling

Data modeling in machine learning refers to the process of creating a mathematical


representation or model that captures the underlying patterns, relationships, and
characteristics of a dataset. This model is then used to make predictions, classify new data
points, or gain insights from the data.

1. Decision Tree Modeling

A decision tree is one of the most powerful tools of supervised learning algorithms
used for both classification and regression tasks. It builds a flowchart-like tree
structure where each internal node denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node (terminal node) holds a class
label. It is constructed by recursively splitting the training data into subsets based
on the values of the attributes until a stopping criterion is met, such as the
maximum depth of the tree or the minimum number of samples required to split a
node.

During training, the Decision Tree algorithm selects the best attribute to split the
data based on a metric such as entropy or Gini impurity, which measures the level
of impurity or randomness in the subsets. The goal is to find the attribute that
maximizes the information gain or the reduction in impurity after the split.
A tree can be “learned” by splitting the source set into subsets based on Attribute
Selection Measures. Attribute selection measure (ASM) is a criterion used in
decision tree algorithms to evaluate the usefulness of different attributes for splitting
a dataset. The goal of ASM is to identify the attribute that will create the most
homogeneous subsets of data after the split, thereby maximizing the information
gain. This process is repeated on each derived subset in a recursive manner
called recursive partitioning. The recursion is completed when the subset at a node
all has the same value of the target variable, or when splitting no longer adds value
to the predictions. The construction of a decision tree classifier does not require any
domain knowledge or parameter setting and therefore is appropriate for exploratory
knowledge discovery. Decision trees can handle high-dimensional data.

 Decision Tree Model accuracy and Precision of the Model


2. Naive Bayes Classifiers

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’


Theorem.

Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of
another event that has already occurred. Bayes’ theorem is stated mathematically
as the following equation:

where A and B are events and P(B) ≠ 0.

 Basically, we are trying to find probability of event A, given the event B is true.
Event B is also termed as evidence.
 P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).
 P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.

 Naïve Bayes Classification Model accuracy and Precision of the Model


3.Support Vector Machine
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine.

 Support Vector Machine Model accuracy and Precision of the Model


4. Neural Network Model-Multi-layer Perceptron(MLP)

A multi-layer perceptron has one input layer and for each input, there is one
neuron(or node), it has one output layer with a single node for each output and it
can have any number of hidden layers and each hidden layer can have any number
of nodes. A schematic diagram of a Multi-Layer Perceptron (MLP) is depicted
below.

In the multi-layer perceptron diagram above, we can see that there are three inputs
and thus three input nodes and the hidden layer has three nodes. The output layer
gives two outputs, therefore there are two output nodes. The nodes in the input
layer take input and forward it for further process, in the diagram above the nodes
in the input layer forwards their output to each of the three nodes in the hidden
layer, and in the same way, the hidden layer processes the information and passes
it to the output layer.
 Neural Network Model-Multi-layer Perceptron(MLP) Model accuracy and
Precision of the Model
 Optimisation and Deployment

After comparing accuracy of different machine learning algorithm we find out we get
best result on support vector machine but comparing precision we get Bernoulli Naïve
bayes is best.
Deployment

Python pickle module is used for serializing and de-serializing a Python object
structure. Any object in Python can be pickled so that it can be saved on disk. What
pickle does is that it “serializes” the object first before writing it to file. Pickling is a
way to convert a python object (list, dict, etc.) into a character stream. The idea is
that this character stream contains all the information necessary to reconstruct the
object in another python script.
 Conclusion

In conclusion, hate speech detection is a crucial task in the field of machine learning.
The detection of hate speech aims to identify and classify offensive, discriminatory, or
harmful language in various forms of text, such as social media posts, online
comments, or public forums. The goal is to develop models and algorithms that can
automatically detect and flag such content, contributing to safer online environments
and promoting inclusive and respectful communication.

Advancements in machine learning techniques have facilitated the development of


hate speech detection models. These models often leverage approaches such as
text classification, sentiment analysis, and natural language processing techniques.
They are trained on labelled datasets that consist of examples of hate speech and
non-hate speech instances, allowing them to learn patterns and characteristics of
offensive language.

However, hate speech detection remains a challenging task due to the complexity
and evolving nature of language. Context, sarcasm, cultural references, and linguistic
nuances can make it difficult to accurately classify certain instances of hate speech.
Additionally, biases and subjectivity in labelling data can impact the performance and
fairness of the models.Our work has made several contributions to this problem. We
introduced a method for automatically classifying hate speech.
 Future Scope of the model

The future scope of hate speech detection is promising, with ongoing advancements
in technology and research. Here are a few key areas of focus:

1. Improved Accuracy: Researchers are working to enhance the accuracy of hate


speech detection models by developing more sophisticated algorithms and
incorporating advanced natural language processing techniques. This includes
considering contextual information, sarcasm, and cultural references to better
understand the nuances of hate speech.

2. Multilingual and Multimodal Detection: Hate speech exists in various languages and
can manifest through different forms of media, such as images and videos. Future
developments aim to expand hate speech detection to multiple languages and
incorporate multimodal analysis to identify offensive content across different
mediums.

3. Real-Time Detection: Detecting hate speech in real-time is essential for timely


interventions and moderating online platforms effectively. Future advancements may
focus on developing models and systems that can process and analyze streaming
text and user-generated content in real-time.

You might also like