Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

1 - Fake Profile Identification in Social Network Using Machine Learning and NLP

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 59

Connecting Social Media To E-Commerce: Cold-

Start Product Recommendation using Microblogging


Information
A project report submitted in partial fulfillment
of the requirements for the award of the degree of

MASTER OF TECHNOLOGY
IN
COMPUTER SCIENCE &ENGINEERING
Submitted by
JALA SINDHUJA
14BJ1D5802
Under the esteemed guidance of
Mr. RAVEENDRA REDDY ENUMULA (P.hD)
Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE &ENGINEERING


ST.MARY’S GROUP OF INSTITUTIONS GUNTUR (Affiliated
to JNTU Kakinada, Approved by AICTE, Accredited by NBA)
CHEBROLU-522 212, A.P, INDIA
2014-16
ST. MARY’S GROUP OF INSTITUTIONS, CHEBROLU, GUNTUR
(Affiliated to JNTU Kakinada)

DEPARTMENT OF COMPUTER SCIENCE & ENGINEEING

CERTIFICATE

This is to certify that the project report entitled “Connecting Social Media To E- Commerce:
Cold-Start Product Recommendation using Microblogging Information” is the bonafied
record of project work carried out by JALA SINDHUJA, a student of this college, during the
academic year 2014 - 2016, in partial fulfillment of the requirements for the award of the degree
of Master of Technology in Computer Science &Engineering from St.Marys Group Of
Institutions Guntur of Jawaharlal Nehru Technological University, Kakinada.

Mr. RAVEENDRA REDDY ENUMULA,


Asst. Professor Associate. Professor
(Project Guide) (Head of Department, CSE)
DECLARATION
We, hereby declare that the project report entitled “Connecting Social Media To E-Commerce: Cold-
Start Product Recommendation using Microblogging Information” is an original work done at
St.Mary„s Group of Institutions Guntur, Chebrolu, Guntur and submitted in fulfillment of the
requirements for the award of Master of Technology in Computer Science & Engineering, to
St.Mary„s Group of Institutions Guntur, Chebrolu, Guntur.

JALA SINDHUJA
(14BJ1D5802)
ACKNOWLEDGEMENT

We consider it as a privilege to thank all those people who helped us a lot for successful
completion of the project “Connecting Social Media To E-Commerce: Cold-Start Product
Recommendation using Microblogging Information” A special gratitude we extend to our guide
Mr. E. Raveendra Reddy, Asst. Professor whose contribution in stimulating suggestions and
encouragement ,helped us to coordinate our project especially in writing this report, whose valuable
suggestions, guidance and comprehensive assistance helped us a lot in presenting the project
“Connecting Social Media To E-Commerce: Cold-Start Product Recommendation using
Microblogging Information”.
We would also like to acknowledge with much appreciation the crucial role of our Co-Ordinator
Mr. E.Raveendra Reddy, Asst.Professor for helping us a lot in completing our project. We just
wanted to say thank you for being such a wonderful educator as well as a person.
We express our heartfelt thanks to Mr. Subhani Shaik, Head of the Department, CSE, for his
spontaneous expression of knowledge, which helped us in bringing up this project through the
academic year.

JALA SINDHUJA
(14BJ1D5802)
ABSTRACT

Social Networking is the main era of data transmission as well as data


creation in a large scale and the reason why big data is created is very much
known. Social networking is the main platform where tons of data is being
created and by 2025 even Google data centers can’t handle that kind of huge
volume. Increase in the fake accounts are creating exponential growth of the
volume and in this paper, we are proposing architecture for identifying the
fake accounts in the social networking, especially in Facebook. In this
research Deep Learning algorithm is used to implement better prediction on
identifying the fake account based on their posts and status on their social
networking walls. We can consider Facebook and Twitter for this research
and for the security purpose and availability of data, we are considering
Twitter for this research. Twitter tweets and tagging, type of posts they are
creating and sometimes some account people creates mess in the society with
their posts, identify those fake and unwanted information circulating over the
network for the peace and security.

v
TABLE OF CONTENTS

CHAPTER TITLE PAGE


NO NO

ABSTRACT v
LIST OF FIGURES ix
LIST OF ABBREVIATIONS x
1 INTRODUCTION 1
1.1 HISTORY 1
1.2 SOCIAL IMPACT 1
1.3 ISSUES 2
1.4 OBJECTIVE 2
1.5 DEEP LEARNING 2
1.6 WORKING OF DEEP LEARNING 3
1.7 DEEP LEARNING vs. MACHINE 4
LEARNING
1.8 NETWORK ARCHITECTURES OF DEEP 4
LEARNING
1.9 ORGANIZATION OF THE THESIS 5
2 SURVEY ON FAKE PROFILE 6
IDENTIFICATION
3 SYSTEM REQUIREMENTS 9
3.1 HARDWARE REQUIREMENTS 9
3.2 SOFTWARE REQUIREMENTS 9
3.3 FEATURES OF KERAS 9
3.4 PRINCIPLES OF KERAS 10
3.5 FEATURES OF TENSOR FLOW 12
4 FAKE PROFILE IDENTIFICATION ON 15
SOCIAL NETWORKS UING DEEP LEARNING
4.1 CONVOLUTIONAL NEURAL NETWORK 15
4.2 WORKING OF CNN LAYERS 16
4.2.1 CONVOLUTIONAL LAYER 16
4.2.2 POOLING 18
4.2.3 FULLY CONNECTED LAYER 21
4.3 SPLITING OF TRAINING AND TESTING 21
DATA
4.3.1 PREREQUISITES FOR TRAIN AND 22
TEST DATA
5 FAKE PROFILE IDENTIFICATION ON 24
SOCIAL NETWORKS USING DEEP LEARNING
5.1 PREPROCESSING 24
5.2 STEPS IN PREPROCESSING 24
5.2.1 DATA CLEANING 24
5.2.2 DATA INTEGRATION AND 26
TRANSFORMATION
5.2.3 DATA REDUCTION 27
5.2.4 DISCRETIZATION AND CONCEPT 27
HIERARCHY GENERATION
6 IMPLEMENTATION AND RESULT 28
6.1 INTRODUCTION 28
6.2 CODING 28
6.3 EXPERIMENT RESULT 40
7 CONCLUSION 43
REFERENCES 44
LIST OF FIGURES

FIGURE TITLE PAGE


NO NO

1.1 LAYERS OF DEEP LEARNING 5


3.1 KERAS ENVIRONMENT 11
3.2 FEATURES OF TENSOR FLOW 12
4.1 CONVOLUTIONAL NEURAL NETWORK 16
4.2 FEATURE MAP 17
4.3 EXAMPLE FOR POOLING LAYER 20
4.4 FULLY CONNECTED LAYER 21
6.1 SPLITTING OF TEST AND TRAIN DATA 40
6.2 PREPROCESSING 40
6.3 FEATURE EXTRACTION 41
6.4 PREDICTION 41
6.5 EPOCH 42
6.6 CLASSIFICATION 42

ix
LIST OF ABBREVIATIONS

AI Artificial Intelligence
API Application Programming Interface
CNGF Common New Generation Frigate
CNN Convolution Neural Network
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
DBLP Digital Bibliography and Library Project
EMD Entropy Minimization Discretization
FNR False Negative Rate
FPR False Positive Rate
GPU Graphics Processing Unit
IOS Internetwork Operating System
ML Machine Learning
OSN Online Social Network
PCA Principle Component Analysis
ReLU Rectified Linear Unit
RF Random Forest
SMFSR Supervised Matrix Factorization method with
Regularization
SMOTE Synthetic Minority Oversampling Technique
SVM Support Vector Machine
TPR True Positive Rate
TPU Tensor Processing Unit
UPN User Principal Name
WWW World Wide Web
CHAPTER 1

INTRODUCTION

A social networking site is a website where each user has a profile


and can keep in contact with friends, share their updates, meet new people
who have the same interests. These Online Social Networks (OSN) use
web2.0 technology, which allows users to interact with each other. Social
networking sites are growing rapidly and changing the way people keep in
contact with each other. The online communities bring people with same
interests together which makes users easier to make new friends.

1.1 HISTORY
Early social networking on the World Wide Web (WWW) began in the
form of generalized online communities such as Theglobe.com (1995),
Geocities (1994) and Tripod.com (1995). In the late 1990s, user profiles
became a central feature of social networking sites, allowing users to compile
lists of "friends" and search for other users with similar interests. Facebook,
launched in 2004, became the largest social networking site in the world in
early 2009. Facebook was first introduced as a Harvard social networking
site, expanding to other universities and eventually, anyone. The term social
media was introduced and soon became widespread.

1.2 SOCIAL IMPACT

In the present generation, the social life of everyone has become


associated with the online social networks. Adding new friends and keeping
in contact with them and their updates has become easier. The online social
networks have impact on the science, education, grassroots organizing,

1
employment, business, etc. Researchers have been studying these online
social networks to see the impact they make on the people. Teachers can
reach the students easily through this making a friendly environment for the
students to study, teachers nowadays are 7 getting themselves familiar to
these sites bringing online classroom pages, giving homework, making
discussions, etc. which improves education a lot. The employers can use these
social networking sites to employ the people who are talented and interested
in the work, their background check can be done easily.

1.3 ISSUES

The social networking sites are making our social lives better but
nevertheless there are a lot of issues with using these social networking sites.
The issues are privacy, online bullying, potential for misuse, trolling, etc.
These are done mostly by using fake profiles.

1.4 OBJECTIVE

This is the framework through which we can detect a fake profile using
machine learning algorithms so that the social life of people become secured.

1.5 DEEP LEARNING

Deep learning is a branch of machine learning which is completely


based on artificial neural networks, as neural network is going to mimic the
human brain so deep learning is also a kind of mimic of human brain. In deep
learning, we don’t need to explicitly program everything. The concept of deep
learning is not new. It has been around for a couple of years now. It’s on hype
nowadays because earlier we did not have that much processing power and a
lot of data. Deep learning is a particular kind of machine learning that
achieves great power and flexibility by learning to represent the world as
a nested
hierarchy of concepts, with each concept defined in relation to simpler
concepts, and more abstract representations computed in terms of less abstract
ones.
In human brain approximately 100 billion neurons all together this
is a picture of an individual neuron and each neuron is connected through
thousands of their neighbours. The question here is how do we recreate these
neurons in a computer. So, we create an artificial structure called an artificial
neural net where we have nodes or neurons. We have some neurons for input
value and some for output value and in between, there may be lots of neurons
interconnected in the hidden layer. Deep learning is an AI function that
mimics the workings of the human brain in processing data for use in decision
making. Deep learning AI is able to learn from data that is both unstructured
and unlabeled. Deep learning, a machine learning subset, can be used to help
detect fraud or money laundering.

1.6 WORKING OF DEEP LEARNING


Deep learning has evolved hand-in-hand with the digital era, which has
brought about an explosion of data in all forms and from every region of the
world. This data, known simply as big data, is drawn from sources like social
media, internet search engines, e-commerce platforms, and online cinemas,
among others. This enormous amount of data is readily accessible and can be
shared through fintech applications like cloud computing.

However, the data, which normally is unstructured, is so vast that it


could take decades for humans to comprehend it and extract relevant
information. Companies realize the incredible potential that can result from
unraveling this wealth of information and are increasingly adapting to AI
systems for automated support.
1.7 DEEP LEARNING vs. MACHINE LEARNING

One of the most common AI techniques used for processing big data is
machine learning, a self-adaptive algorithm that gets increasingly better
analysis and patterns with experience or with newly added data.
If a digital payments company wanted to detect the occurrence or
potential for fraud in its system, it could employ machine learning tools for
this purpose. The computational algorithm built into a computer model will
process all transactions happening on the digital platform, find patterns in the
data set and point out any anomaly detected by the pattern.

Deep learning, a subset of machine learning, utilizes a hierarchical


level of artificial neural networks to carry out the process of machine
learning. The artificial neural networks are built like the human brain, with
neuron nodes connected together like a web. While traditional programs build
analysis with data in a linear way, the hierarchical function of deep learning
systems enables machines to process data with a nonlinear approach.

1.8 NETWORK ARCHITECTURES OF DEEP LEARNING

Deep learning then can be defined as neural networks with a large


number of parameters and layers in one four
fundamental network architectures:

1. Unsupervised Pre- Trained Networks (UPNs)


2. Convolutional Neural Networks (CNNs)
3. Recurrent Neural Networks
4. Recursive Neural Networks
Figure 1.1 Layers of Deep Learning

The deep learning deals with the many layers more than 150.

1.9 ORGANIZATION OF THE THESIS

The chapter 1 deals with brief description about the deep learning and
its classification. The chapter 2 describes the related methods for identifying
fake profiles. The chapter 3 tells about the system requirements. The chapter 4
and
5 describes about the modules. The chapter 6 describes about the
implementation and result. The chapter 6 describes the conclusion of the
paper.
CHAPTER 2

SURVEY ON FAKE PROFILE IDENTIFICATION

Mauro Conti, et al. (2012) Detecting fake profiles in online social


networks, the main idea behind this approach is to study the temporal
evolution of OSNs and characterize real user profiles. If the real-world OSN
data can be collected and used to identify set of features that are statistically
consistent, then these feature scan be used to study the time evolution of a
given test profile and identify/detect any major deviations from expected
behavior of a profile. The first stage detection of potential fake profile,
alerting OSN managers to conduct further monitoring of the test profile. The
second stage is identifying the fake profile by using OSN graph. The graph
includes link strength, number of tags, number of pictures, rate of accepted
friendships, and so on.

Ciao Xiao, et al. (2015) Detecting clusters of fake accounts in online


social networks, this work describes a scalable approach to finding groups of
fake accounts registered by the same actor. The main technique is a
supervised machine learning pipeline for classifying an entire cluster of
accounts as malicious or legitimate. This method considered the following
three regression methods: logistic regression with L1 regularization, support
vector machine with a radial basis function kernel, and random forest, a
nonlinear tree based ensemble learning method.

Yin Zhu, et al. (2012) Discovering Spammers in Social Networks


proposed a Supervised Matrix Factorization method with Social
Regularization (SMFSR) for spammer detection in social networks that
exploits both social
activities as well as users’ social relations in an innovative and highly
scalable manner. In particular, recent advances in tensor factorization could
be used to factorize the performer-activity-receiver interaction tensor, which
contains richer information than the user-activity matrix used in this work.

Liyan Dong, et al. (2013) The Algorithm of Link Prediction on Social


Network, used CNGF algorithm based on local information and KatzGF
algorithm based on global information network for predicting the link on
social network. The link prediction task is divided into two categories. The
first category is to predict that the new link will appear in future time. The
second category is to forecast hidden unknown link in the space. DBLP
(Digital Bibliography and Library Project) experimental data sets used to
verify the performance of the link prediction algorithm. This data set is real. It
covers 28 international conferences records related to data mining, machine
learning, and so on.

Yazan Boshmaf, et al. (2016) used ground-truth provided by RenRen to


train an SVM classifier in order to detect fake accounts. Using simple
features, such as frequency of friend requests, fraction of accepted requests,
and per- account clustering coefficient, the authors were able to train a
classifier with 99% true-positive rate (TPR) and 0.7% false-positive rate
(FPR). Stringhini utilized honeypot accounts to collect data describing
various user activities in OSNs [28]. By analyzing the collected data, they
were able to build a ground- truth for real and fake accounts, with features
similar to those outlined above. The authors trained two random forests (RF)
classifiers to detect fakes in Facebook and Twitter, ending up with 2% FPR
and 1% false-negative rate (FNR) for Facebook, and 2.5% FPR and 3% FNR
for Twitter.

Mohammad Ebrahim Shiri, et al. (2018) Identifying fake accounts on


social networks based on graph analysis and classification algorithms, in this
paper, the similarity matrices between accounts were calculated, and then
PCA algorithm was used for feature extraction and SMOTE was used for data
balancing successively. Then the linear SVM, Medium Gaussian SVM and
regression, and logistic algorithms were used to classify the nodes. Finally,
the performance of this method was evaluated using various classifier
algorithms. SVM was used with a linear and Gaussian kernel in training.
Gaussian uses normal curves around the data points and sums these data
points so that the decision boundary can be defined by a type of topology
condition such as curves where the sum is above 0.5.

Buket Eruahin, et al. (2017) Twitter Fake Account Detection presented


the classification method for detecting the fake accounts on twitter.
Preprocessed the dataset using a supervised discretization technique named
Entropy Minimization Discretization (EMD) on numerical features and
analyzed the results of the Naïve Bayes algorithm. The Naive Bayesian
classifier provides simple approach, with clear semantics, for representing,
using and learning probabilistic approach. It is used in supervised learning
tasks. In particularly, it is assumed that the predictive attributes are
conditionally independent. In this paper, Entropy Minimization Discretization
with Minimum Description Length is applied on numeric features and as a
result of the second experiment, 901 of the 1000 instances are classified
correctly with the 90.9% accuracy. 60 of 501 fake accounts are classified as
real and 31 of 499 real accounts are classified as fake. Weighted average of
the F-measure is increased to 0,909.
CHAPTER 3

SYSTEM REQUIREMENTS

3.1 HARDWARE REQUIREMENTS

1. RAM - 4 GB

2. Hard Disk - 320 GB

3. Keyboard - Standard 101 Keys

4. Mouse - Optical Mouse

5. OS - Windows 7 or 8 or 10

3.2 SOFTWARE REQUIREMENTS

1. Tool - keras using TensorFlow

backend 2.Language - python.

3.3 FEATURES OF KERAS

Keras contains numerous implementations of commonly used neural


network building blocks such as layers, objectives, activation, functions,
optimizer and a host of tools to make working with image and text data easier
to simplify the coding necessary for writing deep neural network code. The
code is hosted on GitHub and community support forums include the GitHub
issues page, and a Slack channel. In addition to standard neural networks,
Keras has support for convolutional and recurrent neural networks. It supports
other common utility layers like dropout, batch normalization, and pooling.
Keras allows users to productize deep models on smartphones (IOS and
Android), on the web, or on the Java Virtual Machine. It also allows use of
distributed training of deep-learning models on clusters of Graphics
Processing Units (GPU) and Tensor Processing Units (TPU) principally in
conjunction with CUDA.

3.4 PRINCIPLES OF KERAS

User friendliness

Keras is an API designed for human beings, not machines. It puts user
experience front and center. Keras follows best practices for reducing
cognitive load: it offers consistent and simple APIs, it minimizes the number
of user actions required for common use cases, and it provides clear and
actionable feedback upon user error.

Modularity

A model is understood as a sequence or a graph of standalone, fully-


configurable modules that can be plugged together with as little restrictions as
possible. In particular, neural layers, cost functions, optimizers, initialization
schemes, activation functions, regularization schemes are all standalone
modules that you can combine to create new models.

Minimalism

Each module should be kept short and simple. Every piece of code
should be transparent upon first reading. No black magic: it hurts iteration
speed and ability to innovate.

Easy extensibility

New modules are dead simple to add (as new classes and functions),
and existing modules provide ample examples.
To be able to easily create new modules allows for total expressiveness, making
Keras suitable for advanced research.

Work with python

No separate models configuration files in a declarative format. Models


are described in Python code, which is compact, easier to debug, and allows
for ease of extensibility.

Figure 3.1 Keras Environment

WINDOWS

Windows is a series of operating systems developed by Microsoft. Each


version of Windows includes a graphical user interface, with a desktop that
allows users to view files and folders in windows. For the past two
decades, Windows has been the most widely used operating system for
personal computers PCs.

3.5 FEATURES OF TENSOR FLOW

Tensor Flow is an end-to-end open source platform for machine


learning. It has a comprehensive, flexible ecosystem of tools, libraries and
community resources that lets researchers push the state-of-the-art in ML and
developers easily build and deploy ML powered applications.

Figure 3.2 Features of Tensor Flow

Responsive Construct

With TensorFlow we can easily visualize each and every part of the
graph which is not an option while using Numpy or SciKit.
Flexible

One of the very important Tensor Flow Features is that it is flexible in


its operability, meaning it has modularity and the parts of it which you want
to make standalone, it offers you that option.

Easily Trainable

It is easily trainable on CPU as well as GPU for distributed computing.


Parallel Neural Network Training

Tensor Flow offers pipelining in the sense that you can train
multiple neural networks and multiple GPUs which makes the models very
efficient on large-scale systems.

Large Community

Needless to say, if it has been developed by Google, there already is a


large team of software engineers who work on stability improvements
continuously.

Open Source

The best thing about this machine learning library is that it is open
source so anyone can use it as long as they have internet connectivity.
So, people manipulate the library in ways unimaginable and come up
with an amazing variety of useful products, it has become another DIY
community which has a huge forum for people getting started with it and for
those who find it hard to use it or to get help with their work.
Feature columns

Tensor Flow has feature columns that could be thought of as


intermediaries between raw data and estimators, therefore, bridging input data
with your model.

Availability of Statistical Distributions

The library provides distribution functions including Bernoulli, Beta,


Chi2, Uniform, Gamma, which are important especially while considering
probabilistic approaches such as Bayesian models.
CHAPTER 4

FAKE PROFILE IDENTIFICATION ON SOCIAL

NETWORKS USING DEEP LEARNING

4.1 Convolutional Neural Network (CNN)

Convolutional Neural Network(CNN) models were developed for


image classification, in which the model accepts a two-dimensional input
representing an image’s pixels and color channels, in a process called feature
learning.

This same process can be applied to one-dimensional sequences of


data. The model extract features from sequences data and maps the internal
features of the sequence. A 1D CNN is very effective for deriving features
from a fixed-length segment of the overall dataset, where it is not so
important where the feature is located in the segment.

A Convolutional Neural Network (CNN) is a Deep Learning algorithm


which can take in an input data, assign importance (learnable weights and
biases) to various aspects/objects in the data and be able to differentiate one
from the other. CNN is a category of Neural Networks that have proven very
effective in areas such as classification and image recognition. CNN is a
complex feed forward neural networks which is used for achieving high
accuracy.
Figure 4.1 Convolutional Neural Network

4.2 WORKING OF CNN LAYERS

4.2.1 Convolutional layer:

Convolutional layers are the major building blocks used in convolutional


neural networks.

A convolution is the simple application of a filter to an input that results in


an activation. Repeated application of the same filter to an input results in a map
of activations called a feature map, indicating the locations and strength of a
detected feature in an input, such as an image or text.

The innovation of convolutional neural networks is the ability to


automatically learn a large number of filters in parallel specific to a training
dataset under the constraints of a specific predictive modeling problem, such as
image or text classification. The result is highly specific features that can be
detected anywhere on input data. Given that the technique was designed for
one-
16
dimensional input, the multiplication is performed between an array of input
data and a one-dimensional array of weights, called a filter or a kernel.

The filter is smaller than the input data and the type of multiplication
applied between a filter-sized patch of the input and the filter is a dot product. A
dot product is the element-wise multiplication between the filter-sized patch of
the input and filter, which is then summed, always resulting in a single value.
Because it results in a single value, the operation is often referred to as the scalar
product. Using a filter smaller than the input is intentional as it allows the same
filter (set of weights) to be multiplied by the input array multiple times at
different points on the input. Specifically, the filter is applied systematically to
each overlapping part or filter-sized patch of the input data, left to right, top to
bottom.
Figure 4.2 Example of a filter applied to a one-dimensional input to create a
feature map
4.2.2 Pooling:

Its function is to progressively reduce the spatial size of the representation


to reduce the number of parameters and computation in the network. Pooling
layer operates on each feature map independently. The most common approach
used in pooling is max pooling. Pooling layers provide an approach to down
sampling feature maps by summarizing the presence of features in patches of the
feature map. Two common pooling methods are average pooling and max
pooling that summarize the average presence of a feature and the most activated
presence of a feature respectively. Convolutional layers in a convolutional neural
network systematically apply learned filters to input images in order to create
feature maps that summarize the presence of those features in the input.

Convolutional layers prove very effective, and stacking convolutional


layers in deep models allows layers close to the input to learn low-level features
(e.g. lines) and layers deeper in the model to learn high-order or more abstract
features, like shapes or specific objects.

A limitation of the feature map output of convolutional layers is that they


record the precise position of features in the input. This means that small
movements in the position of the feature in the input data will result in a
different feature map. This can happen with re-cropping, rotation, shifting, and
other minor changes to the input data.

A common approach to addressing this problem from signal processing is


called down sampling. This is where a lower resolution version of an input signal
is created that still contains the large or important structural elements, without
the fine detail that may not be as useful to the task. Down sampling can be
achieved
with convolutional layers by changing the stride of the convolution across the
data. A more robust and common approach is to use a pooling layer.

A pooling layer is a new layer added after the convolutional layer.


Specifically, after a nonlinearity (e.g. ReLU) has been applied to the feature
maps output by a convolutional layer. The addition of a pooling layer after the
convolutional layer is a common pattern used for ordering layers within a
convolutional neural network that may be repeated one or more times in a given
model.

The pooling layer operates upon each feature map separately to create a
new set of the same number of pooled feature maps. Pooling involves selecting a
pooling operation, much like a filter to be applied to feature maps. The size of
the pooling operation or filter is smaller than the size of the feature map;
specifically, it is almost always 2×2 pixels applied with a stride of 2 pixels.

This means that the pooling layer will always reduce the size of each
feature map by a factor of 2, e.g. each dimension is halved, reducing the number
of pixels or values in each feature map to one quarter the size. For example, a
pooling layer applied to a feature map of 6×6 (36 pixels) will result in an output
pooled feature map of 3×3 (9 pixels).

The pooling operation is specified, rather than learned. Two common


functions used in the pooling operation are:

• Average Pooling: Calculate the average value for each patch on


the feature map.
• Maximum Pooling (or Max Pooling): Calculate the
maximum value for each patch of the feature map.

The result of using a pooling layer and creating down sampled or pooled
feature maps is a summarized version of the features detected in the input. They
are useful as small changes in the location of the feature in the input detected by
the convolutional layer will result in a pooled feature map with the feature in the
same location. This capability added by pooling is called the model’s invariance
to local translation.

In all cases, pooling helps to make the representation become


approximately invariant to small translations of the input. Invariance to
translation means that if we translate the input by a small amount, the values of
most of the pooled outputs do not change.

Figure 4.3 Example for pooling layer


4.2.3 Fully connected layer:

Fully Connected Layer is simply, feed forward neural networks. Fully


Connected Layers form the last few layers in the network. The input to the fully
connected layer is the output from the final Pooling or Convolutional Layer,
which is flattened and then fed into the fully connected layer. Fully connected
layers are an essential component of Convolutional Neural Networks (CNNs),
which have been proven very successful in recognizing and classifying images
for computer vision. The CNN process begins with convolution and pooling,
breaking down the image into features, and analyzing them independently. The
result of this process feeds into a fully connected neural network structure that
drives the final classification decision.

Figure 4.4 Fully connected layer

4.3 SPLITTING OF TRAINING AND TESTING DATA

Separating data into training and testing sets is an important part of


evaluating data mining models. Typically, when you separate a data set into a
training set and testing set, most of the data is used for training, and a smaller
portion of the data is used for testing. Analysis Services randomly samples the
data to help ensure that the testing and training sets are similar. By using similar
data for training and testing, you can minimize the effects of data discrepancies
and better understand the characteristics of the model.

After a model has been processed by using the training set, you test the
model by making predictions against the test set. Because the data in the testing
set already contains known values for the attribute that you want to predict, it is
easy to determine whether the model's guesses are correct.

When working with datasets, a deep learning algorithm works in two


stages. Usually split the data around 20%-80%between testing and training
stages. Under supervised learning, split a dataset into a training data and test data
in Python.

Datase
Training set Testing set

4.3.1 PREREQUISITES FOR TRAIN ANDTEST DATA

STEP 1: We will need Python libraries such as pandas and sklearn. Command for
installing python libraries:

pip install pandas

pip install sklearn

STEP 2: We use pandas to import the dataset and sklearn to perform the splitting.
STEP3: To split the training and testing dataset in Python the process included are

• Loading the dataset

• Splitting

STEP 4: Plotting of train and test set in Python

We fit our model on the train data to make predictions on it. Let’s import
the linear model from sklearn, apply linear regression to the dataset, and plot the
results
CHAPTER 5

FAKE PROFILE IDENTIFICATION ON SOCIAL

NETWORKS USING DEEP LEARNING

5.1 PREPROCESSING

Pre-processing refers to the transformations applied to our data before


feeding it to the algorithm. Data Preprocessing is a technique that is used to
convert the raw data into a clean data set.

In other words, whenever the data is gathered from different sources it


is collected in raw format which is not feasible for the analysis. Data
Preprocessing is required because real world data are generally:

1. Incomplete: Missing attribute values, missing certain attributes of


importance, or having only aggregate data

2. Noisy: Containing errors or outlier

3. Inconsistent: Containing discrepancies in codes or names

5.2 STEPS IN PREPROCESSING

5.2.1 Data cleaning:

• Data cleaning, also called data cleansing or scrubbing.


• Fill in missing values, smooth noisy data, identify or remove the
outliers, and resolve inconsistencies.

24
• Data cleaning is required because source systems contain “dirty data”
that must be cleaned.

Steps in Data cleaning:

STEP1: Parsing

• Parsing locates and identifies individual data elements in the source


files and then isolates these data elements in the target files.
• Example includes parsing the first, middle and the last name.

STEP2: Correcting

• Correct parsed individual data components using sophisticated data


algorithms and secondary data sources.
• Example includes replacing a vanity address and adding a zip code.

STEP3: Standardizing

• Standardizing applies conversion routines to transform data into its


preferred and consistent format using both standard and custom
business rules.
• Examples include adding a pre name, replacing a nickname.

STEP4: Matching

• Searching and matching records within and across the parsed, corrected
and standardized data based on predefined business rules to eliminate
duplications.
• Examples include identifying similar names and addresses.
STEP5: Consolidating

• Analyzing and identifying relationships between matched records and


consolidating/merging them into one representation.

STEP6: Data cleansing must deal with many types of possible errors

• These include missing data and incorrect data at one source.

STEP7: Data Staging

• Accumulates data from asynchronous sources.


• At a predefined cutoff time, data in the staging file is transformed
and loaded to the warehouse.
• There is usually no end user access to the staging file.
• An operational data store may be used for data staging.

5.2.2 Data integration and Transformation:

• Data integration: Combines data from multiple sources into a coherent


data store e.g. data warehouse.

• Sources may include multiple databases, data cubes or data files.

• Data Transformation: Transformation process deals with rectifying any


inconsistency (if any).
• One of the most common transformation issues is ‘Attribute Naming
Inconsistency’. It is common for the given data element to be referred
to by different data names in different databases.
• Eg. Employee Name may be EMP_NAME in one database, ENAME in
the other.
• Thus one set of Data Names are picked and used consistently in the
data warehouse.
• Once all the data elements have right names, they must be converted to
common formats.

5.2.3 Data Reduction:

• Obtains reduced representation in volume but produces the same or


similar analytical results.
• Need for data reduction:

• Reducing the number of attributes


• Reducing the number of attribute values
• Reducing the number of tuples

5.2.4 Discretization and Concept Hierarchy Generation (or


summarization):

• Discretization: Reduce the number of values for a given continuous


attribute by divide the range of a continuous attribute into intervals.
• Interval labels can then be used to replace actual data values.
• Concept Hierarchies: Reduce the data by collecting and replacing low
level concepts (such as numeric values for the attribute age) by higher
level concepts (such as young, middle-aged or senior).
CHAPTER 6
IMPLEMENTATION AND
RESULT

6.1 INTRODUCTION
The pseudo facebook dataset includes hypothetical samples
corresponding to two types of accounts in the facebook. Various attributes
included in dataset are number of friends, followers, status count, etc. Dataset
is divided into training and testing data. From the dataset used 80% of both
profile (genuine and fake) are used to prepare a training dataset and 20% of
both profiles are used to prepare a testing dataset. CNN algorithm is used to
identify the fake profiles. The algorithm is implemented using python.

6.2 CODING

TRAIN AND TEXT CLASSIFICATION


Split.py
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
### Reads users profile from csv files
genuine_users = pd.read_csv("data/users.csv")
fake_users = pd.read_csv("data/fusers.csv")
x=pd.concat([genuine_users,fake_users])
y=len(fake_users)*[0] + len(genuine_users)*[1]
print(x)
print(y)
###spliting the dataset for training and testing
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,
random_state=0)
print(x_train)
print(x_test)
print(y_train)
print(y_test)

preprocessing.py
import pandas as pd
import numpy as np
import random
from nltk.corpus import names
import nltk
from sklearn.model_selection import train_test_split
""" Reads users profile from csv files """
genuine_users = pd.read_csv("data/users.csv")
fake_users = pd.read_csv("data/fusers.csv")
##print (genuine_users.columns)
##print (fake_users.columns)
##print (genuine_users.name)
##print (fake_users.name)
##genuine_users.info()
##fake_users.info()
x=pd.concat([genuine_users,fake_users])
X=pd.DataFrame(x)
#x.info()
t=len(fake_users)*['Genuine'] + len(genuine_users)*['Fake']
y=pd.Series(t)
##print(y)
p=x.name
#print(p)
def gender_features(word):
return {'last_letter':word[-1]}
labeled_names = ([(name, 'male') for name in names.words('male.txt')]+
[(name, 'female') for name in names.words('female.txt')])
##print(labeled_names)
featuresets = [(gender_features(n), gender)
for (n, gender)in labeled_names]
classifier =
nltk.NaiveBayesClassifier.train(featuresets) a=[]
for i in X['name']:
#print(i)
vf=classifier.classify(gender_features(i))
a.append(vf)
X['gender']=pd.DataFrame(a)
print(X['gender'])
lang_list = list(enumerate(np.unique(X['lang'])))
lang_dict = { name : i for i, name in lang_list }
def strToBinary(s):
bin_conv = []
for c in s:
# convert each char to
# ASCII value
ascii_val = ord(c)
# Convert ASCII value to binary
binary_val = bin(ascii_val)
bin_conv.append(binary_val[2:])
num=' '.join(bin_conv)
return num
# Driver Code
# This code is contributed
# by Vikas Chitturi
def binaryToDecimal(n):
num = int(n)
dec_value = 0
# Initializing base
# value to 1, i.e 2 ^ 0
base = 1
temp = num
while(temp):
last_digit = temp % 10
temp = int(temp / 10)
dec_value += last_digit * base
base = base * 2
return dec_value
# Driver Code
s = X['name']
l=[]
for i in s:
k=0
for j in i:
a=strToBinary(j)
k=k+int(a)
c=binaryToDecimal(k)
l.append(c)
X['name']=pd.DataFrame(l)
##print(X['name'])
##print(X)
gender = {'male': 0,'female': 1}
X.gender = [gender[item] for item in X.gender]
X.loc[:,'lang_code'] = X['lang'].map( lambda X: lang_dict[X]).astype(int)
feature_columns_to_use
=['name','gender','statuses_count','followers_count','friends_count','favourites_
count','listed_count','lang_code']
##print(feature_columns_to_use)
ty=X.loc[:,feature_columns_to_use]
##print(ty)
x_train,x_test,y_train,y_test = train_test_split(ty,y,
shuffle=True,test_size=0.3)
print(x_train)
print(x_test)
print(y_train)

Prediction.py
import pandas as pd
import numpy as np
import random
from nltk.corpus import names
import nltk
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from keras.layers import Dense, Flatten, Conv1D,LSTM
from sklearn.feature_selection import f_classif
from keras.callbacks import ModelCheckpoint
from sklearn.metrics import accuracy_score
from keras.models import Sequential
from keras import backend as K
from functools import partial
import warnings
import keras
import matplotlib.pyplot as plt
import scikitplot.plotters as skplt

""" Reads users profile from csv files """


genuine_users = pd.read_csv("data/users.csv")
fake_users = pd.read_csv("data/fusers.csv")
##print (genuine_users.columns)
##print (fake_users.columns)
##print (genuine_users.name)
##print (fake_users.name)
##genuine_users.info()
##fake_users.info()
x=pd.concat([genuine_users,fake_users])
X=pd.DataFrame(x)

t=len(fake_users)*['Genuine'] + len(genuine_users)*['Fake']
er=pd.Series(t)
X['label']=pd.DataFrame(er)
##print(X['label'])
label = {'Genuine': 0,'Fake': 1}
X.label = [label[item] for item in X.label]
y=X.loc[:,'label'].values
##print(y)
p=x.name
#print(p)
def gender_features(word):
return {'last_letter':word[-1]}
labeled_names = ([(name, 'male') for name in names.words('male.txt')]+
[(name, 'female') for name in names.words('female.txt')])
##print(labeled_names)
featuresets = [(gender_features(n), gender)
for (n, gender)in labeled_names]
classifier =
nltk.NaiveBayesClassifier.train(featuresets) a=[]
for i in X['name']:
#print(i)
vf=classifier.classify(gender_features(i))
a.append(vf)
X['gender']=pd.DataFrame(a)
##print(X['gender'])
lang_list = list(enumerate(np.unique(X['lang'])))
lang_dict = { name : i for i, name in lang_list }
##print(lang_dict)
def strToBinary(s):
bin_conv = []
for c in s:
# convert each char to
# ASCII value
ascii_val = ord(c)
# Convert ASCII value to binary
binary_val = bin(ascii_val)
bin_conv.append(binary_val[2:])
num=' '.join(bin_conv)
return num

# Driver Code
def binaryToDecimal(n):
num = int(n)
dec_value = 0
# Initializing base
# value to 1, i.e 2 ^ 0
base = 1
temp = num
while(temp):
last_digit = temp % 10
temp = int(temp / 10)
dec_value += last_digit * base
base = base * 2
return dec_value
# Driver Code
s = X['name']
l=[]
for i in s:
k=0
for j in i:
a=strToBinary(j)
k=k+int(a)
c=binaryToDecimal(k)
l.append(c)

X['name']=pd.DataFrame(l)
gender = {'male': 0,'female': 1}
X.gender = [gender[item] for item in X.gender]
X.loc[:,'lang_code'] = X['lang'].map( lambda X: lang_dict[X]).astype(int)
feature_columns_to_use =
['name','gender','statuses_count','followers_count','friends_count','favourites_c
ount','listed_count','lang_code']
print(feature_columns_to_use)
ty=X.loc[:,feature_columns_to_use].values
print(ty)

x_train,x_test,y_train,y_test = train_test_split(ty,y,
shuffle=True,test_size=0.3)
print(x_train)
print(x_test)
print(y_train)
print(y_test)
epochs = 50
num_classes = 2
batch_size = 256
input_shape = (1972, 8)
###### Convert class vectors to binary class matrices. This uses 1 hot
encoding ##############
y_train_binary =
keras.utils.to_categorical(y_train) y_test_binary =
keras.utils.to_categorical(y_test)
###################### num_classes
#######################################

x_train =
x_train.reshape(x_train.shape[0],x_train.shape[1],1) x_test =
x_test.reshape(x_test.shape[0], x_train.shape[1],1)
##################### Bulid an CNN network
############################

model_cnn = Sequential()
model_cnn.add(Conv1D(32, (3), input_shape=(x_train.shape[1],1),
activation='relu'))
model_cnn.add(Flatten())
model_cnn.add(Dense(64, activation='softmax'))
model_cnn.add(Dense(num_classes, activation='softmax'))
model_cnn.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
model_cnn.summary()
history=model_cnn.fit(x_train, y_train_binary,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test_binary))
##scores = model_cnn.evaluate(X_test, y_test, verbose=0)
##print("Accuracy= %.2f%%" % (scores[1]*100))

y_pred_cnn=model_cnn.predict(x_test)
print(y_pred_cnn)
thm=y_pred_cnn[0,0]
tym=y_pred_cnn[0,1]
print('Original',thm)
print('Fake',tym)
#labelResult.config(text="normal = %f,abnormal = %f" %(thm,tym))
a=input("Enter the name:")
usname=classifier.classify(gender_features(a))
print(usname)

s = usname
ln=[]
for i in s:
k=0
c=strToBinary(i)
#print(c)
k=k+int(c)
#print(k)
h=binaryToDecimal(k)
ln.append(h)
name=sum(ln)
if usname
=="male":
gender=0
else:
gender=1
statuses_count=int(input("statuses_count:"))
followers_count=int(input("followers_count:"))
friends_count=int(input("friends_count:"))
favourites_count=int(input("favourites_count:"))
listed_count=int(input("listed_count:"))
lang_code=int(input("lang_code:"))
new={"name":name,"gender":gender,"statuses_count":statuses_count,"follow
ers_count":followers_count,"friends_count":friends_count,"favourites_count"
:favourites_count,"listed_count":listed_count,"lang_code":lang_code}
de=pd.DataFrame(new,index=[0])
de.to_csv("new.csv")
re=pd.read_csv("new.csv")
rs=re.iloc[:,1:9].values
print(rs)
print(rs.shape)
rs = rs.reshape(1,8,1)
#######################new input
variations####################################

y_pred_cnn_input=model_cnn.predict(rs)
print(y_pred_cnn_input)
6.3 EXPERIMENT RESULT

SPLITTING OF TEST AND TRAIN DATA

Below Figure 6.1 shows the result of splitting test and train data in
convolutional neural network.

Figure 6.1 CNN with splitting test and train data.

Below the Figure 6.2 shows the result of preprocessing in convolutional


neural network.

Figure 6.2 preprocessing in convolutional neural network.


Below the Figure 6.3 shows the result of preprocessing with feature extraction
in convolutional neural network.

Figure 6.3 feature extraction in convolutional neural network.

Below the Figure 6.4 shows the result of prediction in convolutional neural
network.

Figure 6.4 prediction in convolutional neural network.


Below the Figure 6.5 shows the result of epoch in convolutional neural
network.

Figure 6.5 epoch in convolutional neural network.

Below the Figure 6.6 shows the classification result in convolutional neural
network.

Figure 6.5 classification result in convolutional neural network.


CHAPTER 7

CONCLUSION AND FUTURE ENHANCEMENT

Fake profiles are identified from the dataset using deep learning networks
in keras with the tensor flow backend. The results are notified. Fake profiles are
created in social networks for various reasons by individuals or groups. The
results are about detecting the account is fake or genuine by using engineered
features and deep learning models like convolutional neural networks. The
predictions indicate that the algorithm convolutional neural networks produced
97.5% accuracy.

43
REFERENCES

1. Aman Shenoy, TeerthaRaj Chatterjee, Prajjwal Mahajan, “Community


Detection-Detection of Fake Profiles on Facebook”,2019.

2. B. Erçahin, O. Akta, D. Kilinc and C. Akyol, “Twitter fake account


detection” , in Proc. Int. Conf. Computer Science and Engineering
(UBMK) 2017.

3. Ciao Xiao, David Mendell Freeman and Theodore Hwa, “Detecting


clusters of fake accounts in Online Social Networks”,2015.
4. Dr. Vijay Tiwari “Analysis and Detection of Fake Profile over social
Network”, International Conference on Computing, Communication
&Automation (ICCCA2017).

5. Haoran Xu and Yuqing Sun “Identify User Variants Based on User


Behavior on Social Media” School of Computer Science and Technology
Engineering Research Center of Digital Media Tech,2015 IEEE.

6. K. Siva Nandini, P. Bhavya Anjali, K. Devi Manaswi “Fake Profile


Identification in Online Social networks” ,(IJRTE)ISSN:2277- 3878,
Volume- 8, Issue-4, Nov.19.

7. Liyan Dong, YongliLi, HanYin, HuangLe and MaoRui, “The Algorithm


of Link Prediction on Social Network”, Mathematical Problems in
Engineering, vol.2013, pp.1–7, 2013.

8. M.Conti, R. Poovendran and M. Secchiero, “Fakebook: Detecting fake


profiles in on-line social networks”, in Proceedings of the 2012
IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining, pp.1071– 1078, Turkey, August2012.
9. Revathi.S, Dr.M.Suriakala, “Profile Similarity Communication
Matching Approaches for Detection of Duplicate Profiles in Online
Social Network”, 3rd IEEE International Conference, on Com. Sys &Info
Tech., 2018.

10. Salon.com, “The fake Facebook profile I could not get removed”,
http://www.salon.com/2011/02/02/my fake facebook profile/, 2012.
11. Shivangi Gheewala, Rakesh Patel “Machine Learning Based Twitter
SPAM: Review”, Proceedings of the Second International Conf. on
Comp Methodologies and Communication (ICCMC 2018).

12. W. Cukierski, B. Hamner, and B. Yang, “Graph-based features for


supervised link prediction,” in Proceedings of the International Joint
Conference on Neural Network (IJCNN ’11), IEEE, SanJose, Calif,
USA,2011.

13. Yazan Boshmaf, Georgos Siganos and Jorge Leria, Integro: “Leveraging
victim prediction for robust fake account detection in large scale
OSNs”,2016.

14. Y.Boshmaf, D. Logothetis, G. Siganosetal, “Integro: Leveraging victim


prediction for robust fake account detection in large scale OSNs”,
Computers &Security, vol.61, pp.142–168, 2016.

15. Yin Zhu, Xiao Wang, Erheng Zhong, Nanthan N. Liu, He Li and Qiang
Yang, “Discovering Spammers in Social Networks”, 2012.

You might also like