Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Detecting Fake Accounts in Media Application Using Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Special Issue Published in Int. Jnl.

Of Advanced Networking & Applications (IJANA)

Detecting Fake Accounts in Media Application


Using Machine Learning
Gayathri A
Dept. of Computer Science, VelammalEngg College Chennai, India
Radhika S
Dept. of Computer Science, VelammalEngg College Chennai, India
Mrs. Jayalakshmi S.L.
Assistant Professor, Dept. of Computer Science, VelammalEngg College Chennai, India
-------------------------------------------------------------------ABSTRACT--------------------------------------------------------------
The social network, a crucial part of our life is plagued by online impersonation and fake accounts .Fake profiles
are mostly used by the intruders to carry out malicious activities such as harming person , identity theft and
privacy intrusion in Online Social Network(OSN).Hence identifying an account is genuine or fake is one of the
critical problem in OSN .In this paper we proposed many classification algorithm like Support Vector Machine
algorithm and deep neural network .It also studies the comparison of classification methods on SpamUser dataset
which is used to select the best.

Keywords - fake accounts, fake identities, social media, data science, friends, followers, fake profiles
------------------------------------------------------------------------------------------------------------------------------------------------
I. INTRODUCTION II. RELATED WORK
In the present generation, everyone in society has become This paper presents some filtering algorithms that rely on
associated with the Online Social Networks(OSN). These classification to decide whether the profile is genuine or
OSN have made a drastic change in the way we pursue our fake.
social life. Making new friends, keeping in contact with
them and knowing their updates has become easier. But III. SUPPORT VECTOR MACHINE
with the rapid growth of social media many problems like
Support Vector Machine is a binary classification
fake profiles, online impersonation have also grown.There
are no feasible solution existing to control these problems algorithm that finds the maximum separation hyper plane
.Fake accounts can be either human-generated, computer- between two classes. It is a supervised learning algorithm
generated(also referred as “bots”), or cyborgs[1]. A cyborg that gives enough training examples, divides two classes
is half-human, half-bot account [1]. Such an account is fairly well and classifies new examples .It offers a
manually opened by a human, but from then onwards the principle approach to machine learning problems because
actions are automated by a bot. of their mathematical foundation in statistical learning
To become member of the OSN the user has to create his theory [10]. SVM construct their solution as a weighted
profile by entering information like name, photo, date of sum of SVs ,which are only a subset of the training input
birth, Email ID, graduation details, place of work, home .It is effective in cases where number of dimensions is
town, interests and so on [2][3]. Some of the fields are greater than the number of samples given
mandatory and some are optional and it varies from one
OSN to the another. These websites are popular because of
IV. RANDOM FOREST
people’s interest in finding friends, sharing pictures,
tagging people in group photos,sharing their ideas and Random Forest is versatile method performing both
views on common topics, maintain good business classification and regression tasks[8]. It has nearly same
relationship and general interest with others. hyperparameters as a decision tree or a bagging classifier
In this paper we came up with a framework in which .It creates many variations of trees .The best outcome will
automatic detection of fake profiles is possible and is be used to predict identity deception .Each outcomes from
efficient. This framework uses classification techniques the classifier represents different section of a tree.
like Support Vector Machine, Random Forest and Deep
Neural Networks to classify the profiles into fake or
V. SPAM FILTERING
genuine classes.As it is an automatic detection method, it
can be applied easily by OSN which has millions of The research study by Simranjit Kaur et al [4] is based on
profile where profiles cannot be examined manually .We implementing a k-mean clustering algorithm on vector set
evaluate whether readily available and engineered features to increase efficiency .To detect spam emails using neural
that are used for the successful detection ,using machine networks the two phases namely training and testing are
learning models. needed to be done. The process of detecting spam and
phishing emails using feed forward neural network .The
paper has 11 features have been implemented as a binary
values 0 or 1 with value 1 indicating this feature appeared

Page 234
Special Issue Published in Int. Jnl. Of Advanced Networking & Applications (IJANA)

in the tested email and value 0 indicating non-appearance


case.
Table 1: Attributes used in previous study.
VI. ACTIVITY PATTERN
The research study by Jiang et al used Catch Sync to
detect suspicious behavior in Twitter based on
synchronized and abnormal user activity. They were able
to show that their approach resulted in high efficiency of
detecting fake accounts in Twitter.

VII. SUPERVISED MACHINE LEARNING


Garadi et al [7] evaluates whether the readily available and
engineered features that are used for the successful
detection using machine learning algorithms of fake
identities created by bots or computers can be use to detect
the fake identities created by humans. It is done by XI. MOTIVATION WORK
considering that similar features can serve as a catalyst for In today’s online social networks there have been a lot of
uncovering identity deception by humans on online social problems like fake profiles, online impersonation etc., Till
networks. date , no one has come up with a feasible solution to these
problems .In this project we intend to give a framework
VIII. EINFORCEMENT MACHINE LEARNING with which the automatic detection of fake profiles can be
Venakatesan et al [5] presented a reinforcement proof-of- done so that the social life of people become secured and
concept model that rewards itself for detecting bots using automatic detection technique we can make it easier
successfully. Reinforcement machines learning models for the sites to manage the huge number of profiles which
require feedback from the environment to adjust and can’t be done manually.
improve. This is not readily available in social media
network. XII. PROPOSED WORK
This paper proposes the detection process starts with the
IX. UNSUPERVISED MACHINE LEARNING selection of the profile that needs to be tested. After
Miller et al [6] proposed that supervised machine learning selection of the profile the suitable attributes ie., features
models require a label included in the corpus to predict the are selected on which the classification algorithm is being
expected outcome. With unsupervised machine learning implemented ,the attributes extracted is passed to the
the data isunlabeled and data are being grouped based on trained classifier .
the similarity of the data considered. It is not practical to The classifier is being trained regularly as new training
search the class consisting of fake accounts. The norm is data set is feed into the classifier. The classifier determines
to train a one class support vector machine on the minority whether the profile is fake or real. The classifier may not
class. be 100 % accurate in classifying the profile so the
feedback obtained from the result is being given back to
X. FILTERING the classifier. For example if the profile is identified as
When a new threat is identified and verified will that fake ,social networking sites can send notification to the
sender be added to a blacklist [9]. Similar methods of profile to submit details.
dealing with spam have been proposed on twitter to
blacklist known malicious URL content and quarantine Classification is the process of learning a target function f
known as bots. that maps each record consulting of set of attributes to one
of the predefined classes models from an input data set.
Classification technique is a approach of building
classification models from an input data set. This
technique uses a learning algorithm to identify a model
that best fits the relationship between the attribute set and
class label of the training set.

The model generated by the learning algorithm should


both fit the input data correctly and correctly predict the
class labels of the learning algorithm is to build the model
with good generality capability.Different steps are

Page 235
Special Issue Published in Int. Jnl. Of Advanced Networking & Applications (IJANA)

executed to classify an account as fake or genuine profiles.


They are:

Data set of both fake and genuine profiles with various


attributes like number of friends ,followers, status count.
Dataset is divided into training and testing data.
Classification algorithm are trained using training dataset
and testing data set is used to determine the efficiency of
algorithm .From the dataset used 80% of both (real and
fake ) are used to prepare a training data set and 20% of
both profiles are used to prepare a testing dataset.

Features are selected to apply classification algorithms.


The classification algorithm is being discussed further.
Attributes are selected as features if they are not dependent
on other attributes and they increase efficiency of the
classification. Fig 1: RESEARCH STEPS

Table 2: FEATURES EXTRACTED XIII. EXPERIMENTAL RESULTS


S.NO FEATURES We used Keras with TensorFlow backend to implement
1 Number of friends the Multi-Layer Perceptron model. We have used a 1-
2 Number of hidden layer neural network with 500 hidden units. The
3 followers output obtained from the neural network is a single value
4 Favorite Count which we pass through the sigmoid non-linearity to squish
5 Languages it in the range [0, 1] The sigmoid function is defined by
6 Known the output from the neural network gives the probability
7 Sex code (positive tweet) i.e. the probability of the tweets sentiment
Listed Count being positive. At the prediction step, we round off the
Status Count
probability values to class labels 0 (negative) and 1
After selection of attributes, the dataset of profiles that are
(positive). Red hidden layers represent layers with sigmoid
already classified as fake or genuine are needed for the
non-linearity. We also conducted experiments using SGD
training purpose of the classification algorithm. We have
+ Momentum weight updates and found out that it takes
used a publicly available dataset of 1337 fake users and
too long to cover the entire data set. We ran our model up
1481 genuine users consisting of various attributes
to 20 epochs after which it began to over fit. Thus
including listed count, status count, number of friends,
identifying the profile is real or fake.
followers count, favourites, languages known, sex code.
We used sparse vector representation of tweets for training
Classification is the process of categorizing a data object
the classifier. We identify that the presence of bigrams
into categories called classes based upon
features significantly improved the accuracy. The overall
features/attributes associated with that data object.
accuracy across all machine learning models was very
Classification uses a classifier, an algorithm that processes
high with the highest being 94.43% using Deep Neural
the attributes of each data object and outputs a class based
Networks and 94% using Random Forest method and
upon this information. In this project, we use Support
finally 90.01% using Support Vector Machine algorithm.
Vector Machine as a classifier. Support Vector Machine is
These results are just below what one would expect from
an elegant and robust technique for classification on a
getting the prediction right by chance.
large data set not unlike the data sets of Social Network
with several millions of profiles. Algorithm used for
classification are Support Vector Machine, Random Forest
and Deep Neural Networks.

Confusion Matrix is a technique for describing the


performance of a classification algorithm. Confusion
Matrix is used to give you a better idea of what your
classification model is getting right and what types of
errors it is making. All the algorithm results are plotted in
confusion matrix to know where the error has occurred.

Page 236
Special Issue Published in Int. Jnl. Of Advanced Networking & Applications (IJANA)

Networks" IEEE International Conference on Big


Data.,, vol. 9, no. 6, pp. 811–824,2018.
[3] SarahKhaled,Neamat El-Tazi and Hoda M. O.
Mokhtar"Detecting Fake Accounts on Social Media"
IEEE International Conference on Big Data.., vol.6 pp
101-110 ,2018.
[4] SuyashSomani and Somya Jain "Resolving Identities
on FaceBook and Twitter" Tenth International
Conference on Contemporary Computing ( IC3), 10-
12 August 2017.
[5] FrancescoBuccafurri, Gianluca Lax,Denis Migdal,
Serena Nicolazzo, Antonino Nocera and Christophe
Rosenberger"Contrasting False Identities in Social
Networks by Trust Chains and Biometric
Reinforcement " International Conference on
Cyberworlds vol 5,2017.
Fig 2: RESULTS [6] MohamedTorky, Ali Meligy and Hani
Ibrahim"Recognizing Fake Identities In Online Social
XIV. CONCLUSION Networks Based on a Finite Automaton
In this Project we have presented a machine learning Approach"International Journal of Computer
pipeline for detecting fake accounts in online social Applications, 2016.
networks. Rather than making a prediction for each [7] SuprajaGurajala, Joshua S White, Brian Hudson,Brian
individual account, our system classifies clusters of fake R Voter and Jeanna N Matthews"Profile
characteristics of fake Twitter accounts"Big Data
accounts to determine whether they have been created by
&Society,July–December 2016: 1–13.
the same actor. Our evaluation on both in-sample and out-
[8] Simranjit. Kaur. Tuteja, ‘‘A survey on classification
of-sample data showed strong performance, and we have algorithms for email spam filtering,’’ International
used the system in production to find and restrict more Journal Eng. Sci., vol. 6, no. 5, pp. 5937–5940, 2016.
than 250,000 accounts. In this work we evaluated our [9] M.A.Devmane and N.K.Rana "Detection and
framework on clusters created by simple grouping based Prevention of Profile Cloning in Online Social
on registration date and registration IP address. In future Networks"IEEE International Conference on Recent
work we expect to run our model on clustering that are Advances and Innovations in Engineering,May 09-11,
created by grouping on other features, such as ISP and 2014.
other time periods, such as week or month. [10] SaraKeretna ,Ahmad Hossny and Doug Creighton
"Recognising User Identity in Twitter Social
Another promising line of research is to use more Networks via Text Mining"IEEE International
sophisticated clustering algorithms such as k-means or Conference on Systems, Man, and Cybernetics,2013.
hierarchical clustering. While these approaches may be
fruitful, they present obstacles to operating at scale: k-
means may require too many clusters (i.e., too large a
value of k) to produce useful results and clustering of data
may be too intensive for classifying millions of accounts
in Online Social Network.

From a modeling perspective, one important direction for


future work is to apply feature sets used in other spam
detection models, and hence to realize multi-model
ensemble prediction. Another direction is to make the
system robust against adversarial attacks, such as a botnet
that diversifies all features, or an attacker that learns from
failures.

REFERENCES
[1] Estee Van Der Walt and Jan Eloff,"Using Machine
Learning to Detect Fake Identities:Bots vs
Humans"IEEE Trans. Emerg.TopicsComput. Intell.,
vol. 1, no. 1, pp. 61–71 March 2018.
[2] Loredana Caruccio,DomenicoDesiato and Giuseppe
Polese"Fake Account Identification in Social

Page 237

You might also like