1 - Fake Profile Identification in Social Network Using Machine Learning and NLP
1 - Fake Profile Identification in Social Network Using Machine Learning and NLP
1 - Fake Profile Identification in Social Network Using Machine Learning and NLP
MASTER OF TECHNOLOGY
IN
COMPUTER SCIENCE &ENGINEERING
Submitted by
JALA SINDHUJA
14BJ1D5802
Under the esteemed guidance of
Mr. RAVEENDRA REDDY ENUMULA (P.hD)
Assistant Professor
CERTIFICATE
This is to certify that the project report entitled “Connecting Social Media To E- Commerce:
Cold-Start Product Recommendation using Microblogging Information” is the bonafied
record of project work carried out by JALA SINDHUJA, a student of this college, during the
academic year 2014 - 2016, in partial fulfillment of the requirements for the award of the degree
of Master of Technology in Computer Science &Engineering from St.Marys Group Of
Institutions Guntur of Jawaharlal Nehru Technological University, Kakinada.
JALA SINDHUJA
(14BJ1D5802)
ACKNOWLEDGEMENT
We consider it as a privilege to thank all those people who helped us a lot for successful
completion of the project “Connecting Social Media To E-Commerce: Cold-Start Product
Recommendation using Microblogging Information” A special gratitude we extend to our guide
Mr. E. Raveendra Reddy, Asst. Professor whose contribution in stimulating suggestions and
encouragement ,helped us to coordinate our project especially in writing this report, whose valuable
suggestions, guidance and comprehensive assistance helped us a lot in presenting the project
“Connecting Social Media To E-Commerce: Cold-Start Product Recommendation using
Microblogging Information”.
We would also like to acknowledge with much appreciation the crucial role of our Co-Ordinator
Mr. E.Raveendra Reddy, Asst.Professor for helping us a lot in completing our project. We just
wanted to say thank you for being such a wonderful educator as well as a person.
We express our heartfelt thanks to Mr. Subhani Shaik, Head of the Department, CSE, for his
spontaneous expression of knowledge, which helped us in bringing up this project through the
academic year.
JALA SINDHUJA
(14BJ1D5802)
ABSTRACT
v
TABLE OF CONTENTS
ABSTRACT v
LIST OF FIGURES ix
LIST OF ABBREVIATIONS x
1 INTRODUCTION 1
1.1 HISTORY 1
1.2 SOCIAL IMPACT 1
1.3 ISSUES 2
1.4 OBJECTIVE 2
1.5 DEEP LEARNING 2
1.6 WORKING OF DEEP LEARNING 3
1.7 DEEP LEARNING vs. MACHINE 4
LEARNING
1.8 NETWORK ARCHITECTURES OF DEEP 4
LEARNING
1.9 ORGANIZATION OF THE THESIS 5
2 SURVEY ON FAKE PROFILE 6
IDENTIFICATION
3 SYSTEM REQUIREMENTS 9
3.1 HARDWARE REQUIREMENTS 9
3.2 SOFTWARE REQUIREMENTS 9
3.3 FEATURES OF KERAS 9
3.4 PRINCIPLES OF KERAS 10
3.5 FEATURES OF TENSOR FLOW 12
4 FAKE PROFILE IDENTIFICATION ON 15
SOCIAL NETWORKS UING DEEP LEARNING
4.1 CONVOLUTIONAL NEURAL NETWORK 15
4.2 WORKING OF CNN LAYERS 16
4.2.1 CONVOLUTIONAL LAYER 16
4.2.2 POOLING 18
4.2.3 FULLY CONNECTED LAYER 21
4.3 SPLITING OF TRAINING AND TESTING 21
DATA
4.3.1 PREREQUISITES FOR TRAIN AND 22
TEST DATA
5 FAKE PROFILE IDENTIFICATION ON 24
SOCIAL NETWORKS USING DEEP LEARNING
5.1 PREPROCESSING 24
5.2 STEPS IN PREPROCESSING 24
5.2.1 DATA CLEANING 24
5.2.2 DATA INTEGRATION AND 26
TRANSFORMATION
5.2.3 DATA REDUCTION 27
5.2.4 DISCRETIZATION AND CONCEPT 27
HIERARCHY GENERATION
6 IMPLEMENTATION AND RESULT 28
6.1 INTRODUCTION 28
6.2 CODING 28
6.3 EXPERIMENT RESULT 40
7 CONCLUSION 43
REFERENCES 44
LIST OF FIGURES
ix
LIST OF ABBREVIATIONS
AI Artificial Intelligence
API Application Programming Interface
CNGF Common New Generation Frigate
CNN Convolution Neural Network
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
DBLP Digital Bibliography and Library Project
EMD Entropy Minimization Discretization
FNR False Negative Rate
FPR False Positive Rate
GPU Graphics Processing Unit
IOS Internetwork Operating System
ML Machine Learning
OSN Online Social Network
PCA Principle Component Analysis
ReLU Rectified Linear Unit
RF Random Forest
SMFSR Supervised Matrix Factorization method with
Regularization
SMOTE Synthetic Minority Oversampling Technique
SVM Support Vector Machine
TPR True Positive Rate
TPU Tensor Processing Unit
UPN User Principal Name
WWW World Wide Web
CHAPTER 1
INTRODUCTION
1.1 HISTORY
Early social networking on the World Wide Web (WWW) began in the
form of generalized online communities such as Theglobe.com (1995),
Geocities (1994) and Tripod.com (1995). In the late 1990s, user profiles
became a central feature of social networking sites, allowing users to compile
lists of "friends" and search for other users with similar interests. Facebook,
launched in 2004, became the largest social networking site in the world in
early 2009. Facebook was first introduced as a Harvard social networking
site, expanding to other universities and eventually, anyone. The term social
media was introduced and soon became widespread.
1
employment, business, etc. Researchers have been studying these online
social networks to see the impact they make on the people. Teachers can
reach the students easily through this making a friendly environment for the
students to study, teachers nowadays are 7 getting themselves familiar to
these sites bringing online classroom pages, giving homework, making
discussions, etc. which improves education a lot. The employers can use these
social networking sites to employ the people who are talented and interested
in the work, their background check can be done easily.
1.3 ISSUES
The social networking sites are making our social lives better but
nevertheless there are a lot of issues with using these social networking sites.
The issues are privacy, online bullying, potential for misuse, trolling, etc.
These are done mostly by using fake profiles.
1.4 OBJECTIVE
This is the framework through which we can detect a fake profile using
machine learning algorithms so that the social life of people become secured.
One of the most common AI techniques used for processing big data is
machine learning, a self-adaptive algorithm that gets increasingly better
analysis and patterns with experience or with newly added data.
If a digital payments company wanted to detect the occurrence or
potential for fraud in its system, it could employ machine learning tools for
this purpose. The computational algorithm built into a computer model will
process all transactions happening on the digital platform, find patterns in the
data set and point out any anomaly detected by the pattern.
The deep learning deals with the many layers more than 150.
The chapter 1 deals with brief description about the deep learning and
its classification. The chapter 2 describes the related methods for identifying
fake profiles. The chapter 3 tells about the system requirements. The chapter 4
and
5 describes about the modules. The chapter 6 describes about the
implementation and result. The chapter 6 describes the conclusion of the
paper.
CHAPTER 2
SYSTEM REQUIREMENTS
1. RAM - 4 GB
5. OS - Windows 7 or 8 or 10
User friendliness
Keras is an API designed for human beings, not machines. It puts user
experience front and center. Keras follows best practices for reducing
cognitive load: it offers consistent and simple APIs, it minimizes the number
of user actions required for common use cases, and it provides clear and
actionable feedback upon user error.
Modularity
Minimalism
Each module should be kept short and simple. Every piece of code
should be transparent upon first reading. No black magic: it hurts iteration
speed and ability to innovate.
Easy extensibility
New modules are dead simple to add (as new classes and functions),
and existing modules provide ample examples.
To be able to easily create new modules allows for total expressiveness, making
Keras suitable for advanced research.
WINDOWS
Responsive Construct
With TensorFlow we can easily visualize each and every part of the
graph which is not an option while using Numpy or SciKit.
Flexible
Easily Trainable
Tensor Flow offers pipelining in the sense that you can train
multiple neural networks and multiple GPUs which makes the models very
efficient on large-scale systems.
Large Community
Open Source
The best thing about this machine learning library is that it is open
source so anyone can use it as long as they have internet connectivity.
So, people manipulate the library in ways unimaginable and come up
with an amazing variety of useful products, it has become another DIY
community which has a huge forum for people getting started with it and for
those who find it hard to use it or to get help with their work.
Feature columns
The filter is smaller than the input data and the type of multiplication
applied between a filter-sized patch of the input and the filter is a dot product. A
dot product is the element-wise multiplication between the filter-sized patch of
the input and filter, which is then summed, always resulting in a single value.
Because it results in a single value, the operation is often referred to as the scalar
product. Using a filter smaller than the input is intentional as it allows the same
filter (set of weights) to be multiplied by the input array multiple times at
different points on the input. Specifically, the filter is applied systematically to
each overlapping part or filter-sized patch of the input data, left to right, top to
bottom.
Figure 4.2 Example of a filter applied to a one-dimensional input to create a
feature map
4.2.2 Pooling:
The pooling layer operates upon each feature map separately to create a
new set of the same number of pooled feature maps. Pooling involves selecting a
pooling operation, much like a filter to be applied to feature maps. The size of
the pooling operation or filter is smaller than the size of the feature map;
specifically, it is almost always 2×2 pixels applied with a stride of 2 pixels.
This means that the pooling layer will always reduce the size of each
feature map by a factor of 2, e.g. each dimension is halved, reducing the number
of pixels or values in each feature map to one quarter the size. For example, a
pooling layer applied to a feature map of 6×6 (36 pixels) will result in an output
pooled feature map of 3×3 (9 pixels).
The result of using a pooling layer and creating down sampled or pooled
feature maps is a summarized version of the features detected in the input. They
are useful as small changes in the location of the feature in the input detected by
the convolutional layer will result in a pooled feature map with the feature in the
same location. This capability added by pooling is called the model’s invariance
to local translation.
After a model has been processed by using the training set, you test the
model by making predictions against the test set. Because the data in the testing
set already contains known values for the attribute that you want to predict, it is
easy to determine whether the model's guesses are correct.
Datase
Training set Testing set
STEP 1: We will need Python libraries such as pandas and sklearn. Command for
installing python libraries:
STEP 2: We use pandas to import the dataset and sklearn to perform the splitting.
STEP3: To split the training and testing dataset in Python the process included are
• Splitting
We fit our model on the train data to make predictions on it. Let’s import
the linear model from sklearn, apply linear regression to the dataset, and plot the
results
CHAPTER 5
5.1 PREPROCESSING
24
• Data cleaning is required because source systems contain “dirty data”
that must be cleaned.
STEP1: Parsing
STEP2: Correcting
STEP3: Standardizing
STEP4: Matching
• Searching and matching records within and across the parsed, corrected
and standardized data based on predefined business rules to eliminate
duplications.
• Examples include identifying similar names and addresses.
STEP5: Consolidating
STEP6: Data cleansing must deal with many types of possible errors
6.1 INTRODUCTION
The pseudo facebook dataset includes hypothetical samples
corresponding to two types of accounts in the facebook. Various attributes
included in dataset are number of friends, followers, status count, etc. Dataset
is divided into training and testing data. From the dataset used 80% of both
profile (genuine and fake) are used to prepare a training dataset and 20% of
both profiles are used to prepare a testing dataset. CNN algorithm is used to
identify the fake profiles. The algorithm is implemented using python.
6.2 CODING
preprocessing.py
import pandas as pd
import numpy as np
import random
from nltk.corpus import names
import nltk
from sklearn.model_selection import train_test_split
""" Reads users profile from csv files """
genuine_users = pd.read_csv("data/users.csv")
fake_users = pd.read_csv("data/fusers.csv")
##print (genuine_users.columns)
##print (fake_users.columns)
##print (genuine_users.name)
##print (fake_users.name)
##genuine_users.info()
##fake_users.info()
x=pd.concat([genuine_users,fake_users])
X=pd.DataFrame(x)
#x.info()
t=len(fake_users)*['Genuine'] + len(genuine_users)*['Fake']
y=pd.Series(t)
##print(y)
p=x.name
#print(p)
def gender_features(word):
return {'last_letter':word[-1]}
labeled_names = ([(name, 'male') for name in names.words('male.txt')]+
[(name, 'female') for name in names.words('female.txt')])
##print(labeled_names)
featuresets = [(gender_features(n), gender)
for (n, gender)in labeled_names]
classifier =
nltk.NaiveBayesClassifier.train(featuresets) a=[]
for i in X['name']:
#print(i)
vf=classifier.classify(gender_features(i))
a.append(vf)
X['gender']=pd.DataFrame(a)
print(X['gender'])
lang_list = list(enumerate(np.unique(X['lang'])))
lang_dict = { name : i for i, name in lang_list }
def strToBinary(s):
bin_conv = []
for c in s:
# convert each char to
# ASCII value
ascii_val = ord(c)
# Convert ASCII value to binary
binary_val = bin(ascii_val)
bin_conv.append(binary_val[2:])
num=' '.join(bin_conv)
return num
# Driver Code
# This code is contributed
# by Vikas Chitturi
def binaryToDecimal(n):
num = int(n)
dec_value = 0
# Initializing base
# value to 1, i.e 2 ^ 0
base = 1
temp = num
while(temp):
last_digit = temp % 10
temp = int(temp / 10)
dec_value += last_digit * base
base = base * 2
return dec_value
# Driver Code
s = X['name']
l=[]
for i in s:
k=0
for j in i:
a=strToBinary(j)
k=k+int(a)
c=binaryToDecimal(k)
l.append(c)
X['name']=pd.DataFrame(l)
##print(X['name'])
##print(X)
gender = {'male': 0,'female': 1}
X.gender = [gender[item] for item in X.gender]
X.loc[:,'lang_code'] = X['lang'].map( lambda X: lang_dict[X]).astype(int)
feature_columns_to_use
=['name','gender','statuses_count','followers_count','friends_count','favourites_
count','listed_count','lang_code']
##print(feature_columns_to_use)
ty=X.loc[:,feature_columns_to_use]
##print(ty)
x_train,x_test,y_train,y_test = train_test_split(ty,y,
shuffle=True,test_size=0.3)
print(x_train)
print(x_test)
print(y_train)
Prediction.py
import pandas as pd
import numpy as np
import random
from nltk.corpus import names
import nltk
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from keras.layers import Dense, Flatten, Conv1D,LSTM
from sklearn.feature_selection import f_classif
from keras.callbacks import ModelCheckpoint
from sklearn.metrics import accuracy_score
from keras.models import Sequential
from keras import backend as K
from functools import partial
import warnings
import keras
import matplotlib.pyplot as plt
import scikitplot.plotters as skplt
t=len(fake_users)*['Genuine'] + len(genuine_users)*['Fake']
er=pd.Series(t)
X['label']=pd.DataFrame(er)
##print(X['label'])
label = {'Genuine': 0,'Fake': 1}
X.label = [label[item] for item in X.label]
y=X.loc[:,'label'].values
##print(y)
p=x.name
#print(p)
def gender_features(word):
return {'last_letter':word[-1]}
labeled_names = ([(name, 'male') for name in names.words('male.txt')]+
[(name, 'female') for name in names.words('female.txt')])
##print(labeled_names)
featuresets = [(gender_features(n), gender)
for (n, gender)in labeled_names]
classifier =
nltk.NaiveBayesClassifier.train(featuresets) a=[]
for i in X['name']:
#print(i)
vf=classifier.classify(gender_features(i))
a.append(vf)
X['gender']=pd.DataFrame(a)
##print(X['gender'])
lang_list = list(enumerate(np.unique(X['lang'])))
lang_dict = { name : i for i, name in lang_list }
##print(lang_dict)
def strToBinary(s):
bin_conv = []
for c in s:
# convert each char to
# ASCII value
ascii_val = ord(c)
# Convert ASCII value to binary
binary_val = bin(ascii_val)
bin_conv.append(binary_val[2:])
num=' '.join(bin_conv)
return num
# Driver Code
def binaryToDecimal(n):
num = int(n)
dec_value = 0
# Initializing base
# value to 1, i.e 2 ^ 0
base = 1
temp = num
while(temp):
last_digit = temp % 10
temp = int(temp / 10)
dec_value += last_digit * base
base = base * 2
return dec_value
# Driver Code
s = X['name']
l=[]
for i in s:
k=0
for j in i:
a=strToBinary(j)
k=k+int(a)
c=binaryToDecimal(k)
l.append(c)
X['name']=pd.DataFrame(l)
gender = {'male': 0,'female': 1}
X.gender = [gender[item] for item in X.gender]
X.loc[:,'lang_code'] = X['lang'].map( lambda X: lang_dict[X]).astype(int)
feature_columns_to_use =
['name','gender','statuses_count','followers_count','friends_count','favourites_c
ount','listed_count','lang_code']
print(feature_columns_to_use)
ty=X.loc[:,feature_columns_to_use].values
print(ty)
x_train,x_test,y_train,y_test = train_test_split(ty,y,
shuffle=True,test_size=0.3)
print(x_train)
print(x_test)
print(y_train)
print(y_test)
epochs = 50
num_classes = 2
batch_size = 256
input_shape = (1972, 8)
###### Convert class vectors to binary class matrices. This uses 1 hot
encoding ##############
y_train_binary =
keras.utils.to_categorical(y_train) y_test_binary =
keras.utils.to_categorical(y_test)
###################### num_classes
#######################################
x_train =
x_train.reshape(x_train.shape[0],x_train.shape[1],1) x_test =
x_test.reshape(x_test.shape[0], x_train.shape[1],1)
##################### Bulid an CNN network
############################
model_cnn = Sequential()
model_cnn.add(Conv1D(32, (3), input_shape=(x_train.shape[1],1),
activation='relu'))
model_cnn.add(Flatten())
model_cnn.add(Dense(64, activation='softmax'))
model_cnn.add(Dense(num_classes, activation='softmax'))
model_cnn.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
model_cnn.summary()
history=model_cnn.fit(x_train, y_train_binary,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test_binary))
##scores = model_cnn.evaluate(X_test, y_test, verbose=0)
##print("Accuracy= %.2f%%" % (scores[1]*100))
y_pred_cnn=model_cnn.predict(x_test)
print(y_pred_cnn)
thm=y_pred_cnn[0,0]
tym=y_pred_cnn[0,1]
print('Original',thm)
print('Fake',tym)
#labelResult.config(text="normal = %f,abnormal = %f" %(thm,tym))
a=input("Enter the name:")
usname=classifier.classify(gender_features(a))
print(usname)
s = usname
ln=[]
for i in s:
k=0
c=strToBinary(i)
#print(c)
k=k+int(c)
#print(k)
h=binaryToDecimal(k)
ln.append(h)
name=sum(ln)
if usname
=="male":
gender=0
else:
gender=1
statuses_count=int(input("statuses_count:"))
followers_count=int(input("followers_count:"))
friends_count=int(input("friends_count:"))
favourites_count=int(input("favourites_count:"))
listed_count=int(input("listed_count:"))
lang_code=int(input("lang_code:"))
new={"name":name,"gender":gender,"statuses_count":statuses_count,"follow
ers_count":followers_count,"friends_count":friends_count,"favourites_count"
:favourites_count,"listed_count":listed_count,"lang_code":lang_code}
de=pd.DataFrame(new,index=[0])
de.to_csv("new.csv")
re=pd.read_csv("new.csv")
rs=re.iloc[:,1:9].values
print(rs)
print(rs.shape)
rs = rs.reshape(1,8,1)
#######################new input
variations####################################
y_pred_cnn_input=model_cnn.predict(rs)
print(y_pred_cnn_input)
6.3 EXPERIMENT RESULT
Below Figure 6.1 shows the result of splitting test and train data in
convolutional neural network.
Below the Figure 6.4 shows the result of prediction in convolutional neural
network.
Below the Figure 6.6 shows the classification result in convolutional neural
network.
Fake profiles are identified from the dataset using deep learning networks
in keras with the tensor flow backend. The results are notified. Fake profiles are
created in social networks for various reasons by individuals or groups. The
results are about detecting the account is fake or genuine by using engineered
features and deep learning models like convolutional neural networks. The
predictions indicate that the algorithm convolutional neural networks produced
97.5% accuracy.
43
REFERENCES
10. Salon.com, “The fake Facebook profile I could not get removed”,
http://www.salon.com/2011/02/02/my fake facebook profile/, 2012.
11. Shivangi Gheewala, Rakesh Patel “Machine Learning Based Twitter
SPAM: Review”, Proceedings of the Second International Conf. on
Comp Methodologies and Communication (ICCMC 2018).
13. Yazan Boshmaf, Georgos Siganos and Jorge Leria, Integro: “Leveraging
victim prediction for robust fake account detection in large scale
OSNs”,2016.
15. Yin Zhu, Xiao Wang, Erheng Zhong, Nanthan N. Liu, He Li and Qiang
Yang, “Discovering Spammers in Social Networks”, 2012.