0% found this document useful (0 votes)

19 views

Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

Uploaded by

Sakhi S Anand

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

Uploaded by

Sakhi S Anand

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

SPECIAL SECTION ON EMERGING APPROACHES TO CYBER SECURITY

Received August 28, 2019, accepted September 21, 2019, date of publication September 25, 2019, date of current version October 10, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2943639

Source Code Authorship Attribution Using

Hybrid Approach of Program Dependence
Graph and Deep Learning Model
FARHAN ULLAH 1 , JUNFENG WANG 2 , SOHAIL JABBAR 3,

FADI AL-TURJMAN 4 , AND MAMOUN ALAZAB 5

1 College of Computer Science, Sichuan University, Chengdu 610065, China
2 School of Aeronautics and Astronautics, College of Computer Science, Sichuan University, Chengdu 610065, China
3 Department of Computing and Mathematics, Manchester Metropolitan University, Manchester M15 6BH, U.K.
4 Artificial Intelligence Department, Near East University, 99138 Nicosia, Turkey
5 College of Engineering, IT & Environment, Charles Darwin University, Casuarina, NT 0810, Australia

Corresponding author: Junfeng Wang (wangjf@scu.edu.cn)

This work was supported by the National Key Research and Development Program under Grant 2019QY1400 and Grant
2018YFB0804503, the National Natural Science Foundation of China under Grant U1836103, and the Technology Research and
Development Program of Sichuan, China, under Grant 18ZDYF3867 and Grant 2017GZDZX0002.

ABSTRACT Source Code Authorship Attribution (SCAA) is to find the real author of source code in a
corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to
develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other
code analysis applications. The efficient features extraction is the key challenge for classifying real authors
of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL)
methodology is proposed to identify authors from different programming source codes. First, the PDG
is implemented to extract control and data dependencies from source codes. Second, the preprocessing
technique is applied to convert PDG features into small instances with frequency details. Third, the Term
Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG
feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle
the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles’ features
for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with
drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results.
The proposed work is analyzed on 1000 programmers’ data, collected from Google Code Jam (GCJ). The
dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in
outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and
f-measure metrics.

INDEX TERMS Code authorship attribution, program dependence graph, deep learning, software forensics
and security, software plagiarism.

I. INTRODUCTION the specific author using a set of known potential program-

The programming code authorship attribution is the mers with their coding samples. [1], [2].
programmers0 de-anonymization from source codes frag- The programmers de-anonymization has the amplifica-
ments using coding style features of known authors. It means tions for software privacy and security. The Research and
that programmers coding style or stylistic fingerprint prop- Development (R&D) department of the White House stated
erty is preserved in the software compilation process. These that ‘‘intelligent prevention might increase the cost of mali-
features can be retrieved from source codes to de-anonymize cious cyber activities, lower their gains, and influence
opponents’’ [3]. The main tenet of software author identi-
The associate editor coordinating the review of this manuscript and fication is the software forensics that helps in adjudicating
approving it for publication was Luis Javier Garcia Villalba . cases to dispute the real authorship and copyright. It is also

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 7, 2019 141987
F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

applied on malicious code to analyze the remnants left by code in different lines of codes according to programming
an author/group and to uncover the source of the attack [4]. language syntax. It may result in the class imbalance problem.
The source code authorship attribution mainly depends on We used the SMOTE method to get balance classes from
the extracted features that an author generates in coding TFIDF corpus. Further, these features are used as input to the
structure naming variables. It is used to allocate programmers designed deep learning model [17]. The proposed research
to their source codes based on these structures. Further, it is tries to respond to the following queries:
a severe privacy risk for those programmers who want to 1) How to learn different types of source codes for author-
remain anonymous, i.e., Open source projects or program- ship attribution and how to identify authors for different
mers from higher authorities. On the other side, code author- types of source codes?
ship attribution plays a significant role in software forensics 2) How to use an algorithm in an efficient way for source
activities and security analysis, especially for targeting mal- code authorship attribution that are beyond the pro-
ware authors. The malware authors write malicious software gramming language specific?
which can compromise the compilation process in computer
The proposed approach develops a learning procedure to effi-
system [5]. The code authorship attribution has many appli-
ciently generate PDG features from different programming
cations, such as in stylometry analysis [6], software copy-
codes. The main contributions of the proposed approach are:
rights investigation [7], software plagiarism detection [8] and
software forensics [9]. The coding style patterns can be used • We extract PDG (data & control dependencies) features

to identify the specific author of source code. It can answer to analyze the control flow and data variation in source
of the question, ‘‘Which unknown document source code codes against each author of source codes, i.e. C++,
style exactly or approximately similar to the known authors Java, C#
style’’. In academia, the vital application is software plagia- • TFIDF weighting technique is configured to zoom the

rism detection in students programming assignments [10]. importance of each PDG feature
The software ownership is essential in terms of trade secrets • Source code authorship attribution in cross program-

safety, copyright breach and patent rights [11]. ming languages using PDG analysis and deep learning
It is a big challenge to identify the real author using coding model
styles features and faces several barriers that can prevent The remaining paper is organized as follows: The section
malicious authors. First, it is a challenging task to attribute the 2 contains the related work with state of the art discussions,
author because of his continuously grows of education, pro- the section 3 contains the proposed methodology, the exper-
gramming expertise, and use of specific software engineering imental details are given in section 4 and section 5 includes
paradigms [12]. Second, the author may use a different coding the conclusion with future direction.
style for different types of programming languages due to
some constraints applied by the manager or tools. Third, II. RELATED WORK
generally automated tools are used to obfuscate the software, The SCAA extremely depends on extracted features from
which prevents the recognition of source code style. Recently, source codes. Every author has a unique coding style for
several code authorship attribution techniques are proposed programming, and there are some proposed techniques than
with key limitations [9], [13]. (i) Mostly, the software features can identify authors based on their styles. This type of domain
retrieved for identification of author are not valid to another is called stylometry [18]. Linguistic stylometry is widely used
type of language. For example, the technique used to extract for security and privacy problems. It is applied to categorize
features from C++ may not be applicable to Java or C#. unknown bloggers on large scale datasets to expose the pri-
(ii) The prior proposed work used for extracting authorship vacy concerns [19]. The stylometry is also used in forensic
features is not useful for a large number of programmers. departments to expose cyber-forums. It is more challenging
The prediction accuracy is decreased for a large set of pro- to identify authors from a mixture of used languages with
grammers. (iii) Generally, the large set of features extracted personal writing styles. The specific authors not only iden-
from source codes are not exactly relevant for authorship tified but their links with others or in the forums are also
identification activities. Further, it requires an extra method exposed in stylometry analysis [20]. The source code sty-
for mining and selection of relevant features [14]. lometry analysis can be used in source code authorship attri-
The main aim of the proposed approach is to identify the bution and plagiarism detection. The simple byte level [21]
real authors of different types of source code. The features and n-grams [22] features are used by machine learning to
extraction can be designed in such a way that can be used predict the real author of source code. The structural level
for any programming language and does not follow any pro- features can be achieved from the Abstract Syntax Tree (AST)
gramming structure. The PDG is used to extract control flow of source codes. The lexical information is merged with
and data variations features from source codes. Then, prepro- n-grams features to build up the developer profiles. Then,
cessing techniques are used to break the PDG structural data these profiles are used to identify 12 writers with an accuracy
in small instances and remove the noisy words [15]. Then, of 76% [23]. The genetic algorithm is combined with lexical
the TFIDF technique is used to weight each PDG feature features to classify 20 authors with an accuracy of 75% [24].
for code authorship [16]. As different programmers may Similarly, the AST features are extracted from programming

141988 VOLUME 7, 2019

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

structures which are based on coding styles. These features is analyzed during the reverse engineering process to get
are extracted to attribute authors with an accuracy of 94% for essential information about the executable program. Then,
1600 programmers using GCJ dataset [9]. The information the ancestry information of malware is evaluated in the trans-
retrieval technique for programming code authorship attribu- formation phase. The lineages of malware described the mal-
tion is investigated for C source code. The C source codes, ware samples derivations among each other.
which includes 1,597 programming assignments, are con-
verted into a proper retrieval system. The authors classifica- III. PROPOSED METHODOLOGY: PROGRAM
tion accuracy is 76.78% [25]. In [26], the Source Code Author DEPENDENCY GRAPH WITH DEEP LEARNING
Profiles (SCAP) approach is used to extract coding styles We designed a hybrid approach based on PDG and deep
features from byte-level n-grams. It is shown that the n-gram learning model for identification of source codes authors as
features also used for writing styles, and it is already used shown in Figure 1.
in natural language text analysis for identifying the author.
Further, the idea is used for different sets of programming A. PROGRAM DEPENDENCE GRAPH (PDG)
languages such as java or C++ and got better accuracy. The PDG is a graphical representation of source code. Pro-
In [28], presented two machine learning techniques to gramming expressions, variables, conditions and method
de-anonymize the source code authors. The first algorithm calls can be represented in vertices. The edges show pro-
worked on supervised learning combined with Support Vector gram and control dependencies among vertices in a graph.
Machine (SVM), and the second method is based on cluster- The PDG graph G is generated using four elements for a
ing to merge the authors with the same programming styles. procedure P, i.e. G = (V, E, µ, δ)
Further, they used the distance similarity metric to classify • V is a set of vertices contain in a P
features relevant to each programmers style based on GCJ • E ⊆ V×V is a set of edges contain data or control
dataset. Recently, hackers leave malware on some websites. dependencies among V
In [29], the structural features are extracted using AST to pre- • µ: V → S function defines the assigned types to program
dict the programming style of authors. Further, the n-grams vertices, i.e. variables, statements, condition, method
features are extracted to identify JavaScript programmers calls
over the web. The deep learning with Recurrent Neural Net- • δ: E → T function defines the assigned dependency type
work (RNN) is combined with TFIDF feature to solve the to edges, i.e. data or control dependency
multi-class problem of authors. The Random Forest (RF) The data dependency edge E can be generated between v1 ,
is merged with the proposed approach to de-anonymize the and v2 , if there is variable var which effect the execution of
author on large-scale dataset [30]. The proposed research a program.
gave 93.42% accuracy for 120 authors collected from GCJ
• v1 may be passed to var directly or indirectly using
dataset. The hybrid approach of Back Propagation (BP)
pointers
neural network with Particle Swarm Optimization (PSO) is
• v2 may execute the given value of var using pointers
used to identify the specific author of source code. The
lexical information of source codes is used as input to the So, by changing var value may effect the execution of
proposed hybrid approach. The method is investigated on a source code with different output. PDG data dependency
3,022 java files which include 40 authors and got an accuracy feature may be used to catch such type of source code plagia-
of 91.060% [31]. Many extracted features are not related risms. Further, control dependency shows the internal logic
to the authors coding style, which affects the accuracy. The flow of source code.
Onion approach for Binary Authorship (OBA) attribution • Control dependency edge may be generated from v1
which works on different layers, i.e., preprocessing, syntax to v2 , if there is a truth condition which controls the
and semantic-based attribution analysis [32]. execution of v1 from v2 .
Stylometry plays an important role in malware attribution. First, we extract PDG (data & control) features from
It captures the coding styles patterns from source codes. It is different programming codes, i.e. C++, Java, C#, as shown
the hot demanding area that will need automatic packer and in Figure 4. These are high quality features which may extract
encryption detection with binary segment analysis. It is a vital the data variations and control flow features. It represents that
issue to uncover the hidden descriptions for malware samples. how data flows between different statements and, how control
Currently, the researchers mainly focus on dynamic analysis transfers among different statements. These are significant
rather than static due to the widespread use of obfuscation features which may show the hidden patterns of different
techniques. The [33] research described different static fea- programming codes.
tures of malware that can offer reliable relationships between
executable malware binary programmed by the same writer. B. PREPROCESSING
Some of these features are relevant to the malware i.e., control To identify authors that which programmer has written
and command arrangement and data filtration techniques. which code? We used preprocessing techniques to trans-
In [34], both static and dynamic analysis of malware samples form PDG features into small instances without noisy data.
is analyzed for authorship features. The genetic property It breaks the PDG into tokens and then, calculate the

VOLUME 7, 2019 141989

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

FIGURE 1. Code authorship attribution using program dependence graph and deep learning.

frequency of each token. The preprocessing steps include for accurate predictions. These stylistic features are further
cleaning, instances, selection, transformation. The data clean- used to predict a unique author. We used TFIDF feature for
ing method is used to remove unwanted data such as special global weight and Logarithm of Term Frequency (LogTF) for
symbol, numbers, stop words, punctuations. We do not need local weight [38], [39]. The term frequency is the number of
such noisy information in source codes features classifica- occurrence of each token as shown in (1).
tion. Then, the transformation procedure is used to decom- ft,f
pose source codes into useful features. The stemming, stop tf (t, d) = P (1)
ft 0 ,d
words, and frequency parameters are used to extract valuable t 0 ∈d
features in the transformation phase. Stemming is used to
reduce a group of words into its root forms. Frequency infor- where tf denotes term frequency, d denotes each single doc-
mation indicates the number of occurrence of each element ument. The inverse document frequency is given in (2).
in different source codes [35]. N
idf (t, D) = log (2)
|{d ∈ D : t ∈ d}|
C. TFIDF FEATURES’ WEIGHTING
where t represents term, d represents single document, D
The preprocessed PDG features contain cleaned information
represents all documents and N represents all documents.
with frequencies details. To get better classification accuracy,
Mathematically TFIDF is define as in (3)
we need to convert PDG instances into weighting features.
These are used to zoom the importance of each feature in a tfidf (t, d, D) = tf (t, d) × idf (t, D) (3)
single document as well as across multiple documents. The
local and global weighting techniques are used to retrieve where t represents term, f represents frequency, d repre-
the weights of each feature. For example,if we are comparing sents each single document and D represents all collection
three documents of source codes, C++, Java, C#. Here, local of documents contained in the corpus.
weight extracts the significance of each feature contained in
one single document, i.e., C++ or Java or C#. But, global D. SYNTHETIC MINORITY OVER-SAMPLING
weight capture the significance of each feature across all TECHNIQUE (SMOTE)
documents. These features are useful to score and rank each Oversampling can give better accuracy as compared to
feature for each programming style. This process provides a undersampling with class imbalance problem. The SMOTE
direction to a classifier that which features are more valuable technique can provide better results in dealing with class

141990 VOLUME 7, 2019

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

FIGURE 2. TensorFlow data flow graph for training features by queuing and preprocessing.

minority problem. It synthetics new minority class sam- 1) MODEL DESIGN

ples based on the similarity with original minor class We have configured seven layers to train the features with
samples [37], [40]. SMOTE works in following steps: 100, 80, 80, 60, 60, 40 neurons, respectively. The 7th layer
1) Calculates k-nearest neighbor value for each minority is configured for output variable, i.e., programmers. First is
class xi ∈ Smin using Euclidean distance. the input layer, then five are hidden layers, and last is the
2) Choose a random closest neighbor xj in a group of k- output layer. The Relu activation function is used in input
nearest neighbor of xi and hidden layers. The softmax function is used for the target
3) Now, new sample is produced on the basis of (4) variable. The dropout layer is used to fine-tune the deep
learning algorithm to remove the overfitting problem. There
xnew = xi + xi + xj × δ (4)
are 750 parameters trained on layer 1, 15100 parameters
where δ ∈ [0,1] is random factor which controls the on layer 2, 5050, on layer 3 and 5049 on layer 4. Total
placement of newly generated samples. We face the 25,949 parameters are trained for the designed experiment.
class imbalance problem while training features. The For better accuracy, the deep learning algorithm is optimized
dataset contains a different number of c++, java, and with fine-tune configuration in the context of drop out layer,
C# classes. Secondly, each programmer may type a activation and loss function, optimizer method, and learning
different number of lines of codes. Deep learning algo- error rate. The softmax activation function is also called soft-
rithm has trouble in learning when one class dominates argmax or normalized exponential function, which is used in
the other. It greatly affects the classification accuracy. the output layer to handle multi-class problems [42]. It takes
We use SMOTE method for oversampling minority a vector of K real numbers and, transforms into the normal-
classes to overcome the class imbalance problem. ized probability distribution of K probabilities. The output is
proportional to the input of K numbers. Some input may not
E. DEEP LEARNING MODEL be in proper distribution of numbers. Softmax is applied to
The TensorFlow is an open-source library which is significant convert K numbers in a range of [0,1]. It is often used in multi-
for diverse applications of deep learning programming tasks. class neural networks to convert the non-normalized output to
It is a severe need in the research industry to experience with a probability distribution over predicted classes. The standard
machine and deep learning algorithms online. The user can softmax function σ : RK → RK , can be defined using (5).
configure different hidden layers using TensorFlow library.
ezi
To programmed high-level computations, train the dataset, σ (z)i = PK for i = 1, . . . , K and z = (z1 , . , zK ) ∈ RK
track and shared the state of each operation with mutations j=1 ezj
details. The queue feature is used to compute the correspond- (5)
ing tensor asynchronously. The function of the queue feature
is just like multi-threading process. It can run the operations We apply a standard exponential method for each instance of
in a parallel manner to speed up the operation [17], [36], [40]. zi with input vector z. It normalizes the output value by divid-
We designed deep PDGDL methodology to identify corre- ing the sum of all these exponentials. The main goal of the
sponding authors for each type of source code. The deep training process is to learn enough about the dataset structure
learning model can be trained using high-level Keras API. It is so that to make accurate predictions of unseen data. The rec-
easy to configure to extend different modules and fast pro- tifier (Relu) activation function is used for the output variable
totyping [41]. The normalized dataset with local and global for better understanding and deep neural network [43]. It is
weighting values are input to the deep learning algorithm. The also called the Rectifier Linear Unit (ReLU). Mathematically,
TensorFlow preprocessed the normalized data to queue for it is defined as the positive part of its argument as shown
the training phase as shown in Figure 2. in (6).
The queue manages the multiple threads and the forward f (x) = x + = max(0, x) (6)
back process is used to organize and manage the training. It is
the loop like the procedure to get the best-trained data using where x represents the input to the corresponding neurons,
fine-tune configuration. It has a queue process which runs the this is also known as a ramp function whose graph behaves
features for the next phase in parallel like a procedure. like a ramp based on unary real numbers. Now, It is the most

VOLUME 7, 2019 141991

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

popular activation function for deep neural networks. The Algorithm 1 PDG Based Deep Learning (PDGDL) for Dif-
entropy function is applied to identify loss of each instant to ferent Programming Code Authorship
accumulate the deep learning functionalities. It takes tensor Input = list < author > folder [ ] = directory\\Codes
as input and marks tensor with a similar profile as output. dataset [ ] = null
foreach(author in folder)
2) MODEL TRAINING {
Training is the next stage of a deep learning algorithm in foreach(question in author)
which model gradually optimized and learned the given {
dataset. The main goal of training is to learn enough about the PDG[ ] = getPDG(V , E, µ, δ)\\PDG generation
structure of the dataset. It enables the model to make accurate }
predictions for unseen corpus. The optimization and loss }
functions may contribute well to train the designed model. foreach(PDG)\\Preprocessing
The Adam optimizer, which is also recognized as a stochastic {
descent gradient, is applied to compile and optimize the terms [ ] = getFreq(ques, arg s{rmNoise = T } )
deep learning model. It practices the iterative technique to termLocalWeight = getLocalWeight(terms)\ \Equ.1
update the network weights. It computes the distinct adaptive termGlobalWegith = getGlobalWeight(terms)\\Equ.3
learning rates for each constraint in the deep learning network for(int i = 0; i < terms.size( ); i + +)
[44], [45]. The decaying means of pas squared gradients are {
shown in (7) and (8). List < term, weight >
termWeight = null
mt = β1 mt−1 + (1 − β1 )gt (7) weight = termLocalWeigh[i]+ termGlobalWeigh[i]
vt = β2 vt−1 + (1 − β2 )gt 2 (8) termWeight.add(term[i], weight)
}
where mt and vt are the predictable means of the first and sec- question.add(termWeight)
ond instant gradients respectively. The g signifies particular dataset(question[ ], author)
gradient for every instant. It stabilizes these preferences by }
calculating bias-corrected first and second instant estimations train, test = randomSplitRatio(dataset, 80, 20)
using (9) and (10). train = SMOTE(train)\\class balancing
∧ mt one − hotEncoding(train)
mt = (9)
1 − β1t one − hotEncoding(test)
∧ vt Tensorflow model = model_type =0 Sequential 0 ,
vt = (10)
1 − β2t layer_type =0 Dense, Dropout 0 , activation =0 relu0 ,
epoches =0 20000 )//Model design
The loss and accuracy functions are used to predict class
model.compile(loss = categorical_crossEntropy,
probabilities. The target values are one-hot encoded, so the 0 Model compile
optimizer = adam, learning_error_rate,)
loss is the best when the model output is very close to 1 for
authorship_attribution(train, test)//Model evaluation
the right category and very close to 0 for other categories.
Output=Code_authorship_attributtion
We used categorical loss function and mathematically it is
defined in 11.
M
X
Loss = yo,c log(po,c ) (11)
c=1
where M is the number of predicted classes, c is the correct
classification for observations o and p is the predicted prob-
ability for observations o in class c. Algorithm 1 shows the
implementation of the proposed methodology.

IV. EXPERIMENTS
The programming code authorship attribution is a critical task
which is used to uncover the hidden patterns from source
codes and predict features from coding styles. FIGURE 3. Amount of training data required for identification
of 1000 programmers.
A. DATASET
The GCJ is a programming competition hosted and admin- consists of a set of algorithmic problems which must be
istered by Google every year. Thousands of programmers solved in a fixed amount of time. Each programmer may
around the globe participate in this activity. The competition use any type of programming language to solve the given

141992 VOLUME 7, 2019

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

FIGURE 4. Program dependence graph (data & control dependencies) for C++, Java and C#.

problems. The dataset is collected from GCJ, which contains 53.96435% contributed by C++, 32.36987% contributed by
1000 programmers’ source codes in three different program- Java and 13.66578% by C#. In the right y-axis, the cumulative
ming languages, i.e., C++, Java, C#. We took source code frequency for each source code is given.
data from the 2017 year’s corpus. The amount of data required
for training the source code features, as shown in Figure 3. B. EVALUATION METRICS
The source code types are given on x-axis, and their per- The proposed methodology is evaluated on mostly used met-
centile distribution is given on the y-axis. There is a total rics, i.e., precision, recall, f-measure, and accuracy, are calcu-
of 1000 programmers analyzed in the experiment in which lated to evaluate the designed approach. The numbers of True

VOLUME 7, 2019 141993

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

TABLE 1. Normalized input and output variables with minimum, maximum, mean and median values.

Positives (TPs) and False Positive (FPs) represent the number

of source code samples classified as false and true, respec-
tively. Similarly, the number of True Negatives (TNs) and
False Negatives (FNs) represent the number of source code
samples classified as false and true, respectively. The overall
classification performance is evaluated using the accuracy
metric, which is equal to the sum of correctly classified
instances divided by the total number of instances. The eval-
uation metrics are presented as follows in equations 12, 13,
14, and 15.
TP
Precision = (12)
TP + FP
FP
Recall = (13)
FP + TN
TP + TN FIGURE 5. TFIDF features before SMOTE (green: C++, black: Java,
Accuracy = (14) red: C#).
TP + TN + FP + FN
Precision ∗ Recall
F − measure = 2 ∗ (15) The TFIDF weighting feature is used to present the impor-
Precision + Recall
tance of each PDG feature, as shown in Figure 5. The green,
C. RESULTS ANALYSIS black, and red dots are showing the C++, Java, C# data
Initially, we have raw source codes which contain unimpor- variables, respectively. The green dots are more than black
tant noisy data.We extracted PDG (data & control) features dots and black more than red. So, C++ is a dominant class
from source codes, i.e. C++, Java, C# as shown in Figure 4. over the other two class, i.e., Java, C#. Similarly, the Java
A PDG example is shown for source code, i.e., find a maxi- class is dominant over C# class. Due to this reason, we face
mum between two numbers. The data & control dependencies the class imbalance problem during training features of each
for C++, Java and C# are presented in (a), (b), (c), (d), (e), (f), class. SMOTE technique is applied to solve the class imbal-
respectively. We are only interested in capturing the control ance problem, as shown in Figure 6. The same colors present
flow and, data variation in source codes. Source code may for each class. It converts the weighting values into different
be changed to different programming structure, i.e., code to features with balanced synthetic observations of each class.
different programming code, rename a method or variable, As we see the data of each class is quite visible after applying
change the conditional statement, etc. PDG may retrieve SMOTE. Black dots are approaching green dots which means
high-quality features from different programming codes to that the Java class is balanced to C++ class. Similarly, red
catch such types of plagiarisms. dots are also more visible than before.
Next, preprocessing techniques are used to get numerical The SMOTE features are used as input to the deep learning
values from PDG features to use for classification. The data model for better accuracy. The dynamic graph for accuracy
is normalized in order to apply further experiment, as shown without SMOTE fine-tune configuration is shown in Figure 7.
in Table 1. The dataset contains a total of 5 variables in Classification accuracy, the epoch is given on y-axis and
which four variables as input and 5th is the target variable. x-axis, respectively. The blue curve shows train data, and
The minimum, maximum, mean, and median of each vari- the orange curve shows test data using 1000 epochs. Both
able after normalization are calculated. The V1, V2, V3, curves start from 0.68 and, then grows to 0.78. After that,
and V4 represent the programming questions 1,2,3 and 4, it behaves more or less constant on each epoch. As there is
respectively. Each programmer has attempted four different a miss classification because of imbalance class problem so
programming questions. The Qu (quadrant) represents the the overall accuracy 78%. Similarly, dynamic loss without
data distribution for each variable. The V1 has minimum fine-tune configuration and SMOTE is shown in Figure 8.
and maximum values are more than other programming The orange curve shows test data, and the blue curve shows
variables. train data with 1000 epochs. Both curves start at 0.65 and,

141994 VOLUME 7, 2019

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

FIGURE 6. TFIDF features after SMOTE (green: C++, black: Java, red: C#). FIGURE 8. Dynamic graph for Loss without fine-tune and SMOTE.

FIGURE 7. Dynamic graph for Accuracy without fine-tune and SMOTE.

FIGURE 9. Dynamic graph for Accuracy with fine-tune and SMOTE.

then decrease to 0.40. After that, more or less going constant

towards x-axis. The overall loss is 40%. We got bad results,
due to class imbalance, overfitting and miss classification
problems. Further, we resolve class imbalance, overfitting,
and miss classification problems with SMOTE and fine-
tune configuration. Fine-tune configuration includes dropout
layer, number of neurons, and learning error rate. The dropout
layer is configured with each dense layer using ReLu activa-
tion function. It is used to ignore some unites in the training
phase that produces the overfitting problem. It improves gen-
eralization and forces the layer to learn the same concept with
different neurons. After applying these settings, the dynamic
accuracy in Figure 9. Both curves start from 0.70 and then,
jump to 0.97 value on 100 epoch. After that, it grows more
to 0.99, and then more or less goes constant. We got better FIGURE 10. Dynamic graph for Loss with fine Tune and SMOTE.
results after applying such high-quality methods. The overall
accuracy of the fine-tuned model is 99% which is quite high
than before. The dynamic graph of loss values for training source codes are shown horizontally. Moreover, the proposed
and testing data after SMOTE and fine-tune configuration is approach is evaluated based on the confusion matrix before
shown in Figure 10. The overall loss value is 0.028 which is SMOTE, as shown in Figure 12. The classification and miss
also very less than before. classifications rates for C++, Java, and C# are shown in
The precision, recall, and F measure curves for source percentages. As we see, C# has 64% classification rate,
code authorship attribution are shown in Figure 11. The which is lowest as compared to Java and C++. Similarly,
performance metrics are given vertically while classes of Java has 72% classification rate, which is lower than C++.

VOLUME 7, 2019 141995

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

FIGURE 11. Performance comparison of Precision, Recall and F measure.

It is due to class imbalance ratio among these three used FIGURE 12. Confusion Matrix before SMOTE and Fine-Tune.
classes. For example, C++ has the highest classification rate
because it has the highest number of classes. Classifier learns
the highest class features more than others during training.
As a result, it affects the overall classification accuracy.
Confusion matrix with smote and fine-tune configuration,
as shown in Figure 13. The proposed model learned the
highest class more in training as compared to the lowest
classes. We used SMOTE and fine-tune configuration to solve
the class balance and miss classification problems, as shown
in Figure 12(b). These methods boost classification rates and
accuracy. C++, Java, C# have 100%, Java 98% and 98%
classification rates, respectively.

D. DISCUSSIONS
The proposed work is compared with the existing related
research, as shown in Table 2. In [27], the author used FIGURE 13. Confusion Matrix after SMOTE and Fine-Tune.
the SVM technique to predict programmers in C++
source codes. First, the dataset contained 20 programmers
with an accuracy of 77% and then 100 programmers with from GCJ for 1000 programmers with C++, Java, and C#
an accuracy of 61%. It gave an idea that when the number source codes. We have investigated our dataset on the state
of programmers increased then, SVM decreased its accuracy. of art techniques and also with our proposed deep learning
Similarly, [9], the author used two techniques (SVM, Random approach. The SVM contributed 64%, Random Forest gave
Forest) for C++ source codes. The SVM technique gave an 68%, J48 gave 73% while our proposed research gave 99% of
accuracy 90% for 20 programmers for C++ source code. accuracy. Our dataset contains three different types of source
Further, the Random Forest predicted 100 programmers with codes, but still, the proposed research outperforms. Further,
an accuracy of 96%. All these states of the art techniques the proposed approach is compared with other works based
tested on the same type of source codes i.e., C++ with a on precision, recall, f-measure metrics, as shown in Table 3.
different number of programmers. We have collected dataset We used the same dataset with C++, Java, and C# classes for

TABLE 2. Comparison of proposed work with other methods based on classification accuracy.

141996 VOLUME 7, 2019

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

TABLE 3. Comparison of proposed work with other methods based on Precision, Recall and F-measure metrics.

previous algorithms to extensively investigate these metrics. quencies parameters to transform into useful features’ matrix.
The RF, SVM, KNN, J48, CNN, and MLP are used in com- It is a decomposed matrix which contains features from
parisons. Multilayer Perceptron (MLP) provides good results each type of source codes with frequencies details. Further,
for precision and f-measure but, slightly lower for recall. the term local and global weighting techniques are used to
CNN has good, and SVM provides the lowest precision, show the importance of each PDG feature. The logarithm
recall, and f-measures values as compared to others. Our term frequency is used for local weighting and TFIDF for
proposed approach outperforms among all used algorithms global weighting values. It computes the weighting values
in terms of these metrics. for all PDG features, which are further fed to the deep
learning model. First, the experiment is tested without fine-
tuning configuration and smote with 1000 epochs. Then, fine-
V. CONCLUSION tune setup and smote methods are applied to solve the class
Every Programmer may type the same source code in a imbalance and overfitting issues and to get better accuracy.
different coding style, i.e., control logic flow, different names The number of neurons in each hidden layer, dropout layer,
for variables or methods. We need an intelligent way that can learning error rate, and loss function parameters is designed to
filter such type of fingerprints. The PDG features may be fine-tune the deep learning model. The proposed research is
used to extract hidden patterns regarding control flow logic compared with other states of the art methods in terms of clas-
and data variations in different programming codes. These sification accuracy, precision, recall, and f-measure values.
PDG features are further used as input to the deep learning The experimental results show that the proposed approach is
model to capture coding styles for identification of program- outperformed for identification of the real author of source
mers. We designed TensorFlow framework using Keras API code. The findings of our detailed analysis could:
to predict authors of different types of source codes. The • Help to improve algorithms such as automatic author-
source codes contain a massive amount of raw data that ship attribution as well as plagiarism detection.
are not important for high-quality features. We extract PDG • Assist forensic experts or linguists to create profiles of
(control & data) dependencies for C++, Java, and C# source writers.
codes. These high quality features further preprocessed using • Support intelligence applications to analyze aggressive
stemming, stop words, minimum and maximum global fre- and threatening messages.

VOLUME 7, 2019 141997

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

The proposed research gives better prediction accuracy for [20] S. Afroz, A. C. Islam, A. Stolerman, R. Greenstadt, and D. McCoy,
different types of source code authorship attribution; how- ‘‘Doppelgänger Finder: Taking stylometry to the underground,’’ in Proc.
IEEE Symp. Secur. Privacy, May 2014, pp. 212–226.
ever, still, it has some issues. The growing size of source [21] G. Frantzeskou, E. Stamatatos, S. Gritzalis, and S. Katsikas, ‘‘Effective
code may increase the complexity cost. So, if a method size identification of source code authors using byte-level information,’’ in
is increased, then PDG may take more time in the extraction Proc. 28th Int. Conf. Softw. Eng., 2006, pp. 893–896.
[22] S. Burrows and S. M. Tahaghoghi, ‘‘Source code authorship attribution
process as it works level by level in a graph structure. Also, using n-grams,’’ in Proc. 12th Australas. Document Comput. Symp., Mel-
it may be useful for accuracy and other classification metrics. bourne, VIC, Australia, RMIT University. 2007, pp. 32–39.
[23] J. Kothari, M. Shevertalov, E. Stehle, and S. Mancoridis, ‘‘A probabilistic
approach to source code authorship identification,’’ in Proc. 4th Int. Conf.
REFERENCES Inf. Technol. (ITNG), Apr. 2007, pp. 243–248.
[1] M. Abuhamad, J.-s. Rhim, T. AbuHmed, S. Ullah, S. Kang, and D. Nyang, [24] R. C. Lange and S. Mancoridis, ‘‘Using code metric histograms and genetic
‘‘Code authorship identification using convolutional neural networks,’’ algorithms to perform author identification for software forensics,’’ in
Future Gener. Comput. Syst., vol. 95, pp. 104–115, Jun. 2019. Proc. 9th Annu. Conf. Genetic Evol. Comput., 2007, pp. 2082–2089.
[2] C. Zhang, S. Wang, J. Wu, and Z. Niu, ‘‘Authorship identification of source [25] S. Burrows, A. L. Uitdenbogerd, and A. Turpin, ‘‘Application of informa-
codes,’’ in Proc. Asia–Pacific Web (APWeb) Web-Age Inf. Manage. (WAIM) tion retrieval techniques for source code authorship attribution,’’ in Proc.
Joint Conf. Web Big Data. Beijing, China: Springer, 2017, pp. 282–296. Int. Conf. Database Syst. Adv. Appl., Springer, 2009, pp. 699–713.
[26] G. Frantzeskou, E. Stamatatos, S. Gritzalis, C. E. Chaski, and B. S. Howald,
[3] D. Bodeau and R. Graubart. (2016). Cyber Resilience Metrics: Key Obser-
‘‘Identifying authorship by byte-level N-Grams: The source code author
vations. The MITRE Corporation. [Online]. Available: https://www/. mitre
profile (SCAP) method,’’ Int. J. Digit. Evidence, vol. 6, no. 1, p. 1–18,
org/sites/default/files
2007.
[4] X. Meng, ‘‘Fine-grained binary code authorship identification,’’ in Proc. [27] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, ‘‘LIB-
24th SIGSOFT Int. Symp. Found. Softw. Eng., 2016, pp. 1097–1099. LINEAR: A library for large linear classification,’’ J. Mach. Learn. Res.,
[5] E. Stamatatos, ‘‘A survey of modern authorship attribution methods,’’ vol. 9, pp. 1871–1874, Aug. 2008.
J. Amer. Soc. Inf. Sci. Technol., vol. 60, no. 3, pp. 538–556, 2009. [28] W. Wisse and C. Veenman, ‘‘Scripting DNA: Identifying the JavaScript
[6] G. Frantzeskou, S. G. MacDonell, and E. Stamatatos, ‘‘Source code author- programmer,’’ Digit. Invest., vol. 15, pp. 61–71, Dec. 2015.
ship analysis for supporting the Cybercrime investigation process,’’ in [29] M. Abuhamad, T. AbuHmed, A. Mohaisen, and D. Nyang, ‘‘Large-scale
Handbook of Research on Computational Forensics, Digital Crime, and and language-oblivious code authorship identification,’’ in Proc. SIGSAC
Investigation Methods and Solutions. Hershey, PA, USA: IGI Global, 2010, Conf. Comput. Commun. Secur., 2018, pp. 101–114.
pp. 470–495. [30] X. Yang, G. Xu, Q. Li, Y. Guo, and M. Zhang, ‘‘Authorship attribution of
[7] M. F. Tennyson, ‘‘A replicated comparative study of source code authorship source code by using back propagation neural network based on particle
attribution,’’ in Proc. 3rd Int. Workshop Replication Empirical Softw. Eng. swarm optimization,’’ PLoS ONE, vol. 12, no. 11, 2017. Art. no. e0187204.
Res., Oct. 2013, pp. 76–83. [31] S. Alrabaee, N. Saleem, S. Preda, L. Wang, and M. Debbabi, ‘‘OBA2:
[8] O. Mirza and M. Joy, ‘‘Style analysis for source code plagiarism detec- An onion approach to binary code authorship attribution,’’ Digit. Invest.,
tion,’’ in Proc. Plagiarism Across Eur. Beyond Conf.,2015, pp. 53–61. vol. 11, pp. S94–S103, May 2014.
[9] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, [32] M. Marquis-Boire and M. C. M. Guarnieri, Big Game Hunting: The
F. Yamaguchi, and R. Greenstadt, ‘‘De-anonymizing programmers via Peculiarities Nation-State Malware Research. Las Vegas, NV, USA, Black
code stylometry,’’ in Proc. 24th USENIX Secur. Symp. USENIX Secur., Hat, 2015.
2015, pp. 255–270. [33] A. Pfeffer, C. Call, J. Chamberlain, L. Kellogg, J. Ouellette, T. Patten,
[10] F. Ullah, J. Wang, M. Farhan, S. Jabbar, Z. Wu, and S. Khalid, ‘‘Plagia- G. Zacharias, A. Lakhotia, S. Golconda, J. Bay, R. Hall, and D. Scofield,
rism detection in students’ programming assignments based on semantics: ‘‘Malware analysis and attribution using genetic information,’’ in Proc. 7th
Multimedia e-learning based smart assessment methodology,’’ Multimedia Int. Conf. Malicious Unwanted Softw., Oct. 2012, pp. 39–45.
Tools Appl., pp. 1–18, 2018. doi: 10.1007/s11042-018-5827-6. [34] J. H. Paik, ‘‘A novel TF-IDF weighting scheme for effective ranking,’’ in
[11] S. G. Macdonell, A. R. Gray, G. MacLennan, and P. J. Sallis, ‘‘Software Proc. 36th Int. SIGIR Conf. Res. Develop. Inf. Retr., 2013, pp. 343–352.
forensics for discriminating between program authors using case-based [35] E. Haddi, X. Liu, and Y. Shi, ‘‘The role of text pre-processing in sentiment
reasoning, feedforward neural networks and multiple discriminant anal- analysis,’’ Proc. Comput. Sci., vol. 17, pp. 26–32, May 2013.
ysis,’’ in Proc. 6th Int. Conf. Neural Inf. Process., Nov. 1999, pp. 66–71. [36] D. Baylor et al., ‘‘Tfx: A tensorflow-based production-scale machine
[12] S. Burrows, A. L. Uitdenbogerd, and A. Turpin, ‘‘Temporally robust learning platform,’’ in Proc. 23rd SIGKDD Int. Conf. Knowl. Discovery
software features for authorship attribution,’’ in Proc. 33rd Annu. IEEE Data Mining, 2017, pp. 1387–1395.
Int. Comput. Softw. Appl. Conf., 2009, pp. 599–606. [37] Y. Yan, R. Liu, Z. Ding, X. Du, J. Chen, and Y. Zhang, ‘‘A parameter-free
[13] S. Burrows, A. L. Uitdenbogerd, and A. Turpin, ‘‘Comparing techniques cleaning method for SMOTE in imbalanced classification,’’ IEEE Access,
for authorship attribution of source code,’’ Softw. Pract. Exper., vol. 44, vol. 7, pp. 23537–23548, 2019.
no. 1, pp. 1–32, 2014. [38] A. Gulli and S. Pal, Deep Learning With Keras. Birmingham, U.K.: Packt
Publishing, 2017.
[14] H. Ding and M. H. Samadzadeh, ‘‘Extraction of Java program fingerprints
[39] S. Elfwing, E. Uchibe, and K. Doya, ‘‘Sigmoid-weighted linear units for
for software authorship identification,’’ J. Syst. Softw., vol. 72, no. 1,
neural network function approximation in reinforcement learning,’’ Neural
pp. 49–57, 2004.
Netw., vol. 107, pp. 3–11, Nov. 2018.
[15] H. A. Basit and S. Jarzabek, ‘‘Efficient token based clone detection with [40] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, ‘‘Learning activa-
flexible tokenization,’’ in Proc. 6th Joint Meeting Eur. Softw. Eng. Conf. tion functions to improve deep neural networks,’’ 2014, arXiv:1412.6830.
SIGSOFT Symp. Found. Softw. Eng., 2007, pp. 513–516. [41] A. Tato and R. Nkambou, ‘‘Improving adam optimizer,’’ in Proc. ICLR
[16] S. Haiduc, J. Aponte, L. Moreno, and A. Marcus, ‘‘On the use of automated Workshop, Vancouver, BC, Canada, May 2018.
text summarization techniques for summarizing source code,’’ in Proc. [42] Z. Zhang, ‘‘Improved Adam optimizer for deep neural networks,’’ in Proc.
17th Work. Conf. Reverse Eng., Oct. 2010, pp. 35–44. IEEE/ACM 26th Int. Symp. Qual. Service (IWQoS), 2018, pp. 1–2.
[17] M. Abadi et al., ‘‘TensorFlow: A system for large-scale machine learning,’’ [43] G. Domeniconi, G. Moro, R. Pasolini, and C. Sartori, ‘‘A study on
in Proc. 12th USENIX Symp. Operating Syst. Design Implement. (OSDI), term weighting for text categorization: A novel supervised variant of
2016, pp. 265–283. tf.idf,’’ in Proc. 4th Int. Conf. Data Manage. Technol. Appl., 2015,
[18] T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, and D. Woodard, pp. 26–37.
‘‘Surveying stylometry techniques and applications,’’ Comput. Surv. [44] S. K. Abd-El-Hafiz, ‘‘A metrics-based data mining approach for software
(CSUR), vol. 50, no. 6, p. 86, 2018. clone detection,’’ in Proc. IEEE 36th Annu. Comput. Softw. Appl. Conf.,
[19] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, Jul. 2012, pp. 35–41.
E. C. R. Shin, and D. Song, ‘‘On the feasibility of Internet-scale [45] D. M. Powers, ‘‘Evaluation: From precision, recall and F-measure to ROC,
author identification,’’ in Proc. IEEE Symp. Secur. Privacy, May 2012, informedness, markedness and correlation,’’ Flinders Univ., Bedford Park,
pp. 300–314. SA, Australia, Tech. Rep. 27165, 2011.

141998 VOLUME 7, 2019

F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

FARHAN ULLAH received the B.S. degree from FADI AL-TURJMAN received the Ph.D. degree
the University of Peshawar, Pakistan, in 2008, in computer science from Queen’s University,
and the M.S. degree from CECOS University Kingston, ON, Canada, in 2011. He is cur-
Peshawar, Pakistan, in 2012, both in computer rently a Professor with the Artificial Intelli-
science. He is currently pursuing the Ph.D. degree gence Department, Near East University, Nicosia,
in computer science with the School of Computer Turkey. He is a Leading Authority in the areas
Science, Sichuan University, Chengdu, China. of smart/cognitive, wireless and mobile networks’
His research work has been published in various architectures, protocols, deployments, and perfor-
renowned journals of the IEEE, Springer, Elsevier, mance evaluation. His publication history spans
Wiley, MDPI, and Hindawi. His research inter- over 200 publications in journals, conferences,
ests include software similarity, information security, and data science. patents, books, and book chapters, in addition to numerous keynotes and
He received the Research Productivity Award from the COMSATS Institute plenary talks at flagship venues. He has authored/edited more than 20 books
of Information Technology (CIIT), Sahiwal, Pakistan, in 2016. about cognition, security, and wireless sensor networks’ deployments in
smart environments, published by Taylor & Francis and Springer. He has
received several recognitions and best papers’ awards at top international
JUNFENG WANG received the M.S. degree conferences. He also received the prestigious Best Research Paper Award
in computer application technology from the from the Computer Communications (COMCOM) journal (Elsevier) for the
Chongqing University of Posts and Telecommuni- period 2015–2018 and the Top Researcher Award for 2018 at Antalya Bilim
cations, Chongqing, in 2001, and the Ph.D. degree University, Turkey. He has led a number of international symposia and work-
in computer science from the University of Elec- shops in flagship communication society conferences. He serves as the Lead
tronic Science and Technology of China, Chengdu, Guest Editor for several well reputed journals, including COMCOM (Else-
in 2004. From July 2004 to August 2006, he held vier), Sustainable Cities and Society (SCS), IET Wireless Sensor Systems,
a postdoctoral position with the Institute of Soft- and the Springer, EURASIP, and MONET journals.
ware, Chinese Academy of Sciences. He has been
currently a Professor with the School of Aero-
nautics and Astronautics, College of Computer Science, Sichuan Univer-
sity, since August 2006. His current research interests include network
and information security, spatial information networks, and data mining.
He has been invited to serve as an Associate Editor for IEEE ACCESS,
the IEEE INTERNET OF THINGS JOURNAL, the Security and Communication
Networks, and so on.

SOHAIL JABBAR was an Assistant Professor

with the Department of Computer Science, and
the Director of Graduate Programs at the Fac-
ulty of Sciences, National Textile University,
Faisalabad, Pakistan. He was also a Postdoctoral
Researcher with Kyungpook National University,
Daegu, South Korea. He is currently a Postdoctoral
Researcher with the Department of Computing
and Mathematics, Manchester Metropolitan Uni-
versity, Manchester, U.K. He is also the Head of
the Network Communication and Media Analytics Research Group, National
Textile University. He also served as an Assistant Professor with the Depart-
ment of Computer Science, COMSATS Institute of Information Technol-
ogy (CIIT), Sahiwal, and also headed the Networks and Communication
Research Group at CIIT. He has authored one book, two book chapters, and MAMOUN ALAZAB received the Ph.D. degree
more than 70 research papers. His research work is published in various in computer science from the School of Science,
renowned journals and magazines of the IEEE, Springer, Elsevier, MDPI, Information Technology and Engineering, Feder-
Old City Publication, and Hindawi, and conference proceedings of the ation University of Australia. He is currently an
IEEE, ACM, and IAENG. He is on collaborative research with renowned Associate Professor with the College of Engineer-
research centers and institutes around the globe on various issues in the ing, IT and Environment, Charles Darwin Uni-
domains of the Internet of Things, wireless sensor networks, and big data. versity, Australia. He is also a Cyber Security
He has been a Reviewer for leading journals, including the ACM TOSN, Researcher and a Practitioner with industry and
JoS, MTAP, AHSWN, and ATECS, and conferences, including the C-CODE academic experience. His research is multidisci-
2017, ACM SAC 2016, and ICACT 2016. He is as also a TPC member/chair plinary that focuses on cyber security and digital
for many conferences. He received many awards and honors from the Higher forensics of computer systems with a focus on cybercrime detection and
Education Commission of Pakistan, Bahria University, CIIT, and the Korean prevention, including cyber terrorism and cyber warfare. He has more than
Government. Among those awards, the Best Student Research Awards of the 100 research articles. He delivered many invited and keynote speeches,
Year, the Research Productivity Award, and the BK-21 Plus Post Doctoral 22 events in 2018 alone. He convened and chaired more than 50 conferences
Fellowship are few. He received the Research Productivity Award from CIIT, and workshops. He works closely with government and industry on many
in 2014 and 2015, respectively. He has been engaged in many National projects, including the Northern Territory (NT) Department of Information
and International Level Projects. He is also the Guest Editor of the Sis in and Corporate Services, IBM, Trend Micro, the Australian Federal Police
Concurrency and Computation Practice and Experience (Wiley), the Future (AFP), the Australian Communications and Media Authority (ACMA),
Generation Computer Systems (Elsevier), the Peer-to-Peer Networking and Westpac, the United Nations Office on Drugs and Crime (UNODC), and the
Applications (Springer), the Journal of Information and Processing Sys- Attorney General’s Department. He is the Founder and the Chair of the IEEE
tem (KIPS), the Cyber Physical System (Taylor & Francis), and the IEEE Northern Territory (NT) Subsection Detection and Prevention.
WIRELESS COMMUNICATIONS (IEEE Communication Society).

VOLUME 7, 2019 141999

Ntu Eee Y3 Specialisation
0% (1)
Ntu Eee Y3 Specialisation
5 pages
UT Austin Texas PGP AIML Brochure
No ratings yet
UT Austin Texas PGP AIML Brochure
19 pages
ABDE™ - Introduction
No ratings yet
ABDE™ - Introduction
16 pages
Amazon v. Brian Hall - Declaration of Matt Wood, AWS VP of AI
No ratings yet
Amazon v. Brian Hall - Declaration of Matt Wood, AWS VP of AI
24 pages
BDCC 06 00156 v2
No ratings yet
BDCC 06 00156 v2
23 pages
Hybrid Obfuscation Technique To Protect Source Code From Prohibited Software Reverse Engineering
No ratings yet
Hybrid Obfuscation Technique To Protect Source Code From Prohibited Software Reverse Engineering
17 pages
On The Feasibility of Malware Authorship Attribution
No ratings yet
On The Feasibility of Malware Authorship Attribution
16 pages
Cho2008 Chapter ImplementationOfAnObfuscationT
No ratings yet
Cho2008 Chapter ImplementationOfAnObfuscationT
11 pages
Survey of Techniques To Detect Common Weaknesses in Program Binaries
No ratings yet
Survey of Techniques To Detect Common Weaknesses in Program Binaries
14 pages
Building A Library For Automatic Duplicate Code Detection
No ratings yet
Building A Library For Automatic Duplicate Code Detection
6 pages
Software Theft Detection Using Birthmark Alg
No ratings yet
Software Theft Detection Using Birthmark Alg
7 pages
Bug IR
No ratings yet
Bug IR
24 pages
Natural Language Generation and Understanding of Big Code for AI-Assisted Programming A Review
No ratings yet
Natural Language Generation and Understanding of Big Code for AI-Assisted Programming A Review
23 pages
Ahts04 Sandia National Laboratories: Multimodal Deep Learning For Flaw Detection in Software Programs
No ratings yet
Ahts04 Sandia National Laboratories: Multimodal Deep Learning For Flaw Detection in Software Programs
13 pages
2014 Corona Lux0r Dynamic Tool
No ratings yet
2014 Corona Lux0r Dynamic Tool
11 pages
Binary Code Vulnerability Detection Based On Multi-Level Feature Fusion
No ratings yet
Binary Code Vulnerability Detection Based On Multi-Level Feature Fusion
12 pages
Poisoned ChatGPT
No ratings yet
Poisoned ChatGPT
19 pages
MalBERTv2
No ratings yet
MalBERTv2
33 pages
SP-Deep Code Comment Generation
No ratings yet
SP-Deep Code Comment Generation
12 pages
Dependent Type Providers
No ratings yet
Dependent Type Providers
10 pages
Cyber Tech Mastery
No ratings yet
Cyber Tech Mastery
21 pages
2308.14434v1
No ratings yet
2308.14434v1
8 pages
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
No ratings yet
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
19 pages
AI Code Generators For Security: Friend or Foe?
No ratings yet
AI Code Generators For Security: Friend or Foe?
9 pages
Cop Tse Accepted
No ratings yet
Cop Tse Accepted
21 pages
Dead Code Detection
No ratings yet
Dead Code Detection
15 pages
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
No ratings yet
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
18 pages
Compiler Architecture For Detection of Suspicious or Malicious Strings in A Program
No ratings yet
Compiler Architecture For Detection of Suspicious or Malicious Strings in A Program
3 pages
Reasons of Code Cloning: An Investigation: 1 Manu Singh 2 Kriti Priya Gupta 3 Vidushi Sharma
No ratings yet
Reasons of Code Cloning: An Investigation: 1 Manu Singh 2 Kriti Priya Gupta 3 Vidushi Sharma
4 pages
FAMD A Fast Multifeature Android Malware Detection
No ratings yet
FAMD A Fast Multifeature Android Malware Detection
12 pages
CodePori_ Large-Scale System for Autonomous Software Development Using Multi-Agent Technology - 2402.01411v2
No ratings yet
CodePori_ Large-Scale System for Autonomous Software Development Using Multi-Agent Technology - 2402.01411v2
23 pages
Understanding Misunderstandings Fse 2017
No ratings yet
Understanding Misunderstandings Fse 2017
11 pages
Privacy Preserving Mining in Code Profiling Data: ISSN (ONLINE) : 2250-0758, ISSN (PRINT) : 2394-6962
No ratings yet
Privacy Preserving Mining in Code Profiling Data: ISSN (ONLINE) : 2250-0758, ISSN (PRINT) : 2394-6962
5 pages
Referat Plagiat 1
No ratings yet
Referat Plagiat 1
4 pages
08 Rohit Final Malware Research Paper
No ratings yet
08 Rohit Final Malware Research Paper
13 pages
Buffer Overflow
No ratings yet
Buffer Overflow
12 pages
Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet
No ratings yet
Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet
17 pages
A novel approach to enhancing software quality assurance through early detection and prevention of software faults
No ratings yet
A novel approach to enhancing software quality assurance through early detection and prevention of software faults
13 pages
Taha Research Paper On Malware
No ratings yet
Taha Research Paper On Malware
5 pages
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
From Everand
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
DG. Junior
No ratings yet
Unit 1 Assignment Brief 2
No ratings yet
Unit 1 Assignment Brief 2
31 pages
1 DF 4
No ratings yet
1 DF 4
9 pages
Consortium Blockchain-Based Malware Detection in
No ratings yet
Consortium Blockchain-Based Malware Detection in
12 pages
C Model Questions
No ratings yet
C Model Questions
12 pages
SSRN Id4632664
No ratings yet
SSRN Id4632664
39 pages
Discovre Efficient Cross Architecture Identification Bugs Binary Code
No ratings yet
Discovre Efficient Cross Architecture Identification Bugs Binary Code
15 pages
sensors-23-07978-v2
No ratings yet
sensors-23-07978-v2
33 pages
A Coding Style-Based Plagiarism Detection - Arabyarmohamady Et Al. 2012
No ratings yet
A Coding Style-Based Plagiarism Detection - Arabyarmohamady Et Al. 2012
8 pages
Source Code
No ratings yet
Source Code
6 pages
hafta9
No ratings yet
hafta9
11 pages
An Introduction To Undetectable Keyloggers With Experimental Testing
No ratings yet
An Introduction To Undetectable Keyloggers With Experimental Testing
6 pages
Lesson 4 Programming Techniques Paradigms
No ratings yet
Lesson 4 Programming Techniques Paradigms
74 pages
Android Code Protection Via Obfuscation Techniques: Past, Present and Future Directions
No ratings yet
Android Code Protection Via Obfuscation Techniques: Past, Present and Future Directions
37 pages
Irjet V8i5356
No ratings yet
Irjet V8i5356
4 pages
A Survey of Modern Compiler Fuzzing
No ratings yet
A Survey of Modern Compiler Fuzzing
25 pages
Malware Detection and Evasion With Machine Learning Techniques: A Survey
No ratings yet
Malware Detection and Evasion With Machine Learning Techniques: A Survey
9 pages
Computer Programming Is The Process of Designing
No ratings yet
Computer Programming Is The Process of Designing
14 pages
1 s2.0 S2214212623002740 Main
No ratings yet
1 s2.0 S2214212623002740 Main
12 pages
Supporting A Dynamic Program Signature: An Intrusion Detection Framework For Microprocessors
No ratings yet
Supporting A Dynamic Program Signature: An Intrusion Detection Framework For Microprocessors
4 pages
BPP assignment
No ratings yet
BPP assignment
4 pages
Yousefi Azar2017
No ratings yet
Yousefi Azar2017
8 pages
Information 15 00025
No ratings yet
Information 15 00025
25 pages
Mic 11
No ratings yet
Mic 11
17 pages
A Review of Fuzzing Tools and Methods
No ratings yet
A Review of Fuzzing Tools and Methods
21 pages
IIIT Allahabad Resume Template
No ratings yet
IIIT Allahabad Resume Template
1 page
entropy and information gain for decision tree algorithm
No ratings yet
entropy and information gain for decision tree algorithm
12 pages
Generative AI in Manufacturing1
No ratings yet
Generative AI in Manufacturing1
19 pages
Phishing Website Detection Using ML IJERTCONV9IS13006
No ratings yet
Phishing Website Detection Using ML IJERTCONV9IS13006
4 pages
Download Complete Ethics Of Artificial Intelligence S. Matthew Liao PDF for All Chapters
100% (1)
Download Complete Ethics Of Artificial Intelligence S. Matthew Liao PDF for All Chapters
55 pages
Set 1
No ratings yet
Set 1
5 pages
Deep Learning - Model Paper
No ratings yet
Deep Learning - Model Paper
2 pages
Learning Capability and Storage Capacity of Two-Hidden-Layer Feedforward Networks
No ratings yet
Learning Capability and Storage Capacity of Two-Hidden-Layer Feedforward Networks
8 pages
A Deep and Scalable Unsupervised Machine Learning System for Cyber-Attack Detection in Large-Scale Smart Grids
No ratings yet
A Deep and Scalable Unsupervised Machine Learning System for Cyber-Attack Detection in Large-Scale Smart Grids
11 pages
Transformer-Based Deep Learning Models For The Sentiment Analysis of Social Media Data
No ratings yet
Transformer-Based Deep Learning Models For The Sentiment Analysis of Social Media Data
12 pages
Machine Learning Marking Criteria Portfolio Part 3
No ratings yet
Machine Learning Marking Criteria Portfolio Part 3
1 page
Bcs Higher Education Qualifications BCS Level 5 Diploma in IT
No ratings yet
Bcs Higher Education Qualifications BCS Level 5 Diploma in IT
2 pages
Full Download Machine Learning For Factor Investing R Version Chapman and Hall CRC Financial Mathematics Series 1st Edition Guillaume Coqueret PDF
100% (8)
Full Download Machine Learning For Factor Investing R Version Chapman and Hall CRC Financial Mathematics Series 1st Edition Guillaume Coqueret PDF
62 pages
Unit IV Emerging Trends
No ratings yet
Unit IV Emerging Trends
7 pages
Data Science Masters 2.0: Impact Batch 2.0
No ratings yet
Data Science Masters 2.0: Impact Batch 2.0
11 pages
Plant Leaf Disease Recognition Using Random Forest KNN SVM and CNN
No ratings yet
Plant Leaf Disease Recognition Using Random Forest KNN SVM and CNN
7 pages
Syntax DMC
No ratings yet
Syntax DMC
49 pages
[2020TACL]Eﬃcient Content-Based Sparse Attention With Routing Transformers
No ratings yet
[2020TACL]Eﬃcient Content-Based Sparse Attention With Routing Transformers
24 pages
A Machine Learning Model For Average FuelConsumption in Heavy Vehicles
No ratings yet
A Machine Learning Model For Average FuelConsumption in Heavy Vehicles
9 pages
Final - Roll Call List of BTECH Div A, B, C - 2024-25 Term I - 5 July 24
No ratings yet
Final - Roll Call List of BTECH Div A, B, C - 2024-25 Term I - 5 July 24
85 pages
Lec4 Tree v2.4 1
No ratings yet
Lec4 Tree v2.4 1
54 pages
Mid2 Date Sheet Spring2024 Stu0.3
No ratings yet
Mid2 Date Sheet Spring2024 Stu0.3
12 pages
Cluster Analysis in Python Chapter2 PDF
No ratings yet
Cluster Analysis in Python Chapter2 PDF
30 pages
IDDD Details
No ratings yet
IDDD Details
26 pages
KNN
No ratings yet
KNN
3 pages
9 Hours
100% (1)
9 Hours
2 pages