Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model
Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model
Received August 28, 2019, accepted September 21, 2019, date of publication September 25, 2019, date of current version October 10, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2943639
ABSTRACT Source Code Authorship Attribution (SCAA) is to find the real author of source code in a
corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to
develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other
code analysis applications. The efficient features extraction is the key challenge for classifying real authors
of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL)
methodology is proposed to identify authors from different programming source codes. First, the PDG
is implemented to extract control and data dependencies from source codes. Second, the preprocessing
technique is applied to convert PDG features into small instances with frequency details. Third, the Term
Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG
feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle
the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles’ features
for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with
drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results.
The proposed work is analyzed on 1000 programmers’ data, collected from Google Code Jam (GCJ). The
dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in
outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and
f-measure metrics.
INDEX TERMS Code authorship attribution, program dependence graph, deep learning, software forensics
and security, software plagiarism.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 7, 2019 141987
F. Ullah et al.: SCAA Using Hybrid Approach of Program Dependence Graph and Deep Learning Model
applied on malicious code to analyze the remnants left by code in different lines of codes according to programming
an author/group and to uncover the source of the attack [4]. language syntax. It may result in the class imbalance problem.
The source code authorship attribution mainly depends on We used the SMOTE method to get balance classes from
the extracted features that an author generates in coding TFIDF corpus. Further, these features are used as input to the
structure naming variables. It is used to allocate programmers designed deep learning model [17]. The proposed research
to their source codes based on these structures. Further, it is tries to respond to the following queries:
a severe privacy risk for those programmers who want to 1) How to learn different types of source codes for author-
remain anonymous, i.e., Open source projects or program- ship attribution and how to identify authors for different
mers from higher authorities. On the other side, code author- types of source codes?
ship attribution plays a significant role in software forensics 2) How to use an algorithm in an efficient way for source
activities and security analysis, especially for targeting mal- code authorship attribution that are beyond the pro-
ware authors. The malware authors write malicious software gramming language specific?
which can compromise the compilation process in computer
The proposed approach develops a learning procedure to effi-
system [5]. The code authorship attribution has many appli-
ciently generate PDG features from different programming
cations, such as in stylometry analysis [6], software copy-
codes. The main contributions of the proposed approach are:
rights investigation [7], software plagiarism detection [8] and
software forensics [9]. The coding style patterns can be used • We extract PDG (data & control dependencies) features
to identify the specific author of source code. It can answer to analyze the control flow and data variation in source
of the question, ‘‘Which unknown document source code codes against each author of source codes, i.e. C++,
style exactly or approximately similar to the known authors Java, C#
style’’. In academia, the vital application is software plagia- • TFIDF weighting technique is configured to zoom the
rism detection in students programming assignments [10]. importance of each PDG feature
The software ownership is essential in terms of trade secrets • Source code authorship attribution in cross program-
safety, copyright breach and patent rights [11]. ming languages using PDG analysis and deep learning
It is a big challenge to identify the real author using coding model
styles features and faces several barriers that can prevent The remaining paper is organized as follows: The section
malicious authors. First, it is a challenging task to attribute the 2 contains the related work with state of the art discussions,
author because of his continuously grows of education, pro- the section 3 contains the proposed methodology, the exper-
gramming expertise, and use of specific software engineering imental details are given in section 4 and section 5 includes
paradigms [12]. Second, the author may use a different coding the conclusion with future direction.
style for different types of programming languages due to
some constraints applied by the manager or tools. Third, II. RELATED WORK
generally automated tools are used to obfuscate the software, The SCAA extremely depends on extracted features from
which prevents the recognition of source code style. Recently, source codes. Every author has a unique coding style for
several code authorship attribution techniques are proposed programming, and there are some proposed techniques than
with key limitations [9], [13]. (i) Mostly, the software features can identify authors based on their styles. This type of domain
retrieved for identification of author are not valid to another is called stylometry [18]. Linguistic stylometry is widely used
type of language. For example, the technique used to extract for security and privacy problems. It is applied to categorize
features from C++ may not be applicable to Java or C#. unknown bloggers on large scale datasets to expose the pri-
(ii) The prior proposed work used for extracting authorship vacy concerns [19]. The stylometry is also used in forensic
features is not useful for a large number of programmers. departments to expose cyber-forums. It is more challenging
The prediction accuracy is decreased for a large set of pro- to identify authors from a mixture of used languages with
grammers. (iii) Generally, the large set of features extracted personal writing styles. The specific authors not only iden-
from source codes are not exactly relevant for authorship tified but their links with others or in the forums are also
identification activities. Further, it requires an extra method exposed in stylometry analysis [20]. The source code sty-
for mining and selection of relevant features [14]. lometry analysis can be used in source code authorship attri-
The main aim of the proposed approach is to identify the bution and plagiarism detection. The simple byte level [21]
real authors of different types of source code. The features and n-grams [22] features are used by machine learning to
extraction can be designed in such a way that can be used predict the real author of source code. The structural level
for any programming language and does not follow any pro- features can be achieved from the Abstract Syntax Tree (AST)
gramming structure. The PDG is used to extract control flow of source codes. The lexical information is merged with
and data variations features from source codes. Then, prepro- n-grams features to build up the developer profiles. Then,
cessing techniques are used to break the PDG structural data these profiles are used to identify 12 writers with an accuracy
in small instances and remove the noisy words [15]. Then, of 76% [23]. The genetic algorithm is combined with lexical
the TFIDF technique is used to weight each PDG feature features to classify 20 authors with an accuracy of 75% [24].
for code authorship [16]. As different programmers may Similarly, the AST features are extracted from programming
structures which are based on coding styles. These features is analyzed during the reverse engineering process to get
are extracted to attribute authors with an accuracy of 94% for essential information about the executable program. Then,
1600 programmers using GCJ dataset [9]. The information the ancestry information of malware is evaluated in the trans-
retrieval technique for programming code authorship attribu- formation phase. The lineages of malware described the mal-
tion is investigated for C source code. The C source codes, ware samples derivations among each other.
which includes 1,597 programming assignments, are con-
verted into a proper retrieval system. The authors classifica- III. PROPOSED METHODOLOGY: PROGRAM
tion accuracy is 76.78% [25]. In [26], the Source Code Author DEPENDENCY GRAPH WITH DEEP LEARNING
Profiles (SCAP) approach is used to extract coding styles We designed a hybrid approach based on PDG and deep
features from byte-level n-grams. It is shown that the n-gram learning model for identification of source codes authors as
features also used for writing styles, and it is already used shown in Figure 1.
in natural language text analysis for identifying the author.
Further, the idea is used for different sets of programming A. PROGRAM DEPENDENCE GRAPH (PDG)
languages such as java or C++ and got better accuracy. The PDG is a graphical representation of source code. Pro-
In [28], presented two machine learning techniques to gramming expressions, variables, conditions and method
de-anonymize the source code authors. The first algorithm calls can be represented in vertices. The edges show pro-
worked on supervised learning combined with Support Vector gram and control dependencies among vertices in a graph.
Machine (SVM), and the second method is based on cluster- The PDG graph G is generated using four elements for a
ing to merge the authors with the same programming styles. procedure P, i.e. G = (V, E, µ, δ)
Further, they used the distance similarity metric to classify • V is a set of vertices contain in a P
features relevant to each programmers style based on GCJ • E ⊆ V×V is a set of edges contain data or control
dataset. Recently, hackers leave malware on some websites. dependencies among V
In [29], the structural features are extracted using AST to pre- • µ: V → S function defines the assigned types to program
dict the programming style of authors. Further, the n-grams vertices, i.e. variables, statements, condition, method
features are extracted to identify JavaScript programmers calls
over the web. The deep learning with Recurrent Neural Net- • δ: E → T function defines the assigned dependency type
work (RNN) is combined with TFIDF feature to solve the to edges, i.e. data or control dependency
multi-class problem of authors. The Random Forest (RF) The data dependency edge E can be generated between v1 ,
is merged with the proposed approach to de-anonymize the and v2 , if there is variable var which effect the execution of
author on large-scale dataset [30]. The proposed research a program.
gave 93.42% accuracy for 120 authors collected from GCJ
• v1 may be passed to var directly or indirectly using
dataset. The hybrid approach of Back Propagation (BP)
pointers
neural network with Particle Swarm Optimization (PSO) is
• v2 may execute the given value of var using pointers
used to identify the specific author of source code. The
lexical information of source codes is used as input to the So, by changing var value may effect the execution of
proposed hybrid approach. The method is investigated on a source code with different output. PDG data dependency
3,022 java files which include 40 authors and got an accuracy feature may be used to catch such type of source code plagia-
of 91.060% [31]. Many extracted features are not related risms. Further, control dependency shows the internal logic
to the authors coding style, which affects the accuracy. The flow of source code.
Onion approach for Binary Authorship (OBA) attribution • Control dependency edge may be generated from v1
which works on different layers, i.e., preprocessing, syntax to v2 , if there is a truth condition which controls the
and semantic-based attribution analysis [32]. execution of v1 from v2 .
Stylometry plays an important role in malware attribution. First, we extract PDG (data & control) features from
It captures the coding styles patterns from source codes. It is different programming codes, i.e. C++, Java, C#, as shown
the hot demanding area that will need automatic packer and in Figure 4. These are high quality features which may extract
encryption detection with binary segment analysis. It is a vital the data variations and control flow features. It represents that
issue to uncover the hidden descriptions for malware samples. how data flows between different statements and, how control
Currently, the researchers mainly focus on dynamic analysis transfers among different statements. These are significant
rather than static due to the widespread use of obfuscation features which may show the hidden patterns of different
techniques. The [33] research described different static fea- programming codes.
tures of malware that can offer reliable relationships between
executable malware binary programmed by the same writer. B. PREPROCESSING
Some of these features are relevant to the malware i.e., control To identify authors that which programmer has written
and command arrangement and data filtration techniques. which code? We used preprocessing techniques to trans-
In [34], both static and dynamic analysis of malware samples form PDG features into small instances without noisy data.
is analyzed for authorship features. The genetic property It breaks the PDG into tokens and then, calculate the
FIGURE 1. Code authorship attribution using program dependence graph and deep learning.
frequency of each token. The preprocessing steps include for accurate predictions. These stylistic features are further
cleaning, instances, selection, transformation. The data clean- used to predict a unique author. We used TFIDF feature for
ing method is used to remove unwanted data such as special global weight and Logarithm of Term Frequency (LogTF) for
symbol, numbers, stop words, punctuations. We do not need local weight [38], [39]. The term frequency is the number of
such noisy information in source codes features classifica- occurrence of each token as shown in (1).
tion. Then, the transformation procedure is used to decom- ft,f
pose source codes into useful features. The stemming, stop tf (t, d) = P (1)
ft 0 ,d
words, and frequency parameters are used to extract valuable t 0 ∈d
features in the transformation phase. Stemming is used to
reduce a group of words into its root forms. Frequency infor- where tf denotes term frequency, d denotes each single doc-
mation indicates the number of occurrence of each element ument. The inverse document frequency is given in (2).
in different source codes [35]. N
idf (t, D) = log (2)
|{d ∈ D : t ∈ d}|
C. TFIDF FEATURES’ WEIGHTING
where t represents term, d represents single document, D
The preprocessed PDG features contain cleaned information
represents all documents and N represents all documents.
with frequencies details. To get better classification accuracy,
Mathematically TFIDF is define as in (3)
we need to convert PDG instances into weighting features.
These are used to zoom the importance of each feature in a tfidf (t, d, D) = tf (t, d) × idf (t, D) (3)
single document as well as across multiple documents. The
local and global weighting techniques are used to retrieve where t represents term, f represents frequency, d repre-
the weights of each feature. For example,if we are comparing sents each single document and D represents all collection
three documents of source codes, C++, Java, C#. Here, local of documents contained in the corpus.
weight extracts the significance of each feature contained in
one single document, i.e., C++ or Java or C#. But, global D. SYNTHETIC MINORITY OVER-SAMPLING
weight capture the significance of each feature across all TECHNIQUE (SMOTE)
documents. These features are useful to score and rank each Oversampling can give better accuracy as compared to
feature for each programming style. This process provides a undersampling with class imbalance problem. The SMOTE
direction to a classifier that which features are more valuable technique can provide better results in dealing with class
FIGURE 2. TensorFlow data flow graph for training features by queuing and preprocessing.
popular activation function for deep neural networks. The Algorithm 1 PDG Based Deep Learning (PDGDL) for Dif-
entropy function is applied to identify loss of each instant to ferent Programming Code Authorship
accumulate the deep learning functionalities. It takes tensor Input = list < author > folder [ ] = directory\\Codes
as input and marks tensor with a similar profile as output. dataset [ ] = null
foreach(author in folder)
2) MODEL TRAINING {
Training is the next stage of a deep learning algorithm in foreach(question in author)
which model gradually optimized and learned the given {
dataset. The main goal of training is to learn enough about the PDG[ ] = getPDG(V , E, µ, δ)\\PDG generation
structure of the dataset. It enables the model to make accurate }
predictions for unseen corpus. The optimization and loss }
functions may contribute well to train the designed model. foreach(PDG)\\Preprocessing
The Adam optimizer, which is also recognized as a stochastic {
descent gradient, is applied to compile and optimize the terms [ ] = getFreq(ques, arg s{rmNoise = T } )
deep learning model. It practices the iterative technique to termLocalWeight = getLocalWeight(terms)\ \Equ.1
update the network weights. It computes the distinct adaptive termGlobalWegith = getGlobalWeight(terms)\\Equ.3
learning rates for each constraint in the deep learning network for(int i = 0; i < terms.size( ); i + +)
[44], [45]. The decaying means of pas squared gradients are {
shown in (7) and (8). List < term, weight >
termWeight = null
mt = β1 mt−1 + (1 − β1 )gt (7) weight = termLocalWeigh[i]+ termGlobalWeigh[i]
vt = β2 vt−1 + (1 − β2 )gt 2 (8) termWeight.add(term[i], weight)
}
where mt and vt are the predictable means of the first and sec- question.add(termWeight)
ond instant gradients respectively. The g signifies particular dataset(question[ ], author)
gradient for every instant. It stabilizes these preferences by }
calculating bias-corrected first and second instant estimations train, test = randomSplitRatio(dataset, 80, 20)
using (9) and (10). train = SMOTE(train)\\class balancing
∧ mt one − hotEncoding(train)
mt = (9)
1 − β1t one − hotEncoding(test)
∧ vt Tensorflow model = model_type =0 Sequential 0 ,
vt = (10)
1 − β2t layer_type =0 Dense, Dropout 0 , activation =0 relu0 ,
epoches =0 20000 )//Model design
The loss and accuracy functions are used to predict class
model.compile(loss = categorical_crossEntropy,
probabilities. The target values are one-hot encoded, so the 0 Model compile
optimizer = adam, learning_error_rate,)
loss is the best when the model output is very close to 1 for
authorship_attribution(train, test)//Model evaluation
the right category and very close to 0 for other categories.
Output=Code_authorship_attributtion
We used categorical loss function and mathematically it is
defined in 11.
M
X
Loss = yo,c log(po,c ) (11)
c=1
where M is the number of predicted classes, c is the correct
classification for observations o and p is the predicted prob-
ability for observations o in class c. Algorithm 1 shows the
implementation of the proposed methodology.
IV. EXPERIMENTS
The programming code authorship attribution is a critical task
which is used to uncover the hidden patterns from source
codes and predict features from coding styles. FIGURE 3. Amount of training data required for identification
of 1000 programmers.
A. DATASET
The GCJ is a programming competition hosted and admin- consists of a set of algorithmic problems which must be
istered by Google every year. Thousands of programmers solved in a fixed amount of time. Each programmer may
around the globe participate in this activity. The competition use any type of programming language to solve the given
FIGURE 4. Program dependence graph (data & control dependencies) for C++, Java and C#.
problems. The dataset is collected from GCJ, which contains 53.96435% contributed by C++, 32.36987% contributed by
1000 programmers’ source codes in three different program- Java and 13.66578% by C#. In the right y-axis, the cumulative
ming languages, i.e., C++, Java, C#. We took source code frequency for each source code is given.
data from the 2017 year’s corpus. The amount of data required
for training the source code features, as shown in Figure 3. B. EVALUATION METRICS
The source code types are given on x-axis, and their per- The proposed methodology is evaluated on mostly used met-
centile distribution is given on the y-axis. There is a total rics, i.e., precision, recall, f-measure, and accuracy, are calcu-
of 1000 programmers analyzed in the experiment in which lated to evaluate the designed approach. The numbers of True
TABLE 1. Normalized input and output variables with minimum, maximum, mean and median values.
FIGURE 6. TFIDF features after SMOTE (green: C++, black: Java, red: C#). FIGURE 8. Dynamic graph for Loss without fine-tune and SMOTE.
It is due to class imbalance ratio among these three used FIGURE 12. Confusion Matrix before SMOTE and Fine-Tune.
classes. For example, C++ has the highest classification rate
because it has the highest number of classes. Classifier learns
the highest class features more than others during training.
As a result, it affects the overall classification accuracy.
Confusion matrix with smote and fine-tune configuration,
as shown in Figure 13. The proposed model learned the
highest class more in training as compared to the lowest
classes. We used SMOTE and fine-tune configuration to solve
the class balance and miss classification problems, as shown
in Figure 12(b). These methods boost classification rates and
accuracy. C++, Java, C# have 100%, Java 98% and 98%
classification rates, respectively.
D. DISCUSSIONS
The proposed work is compared with the existing related
research, as shown in Table 2. In [27], the author used FIGURE 13. Confusion Matrix after SMOTE and Fine-Tune.
the SVM technique to predict programmers in C++
source codes. First, the dataset contained 20 programmers
with an accuracy of 77% and then 100 programmers with from GCJ for 1000 programmers with C++, Java, and C#
an accuracy of 61%. It gave an idea that when the number source codes. We have investigated our dataset on the state
of programmers increased then, SVM decreased its accuracy. of art techniques and also with our proposed deep learning
Similarly, [9], the author used two techniques (SVM, Random approach. The SVM contributed 64%, Random Forest gave
Forest) for C++ source codes. The SVM technique gave an 68%, J48 gave 73% while our proposed research gave 99% of
accuracy 90% for 20 programmers for C++ source code. accuracy. Our dataset contains three different types of source
Further, the Random Forest predicted 100 programmers with codes, but still, the proposed research outperforms. Further,
an accuracy of 96%. All these states of the art techniques the proposed approach is compared with other works based
tested on the same type of source codes i.e., C++ with a on precision, recall, f-measure metrics, as shown in Table 3.
different number of programmers. We have collected dataset We used the same dataset with C++, Java, and C# classes for
TABLE 2. Comparison of proposed work with other methods based on classification accuracy.
TABLE 3. Comparison of proposed work with other methods based on Precision, Recall and F-measure metrics.
previous algorithms to extensively investigate these metrics. quencies parameters to transform into useful features’ matrix.
The RF, SVM, KNN, J48, CNN, and MLP are used in com- It is a decomposed matrix which contains features from
parisons. Multilayer Perceptron (MLP) provides good results each type of source codes with frequencies details. Further,
for precision and f-measure but, slightly lower for recall. the term local and global weighting techniques are used to
CNN has good, and SVM provides the lowest precision, show the importance of each PDG feature. The logarithm
recall, and f-measures values as compared to others. Our term frequency is used for local weighting and TFIDF for
proposed approach outperforms among all used algorithms global weighting values. It computes the weighting values
in terms of these metrics. for all PDG features, which are further fed to the deep
learning model. First, the experiment is tested without fine-
tuning configuration and smote with 1000 epochs. Then, fine-
V. CONCLUSION tune setup and smote methods are applied to solve the class
Every Programmer may type the same source code in a imbalance and overfitting issues and to get better accuracy.
different coding style, i.e., control logic flow, different names The number of neurons in each hidden layer, dropout layer,
for variables or methods. We need an intelligent way that can learning error rate, and loss function parameters is designed to
filter such type of fingerprints. The PDG features may be fine-tune the deep learning model. The proposed research is
used to extract hidden patterns regarding control flow logic compared with other states of the art methods in terms of clas-
and data variations in different programming codes. These sification accuracy, precision, recall, and f-measure values.
PDG features are further used as input to the deep learning The experimental results show that the proposed approach is
model to capture coding styles for identification of program- outperformed for identification of the real author of source
mers. We designed TensorFlow framework using Keras API code. The findings of our detailed analysis could:
to predict authors of different types of source codes. The • Help to improve algorithms such as automatic author-
source codes contain a massive amount of raw data that ship attribution as well as plagiarism detection.
are not important for high-quality features. We extract PDG • Assist forensic experts or linguists to create profiles of
(control & data) dependencies for C++, Java, and C# source writers.
codes. These high quality features further preprocessed using • Support intelligence applications to analyze aggressive
stemming, stop words, minimum and maximum global fre- and threatening messages.
The proposed research gives better prediction accuracy for [20] S. Afroz, A. C. Islam, A. Stolerman, R. Greenstadt, and D. McCoy,
different types of source code authorship attribution; how- ‘‘Doppelgänger Finder: Taking stylometry to the underground,’’ in Proc.
IEEE Symp. Secur. Privacy, May 2014, pp. 212–226.
ever, still, it has some issues. The growing size of source [21] G. Frantzeskou, E. Stamatatos, S. Gritzalis, and S. Katsikas, ‘‘Effective
code may increase the complexity cost. So, if a method size identification of source code authors using byte-level information,’’ in
is increased, then PDG may take more time in the extraction Proc. 28th Int. Conf. Softw. Eng., 2006, pp. 893–896.
[22] S. Burrows and S. M. Tahaghoghi, ‘‘Source code authorship attribution
process as it works level by level in a graph structure. Also, using n-grams,’’ in Proc. 12th Australas. Document Comput. Symp., Mel-
it may be useful for accuracy and other classification metrics. bourne, VIC, Australia, RMIT University. 2007, pp. 32–39.
[23] J. Kothari, M. Shevertalov, E. Stehle, and S. Mancoridis, ‘‘A probabilistic
approach to source code authorship identification,’’ in Proc. 4th Int. Conf.
REFERENCES Inf. Technol. (ITNG), Apr. 2007, pp. 243–248.
[1] M. Abuhamad, J.-s. Rhim, T. AbuHmed, S. Ullah, S. Kang, and D. Nyang, [24] R. C. Lange and S. Mancoridis, ‘‘Using code metric histograms and genetic
‘‘Code authorship identification using convolutional neural networks,’’ algorithms to perform author identification for software forensics,’’ in
Future Gener. Comput. Syst., vol. 95, pp. 104–115, Jun. 2019. Proc. 9th Annu. Conf. Genetic Evol. Comput., 2007, pp. 2082–2089.
[2] C. Zhang, S. Wang, J. Wu, and Z. Niu, ‘‘Authorship identification of source [25] S. Burrows, A. L. Uitdenbogerd, and A. Turpin, ‘‘Application of informa-
codes,’’ in Proc. Asia–Pacific Web (APWeb) Web-Age Inf. Manage. (WAIM) tion retrieval techniques for source code authorship attribution,’’ in Proc.
Joint Conf. Web Big Data. Beijing, China: Springer, 2017, pp. 282–296. Int. Conf. Database Syst. Adv. Appl., Springer, 2009, pp. 699–713.
[26] G. Frantzeskou, E. Stamatatos, S. Gritzalis, C. E. Chaski, and B. S. Howald,
[3] D. Bodeau and R. Graubart. (2016). Cyber Resilience Metrics: Key Obser-
‘‘Identifying authorship by byte-level N-Grams: The source code author
vations. The MITRE Corporation. [Online]. Available: https://www/. mitre
profile (SCAP) method,’’ Int. J. Digit. Evidence, vol. 6, no. 1, p. 1–18,
org/sites/default/files
2007.
[4] X. Meng, ‘‘Fine-grained binary code authorship identification,’’ in Proc. [27] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, ‘‘LIB-
24th SIGSOFT Int. Symp. Found. Softw. Eng., 2016, pp. 1097–1099. LINEAR: A library for large linear classification,’’ J. Mach. Learn. Res.,
[5] E. Stamatatos, ‘‘A survey of modern authorship attribution methods,’’ vol. 9, pp. 1871–1874, Aug. 2008.
J. Amer. Soc. Inf. Sci. Technol., vol. 60, no. 3, pp. 538–556, 2009. [28] W. Wisse and C. Veenman, ‘‘Scripting DNA: Identifying the JavaScript
[6] G. Frantzeskou, S. G. MacDonell, and E. Stamatatos, ‘‘Source code author- programmer,’’ Digit. Invest., vol. 15, pp. 61–71, Dec. 2015.
ship analysis for supporting the Cybercrime investigation process,’’ in [29] M. Abuhamad, T. AbuHmed, A. Mohaisen, and D. Nyang, ‘‘Large-scale
Handbook of Research on Computational Forensics, Digital Crime, and and language-oblivious code authorship identification,’’ in Proc. SIGSAC
Investigation Methods and Solutions. Hershey, PA, USA: IGI Global, 2010, Conf. Comput. Commun. Secur., 2018, pp. 101–114.
pp. 470–495. [30] X. Yang, G. Xu, Q. Li, Y. Guo, and M. Zhang, ‘‘Authorship attribution of
[7] M. F. Tennyson, ‘‘A replicated comparative study of source code authorship source code by using back propagation neural network based on particle
attribution,’’ in Proc. 3rd Int. Workshop Replication Empirical Softw. Eng. swarm optimization,’’ PLoS ONE, vol. 12, no. 11, 2017. Art. no. e0187204.
Res., Oct. 2013, pp. 76–83. [31] S. Alrabaee, N. Saleem, S. Preda, L. Wang, and M. Debbabi, ‘‘OBA2:
[8] O. Mirza and M. Joy, ‘‘Style analysis for source code plagiarism detec- An onion approach to binary code authorship attribution,’’ Digit. Invest.,
tion,’’ in Proc. Plagiarism Across Eur. Beyond Conf.,2015, pp. 53–61. vol. 11, pp. S94–S103, May 2014.
[9] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, [32] M. Marquis-Boire and M. C. M. Guarnieri, Big Game Hunting: The
F. Yamaguchi, and R. Greenstadt, ‘‘De-anonymizing programmers via Peculiarities Nation-State Malware Research. Las Vegas, NV, USA, Black
code stylometry,’’ in Proc. 24th USENIX Secur. Symp. USENIX Secur., Hat, 2015.
2015, pp. 255–270. [33] A. Pfeffer, C. Call, J. Chamberlain, L. Kellogg, J. Ouellette, T. Patten,
[10] F. Ullah, J. Wang, M. Farhan, S. Jabbar, Z. Wu, and S. Khalid, ‘‘Plagia- G. Zacharias, A. Lakhotia, S. Golconda, J. Bay, R. Hall, and D. Scofield,
rism detection in students’ programming assignments based on semantics: ‘‘Malware analysis and attribution using genetic information,’’ in Proc. 7th
Multimedia e-learning based smart assessment methodology,’’ Multimedia Int. Conf. Malicious Unwanted Softw., Oct. 2012, pp. 39–45.
Tools Appl., pp. 1–18, 2018. doi: 10.1007/s11042-018-5827-6. [34] J. H. Paik, ‘‘A novel TF-IDF weighting scheme for effective ranking,’’ in
[11] S. G. Macdonell, A. R. Gray, G. MacLennan, and P. J. Sallis, ‘‘Software Proc. 36th Int. SIGIR Conf. Res. Develop. Inf. Retr., 2013, pp. 343–352.
forensics for discriminating between program authors using case-based [35] E. Haddi, X. Liu, and Y. Shi, ‘‘The role of text pre-processing in sentiment
reasoning, feedforward neural networks and multiple discriminant anal- analysis,’’ Proc. Comput. Sci., vol. 17, pp. 26–32, May 2013.
ysis,’’ in Proc. 6th Int. Conf. Neural Inf. Process., Nov. 1999, pp. 66–71. [36] D. Baylor et al., ‘‘Tfx: A tensorflow-based production-scale machine
[12] S. Burrows, A. L. Uitdenbogerd, and A. Turpin, ‘‘Temporally robust learning platform,’’ in Proc. 23rd SIGKDD Int. Conf. Knowl. Discovery
software features for authorship attribution,’’ in Proc. 33rd Annu. IEEE Data Mining, 2017, pp. 1387–1395.
Int. Comput. Softw. Appl. Conf., 2009, pp. 599–606. [37] Y. Yan, R. Liu, Z. Ding, X. Du, J. Chen, and Y. Zhang, ‘‘A parameter-free
[13] S. Burrows, A. L. Uitdenbogerd, and A. Turpin, ‘‘Comparing techniques cleaning method for SMOTE in imbalanced classification,’’ IEEE Access,
for authorship attribution of source code,’’ Softw. Pract. Exper., vol. 44, vol. 7, pp. 23537–23548, 2019.
no. 1, pp. 1–32, 2014. [38] A. Gulli and S. Pal, Deep Learning With Keras. Birmingham, U.K.: Packt
Publishing, 2017.
[14] H. Ding and M. H. Samadzadeh, ‘‘Extraction of Java program fingerprints
[39] S. Elfwing, E. Uchibe, and K. Doya, ‘‘Sigmoid-weighted linear units for
for software authorship identification,’’ J. Syst. Softw., vol. 72, no. 1,
neural network function approximation in reinforcement learning,’’ Neural
pp. 49–57, 2004.
Netw., vol. 107, pp. 3–11, Nov. 2018.
[15] H. A. Basit and S. Jarzabek, ‘‘Efficient token based clone detection with [40] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, ‘‘Learning activa-
flexible tokenization,’’ in Proc. 6th Joint Meeting Eur. Softw. Eng. Conf. tion functions to improve deep neural networks,’’ 2014, arXiv:1412.6830.
SIGSOFT Symp. Found. Softw. Eng., 2007, pp. 513–516. [41] A. Tato and R. Nkambou, ‘‘Improving adam optimizer,’’ in Proc. ICLR
[16] S. Haiduc, J. Aponte, L. Moreno, and A. Marcus, ‘‘On the use of automated Workshop, Vancouver, BC, Canada, May 2018.
text summarization techniques for summarizing source code,’’ in Proc. [42] Z. Zhang, ‘‘Improved Adam optimizer for deep neural networks,’’ in Proc.
17th Work. Conf. Reverse Eng., Oct. 2010, pp. 35–44. IEEE/ACM 26th Int. Symp. Qual. Service (IWQoS), 2018, pp. 1–2.
[17] M. Abadi et al., ‘‘TensorFlow: A system for large-scale machine learning,’’ [43] G. Domeniconi, G. Moro, R. Pasolini, and C. Sartori, ‘‘A study on
in Proc. 12th USENIX Symp. Operating Syst. Design Implement. (OSDI), term weighting for text categorization: A novel supervised variant of
2016, pp. 265–283. tf.idf,’’ in Proc. 4th Int. Conf. Data Manage. Technol. Appl., 2015,
[18] T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, and D. Woodard, pp. 26–37.
‘‘Surveying stylometry techniques and applications,’’ Comput. Surv. [44] S. K. Abd-El-Hafiz, ‘‘A metrics-based data mining approach for software
(CSUR), vol. 50, no. 6, p. 86, 2018. clone detection,’’ in Proc. IEEE 36th Annu. Comput. Softw. Appl. Conf.,
[19] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, Jul. 2012, pp. 35–41.
E. C. R. Shin, and D. Song, ‘‘On the feasibility of Internet-scale [45] D. M. Powers, ‘‘Evaluation: From precision, recall and F-measure to ROC,
author identification,’’ in Proc. IEEE Symp. Secur. Privacy, May 2012, informedness, markedness and correlation,’’ Flinders Univ., Bedford Park,
pp. 300–314. SA, Australia, Tech. Rep. 27165, 2011.
FARHAN ULLAH received the B.S. degree from FADI AL-TURJMAN received the Ph.D. degree
the University of Peshawar, Pakistan, in 2008, in computer science from Queen’s University,
and the M.S. degree from CECOS University Kingston, ON, Canada, in 2011. He is cur-
Peshawar, Pakistan, in 2012, both in computer rently a Professor with the Artificial Intelli-
science. He is currently pursuing the Ph.D. degree gence Department, Near East University, Nicosia,
in computer science with the School of Computer Turkey. He is a Leading Authority in the areas
Science, Sichuan University, Chengdu, China. of smart/cognitive, wireless and mobile networks’
His research work has been published in various architectures, protocols, deployments, and perfor-
renowned journals of the IEEE, Springer, Elsevier, mance evaluation. His publication history spans
Wiley, MDPI, and Hindawi. His research inter- over 200 publications in journals, conferences,
ests include software similarity, information security, and data science. patents, books, and book chapters, in addition to numerous keynotes and
He received the Research Productivity Award from the COMSATS Institute plenary talks at flagship venues. He has authored/edited more than 20 books
of Information Technology (CIIT), Sahiwal, Pakistan, in 2016. about cognition, security, and wireless sensor networks’ deployments in
smart environments, published by Taylor & Francis and Springer. He has
received several recognitions and best papers’ awards at top international
JUNFENG WANG received the M.S. degree conferences. He also received the prestigious Best Research Paper Award
in computer application technology from the from the Computer Communications (COMCOM) journal (Elsevier) for the
Chongqing University of Posts and Telecommuni- period 2015–2018 and the Top Researcher Award for 2018 at Antalya Bilim
cations, Chongqing, in 2001, and the Ph.D. degree University, Turkey. He has led a number of international symposia and work-
in computer science from the University of Elec- shops in flagship communication society conferences. He serves as the Lead
tronic Science and Technology of China, Chengdu, Guest Editor for several well reputed journals, including COMCOM (Else-
in 2004. From July 2004 to August 2006, he held vier), Sustainable Cities and Society (SCS), IET Wireless Sensor Systems,
a postdoctoral position with the Institute of Soft- and the Springer, EURASIP, and MONET journals.
ware, Chinese Academy of Sciences. He has been
currently a Professor with the School of Aero-
nautics and Astronautics, College of Computer Science, Sichuan Univer-
sity, since August 2006. His current research interests include network
and information security, spatial information networks, and data mining.
He has been invited to serve as an Associate Editor for IEEE ACCESS,
the IEEE INTERNET OF THINGS JOURNAL, the Security and Communication
Networks, and so on.