Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
https://doi.org/10.1007/s13198-018-0740-y
ORIGINAL ARTICLE
Information entropy applied to software based control flow graphs
Aditya Akundi1
•
Eric Smith1 • Tzu-Liang Tseng1
Received: 27 April 2017 / Revised: 27 February 2018 / Published online: 24 July 2018
The Society for Reliability Engineering, Quality and Operations Management (SREQOM), India and The Division of Operation and
Maintenance, Lulea University of Technology, Sweden 2018
Abstract Information theory, introduced by Shannon in
the context of information transfer in communication
channels, is used as a foundation for research in many
diverse fields. The concept Entropy in terms of information
theory is seen as the average amount of information or the
rate of information produced when forming a message,
element by element. Entropy has found broad application
in many research fields and can also be applied in software
engineering for quantifying the uncertainty associated with
a software code. In this paper, information entropy and its
application towards measuring software complexity are
explored, along with the formulation of an information
entropy based complexity measure that considers logical
decision-making, processes, and software statement interaction patterns in control flow graphs mapped from actual
software code. To broaden the application of the proposed
metric, the execution times of nodes in the control flow
graphs are also incorporated. Further, the metric is evaluated against eight different axioms that a software complexity measure should satisfy.
Keywords Complexity Control flow graph Entropy
Information theory Software complexity
1 Introduction
Software code complexity measures are mainly used or
adapted in the design and implementation phase of a code.
They are used to measure the individual inherent complexity of code modules, individual components (a software component is an element composed based on a set of
pre-defined standards that conform to a specific behavior),
and procedures. Modules, procedures and components of a
code, irrespective of the level at which they are developed,
are inter-dependent. The structural and information architecture of code has a significant impact on both complexity
measures and quality measures (McCall 1977).
Complexity in software code can be defined as the
attribute associated to a code that effects the effort required
to either develop, change or debug a piece of software.
Many different methods have been suggested throughout
the literature in this field for the quantitative characterization of the complexity inherent in software. These metrics, when captured quantitatively, work as anchors for
software design development and re-engineering efforts.
The complexity metrics can be broadly characterized into
Information based complexity metrics, and Structural
based complexity metrics. Table 1 lists contributions in the
development of software code complexity metrics.
2 Entropy based software complexity measures
& Aditya Akundi
sakundi@utep.edu
1
Industrial Manufacturing and Systems Engineering
Department, The University of Texas at El Paso, El Paso,
TX 79968, USA
123
Entropy in thermodynamics represents the inherent disorder in a system over a period of time as the system heads
towards thermodynamic equilibrium. In information theory, according to Shannon, Entropy helps to quantify the
information (Shannon et al. 1951). Quantifying information
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
1081
Table 1 Software complexity metric contributions
Category
Information
based
Size and
structure based
Contributing authors
Complexity metric
G.M. Weinberg
Complexity measure based on number of lines of code (Van Vliet et al. 1993)
Maurice Howard
Halstead
Based on number of operators and operands (Van Vliet et al. 1993; Hamer and Frewin 1982)
Scott W. Woodfield
Based on conceptually unique operands (Woodfield 1979)
Eli Berlinger
Information based complexity measure (Berlinger 1980)
Maurice Howard
Halstead
Program effort and difficulty measure based on program vocabulary, number of distinct operators,
total number of operators, total number of operands and volume (Hamer and Frewin 1982;
Weyuker 1988)
Fitzsimmons and
Love
Number of delivered errors based on Halstead’s effort metric (Fitzsimmons and Love 1978)
El Oviedo
Complexity measure based on control and data flow using number of available definitions of
variables in blocks of program body (Weyuker 1988; Oviedo 1993)
Thomas J. McCabe
Cyclomatic complexity metric (Wallace et al. 1993)
Sallie Henry and
Dennis Kafura
Metrics based on global, local and Indirect flow relations (Van Vliet et al. 1993; Henry and Kafura
1981)
Shepperds
Metric based on fan in and fan out measures (Van Vliet et al. 1993)
implies analyzing the information present and measuring
its associated uncertainty. Higher values of entropy signify
lesser order in a system and lower the values of entropy
signify a more ordered system. Entropy has found broad
application in many fields and can also be applied in
software engineering for quantifying the uncertainty associated with a software code.
Entropy H of a system, according to statistical
mechanics is defined as Shannon et al. (1951):
X
H¼ K
Pi log Pi
ð1Þ
where Pi is the probability of a particular state and K is
Boltzmann’s constant.
Shannon Entropy H is given as Shannon et al. (1951):
X
H¼
Pi log2 Pi
ð2Þ
where Pi is the probability of a symbol showing up in a
given stream of symbols, and the use of the logarithm base
two corresponds to expressing information entropy in terms
of bits.
There are a number of studies where entropy is used as
basic foundation for software complexity measures. Several entropy based measures have been proposed and
defined which are sensitive to the probability values calculated based on the frequency of usage of symbols, set of
software inputs, set of outputs, set of links of nodes, frequency of string occurrence in a code, frequency of names
occurring in a code, frequency of operator occurrence,
number of attributes, reuse ratio, frequently occurring
operators, number of leaf nodes and also a few object
oriented design metrics (Selvarani et al. 2009; Berlinger
1980; Jung et al. 2011; Chaturvedi et al. 2014; Snider 2001;
Harrison 1992; Bansiya et al. 1999; Mills 1999; Roca
1996).
Berlinger (1980) provided an information theory based
complexity measure based on entropy theory. The defined
complexity measure is sensitive to the frequency of
occurrence of all the tokens in a program. Tokens such as
operators and operands here refer to elements of the programming language being used. According to Berlinger,
there are several possible interpretations of this measure,
either an information point of view where it represents the
total information contained in the code or an ideal coding
scheme representing the total length required to develop
the program. To add, Berlinger suggests that irrespective of
the interpretation used, this measure is sensitive to the
frequency of the symbols’ occurrence and the proportion of
the number of times the symbol occurs in the past (Berlinger 1980).
Snider (2001) provided a complexity metric using
structural graphs. The entropy based complexity metric for
measuring the entropy of large software systems is based
on number of leaf nodes, number of dependency edges, and
the distance between two leaf nodes (minimum number of
interior nodes traversed) (Snider 2001). Refer to Snider
(2001) for more information.
Harrison (1992) provided an entropy-based measure of
software complexity on the basis of information theory.
This metric is developed based on the hypothesis that a
program with high information content on average should
on whole be less complex compared to a program with an
average of less information content. Harrison calls the
complexity metric an Average Information Content Classification (AICC) measure, which is dependent upon the
123
1082
total number of operators used in the program and the
frequency by which a considered operator appears in the
code (Harrison 1992). Refer to Harrison (1992) for more
information on this metric.
According to Bansiya et al. (1999), an entropy based
complexity metric for object oriented designs can be
applied in the early stages of development to ensure that a
developer analyzes and reiterates the internal characteristics that lead to a quality oriented design. The entropy
measure developed is solely a measure of class complexity
to measure the information content that is a function of
number of strings in a class and on how frequently a string
repeats within class definitions, irrespective of the language
being used. The Class Definition Entropy (CDE) is developed on the basis of Shannon entropy, where CDE is
characterized by the probabilities of most frequently
occurring strings (Bansiya et al. 1999). Refer to Bansiya
et al. (1999) for more information on this metric. Solé and
Valverde (2004) proposed entropy and mutual information
measures for networks based on degree distributions,
applied to a range of real world and software networks
specifically for software class diagrams.
Although there are a lot of contributions and studies
observed in this field of software complexity based on
entropy, it is observed that not many authors consider
structural and logical flows of input and output variables
among the developed software modules for calculating
software complexity, which either directly or indirectly
relates to software quality. A logical flow here is defined as
a representation of decision-making processes coded into
software modules and a structural flow to be a representation of the interaction patterns among the statements in a
software code. Though the topological interactions and
logical characteristics have been previously considered
individually, for example in Solé and Valverde (2004),
characterizing complexity based on structural and a logical
flow along with time is currently not observed. Therefore,
we develop a complexity measure for software which
considers logical decision making processes, software
statement interactions patterns, and time, utilized together
to create an improved software complexity measure based
on the concept of entropy.
3 Proposed entropy based complexity metric
3.1 Definition of proposed metric
For analyzing large software codes, the techniques used are
often important and learning about the code based on its
structure is deemed important (Mens 2016). The proposed
software complexity metric is developed based on the
concept of Shannon entropy. The modules of software
123
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
code, when similarly contemplated in terms of symbols,
will have input and output flows that provide for information transfer from one module to another. Assuming that
the modules are fully functional without any uncertainties
associated with them, the proposed metric considers the
data flow relationships of a module (A module represents a
decision control structure, loop control structure, case
control structure, subroutine, or a function).
The entropy associated with a module is dependent upon
the output data flows from a module, input data flows to a
module and the time of execution at the modules.
Depending upon the structure of the module considered, it
has corresponding paths for input and output flows. This
module embedded in a software code when mapped
graphically, forms the foundation of the proposed metric
developed.
Let a given piece of code be characterized using a
control flow graph wherein the nodes represent various
modules associated with the code and the edges represent
the input and output flows associated with them. Each edge
originating at a node has a time factor associated with it,
based on the time taken to successfully complete the task it
is coded for. The more the number of the inputs and output
flows associated to a module, depending upon its characterization, the more the associated uncertainty. The proposed complexity measure is thus defined as:
H¼
k
n X
X
lj Pji log2 Pji
ð3Þ
j¼1 i¼1
where, n is the number of nodes characterizing the software,
k is the number of outputs associated to a node (Number of
outputs correspond to different outgoing edges representing
all possible distinct outputs leading to different nodes), lj is
the likelihood of occurrence associated node to j based on
number of arcs incoming to the node, Pji is probability
distribution of the output i associated to node j.
A software code when transformed to a control flow
graph, the value of n is obtained based on the number of
nodes in the graph. It is assumed here that all the edges in
CFG have a travel time of one unit each associated with
them. Likelihood of occurrence lj , associated to a node
depends upon the number of inputs (i.e. number of arcs
incoming to a node) of node j. This is based on the
assumption that more the number of inputs of a node, it is
more likely to occur in a graph which may be due to the
fact that it either has more feedback loops, controls from
predecessor nodes, loops entering a node, loops ending at a
node, or multiple possible executions. Probability distribution Pji , of output i associated to node j, is based on its
output according to the Principle of Indifference. Based on
the principle of indifference, suppose there are k indistinguishable possible outputs coming out of a node
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
represented on a control flow graph, then each outcome
will be assigned a probability 1=k. For a given node, each
outgoing edge represents an outgoing flow of control after
some part of node execution. The probabilities here are
based on the outgoing flows observed.
1083
Fig. 2 Sample CFG for metric
application
1
2
3
4
3.2 Characterization of the proposed metric
The proposed entropy measure is additive; that is, the
amount of total entropy in a piece of code characterized by
a control flow graph is the sum of the individual entropies
of all the associated nodes.
This metric can be characterized in different ways to
support its application in software code systems. As discussed previously, nodes are associated with their individual input and output flows, which influence their
software behavior. Since the nodes of a control flow graph
at the structural level require representing decision logic
and flows, the illustrated primitive formulations can be
established as shown in Fig. 1. The value of H (1) in Fig. 1
implies that the particular node has a single outflow that
occurs with a probability of 1.
According to the principle of Shannon entropy, considering an example of a coin toss, there is an equal chance
of heads or tails and the outcome of this experiment has an
entropy or information content of one bit. Similarly, if
there are two outputs from a node, the entropy associated,
based on the principle of indifference, will be equal to one.
1 1
1
1 1
1
i:e: E ¼ H ;
¼ log2 þ log2 ¼ 1
2 2
2
2 2
2
Figure 2 shows a sample Control Flow Graph (CFG) to
illustrate the application of proposed complexity metric.
The metric is computed for all the nodes of the graph based
on its input and output flows. The computation of the
metric is based on the Entropy value calculated at each and
5
6
7
every node identified using a Control Flow Graph (CFG).
The software complexity measure is obtained by the
summation of the entropy values at each node of CFG.
As mentioned previously, the metric is solely based on
the output distribution and likelihood of occurrence of the
nodes. All the edges associated to the CFG are assumed to
be of one time unit each. Table 2 shows the metric calculation for the CFG illustrated in Fig. 2.
The complexity metric computes a number to represent
the complexity. The higher the value of the entropy, the
more complex is the considered piece of code, and vice
versa. Also, information can be drawn at each node to
identify the possible complexity associated with that particular node by measuring the entropy value based on the
number of outputs and the likelihood of occurrence of each
output. Therefore, based on the illustrated computations of
the metric values in Table 2, node 4 (H ¼ 2) is more
complex than node 1(H ¼ 1), and node 1 (H ¼ 1) is
complex comparative to all the other associated nodes
(with H ¼ 0) in the CFG in Fig. 2.
Table 2 Entropy based metric values for CFG
E = H(1) + H(1) = 0 + 0 = 0
E = H(1) = 0
Node
Likelihood of
occurrence
(incoming
edges)
No: of
outputs
(data
outputs)
Entropy
Pn Pk
H¼
j¼1
i¼1 lj Pji log2 Pji
1
1
2
1H
2
1
1
1 H ð1Þ ¼ 0
3
1
1
4
2
2
5
1
1
1 H ð1Þ ¼ 0
2 H 12 ; 12 ¼ 2
1 H ð1Þ ¼ 0
6
2
1
1 H ð1Þ ¼ 0
7
1
1
“Single I/O Node”
“Sequence”
E = H(½, ½) + H(1) + H(1) =
1+0+0=1
“if”
E = H(1/3, 1/3, 1/3) + H(1) + H(1) + H(1)
= 1.58 + 0 + 0+ 0 = 1.58
“Case”
E = H(1) + 2 x H(½,½) + H(1) =
0+2+0=2
“while”
E = H(1) + H(½,½) = 0 + 1 = 1
“until loop”
Fig. 1 Basic programing primitive definitions (Note: All the starting
nodes are assumed to have inflow of 1)
SUM
1 1
2;2
¼1
1 H ð1Þ ¼ 0
H total = 3
The CFG complexity metric equals to the sum of the values in column
4 of the above table E ¼ Htotal ¼ 3
123
1084
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
4 Validation of proposed metric
Table 4 Relative rankings of complexity metric measures in Table 3
In order to verify and validate the proposed metric and its
functionality of determining a complexity measure based
on CFG of a given piece of code, it is here correlated with a
well-known and frequently cited, Thomas J. McCabe’s
Cyclomatic Complexity measure. This validation is based
on calculating complexity of 12 different control flows
using the proposed metric and using McCabe’s complexity
measure.
The Cyclomatic Complexity measure is based on the
number of edges, number of connected components, and
the number of vertices in a CFG. The 12 different control
flow graphs used are adapted from McCabe’s paper which
establishes a Cyclomatic Complexity measure for a given
program based on its characterization as a control flow
graphs (McCabe 1976). McCabe’s Cyclomatic Complexity
number along with the complexity values obtained using
the proposed metric are tabulated in Table 3.
Spearman’s rank correlation method was used to measure the correlation between the two different variables of
complexity measures obtained in Table 3. Table 4 shows
the relative ranking of the complexity metrics.
A value of rs ¼ 0:9771 is obtained using Spearman’s
correlation coefficient formulation for a sample size of 12
variables implying that the proposed Complexity metric
and McCabe’s complexity metric are very strongly
correlated.
To further validate the metric, a set of seven different
Matlab codes are converted to Control Flow Graphs. These
codes are adapted and randomly chosen from freely
available online database provided by Massachusetts
Table 3 Complexity metric values of cfg’s adapted from McCabe’s
paper
Control
flow graph
Number
of nodes
McCabe’s cyclomatic
complexity measure
Ranks
of (xi )
Ranks
of (yi )
0
0.00
1.0
1.0
3
3.00
2.0
2.0
5
6
6.00
7.00
3.0
4.0
3.0
4.0
8
7.58
5.5
5.0
8
8.16
5.5
6.0
9
12.16
7.0
8.0
10
13.00
9.0
9.0
10
11.58
9.0
7.0
11
16.58
11.0
11.0
10
16.00
9.0
10.0
19
22.16
12.0
12.0
Institute of Technology (Web.mit.edu 2014), programmed
to perform basic linear algebraic computations.
Validation here is based on the same guidelines followed previously, where, both the Cyclomatic Complexity
and proposed entropy based complexity measure are each
calculated for the control flow graphs. Spearman’s rank
correlation method is used to measure correlation between
the two complexity measures. Table 5 illustrates the
complexity metric measures calculated based on the control flow graphs developed from the Matlab codes and the
relative rankings of the obtained measures are tabulated in
Table 6.
A value of rs ¼ 0:8818 is obtained using Spearman’s
correlation coefficient (Bansiya et al. 1999) implying that
the proposed Complexity metric and McCabe’s complexity
metric are further strongly correlated. Based on the aformentioned, it can be seen that the proposed metric can
further be very strongly correlated with well-established
McCabe’s Cyclomatic Complexity measure.
Proposed
complexity
metric
1
3
0
0.00
2
7
3
3.00
3
10
5
6.00
4
12
6
7.00
5
12
8
7.58
6
7
13
19
8
9
8.16
12.16
8
20
10
13.00
9
23
10
11.58
10
25
11
16.58
11
18
10
16.00
12
36
19
22.16
123
Complexity measure using
proposed metric (yi )
McCabe’s
complexity
measure (xi )
Table 5 Complexity metric values of cfg’s based on matlab code
considered
Control
flow graph
Number
of nodes
McCabe’s cyclomatic
complexity measure
Proposed
complexity
metric
1
15
5
5
2
26
11
19
3
28
13
23
4
11
4
7
5
6
2
2
6
7
10
9
2
3
4
2
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
1085
Table 6 Relative rankings of complexity metric measures in Table 5
McCabe’s
complexity
measure (xi )
Complexity measure using
proposed metric (yi )
Ranks
of (xi )
Ranks
of (yi )
5
5
5
4
11
19
6
6
13
4
23
7
7
4
7
5
2
2
1.5
1.5
2
4
1.5
3
3
2
3
1.5
To further expand on the analysis, 8 different engineering tasks converted to a CFG each coded in C, Java,
Python, and Matlab are used for applying the proposed
metric to see which language is likely to be ranked more
complex. Table 7 shows the results where, a given row
across the table illustrates the complexity calculated for a
given program.
Here based on a small and simple set of programs
considered, It is observed from that a program when coded
in C or Java is associated with a high complexity value
compared to the same when coded in either Python or
Matlab.
5 Metric improvement to include time
of execution
The proposed metric defined in Sect. 3 is here modified to
include the time of execution at each and every module for
incorporating execution time into the complexity analysis.
Execution time of a program can be defined as the time
taken by the program to process its inputs and execute its
tasks (Puschner and Koza 1989).
A software code module’s execution time depends upon
several factors such as the instruction set used, type of
compiler, processor speed and several other similar factors
(Adamchik 2009). This implies that the time of execution
Table 7 Calculated complexity
for engineering tasks coded in
C, Java, Python and Matlab
Program vs calculated complexity
associated to a module depends upon its implementation.
Complexity of a program based on Time is defined as a
measurement of how fast the time taken by the program
grows with an increase in the input size, implying that for a
given input vector n = {n1, n2, n3…}, the execution time t
taken will be proportional to n, which can be represented as
t / n.
It is here assumed that a software code when represented
by a control flow graph has exponentially many paths for its
execution. The execution time of a module (represented as a
node in control flow graph) remains same and the total
execution time will be based on the count of the individual
node occurrence. Therefore, the run time of the program will
be equal to the summation of the total execution times at
each node. The improved complexity measure which also
incorporates individual node execution times is now represented as shown below based on the following assumptions.
Let ‘n’ be the number of nodes in the CFG, Let ‘m’ be
the number of outputs originating from a node ‘n’, Let ‘r’
be the number of inputs converging to a node ‘n’
n X
m X
r
X
H¼
T ðiÞcntðik ÞPij log2 Pij
ð4Þ
i¼1 j¼1 k¼1
where T ðiÞ ¼ ftði1 Þ; tði2 Þ; tði3 Þ; . . .; tðik Þg, a vector of execution times for each node associated to CFG. cntðik Þ Is the
count of the number of inputs to a node n. Pij is the
probability distribution of output j associated to node n.
From the perspective of a programmer, the execution
time of a node while either active (while executing the
functions) or inactive (while waiting for an input) depends
upon the computational algorithms and processes of the
node, thereby units ranging in the order of nanoseconds,
milliseconds or seconds. To overcome the effect of the
units in complexity analysis we suggest normalizing these
values on a scale of 0–1. Normalizing the execution times
though maps them to a range of values among 0–1; the
effect of execution time on complexity analysis still
remains the same. For convenience, the formulated metric
shown in ‘‘(4)’’ can be represented in the following terms
Pyhton
C/C??
Matlab
Java
Greatest common denominator
7
7
4
7
Matrices addition
5
12
4
16
Linear search
3
7
1
7
Binary search
5
8
5
8
Floyds triangle
5
4
2
4
Transpose of a matrix
6
11
4
12
Multiplication of matrices
8
19
5
19
Bubble sort
5
9
2
9
123
1086
H¼
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
r
m X
n X
X
•
C ðik Þ ðPij log2 Pij Þ
ð5Þ
jQ;Rj represents the complexity measure obtained by
concatenation of Q and R
i¼1 j¼1 k¼1
where,
C ðik Þ ¼
tðik Þ minðT ðiÞÞ
cntðik Þ
maxðT ðiÞÞ minðT ðiÞÞ
The data and procedure used for validation of the initially suggested metric in ‘‘(3)’’ holds credible for the
improved version of the metric illustrated in equation ‘‘(5)’’
based upon the assumption of a unit execution time taken
by each node on the fact of unavailable data on specific
node execution times.
6 Evaluation of proposed metric
In order to evaluate the proposed entropy based metric, we
use a set of eight axioms formulated and proposed by
Elaine J. Weyuker. Weyuker suggests that these axioms are
a set of conclusive evaluation measures to be satisfied by
any syntactic complexity measure. The four well known
software complexity measures, Cyclomatic Complexity
Number, Halstead’s Programming Effort, Statement Count
and Oviedo’s Data flow Complexity were evaluated by
Weyuker against the proposed axioms and it was found that
none of the measures satisfy all the properties (Weyuker
1988).
In this paper, these measures are used to intuitively
understand the properties of the proposed metric and to
identify possible scenarios where it can be applied, along
with identifying the weaknesses, which helps to decide
whether the proposed metric is useful in a given scenario or
not. This section introduces each axiom that Weyuker
(1988) deemed necessary for any complexity measure and
thereby shows whether the proposed metric satisfies the
properties.
Notations:
•
•
•
•
•
•
•
•
•
•
AP represents a program body P
AQ represents program body Q
AR represents program body R
jPj represents the complexity measure of P
jQj represents the complexity measure of Q
P Q represents that P and Q have same functionality
jP;Qj represents the complexity measure obtained by
concatenation of P and Q
jP;Rj represents the complexity measure obtained by
concatenation of P and R
jR;Pj represents the complexity measure obtained by
concatenation of R and P
jR;Qj represents the complexity measure obtained by
concatenation of R and Q
123
Property 1 ð9PÞ ð9QÞ ðjPj 6¼ jQjÞ i.e. there exists a
program body P and a program body Q where, a given
complexity measure should not rank them as equally
complex.
This property requires the metric to uniquely measure
the complexity of each program, thereby ensuring that not
all the programs are calculated to be equally complex.
From Eq. 5 it can be seen that the proposed metric depends
on: the type of control flows in the program, the number of
inputs to each node, the output probability distribution of
the nodes and the execution times which are unique for a
given program, thereby satisfying this property. To portray
further applicability of this property, we consider 2 programs implemented in C??, i.e. Implementation of Sorted
Circularly Doubly Linked List (P), and, Tower of Hanoi
Problem (Q), which are converted into control flow graphs
using freely available online converters to apply the proposed complexity metric. Illustrated in Fig. 3 are the control flow graphs used. Calculating the complexity measures
for program bodies P and Q in Fig. 3 using the proposed
metric, jPj ¼ 19 and jQj ¼ 37, thereby satisfying the
property.
Property 2 ð9PÞ ð9QÞ ðP Q &jPj 6¼ jQjÞ i.e. even
though there may exist 2 programs that have same functionality, but the complexity of each program depends upon
its implementation.
This property places emphasis on the effect of different
implementations of a program on its complexity. We
consider two functionally equivalent programs uniquely
different based on their implementation procedures, which
are converted to CFG’s for calculating the complexity
using the proposed metric. Figure 4 illustrates 2 control
flow graphs developed from a C program and its optimized
version. These programs are adapted from (Venkatachalam
et al. 2012), where the authors optimize a C program using
graph mining techniques. Calculating the complexity
measures for program bodies P and Q in Fig. 4 using the
proposed metric, jPj ¼ 3 and jQj ¼ 2:58.
Further, we consider two program bodies based on
Fisher Yates problem, one implemented in C?? (P) and
the other in Java (Q) Scripts. Calculating the complexity
measures for program bodies P and Q in Fig. 5 using the
proposed metric, jPj ¼ 9 and jQj ¼ 4.
Therefore, for P Q : jPj 6¼ jQj showing that the proposed metric satisfies this property.
Property 3 For any non-negative number c, there are
only finite programs with complexity c.
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
1087
Fig. 3 Two program bodies
with 2 uniquely different
functionalities (Bhojasia 2017)
1. Program body P
#include<stdio.h>
#include<conio.h>
void main()
{
int counter,a,b,n,c;
n = 10;
CFG
a = 1; b = 2;
for(counter=0;counter<n;counter+
+){
c = 20;
a=b+1;
b=a+b; }
c = c+a;
if(a<b)
a = 100
else
a = 1000
printf(“%d”,a);
getch();}
1
2
3
4
5
6
7
8
9
BLOCK 1
n = 10
a=1
b=2
counter = 0
BLOCK 2
L1:
t1 = counter<n
If t1 goto L2
BLOCK 3
goto L3
BLOCK 4
L4:
counter++
goto L1
BLOCK 5
L2:
c = 20
t2 = b+1
a = t2
t3 = a+b
b = t3
goto L4
BLOCK 6
L3:
t4 = c+a
c = t4
t5 = a<b
If !t5 goto L5
BLOCK 7
a = 100
goto L6
BLOCK 8
L5:
a = 1000
BLOCK 9
L6:
Print a
2. Program body Q
1
2
3
6
CFG
7
8
5
4
9
Fig. 4 Two programs with same functionality and different implementations along with their control flow graphs (Adamchik 2009)
This property is further build on Property 1, addressing a
complexity metric’s ability to distinguish between the
programs with the same decision structure that perform few
computations and those which perform many computations. The proposed metric, which considers the execution
time at nodes, can distinguish the complexity of a computation, based on the fact that nodes that perform few
computations take less time, compared to the nodes that
perform many computations. Therefore, this property is
satisfied ð8PÞ; jPj 0.
Fig. 5 Two programs with same functionality and different implementations along with their control flow graphs (Bhojasia 2017)
Property 4 ð8PÞð8QÞ ðjPj jP; Qj and jQj jP; QjÞ i.e.
the individual complexities of a given program body should
always be less than or equal to the complexity when they
are concatenated.
The emphasis here is on the increase in complexity
when a program body is composed by combining two
programs (children programs) and that the individual
complexities of the two programs are always less than or
equal to their parent. To illustrate this, we consider three
different control flow graphs where in which the individual
complexities of CFG’s along with the complexity when
they are combined are calculated. From Fig. 6, the
123
1088
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
and decision structures results in an increased complexity.
Therefore, ð8PÞð8QÞðjPj jP;Qj and jQj jP;QjÞ.
1
1
2
1
2
3
3
4
5
2
6
7
4
4
5
6
3
5
6
7
8
8
9
CGF for P
CGF for Q
10
11
12
13
14
CGF for (P;Q)
Fig. 6 Illustrations of CFG’s for (P), (Q) and (P; Q)
complexities when calculated using the proposed metric
jPj ¼ 3, jQj ¼ 3:58 and jP;Qj ¼ 6:58, which illustrates that
this property is satisfied.
Further, to elaborate we consider two different program
bodies where in, Program body P corresponds to the C??
script to find if a given matrix is invertible, Program body
Q corresponds to the C?? script to find the determinant of
a given matrix. When concatenated, program bodies P and
Q are represented to be P; Q. From Fig. 7, the complexities
when calculated using the proposed metric jPj ¼ 14, jQj ¼
13 and jP; Qj ¼ 17, which illustrates that this property is
satisfied.
Hence, whenever two different control flows (extracted
from a program) are concatenated, an increase in total
number of inputs, outputs and decision structure are
observed. This increase in the number of inputs, outputs
Fig. 7 Illustrations of CFG’s
for (P), (Q) and (P; Q) (Bhojasia
2017)
123
Property 5.1 ð9PÞ ð9QÞ ð9RÞ ðjPj ¼ jQj&jP; Rj 6¼ jQ; RjÞ;
5.2:ð9PÞ ð9QÞ ð9RÞ ðjPj ¼ jQj&jR; Pj 6¼ jR; QjÞ i.e. If there
exist two program bodies P and Q with same complexity,
when a new program body R is concatenated with P and Q
the complexities will differ.
This property places emphasis on identifying the interactions which may significantly impact the complexity of a
program when concatenated with an external program
body. To illustrate this, we consider three different program bodies P, Q and R where, P is a C code to identify if a
given number is even or odd, Q identifies if a given number
is greater or less than numerical 10 and R identifies if a
given number is prime or not. Figure 8 illustrates program
bodies P, Q &R, and their respective CFG’s.
In order to check if the proposed complexity measure
holds this property, we concatenate program body R to
program body P and to program body Q. Figure 9 illustrates program bodies (P;R) and (Q:R) and their CFG’s.
From Fig. 5, jPj ¼ jQj ¼ 1 and jRj ¼ 5. When program
body R is concatenated to P and Q, based on the control
flow graphs illustrated in Fig. 6, it is observed that
jP; Rj ¼ jQ; Rj ¼ 6. Although complexity when calculated
using control flow graphs from Fig. 9 is same for both
program bodies, it is to be noticed that in program body (Q;
R) there is an additional assignment in line 14 of the
number being considered to a variable x (‘num == x’) at
node 6. This assignment increases the execution time at
this node when compared to its execution time in (P; R).
Therefore; jP; Rj 6¼ jQ; Rj, which implies that this property
holds for the proposed metric.
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
Fig. 8 Program bodies P, Q, R
and their respective control flow
graphs
1089
#include <stdio.h>
Int main()
{
int x;
Printf(“ Enter a number to be
checked: “);
Scanf(“%d”, &x);
If ((x)>10)
printf(“%d is greater”, x);
else
Printf(“%d is lesser to 10”,x);
Return 0;
}
#include <stdio.h>
Int main()
{
int num;
Printf(“ Enter a number to be
checked: “);
Scanf(“%d”, &num);
If ((num%2)==0)
printf(“%d is even”, num);
else
Printf(“%d is odd”,num);
Return 0;
}
B Program Body Q (To check if a
A Program Body P (To check if a
number is greater or lesser to 10)
number is even or odd)
1
2
3
4
#include<stdio.h>
int main()
{
int num,i,count=0;
printf("Enter a number: ");
scanf("%d",&num);
for(i=2;i<=num/2;i++){
if(num%i==0){
count++;
break;
}
} if(count==0 && num!= 1)
printf("%d is a prime number",num);
else
printf("%d is not a prime number",num);
return 0;
}
C
Program Body R
(To check if a number is Prime or not)
1
2
3
4
1
2
3
5
4
5
6
6
7
7
D. CGF of Program body P
5
7
8
E. CGF of Program body Q
6
9
10
11
F. CGF of Program body R
Fig. 9 Program bodies (P; R)
and (Q; R) and their respective
control flow graphs
#include <stdio.h>
int main(){
int num,i,count=0;
printf("Enter an integer you want
to check: ");
scanf("%d",&num);
if((num%2)==0)
/* Checking
whether remainder is 0 or not. */
printf("%d is even.",num);
else
printf("%d is odd.",num);
CFG
for(i=2;i<=num/2;i++){
if(num%i==0){
count++;
break;
}
1
2
3
4
5
6
7
8
9
10
11
}
if(count==0 && num!= 1)
12
13
printf("%d is a prime
number",num);
14
else
printf("%d is not a prime
number",num);
return 0;
}
A Program body (P;R) and its CFG
Property 6 Two program bodies P and Q exist such that,
Q is formed by permuting the order of statements of P and
jPj 6¼ jQj.
This property signifies the importance of permuting
program statements, with the effect to be considered while
quantifying a programs complexity. This property doesn’t
hold valid for the proposed metric as the nodes in the
control flow graphs are independent of the program
#include <stdio.h>
int main()
1
{
int x;
2
printf("Enter an integer you want
to check: ");
3
scanf("%d",&x);
if((x)>10)
/* Checking
4
5
whether it is greater than10 or not. */
printf("%d is greater.",x);
6
else
printf("%d is lesser.",x);
7
int ,i,count=0;
CFG
num==x;
8
for(i=2;i<=num/2;i++){
if(num%i==0){
10
9
count++;
break;
11
}
}
12
13
if(count==0 && num!= 1)
14
printf("%d is a prime
number",num);
else
printf("%d is not a prime
number",num);
return 0;
}
B Program body (Q;R) and its CFG
statement’s placement. Also, the execution time remains
the same when a given set of statements are reordered.
Property 7 If program bodies P and Q are almost
identical, then jPj ¼ jQj.
This property clearly holds valid for the metric. This is
because, if the names chosen for identifiers (different
mnemonics) are indeed different, the interaction & control
flow patterns along with time of execution still remain the
123
1090
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
Table 8 Metric comparison to other complexity measures using Weyukers criteria
Weyukers property
number
Statement count
Cyclomatic number
Effort measure
Data flow complexity
Proposed entropy based
complexity metric
1
YES
YES
YES
YES
YES
2
YES
YES
YES
YES
YES
3
YES
NO
YES
NO
YES
4
YES
YES
NO
NO
YES
5
NO
NO
YES
YES
YES
6
NO
NO
NO
YES
NO
7
8
YES
YES
YES
YES
NO
NO
YES
NO
YES
YES
‘YES’ indicates property is satisfies and ‘NO’ indicates property is not satisfied
same. This phenomenon also holds valid if there is a
change observed in the operators or the constants used in
two identical program bodies while all the other factors
remain same.
Property 8 ð8PÞð8QÞ ðjPj þ jQj jP; QjÞ i.e. interaction
of any two program bodies always increases complexity.
This can be clearly observed from Fig. 6 where, jPj ¼ 3,
jQj ¼ 3:58 and jP;Qj ¼ 6:58, and Fig. 7 where, jPj ¼ 14,
jQj ¼ 13 and jP;Qj ¼ 17; illustrating that this property
holds valid for the proposed complexity measure.
As observed the proposed metric is mostly compliant
with Weyukers Criteria. Illustrated in Table 8 is the complexity metric evaluation according to Weyuker’s criteria
for measures based on Statement Count, Cyclomatic
Number, Effort Number, Data Flow Complexity [please
refer to Weyuker (1988) for a detailed analysis] and the
proposed metric.
When closely examined, this evaluation helped to
identify the key properties of the proposed metric, where:
•
•
•
•
•
It is sensitive to how components interact based on
control and data flow,
It will not rank all the programs to be equally complex,
It divides programs into various classes of complexity,
It is sensitive to program syntax, and
The complexity measure increases as a program grows.
This also identifies one key weakness of the proposed
measure that, it is unable to distinguish the pattern in which
the statements of a program appear.
7 Conclusion
In this paper, we describe a new information entropy based
complexity measure for software based control flow graphs
defined according to the program component (node)
interactions (i.e. their control and data flows), likelihood of
123
interaction, and the time of execution at each node, to
calculate the software code complexity. A positive correlation was observed for FORTRAN based CFG’s adapted
from McCabe’s complexity measure paper and for CFG’s
based on Matlab code programmed to perform basic linear
algebraic computations. Evaluation against Weyukers criteria helped to support the metric’s validity for use. Further
validation of the metric is required, taking into consideration the execution times of each node. Also, the authors are
currently trying to use the suggested metric to enable in
answering: How complexity varies with the size of software, if software complexity increases over time, and on
how complexity changes in piece of code written today
when compared to that were written previously.
Compliance with ethical standards
Conflict of interest The author(s) declare(s) that there is no conflict
of interest.
References
Adamchik VS (2009) Algorithmic complexity. School of Computer
Science, Carnegie Mellon University, 2009. http://www.cs.cmu.
edu/*adamchik/15121/lectures/Algorithmic%20Complexity/
complexity.html
Bansiya Jagdish, Davis Carl, Etzkorn Letha (1999) An entropy-based
complexity measure for object-oriented designs. Theory Pract
Object Syst 5(2):111–118
Berlinger E (1980) An information theory based complexity measure.
In: Proceedings of the national computer conference, May
19–22, 1980. ACM, New York
Bhojasia M (2017) C?? programming examples on numerical
problems & algorithms. http://www.sanfoundry.com/cpp-pro
gramming-examples-numerical-problems-algorithms/. Accessed
Oct 2017
Chaturvedi KK et al (2014) Predicting the complexity of code
changes using entropy based measures. Int J Syst Assur Eng
Manag 5(2):155–164
Fitzsimmons Ann, Love Tom (1978) A review and evaluation of
software science. ACM Comput Surv (CSUR) 10(1):3–18
Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091
Hamer PG, Frewin GD (1982) MH Halstead’s software science-a
critical examination.In: Proceedings of the 6th international
conference on software engineering. IEEE Computer Society
Press, Washingon
Harrison Warren (1992) An entropy-based measure of software
complexity. IEEE Trans Softw Eng 18(11):1025–1029
Henry Sallie, Kafura Dennis (1981) Software structure metrics based
on information flow. IEEE Trans Softw Eng 5:510–518
Jung W-S et al (2011) An entropy-based complexity measure for web
applications using structural information. J Inf Sci Eng
27(2):595–619
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng
4:308–320
McCall JA (1977) Factors in software quality. US Rome Air
development center reports
Mens T (2016) Research trends in structural software complexity.
arXiv:1608.01533
Mills HD (1999) The management of software engineering, Part I:
principles of software engineering. IBM Syst J 38(2.3):289–295
Oviedo EI (1980) Control flow, data flow and program complexity.
In: Proceedings of IEEE COMPSAC, Chicago, IL, pp 146–152
Puschner Peter, Koza Ch (1989) Calculating the maximum execution
time of real-time programs. Real-time Syst 1(2):159–176
Roca JL (1996) An entropy-based method for computing software
structural complexity. Microelectron Reliab 36(5):609–620
Selvarani R et al (2009) Software metrics evaluation based on
entropy. In: Ramachandran M (ed) Handbook of research on
1091
software engineering and productivity technologies: implications
of globalization: implications of globalization. IGI Global,
Hershey, p 139
Shannon CE, Weaver W, Burks AW (1951) The mathematical theory
of communication. Wiley, New York
Snider G (2001) Measuring the entropy of large software systems. HP
Laboratories Palo Alto, Tech. Rep, Burbank
Solé RV, Valverde S (2004) Information theory of complex networks:
On evolution and architectural constraints. In: Ben-Naim E,
Frauenfelder H, Toroczkai Z (eds) Complex networks. Lecture
notes in physics, vol 650. Springer, Berlin, Heidelberg
Van Vliet H, Van Vliet H, Van Vliet JC (1993) Software engineering:
principles and practice, vol 3. Wiley, New York
Venkatachalam S, Sairam N, Srinivasan B (2012) Code optimization
using graph mining. Res J Appl Sci Eng Technol
4(19):3618–3622
Wallace D, Watson AH, Mccabe TJ (1993) Structured testing: A
testing methodology using the cyclomatic complexity metric.
No. Special Publication (NIST SP)-500-235
Web.mit.edu, MATLAB Teaching Codes. [online] http://web.mit.edu/
18.06/www/Course-Info/Tcodes.html. Accessed 14 July 2014
Weyuker Elaine J (1988) Evaluating software complexity measures.
IEEE Trans Softw Eng 14(9):1357–1365
Woodfield SN (1979) An experiment on unit increase in problem
complexity. IEEE Trans Softw Eng 2:76–79
The Simplest Way to Create Flowcharts. code2flow - Online
Interactive Code to Flowchart Converter, code2flow.com
123