Information entropy applied to software based control flow graphs

Satya Aditya Akundi

Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 https://doi.org/10.1007/s13198-018-0740-y ORIGINAL ARTICLE Information entropy applied to software based control flow graphs Aditya Akundi1 • Eric Smith1 • Tzu-Liang Tseng1 Received: 27 April 2017 / Revised: 27 February 2018 / Published online: 24 July 2018 The Society for Reliability Engineering, Quality and Operations Management (SREQOM), India and The Division of Operation and Maintenance, Lulea University of Technology, Sweden 2018 Abstract Information theory, introduced by Shannon in the context of information transfer in communication channels, is used as a foundation for research in many diverse fields. The concept Entropy in terms of information theory is seen as the average amount of information or the rate of information produced when forming a message, element by element. Entropy has found broad application in many research fields and can also be applied in software engineering for quantifying the uncertainty associated with a software code. In this paper, information entropy and its application towards measuring software complexity are explored, along with the formulation of an information entropy based complexity measure that considers logical decision-making, processes, and software statement interaction patterns in control flow graphs mapped from actual software code. To broaden the application of the proposed metric, the execution times of nodes in the control flow graphs are also incorporated. Further, the metric is evaluated against eight different axioms that a software complexity measure should satisfy. Keywords Complexity Control flow graph Entropy Information theory Software complexity 1 Introduction Software code complexity measures are mainly used or adapted in the design and implementation phase of a code. They are used to measure the individual inherent complexity of code modules, individual components (a software component is an element composed based on a set of pre-defined standards that conform to a specific behavior), and procedures. Modules, procedures and components of a code, irrespective of the level at which they are developed, are inter-dependent. The structural and information architecture of code has a significant impact on both complexity measures and quality measures (McCall 1977). Complexity in software code can be defined as the attribute associated to a code that effects the effort required to either develop, change or debug a piece of software. Many different methods have been suggested throughout the literature in this field for the quantitative characterization of the complexity inherent in software. These metrics, when captured quantitatively, work as anchors for software design development and re-engineering efforts. The complexity metrics can be broadly characterized into Information based complexity metrics, and Structural based complexity metrics. Table 1 lists contributions in the development of software code complexity metrics. 2 Entropy based software complexity measures & Aditya Akundi sakundi@utep.edu 1 Industrial Manufacturing and Systems Engineering Department, The University of Texas at El Paso, El Paso, TX 79968, USA 123 Entropy in thermodynamics represents the inherent disorder in a system over a period of time as the system heads towards thermodynamic equilibrium. In information theory, according to Shannon, Entropy helps to quantify the information (Shannon et al. 1951). Quantifying information Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 1081 Table 1 Software complexity metric contributions Category Information based Size and structure based Contributing authors Complexity metric G.M. Weinberg Complexity measure based on number of lines of code (Van Vliet et al. 1993) Maurice Howard Halstead Based on number of operators and operands (Van Vliet et al. 1993; Hamer and Frewin 1982) Scott W. Woodfield Based on conceptually unique operands (Woodfield 1979) Eli Berlinger Information based complexity measure (Berlinger 1980) Maurice Howard Halstead Program effort and difficulty measure based on program vocabulary, number of distinct operators, total number of operators, total number of operands and volume (Hamer and Frewin 1982; Weyuker 1988) Fitzsimmons and Love Number of delivered errors based on Halstead’s effort metric (Fitzsimmons and Love 1978) El Oviedo Complexity measure based on control and data flow using number of available definitions of variables in blocks of program body (Weyuker 1988; Oviedo 1993) Thomas J. McCabe Cyclomatic complexity metric (Wallace et al. 1993) Sallie Henry and Dennis Kafura Metrics based on global, local and Indirect flow relations (Van Vliet et al. 1993; Henry and Kafura 1981) Shepperds Metric based on fan in and fan out measures (Van Vliet et al. 1993) implies analyzing the information present and measuring its associated uncertainty. Higher values of entropy signify lesser order in a system and lower the values of entropy signify a more ordered system. Entropy has found broad application in many fields and can also be applied in software engineering for quantifying the uncertainty associated with a software code. Entropy H of a system, according to statistical mechanics is defined as Shannon et al. (1951): X H¼ K Pi log Pi ð1Þ where Pi is the probability of a particular state and K is Boltzmann’s constant. Shannon Entropy H is given as Shannon et al. (1951): X H¼ Pi log2 Pi ð2Þ where Pi is the probability of a symbol showing up in a given stream of symbols, and the use of the logarithm base two corresponds to expressing information entropy in terms of bits. There are a number of studies where entropy is used as basic foundation for software complexity measures. Several entropy based measures have been proposed and defined which are sensitive to the probability values calculated based on the frequency of usage of symbols, set of software inputs, set of outputs, set of links of nodes, frequency of string occurrence in a code, frequency of names occurring in a code, frequency of operator occurrence, number of attributes, reuse ratio, frequently occurring operators, number of leaf nodes and also a few object oriented design metrics (Selvarani et al. 2009; Berlinger 1980; Jung et al. 2011; Chaturvedi et al. 2014; Snider 2001; Harrison 1992; Bansiya et al. 1999; Mills 1999; Roca 1996). Berlinger (1980) provided an information theory based complexity measure based on entropy theory. The defined complexity measure is sensitive to the frequency of occurrence of all the tokens in a program. Tokens such as operators and operands here refer to elements of the programming language being used. According to Berlinger, there are several possible interpretations of this measure, either an information point of view where it represents the total information contained in the code or an ideal coding scheme representing the total length required to develop the program. To add, Berlinger suggests that irrespective of the interpretation used, this measure is sensitive to the frequency of the symbols’ occurrence and the proportion of the number of times the symbol occurs in the past (Berlinger 1980). Snider (2001) provided a complexity metric using structural graphs. The entropy based complexity metric for measuring the entropy of large software systems is based on number of leaf nodes, number of dependency edges, and the distance between two leaf nodes (minimum number of interior nodes traversed) (Snider 2001). Refer to Snider (2001) for more information. Harrison (1992) provided an entropy-based measure of software complexity on the basis of information theory. This metric is developed based on the hypothesis that a program with high information content on average should on whole be less complex compared to a program with an average of less information content. Harrison calls the complexity metric an Average Information Content Classification (AICC) measure, which is dependent upon the 123 1082 total number of operators used in the program and the frequency by which a considered operator appears in the code (Harrison 1992). Refer to Harrison (1992) for more information on this metric. According to Bansiya et al. (1999), an entropy based complexity metric for object oriented designs can be applied in the early stages of development to ensure that a developer analyzes and reiterates the internal characteristics that lead to a quality oriented design. The entropy measure developed is solely a measure of class complexity to measure the information content that is a function of number of strings in a class and on how frequently a string repeats within class definitions, irrespective of the language being used. The Class Definition Entropy (CDE) is developed on the basis of Shannon entropy, where CDE is characterized by the probabilities of most frequently occurring strings (Bansiya et al. 1999). Refer to Bansiya et al. (1999) for more information on this metric. Solé and Valverde (2004) proposed entropy and mutual information measures for networks based on degree distributions, applied to a range of real world and software networks specifically for software class diagrams. Although there are a lot of contributions and studies observed in this field of software complexity based on entropy, it is observed that not many authors consider structural and logical flows of input and output variables among the developed software modules for calculating software complexity, which either directly or indirectly relates to software quality. A logical flow here is defined as a representation of decision-making processes coded into software modules and a structural flow to be a representation of the interaction patterns among the statements in a software code. Though the topological interactions and logical characteristics have been previously considered individually, for example in Solé and Valverde (2004), characterizing complexity based on structural and a logical flow along with time is currently not observed. Therefore, we develop a complexity measure for software which considers logical decision making processes, software statement interactions patterns, and time, utilized together to create an improved software complexity measure based on the concept of entropy. 3 Proposed entropy based complexity metric 3.1 Definition of proposed metric For analyzing large software codes, the techniques used are often important and learning about the code based on its structure is deemed important (Mens 2016). The proposed software complexity metric is developed based on the concept of Shannon entropy. The modules of software 123 Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 code, when similarly contemplated in terms of symbols, will have input and output flows that provide for information transfer from one module to another. Assuming that the modules are fully functional without any uncertainties associated with them, the proposed metric considers the data flow relationships of a module (A module represents a decision control structure, loop control structure, case control structure, subroutine, or a function). The entropy associated with a module is dependent upon the output data flows from a module, input data flows to a module and the time of execution at the modules. Depending upon the structure of the module considered, it has corresponding paths for input and output flows. This module embedded in a software code when mapped graphically, forms the foundation of the proposed metric developed. Let a given piece of code be characterized using a control flow graph wherein the nodes represent various modules associated with the code and the edges represent the input and output flows associated with them. Each edge originating at a node has a time factor associated with it, based on the time taken to successfully complete the task it is coded for. The more the number of the inputs and output flows associated to a module, depending upon its characterization, the more the associated uncertainty. The proposed complexity measure is thus defined as: H¼ k n X X lj Pji log2 Pji ð3Þ j¼1 i¼1 where, n is the number of nodes characterizing the software, k is the number of outputs associated to a node (Number of outputs correspond to different outgoing edges representing all possible distinct outputs leading to different nodes), lj is the likelihood of occurrence associated node to j based on number of arcs incoming to the node, Pji is probability distribution of the output i associated to node j. A software code when transformed to a control flow graph, the value of n is obtained based on the number of nodes in the graph. It is assumed here that all the edges in CFG have a travel time of one unit each associated with them. Likelihood of occurrence lj , associated to a node depends upon the number of inputs (i.e. number of arcs incoming to a node) of node j. This is based on the assumption that more the number of inputs of a node, it is more likely to occur in a graph which may be due to the fact that it either has more feedback loops, controls from predecessor nodes, loops entering a node, loops ending at a node, or multiple possible executions. Probability distribution Pji , of output i associated to node j, is based on its output according to the Principle of Indifference. Based on the principle of indifference, suppose there are k indistinguishable possible outputs coming out of a node Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 represented on a control flow graph, then each outcome will be assigned a probability 1=k. For a given node, each outgoing edge represents an outgoing flow of control after some part of node execution. The probabilities here are based on the outgoing flows observed. 1083 Fig. 2 Sample CFG for metric application 1 2 3 4 3.2 Characterization of the proposed metric The proposed entropy measure is additive; that is, the amount of total entropy in a piece of code characterized by a control flow graph is the sum of the individual entropies of all the associated nodes. This metric can be characterized in different ways to support its application in software code systems. As discussed previously, nodes are associated with their individual input and output flows, which influence their software behavior. Since the nodes of a control flow graph at the structural level require representing decision logic and flows, the illustrated primitive formulations can be established as shown in Fig. 1. The value of H (1) in Fig. 1 implies that the particular node has a single outflow that occurs with a probability of 1. According to the principle of Shannon entropy, considering an example of a coin toss, there is an equal chance of heads or tails and the outcome of this experiment has an entropy or information content of one bit. Similarly, if there are two outputs from a node, the entropy associated, based on the principle of indifference, will be equal to one. 1 1 1 1 1 1 i:e: E ¼ H ; ¼ log2 þ log2 ¼ 1 2 2 2 2 2 2 Figure 2 shows a sample Control Flow Graph (CFG) to illustrate the application of proposed complexity metric. The metric is computed for all the nodes of the graph based on its input and output flows. The computation of the metric is based on the Entropy value calculated at each and 5 6 7 every node identified using a Control Flow Graph (CFG). The software complexity measure is obtained by the summation of the entropy values at each node of CFG. As mentioned previously, the metric is solely based on the output distribution and likelihood of occurrence of the nodes. All the edges associated to the CFG are assumed to be of one time unit each. Table 2 shows the metric calculation for the CFG illustrated in Fig. 2. The complexity metric computes a number to represent the complexity. The higher the value of the entropy, the more complex is the considered piece of code, and vice versa. Also, information can be drawn at each node to identify the possible complexity associated with that particular node by measuring the entropy value based on the number of outputs and the likelihood of occurrence of each output. Therefore, based on the illustrated computations of the metric values in Table 2, node 4 (H ¼ 2) is more complex than node 1(H ¼ 1), and node 1 (H ¼ 1) is complex comparative to all the other associated nodes (with H ¼ 0) in the CFG in Fig. 2. Table 2 Entropy based metric values for CFG E = H(1) + H(1) = 0 + 0 = 0 E = H(1) = 0 Node Likelihood of occurrence (incoming edges) No: of outputs (data outputs) Entropy Pn Pk H¼ j¼1 i¼1 lj Pji log2 Pji 1 1 2 1H 2 1 1 1 H ð1Þ ¼ 0 3 1 1 4 2 2 5 1 1 1 H ð1Þ ¼ 0 2 H 12 ; 12 ¼ 2 1 H ð1Þ ¼ 0 6 2 1 1 H ð1Þ ¼ 0 7 1 1 “Single I/O Node” “Sequence” E = H(½, ½) + H(1) + H(1) = 1+0+0=1 “if” E = H(1/3, 1/3, 1/3) + H(1) + H(1) + H(1) = 1.58 + 0 + 0+ 0 = 1.58 “Case” E = H(1) + 2 x H(½,½) + H(1) = 0+2+0=2 “while” E = H(1) + H(½,½) = 0 + 1 = 1 “until loop” Fig. 1 Basic programing primitive definitions (Note: All the starting nodes are assumed to have inflow of 1) SUM 1 1 2;2 ¼1 1 H ð1Þ ¼ 0 H total = 3 The CFG complexity metric equals to the sum of the values in column 4 of the above table E ¼ Htotal ¼ 3 123 1084 Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 4 Validation of proposed metric Table 4 Relative rankings of complexity metric measures in Table 3 In order to verify and validate the proposed metric and its functionality of determining a complexity measure based on CFG of a given piece of code, it is here correlated with a well-known and frequently cited, Thomas J. McCabe’s Cyclomatic Complexity measure. This validation is based on calculating complexity of 12 different control flows using the proposed metric and using McCabe’s complexity measure. The Cyclomatic Complexity measure is based on the number of edges, number of connected components, and the number of vertices in a CFG. The 12 different control flow graphs used are adapted from McCabe’s paper which establishes a Cyclomatic Complexity measure for a given program based on its characterization as a control flow graphs (McCabe 1976). McCabe’s Cyclomatic Complexity number along with the complexity values obtained using the proposed metric are tabulated in Table 3. Spearman’s rank correlation method was used to measure the correlation between the two different variables of complexity measures obtained in Table 3. Table 4 shows the relative ranking of the complexity metrics. A value of rs ¼ 0:9771 is obtained using Spearman’s correlation coefficient formulation for a sample size of 12 variables implying that the proposed Complexity metric and McCabe’s complexity metric are very strongly correlated. To further validate the metric, a set of seven different Matlab codes are converted to Control Flow Graphs. These codes are adapted and randomly chosen from freely available online database provided by Massachusetts Table 3 Complexity metric values of cfg’s adapted from McCabe’s paper Control flow graph Number of nodes McCabe’s cyclomatic complexity measure Ranks of (xi ) Ranks of (yi ) 0 0.00 1.0 1.0 3 3.00 2.0 2.0 5 6 6.00 7.00 3.0 4.0 3.0 4.0 8 7.58 5.5 5.0 8 8.16 5.5 6.0 9 12.16 7.0 8.0 10 13.00 9.0 9.0 10 11.58 9.0 7.0 11 16.58 11.0 11.0 10 16.00 9.0 10.0 19 22.16 12.0 12.0 Institute of Technology (Web.mit.edu 2014), programmed to perform basic linear algebraic computations. Validation here is based on the same guidelines followed previously, where, both the Cyclomatic Complexity and proposed entropy based complexity measure are each calculated for the control flow graphs. Spearman’s rank correlation method is used to measure correlation between the two complexity measures. Table 5 illustrates the complexity metric measures calculated based on the control flow graphs developed from the Matlab codes and the relative rankings of the obtained measures are tabulated in Table 6. A value of rs ¼ 0:8818 is obtained using Spearman’s correlation coefficient (Bansiya et al. 1999) implying that the proposed Complexity metric and McCabe’s complexity metric are further strongly correlated. Based on the aformentioned, it can be seen that the proposed metric can further be very strongly correlated with well-established McCabe’s Cyclomatic Complexity measure. Proposed complexity metric 1 3 0 0.00 2 7 3 3.00 3 10 5 6.00 4 12 6 7.00 5 12 8 7.58 6 7 13 19 8 9 8.16 12.16 8 20 10 13.00 9 23 10 11.58 10 25 11 16.58 11 18 10 16.00 12 36 19 22.16 123 Complexity measure using proposed metric (yi ) McCabe’s complexity measure (xi ) Table 5 Complexity metric values of cfg’s based on matlab code considered Control flow graph Number of nodes McCabe’s cyclomatic complexity measure Proposed complexity metric 1 15 5 5 2 26 11 19 3 28 13 23 4 11 4 7 5 6 2 2 6 7 10 9 2 3 4 2 Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 1085 Table 6 Relative rankings of complexity metric measures in Table 5 McCabe’s complexity measure (xi ) Complexity measure using proposed metric (yi ) Ranks of (xi ) Ranks of (yi ) 5 5 5 4 11 19 6 6 13 4 23 7 7 4 7 5 2 2 1.5 1.5 2 4 1.5 3 3 2 3 1.5 To further expand on the analysis, 8 different engineering tasks converted to a CFG each coded in C, Java, Python, and Matlab are used for applying the proposed metric to see which language is likely to be ranked more complex. Table 7 shows the results where, a given row across the table illustrates the complexity calculated for a given program. Here based on a small and simple set of programs considered, It is observed from that a program when coded in C or Java is associated with a high complexity value compared to the same when coded in either Python or Matlab. 5 Metric improvement to include time of execution The proposed metric defined in Sect. 3 is here modified to include the time of execution at each and every module for incorporating execution time into the complexity analysis. Execution time of a program can be defined as the time taken by the program to process its inputs and execute its tasks (Puschner and Koza 1989). A software code module’s execution time depends upon several factors such as the instruction set used, type of compiler, processor speed and several other similar factors (Adamchik 2009). This implies that the time of execution Table 7 Calculated complexity for engineering tasks coded in C, Java, Python and Matlab Program vs calculated complexity associated to a module depends upon its implementation. Complexity of a program based on Time is defined as a measurement of how fast the time taken by the program grows with an increase in the input size, implying that for a given input vector n = {n1, n2, n3…}, the execution time t taken will be proportional to n, which can be represented as t / n. It is here assumed that a software code when represented by a control flow graph has exponentially many paths for its execution. The execution time of a module (represented as a node in control flow graph) remains same and the total execution time will be based on the count of the individual node occurrence. Therefore, the run time of the program will be equal to the summation of the total execution times at each node. The improved complexity measure which also incorporates individual node execution times is now represented as shown below based on the following assumptions. Let ‘n’ be the number of nodes in the CFG, Let ‘m’ be the number of outputs originating from a node ‘n’, Let ‘r’ be the number of inputs converging to a node ‘n’ n X m X r X H¼ T ðiÞcntðik ÞPij log2 Pij ð4Þ i¼1 j¼1 k¼1 where T ðiÞ ¼ ftði1 Þ; tði2 Þ; tði3 Þ; . . .; tðik Þg, a vector of execution times for each node associated to CFG. cntðik Þ Is the count of the number of inputs to a node n. Pij is the probability distribution of output j associated to node n. From the perspective of a programmer, the execution time of a node while either active (while executing the functions) or inactive (while waiting for an input) depends upon the computational algorithms and processes of the node, thereby units ranging in the order of nanoseconds, milliseconds or seconds. To overcome the effect of the units in complexity analysis we suggest normalizing these values on a scale of 0–1. Normalizing the execution times though maps them to a range of values among 0–1; the effect of execution time on complexity analysis still remains the same. For convenience, the formulated metric shown in ‘‘(4)’’ can be represented in the following terms Pyhton C/C?? Matlab Java Greatest common denominator 7 7 4 7 Matrices addition 5 12 4 16 Linear search 3 7 1 7 Binary search 5 8 5 8 Floyds triangle 5 4 2 4 Transpose of a matrix 6 11 4 12 Multiplication of matrices 8 19 5 19 Bubble sort 5 9 2 9 123 1086 H¼ Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 r m X n X X • C ðik Þ ðPij log2 Pij Þ ð5Þ jQ;Rj represents the complexity measure obtained by concatenation of Q and R i¼1 j¼1 k¼1 where, C ðik Þ ¼ tðik Þ minðT ðiÞÞ cntðik Þ maxðT ðiÞÞ minðT ðiÞÞ The data and procedure used for validation of the initially suggested metric in ‘‘(3)’’ holds credible for the improved version of the metric illustrated in equation ‘‘(5)’’ based upon the assumption of a unit execution time taken by each node on the fact of unavailable data on specific node execution times. 6 Evaluation of proposed metric In order to evaluate the proposed entropy based metric, we use a set of eight axioms formulated and proposed by Elaine J. Weyuker. Weyuker suggests that these axioms are a set of conclusive evaluation measures to be satisfied by any syntactic complexity measure. The four well known software complexity measures, Cyclomatic Complexity Number, Halstead’s Programming Effort, Statement Count and Oviedo’s Data flow Complexity were evaluated by Weyuker against the proposed axioms and it was found that none of the measures satisfy all the properties (Weyuker 1988). In this paper, these measures are used to intuitively understand the properties of the proposed metric and to identify possible scenarios where it can be applied, along with identifying the weaknesses, which helps to decide whether the proposed metric is useful in a given scenario or not. This section introduces each axiom that Weyuker (1988) deemed necessary for any complexity measure and thereby shows whether the proposed metric satisfies the properties. Notations: • • • • • • • • • • AP represents a program body P AQ represents program body Q AR represents program body R jPj represents the complexity measure of P jQj represents the complexity measure of Q P Q represents that P and Q have same functionality jP;Qj represents the complexity measure obtained by concatenation of P and Q jP;Rj represents the complexity measure obtained by concatenation of P and R jR;Pj represents the complexity measure obtained by concatenation of R and P jR;Qj represents the complexity measure obtained by concatenation of R and Q 123 Property 1 ð9PÞ ð9QÞ ðjPj 6¼ jQjÞ i.e. there exists a program body P and a program body Q where, a given complexity measure should not rank them as equally complex. This property requires the metric to uniquely measure the complexity of each program, thereby ensuring that not all the programs are calculated to be equally complex. From Eq. 5 it can be seen that the proposed metric depends on: the type of control flows in the program, the number of inputs to each node, the output probability distribution of the nodes and the execution times which are unique for a given program, thereby satisfying this property. To portray further applicability of this property, we consider 2 programs implemented in C??, i.e. Implementation of Sorted Circularly Doubly Linked List (P), and, Tower of Hanoi Problem (Q), which are converted into control flow graphs using freely available online converters to apply the proposed complexity metric. Illustrated in Fig. 3 are the control flow graphs used. Calculating the complexity measures for program bodies P and Q in Fig. 3 using the proposed metric, jPj ¼ 19 and jQj ¼ 37, thereby satisfying the property. Property 2 ð9PÞ ð9QÞ ðP Q &jPj 6¼ jQjÞ i.e. even though there may exist 2 programs that have same functionality, but the complexity of each program depends upon its implementation. This property places emphasis on the effect of different implementations of a program on its complexity. We consider two functionally equivalent programs uniquely different based on their implementation procedures, which are converted to CFG’s for calculating the complexity using the proposed metric. Figure 4 illustrates 2 control flow graphs developed from a C program and its optimized version. These programs are adapted from (Venkatachalam et al. 2012), where the authors optimize a C program using graph mining techniques. Calculating the complexity measures for program bodies P and Q in Fig. 4 using the proposed metric, jPj ¼ 3 and jQj ¼ 2:58. Further, we consider two program bodies based on Fisher Yates problem, one implemented in C?? (P) and the other in Java (Q) Scripts. Calculating the complexity measures for program bodies P and Q in Fig. 5 using the proposed metric, jPj ¼ 9 and jQj ¼ 4. Therefore, for P Q : jPj 6¼ jQj showing that the proposed metric satisfies this property. Property 3 For any non-negative number c, there are only finite programs with complexity c. Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 1087 Fig. 3 Two program bodies with 2 uniquely different functionalities (Bhojasia 2017) 1. Program body P #include<stdio.h> #include<conio.h> void main() { int counter,a,b,n,c; n = 10; CFG a = 1; b = 2; for(counter=0;counter<n;counter+ +){ c = 20; a=b+1; b=a+b; } c = c+a; if(a<b) a = 100 else a = 1000 printf(“%d”,a); getch();} 1 2 3 4 5 6 7 8 9 BLOCK 1 n = 10 a=1 b=2 counter = 0 BLOCK 2 L1: t1 = counter<n If t1 goto L2 BLOCK 3 goto L3 BLOCK 4 L4: counter++ goto L1 BLOCK 5 L2: c = 20 t2 = b+1 a = t2 t3 = a+b b = t3 goto L4 BLOCK 6 L3: t4 = c+a c = t4 t5 = a<b If !t5 goto L5 BLOCK 7 a = 100 goto L6 BLOCK 8 L5: a = 1000 BLOCK 9 L6: Print a 2. Program body Q 1 2 3 6 CFG 7 8 5 4 9 Fig. 4 Two programs with same functionality and different implementations along with their control flow graphs (Adamchik 2009) This property is further build on Property 1, addressing a complexity metric’s ability to distinguish between the programs with the same decision structure that perform few computations and those which perform many computations. The proposed metric, which considers the execution time at nodes, can distinguish the complexity of a computation, based on the fact that nodes that perform few computations take less time, compared to the nodes that perform many computations. Therefore, this property is satisfied ð8PÞ; jPj 0. Fig. 5 Two programs with same functionality and different implementations along with their control flow graphs (Bhojasia 2017) Property 4 ð8PÞð8QÞ ðjPj jP; Qj and jQj jP; QjÞ i.e. the individual complexities of a given program body should always be less than or equal to the complexity when they are concatenated. The emphasis here is on the increase in complexity when a program body is composed by combining two programs (children programs) and that the individual complexities of the two programs are always less than or equal to their parent. To illustrate this, we consider three different control flow graphs where in which the individual complexities of CFG’s along with the complexity when they are combined are calculated. From Fig. 6, the 123 1088 Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 and decision structures results in an increased complexity. Therefore, ð8PÞð8QÞðjPj jP;Qj and jQj jP;QjÞ. 1 1 2 1 2 3 3 4 5 2 6 7 4 4 5 6 3 5 6 7 8 8 9 CGF for P CGF for Q 10 11 12 13 14 CGF for (P;Q) Fig. 6 Illustrations of CFG’s for (P), (Q) and (P; Q) complexities when calculated using the proposed metric jPj ¼ 3, jQj ¼ 3:58 and jP;Qj ¼ 6:58, which illustrates that this property is satisfied. Further, to elaborate we consider two different program bodies where in, Program body P corresponds to the C?? script to find if a given matrix is invertible, Program body Q corresponds to the C?? script to find the determinant of a given matrix. When concatenated, program bodies P and Q are represented to be P; Q. From Fig. 7, the complexities when calculated using the proposed metric jPj ¼ 14, jQj ¼ 13 and jP; Qj ¼ 17, which illustrates that this property is satisfied. Hence, whenever two different control flows (extracted from a program) are concatenated, an increase in total number of inputs, outputs and decision structure are observed. This increase in the number of inputs, outputs Fig. 7 Illustrations of CFG’s for (P), (Q) and (P; Q) (Bhojasia 2017) 123 Property 5.1 ð9PÞ ð9QÞ ð9RÞ ðjPj ¼ jQj&jP; Rj 6¼ jQ; RjÞ; 5.2:ð9PÞ ð9QÞ ð9RÞ ðjPj ¼ jQj&jR; Pj 6¼ jR; QjÞ i.e. If there exist two program bodies P and Q with same complexity, when a new program body R is concatenated with P and Q the complexities will differ. This property places emphasis on identifying the interactions which may significantly impact the complexity of a program when concatenated with an external program body. To illustrate this, we consider three different program bodies P, Q and R where, P is a C code to identify if a given number is even or odd, Q identifies if a given number is greater or less than numerical 10 and R identifies if a given number is prime or not. Figure 8 illustrates program bodies P, Q &R, and their respective CFG’s. In order to check if the proposed complexity measure holds this property, we concatenate program body R to program body P and to program body Q. Figure 9 illustrates program bodies (P;R) and (Q:R) and their CFG’s. From Fig. 5, jPj ¼ jQj ¼ 1 and jRj ¼ 5. When program body R is concatenated to P and Q, based on the control flow graphs illustrated in Fig. 6, it is observed that jP; Rj ¼ jQ; Rj ¼ 6. Although complexity when calculated using control flow graphs from Fig. 9 is same for both program bodies, it is to be noticed that in program body (Q; R) there is an additional assignment in line 14 of the number being considered to a variable x (‘num == x’) at node 6. This assignment increases the execution time at this node when compared to its execution time in (P; R). Therefore; jP; Rj 6¼ jQ; Rj, which implies that this property holds for the proposed metric. Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 Fig. 8 Program bodies P, Q, R and their respective control flow graphs 1089 #include <stdio.h> Int main() { int x; Printf(“ Enter a number to be checked: “); Scanf(“%d”, &x); If ((x)>10) printf(“%d is greater”, x); else Printf(“%d is lesser to 10”,x); Return 0; } #include <stdio.h> Int main() { int num; Printf(“ Enter a number to be checked: “); Scanf(“%d”, &num); If ((num%2)==0) printf(“%d is even”, num); else Printf(“%d is odd”,num); Return 0; } B Program Body Q (To check if a A Program Body P (To check if a number is greater or lesser to 10) number is even or odd) 1 2 3 4 #include<stdio.h> int main() { int num,i,count=0; printf("Enter a number: "); scanf("%d",&num); for(i=2;i<=num/2;i++){ if(num%i==0){ count++; break; } } if(count==0 && num!= 1) printf("%d is a prime number",num); else printf("%d is not a prime number",num); return 0; } C Program Body R (To check if a number is Prime or not) 1 2 3 4 1 2 3 5 4 5 6 6 7 7 D. CGF of Program body P 5 7 8 E. CGF of Program body Q 6 9 10 11 F. CGF of Program body R Fig. 9 Program bodies (P; R) and (Q; R) and their respective control flow graphs #include <stdio.h> int main(){ int num,i,count=0; printf("Enter an integer you want to check: "); scanf("%d",&num); if((num%2)==0) /* Checking whether remainder is 0 or not. */ printf("%d is even.",num); else printf("%d is odd.",num); CFG for(i=2;i<=num/2;i++){ if(num%i==0){ count++; break; } 1 2 3 4 5 6 7 8 9 10 11 } if(count==0 && num!= 1) 12 13 printf("%d is a prime number",num); 14 else printf("%d is not a prime number",num); return 0; } A Program body (P;R) and its CFG Property 6 Two program bodies P and Q exist such that, Q is formed by permuting the order of statements of P and jPj 6¼ jQj. This property signifies the importance of permuting program statements, with the effect to be considered while quantifying a programs complexity. This property doesn’t hold valid for the proposed metric as the nodes in the control flow graphs are independent of the program #include <stdio.h> int main() 1 { int x; 2 printf("Enter an integer you want to check: "); 3 scanf("%d",&x); if((x)>10) /* Checking 4 5 whether it is greater than10 or not. */ printf("%d is greater.",x); 6 else printf("%d is lesser.",x); 7 int ,i,count=0; CFG num==x; 8 for(i=2;i<=num/2;i++){ if(num%i==0){ 10 9 count++; break; 11 } } 12 13 if(count==0 && num!= 1) 14 printf("%d is a prime number",num); else printf("%d is not a prime number",num); return 0; } B Program body (Q;R) and its CFG statement’s placement. Also, the execution time remains the same when a given set of statements are reordered. Property 7 If program bodies P and Q are almost identical, then jPj ¼ jQj. This property clearly holds valid for the metric. This is because, if the names chosen for identifiers (different mnemonics) are indeed different, the interaction & control flow patterns along with time of execution still remain the 123 1090 Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 Table 8 Metric comparison to other complexity measures using Weyukers criteria Weyukers property number Statement count Cyclomatic number Effort measure Data flow complexity Proposed entropy based complexity metric 1 YES YES YES YES YES 2 YES YES YES YES YES 3 YES NO YES NO YES 4 YES YES NO NO YES 5 NO NO YES YES YES 6 NO NO NO YES NO 7 8 YES YES YES YES NO NO YES NO YES YES ‘YES’ indicates property is satisfies and ‘NO’ indicates property is not satisfied same. This phenomenon also holds valid if there is a change observed in the operators or the constants used in two identical program bodies while all the other factors remain same. Property 8 ð8PÞð8QÞ ðjPj þ jQj jP; QjÞ i.e. interaction of any two program bodies always increases complexity. This can be clearly observed from Fig. 6 where, jPj ¼ 3, jQj ¼ 3:58 and jP;Qj ¼ 6:58, and Fig. 7 where, jPj ¼ 14, jQj ¼ 13 and jP;Qj ¼ 17; illustrating that this property holds valid for the proposed complexity measure. As observed the proposed metric is mostly compliant with Weyukers Criteria. Illustrated in Table 8 is the complexity metric evaluation according to Weyuker’s criteria for measures based on Statement Count, Cyclomatic Number, Effort Number, Data Flow Complexity [please refer to Weyuker (1988) for a detailed analysis] and the proposed metric. When closely examined, this evaluation helped to identify the key properties of the proposed metric, where: • • • • • It is sensitive to how components interact based on control and data flow, It will not rank all the programs to be equally complex, It divides programs into various classes of complexity, It is sensitive to program syntax, and The complexity measure increases as a program grows. This also identifies one key weakness of the proposed measure that, it is unable to distinguish the pattern in which the statements of a program appear. 7 Conclusion In this paper, we describe a new information entropy based complexity measure for software based control flow graphs defined according to the program component (node) interactions (i.e. their control and data flows), likelihood of 123 interaction, and the time of execution at each node, to calculate the software code complexity. A positive correlation was observed for FORTRAN based CFG’s adapted from McCabe’s complexity measure paper and for CFG’s based on Matlab code programmed to perform basic linear algebraic computations. Evaluation against Weyukers criteria helped to support the metric’s validity for use. Further validation of the metric is required, taking into consideration the execution times of each node. Also, the authors are currently trying to use the suggested metric to enable in answering: How complexity varies with the size of software, if software complexity increases over time, and on how complexity changes in piece of code written today when compared to that were written previously. Compliance with ethical standards Conflict of interest The author(s) declare(s) that there is no conflict of interest. References Adamchik VS (2009) Algorithmic complexity. School of Computer Science, Carnegie Mellon University, 2009. http://www.cs.cmu. edu/*adamchik/15121/lectures/Algorithmic%20Complexity/ complexity.html Bansiya Jagdish, Davis Carl, Etzkorn Letha (1999) An entropy-based complexity measure for object-oriented designs. Theory Pract Object Syst 5(2):111–118 Berlinger E (1980) An information theory based complexity measure. In: Proceedings of the national computer conference, May 19–22, 1980. ACM, New York Bhojasia M (2017) C?? programming examples on numerical problems & algorithms. http://www.sanfoundry.com/cpp-pro gramming-examples-numerical-problems-algorithms/. Accessed Oct 2017 Chaturvedi KK et al (2014) Predicting the complexity of code changes using entropy based measures. Int J Syst Assur Eng Manag 5(2):155–164 Fitzsimmons Ann, Love Tom (1978) A review and evaluation of software science. ACM Comput Surv (CSUR) 10(1):3–18 Int J Syst Assur Eng Manag (October 2018) 9(5):1080–1091 Hamer PG, Frewin GD (1982) MH Halstead’s software science-a critical examination.In: Proceedings of the 6th international conference on software engineering. IEEE Computer Society Press, Washingon Harrison Warren (1992) An entropy-based measure of software complexity. IEEE Trans Softw Eng 18(11):1025–1029 Henry Sallie, Kafura Dennis (1981) Software structure metrics based on information flow. IEEE Trans Softw Eng 5:510–518 Jung W-S et al (2011) An entropy-based complexity measure for web applications using structural information. J Inf Sci Eng 27(2):595–619 McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 4:308–320 McCall JA (1977) Factors in software quality. US Rome Air development center reports Mens T (2016) Research trends in structural software complexity. arXiv:1608.01533 Mills HD (1999) The management of software engineering, Part I: principles of software engineering. IBM Syst J 38(2.3):289–295 Oviedo EI (1980) Control flow, data flow and program complexity. In: Proceedings of IEEE COMPSAC, Chicago, IL, pp 146–152 Puschner Peter, Koza Ch (1989) Calculating the maximum execution time of real-time programs. Real-time Syst 1(2):159–176 Roca JL (1996) An entropy-based method for computing software structural complexity. Microelectron Reliab 36(5):609–620 Selvarani R et al (2009) Software metrics evaluation based on entropy. In: Ramachandran M (ed) Handbook of research on 1091 software engineering and productivity technologies: implications of globalization: implications of globalization. IGI Global, Hershey, p 139 Shannon CE, Weaver W, Burks AW (1951) The mathematical theory of communication. Wiley, New York Snider G (2001) Measuring the entropy of large software systems. HP Laboratories Palo Alto, Tech. Rep, Burbank Solé RV, Valverde S (2004) Information theory of complex networks: On evolution and architectural constraints. In: Ben-Naim E, Frauenfelder H, Toroczkai Z (eds) Complex networks. Lecture notes in physics, vol 650. Springer, Berlin, Heidelberg Van Vliet H, Van Vliet H, Van Vliet JC (1993) Software engineering: principles and practice, vol 3. Wiley, New York Venkatachalam S, Sairam N, Srinivasan B (2012) Code optimization using graph mining. Res J Appl Sci Eng Technol 4(19):3618–3622 Wallace D, Watson AH, Mccabe TJ (1993) Structured testing: A testing methodology using the cyclomatic complexity metric. No. Special Publication (NIST SP)-500-235 Web.mit.edu, MATLAB Teaching Codes. [online] http://web.mit.edu/ 18.06/www/Course-Info/Tcodes.html. Accessed 14 July 2014 Weyuker Elaine J (1988) Evaluating software complexity measures. IEEE Trans Softw Eng 14(9):1357–1365 Woodfield SN (1979) An experiment on unit increase in problem complexity. IEEE Trans Softw Eng 2:76–79 The Simplest Way to Create Flowcharts. code2flow - Online Interactive Code to Flowchart Converter, code2flow.com 123

RELATED PAPERS

RELATED TOPICS

Log In

Information entropy applied to software based control flow graphs

Information entropy applied to software based control flow graphs

Related Papers

RELATED PAPERS

RELATED TOPICS