Build Code Analysis With Symbolic Evaluation: Make Make Make Make
Build Code Analysis With Symbolic Evaluation: Make Make Make Make
Build Code Analysis With Symbolic Evaluation: Make Make Make Make
Ahmed Tamrawi, Hoan Anh Nguyen, Hung Viet Nguyen, Tien N. Nguyen
Electrical and Computer Engineering Department
Iowa State University
{atamrawi,hoan,hungnv,tien}@iastate.edu
GNU make. make is a scripting language in which a build file
(called Makefile) is used to specify the build dependencies
among the configuration files in a project via makes program
entities. With a specific input/environment, make first evaluates a Makefile into a dependency graph among concrete file
names and commands. Then, it executes the commands with
those files. With such dynamic nature in makes evaluation,
it is challenging for developers to understand and maintain
over time multiple large, complex, and dependent Makefiles.
Importantly, errors are hard to detect at static time and even
at run time as the evaluation result depends on the input, the
operating environment, and the files in the file system.
To address those challenges in the maintenance of build
code in Makefiles, SYMake provides a symbolic evaluation
algorithm that processes Makefiles and produces a single
symbolic dependency graph (SDG) to represent the build
rules and dependencies among files via build commands. It
differs from a concrete dependency graph of make in that file
names and commands in an SDG might not be completely
resolved into strings. Instead, the SDGs node for a file
refers to a data structure, called V-model, i.e. a graph-based
representation for symbolic string values for the files name.
A V-model often contains symbols to represent the inputs
or data retrieved from user environment. SDG enables static
analysis on Makefiles and supports program understanding.
During the symbolic evaluation, for each resulting string
value that represents a part of a file name or a command of
a rule in an SDG, SYMake provides also an acyclic graph
(called T-model) to represent its symbolic evaluation trace.
That is, the T-model shows how that string value is initialized
and manipulated via various Makefiles program entities.
We used SYMake to develop algorithms and a tool to detect several types of code smells and errors in Makefiles, e.g.
cyclic dependencies, rule inclusion, duplicate prerequisites,
recursive variable loops, etc. The tool supports also build
code refactoring e.g. rule extraction/removal, target creation,
target/variable renaming, prerequisite extraction, etc.
Our empirical evaluation for SYMakes renaming on
several real-world systems has shown that it can achieve high
accuracy in entity renaming. We also conducted a controlled
experiment whose result showed that with SYMake, human
subjects were able to understand the Makefiles better and to
detect more code smells as well as to perform refactoring
more accurately in shorter time. Our contributions include:
I. I NTRODUCTION
Software building is the process that converts and integrates source code, libraries, and other data in a software
project into stand-alone deliverables and executable files.
The build process is managed by a build tool, i.e. a program
that coordinates and controls others [1]. A build tool needs to
execute the build commands according to the rules specified
in build files, which are written in a build language supported
by the tool. Popular build tools are make, ant, and maven.
Prior research found that build maintenance could impose
from 12%-36% overhead on software development [20]. In a
large-scale system, build files grow quickly and become very
complex because they must support the building of the same
software in multiple platforms with various configuration
and environment parameters [4]. McIntosh et al. [5] found
that from 4-27% of tasks involving source code changes
require an accompanied change in the related build code.
They concluded that build code continually evolves and is
likely to have defects due to high churn rate [5]. Importantly,
those studies call for better tool support for build code.
Toward providing automatic tool support for developers to
deal with complex build code, we have developed SYMake,
an infrastructure and tool for the analysis of build code in
c 2012 IEEE
978-1-4673-1067-3/12/$31.00
650
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
OS := $(shell uname)
ifeq ($(OS),Linux)
ext = o
cmd = build.sh
else
ext = exe
cmd = build.bat
endif
serverNM := server.$(ext)
clientNM := client.$(ext)
programs := $(serverNM) $(clientNM)
$(serverNM) libs = priv protocol $(wildcard .conf)
$(serverNM) objs = server impl.$(ext) server access.$(ext)
$(clientNM) objs = client impl.$(ext) client api.$(ext)
$(clientNM) libs = protocol
all: $(programs)
define ProgramTmp =
$(1): $$($(1) objs) $$($(1) libs)
endef
$(foreach prog,$(programs),$(eval $(call ProgramTmp,$(prog))))
$(programs):
$(cmd) $@ $
%.conf : %.$(ext)
genConf $ o $@
ifeq ($(OS),Linux)
demo.o : demo.c linux.conf
install $ o $@
else
demo.exe : demo.c win.conf
install.bat $ o $@
endef
Figure 1.
Figure 2.
651
1 server.conf : server.o
2
genConf server.o o server.conf
652
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
13)
14)
15)
16)
17)
18)
19)
20)
21)
22)
23)
24)
25)
26)
27)
Makefile {Statement|Rule}
Statement Assignment|Definition|FunctionCall|Foreach|If|Directive
Assignment [private|export|override] (Id|Expr) (+=|:=|=) Expr
Id IdPart ((WS) IdPart)
IdPart [ WS = : ; \n]+
Definition [private|export|override] define Id [+=|:=|=] \n \n
endef
FunctionCall $(FunctionName [Expr[{,Expr}]])
FunctionName subst|patsubst|strip|findstring|filter|...
Expr Term{[WS]Term}
Term FunctionCall|ELiteral|Evaluation|Foreach|If
ELiteral WLiteral ((\WS) WLiteral)
WLiteral [ WS \n]+
Evaluation $(Id|Expr)
Rule Expr (: | ::) (Assignment| [|]Expr)[Recipe]
Recipe (;|\n\t)RecipeExpr{\n\t RecipeExpr}
RecipeExpr RecipeTerm{[WS]RecipeTerm}
RecipeTerm FunctionCall|Evaluation|RecipeLiteral|AutoEval|Foreach|If
RecipeLiteral [ \n]+
AutoEval $@|$<|$?|$ |$+|...
Foreach $(foreach Id,Expr,Expr|RecipeExpr)
If
(((ifeq|ifneq)
(Expr,Expr)
|
((ifdef|ifndef)
Expr))
{Rule|Statement}|RecipePart [else {Rule|Statement}|RecipePart] endif) |
$(if Expr,Expr|RecipeExpr[,Expr|RecipeExpr]))
RecipePart \n\t RecipeExpr{\n\t RecipeExpr}
Directive Include|Vpath|Export|Undefine
Include (include|sinclude|-include) Expr
Vpath vpath [Expr]
Export (unexport|export) [Expr]
Undefine [override] undefine Id
Figure 3.
653
Table I
S YMBOLIC E VALUATION RULES TO B UILD SDG
FROM A
M AKEFILE
Makefile Syntax
Evaluation Rule
1. var := E
2. var = E
3. var += E
if (var is defined)
if (var.simple = true)
var.V = new RefNode(new Concat(Expand(var.V),
Expand(E.V)))
else
var.V = new RefNode(new Concat(var.V.child,E.V))
else var.V = new RefNode(E.V), var.simple = false
8. E $(eval E )
9. E $(func {Ei })
10. E $(var)
11. E1 $(E2 )
13. E WLiteral
14. E ELiteral
15. E RecipeLiteral
16. undefine var
17. include E
20. Recipe
Add a new RecipeNode(new Select({new Concat
(;|\n\t)E1 {\n\t Ei } (Tokenize(rj )):rj Flatten((new Concat(Ei .V)))}))
21. RecipePart
RecipePart.V = new Select({new Concat(
\n\t E1 {\n\t Ei } Tokenize(ri )):ri Flatten((new Concat(Ei .V)))})
22. E RecipePart
E.V = RecipePart.V
654
eval E2 ,
E.V = new Select(Expand(E3 .V), Expand(E4 .V))
depends
rcp1
depends
target
all
refers-to
a.
server.o
rcp3
priv
client.o
rcp4
protocol
server.exe
rcp5
SYM01
'build.sh'
depends
rcp2
recipe node
client.exe
rcp6
target/
prerequisite
'server'
server_impl.o
refers-to
server_acces..
'genConf'
'%.'
Concat
%.conf
'-o'
'.o'
'server_impl'
%.o
refers-to
Select
rcp8
'%.conf'
'o'
symbolic
node
Ref:ext
%.exe
Figure 4.
Concat
'o'
rcp7
Concat
'server_im..' 'server_acc..'
refers-to
b.
Concat
literal node
Select
Concat
'-o'
'%.conf'
'exe'
655
18. E1 :E2 (Recipe): For a single-colon rule, SYMake flattens the V-model of E1 to all possible values Ti s. Each string
Ti is tokenized to find the list of targets. Then, for each
resulting target tij , it checks if there exists a rule R with
the same target name via getRule(tij ). If R exists, it updates
R using UpdateRule. UpdateRule flattens the V-model of E2 to
all possible values representing all possible prerequisites of
the rule and combines them with Rs prerequisites. Then, for
each set of possible combinations of prerequisites, it builds
a rule graph as follows: the Recipe node is connected to all
combined prerequisites. The target tij is connected to Recipe
through a new Select node (see Figure 4). Each path of the
Select node represents a set of recipes and prerequisites of
the rule tij . If such a rule does not exist, SYMake creates
a new Makefile rule, but without a Select node.
19. E1 ::E2 (Recipe): Similar to a sub-case of case 18 where
the rule is not defined earlier and no combination is needed.
20. Recipe (;|\n\t)E1 {\n\t Ei }: For a recipe, SYMake
creates a new RecipeNode, which refers to its V-model rooted
at a Select node. Each child node represents a possible recipe
string rj resulted from flattening the V-model that starts at a
Concat connecting all the V-models of the expressions Ei s.
Each rj is then represented by a V-model rooted at a new
Concat node connecting the different tokens of rj .
21. RecipePart \n\t E1 {\n\t Ei }: Similar to rule 20,
however, SYMake does not create a RecipeNode.
22. E RecipePart: E gets the V-model of RecipePart.
23. E1 if (E2 ) E3 else E4 : it processes as follows:
First, it collects into a set VARS* all variables modified or
initialized in either branch. Let us use varE3 .V and varE4 .V
to denote the V-models of var after evaluating each branch,
respectively. For each var in VARS*, SYMake updates its
value with a new V-model rooted at a new reference node
Ref:var and its child is a new Select node whose children are
varE3 .V and varE4 .V. If the else branch is empty, the latest
V-model for var before if is used in place of varE4 .V.
Second, it collects into a set (RULESE3 RULESE4 ) all rules
defined in either E3 or E4 . For each rule in that set, if
there exists a single-colon rule with the same target name
getRule(rule.t), it updates the existing rule R with the rules
ruleE3 and ruleE4 (using Update If Rule). That function adds
a Select node after the target node. Each path of the Select
node represents a possible recipe and prerequisites for the
target node corresponding to the rules ruleE3 and ruleE4 . If
there exists no rule defined before that has the same target
name as rule, SYMake creates a new rule with a Select node
after the target node, and each of its branches represents the
recipe and prerequisites of either ruleE3 or ruleE4 .
Third, if E3 and E4 are of type RecipePart, SYMake
creates a new V-model rooted at a Select node whose
children are the V-models of the expressions E3 and E4 .
24. E1 $(if E2 ,E3 ,E4 ): It creates a new V-model rooted
at a Select node whose children are the expanded V-models
of E3 and E4 . If the condition is known, rule 11 is used.
Figure 5.
656
if
line 3
true
o
line 11
server.
line 4
ext
line 13
programs
line 26
prog
line 26
call ProgramTmp
line 26
+
line 11
serverNM
Figure 6.
eval
line 26
foreach
from SDG
server.o
*.conf
arg
line 15
Legend:
literal
line 15
wildcard
from SDG
SYM01
variable
if condition
Concat
foreach,
built-in
function call
userdefined
functioncall
2) Operator/action node:
A Concat node is the same as in a V-model.
An Evaluation node models the evaluation operation ($)
on a variable. If two variable nodes are connected via
an evaluation node, it is a simple variable assignment.
A Function call node represents a call to a built-in or
user-defined function F . T-models connected to this
node represent the arguments passed to F . A node eval
represents a call to eval (evaluating to rules/statements).
3) Control node: if and foreach in the evaluation process.
Building T-model. Building T-model for any node in a rule
in an SDG (e.g. part of a prerequisite, a recipe, or a target)
occurs during the symbolic evaluation. Generally, for each
evaluation rule in Table I that constructs a new V-model,
SYMake creates the new corresponding T-model:
1) For a simple assignment, it connects the T-model of the
right-hand side expression to a variable node via an $ node
(e.g. programs and serverNM in Figure 6). For a recursive assignment, the T-model and the variable node are connected.
2) As seeing a literal or symbolic value, it creates a new
literal/symbolic node (e.g. server. and SYM01 in Figure 6).
3) For a Concat operation, it creates a Concat node whose
children are the T-models corresponding to two operands.
4) For a built-in function (e.g. $(wildcard *.conf)), it connects
the T-models of the parameters to a newly created function
call node (Figure 6). For an eval call, it connects the T-model
of the computed value from eval to a new eval node.
5) For a call function, the T-model of the returned value
is linked to the new call node (e.g. T-model for prog call).
6) For a foreach statement, the T-model of its body is
linked to a new foreach node (e.g. T-model of eval foreach).
7) For an if, it creates an if node to represent the branch it
took to build the traced value (e.g. true branch in Figure 6).
comp1.conf : comp1.o
genConf comp1.o o comp1.conf
SYMake first runs the specific rule (without %) to generalize the recipe. E.g., the above one becomes genConf $ -o
$@. Then, it tries to match the target and each prerequisite to
the counterpart in the implicit rule. If the pattern is matched
for all (e.g. comp1.conf and %.conf), an inclusion is detected.
In those cases, GNU make might not be able to detect
errors/smells since it could run on a different path that does
not involve the duplication or cycle. In some cases, the
errors are not revealed until the projects and Makefiles are
deployed at user environments, directories, or configurations.
SYMake performs symbolic evaluation to generalize possible evaluation results, thus, is able to detect them statically.
VII. R EFACTORING S UPPORT WITH SYM AKE
Renaming Variable. Another application is automatic renaming, where SYMake needs to find all code locations
where a variable was initialized/referenced. The key challenges are listed in Scenario 2, Section II. E.g., the variable
server.o libs at line 15 has its name composed by the value of
variable serverNM (i.e. server.o) and the substring libs. The
variable server.o libs is then referenced when ProgramTmp is
called at line 26, which leads to line 23 being evaluated,
657
line 3
line 3
if
ext
line 12
client.
ext
ext
+
line 12
clientNM
+
line 11
serverNM
line 11
serverNM
if
o line 4
line 12
client.
$
line 13
programs
line 26
prog
line 26
line 15
_libs
+
line 15
server.o_libs
call ProgramTmp
a)
line 23
_libs
line 26
eval
line 26
foreach
line 26
b) server.o_libs
Figure 7.
c)
line 13
programs
line 26
prog
line 26
call ProgramTmp line 23
_libs
+
line 26
eval
line 26
foreach
line 26
client.o_libs
System MakeF LOCs Locsets Loc. Frag NonFrag Vars Rules Paths Incl
SCST[34]
49 1786
870 2230 39
831 876
112 154
0
LINN[35]
67 4020
3417 7169 53
3364 3425
134 536
0
GCC[36]
68 5350
1972 16546 11
1961 1980
804
75
5
MIN[37]
95 2374
632 3324
0
632 632
121
95 95
LINS[38]
98 1255
973 1563
5
968 973
135
98
0
FIRE[39]
156 6374
1960 4668 12
1948 1991 2635 621 130
TS[40]
232 12950
2655 9711 50
2605 2655 2541 235 210
ext
line 12
clientNM
line 15
serverNM
Table II
S UBJECT S YSTEMS AND B UILD C ODE I NFORMATION
line 3
true
o line 4
o line 4
line 11
server.
line 3
true
true
o line 4
line 11
server.
if
if
true
line 18
clientNM
$
line 18
_libs
+
line 18
client .o_libs
d)
and
to
of
$(serverNM) libs at line 15 into LIBS, then $($(1) libs) at line
23 must be changed into $($(1) LIBS). Also, $(clientNM) libs at
line 18 must be changed into $(clientNM) LIBS since at line
23, $($(1) libs) is evaluated into $(clientNM) libs at foreachs
second iteration (line 26). Thus, 3 locations of libs at 15,
18, and 23 need consistent renaming. We call them a locset.
To support renaming for variables with their names
being fragmented (called fragmented variables), SYMake
determines a locset with the following idea. During the
evaluation, SYMake keeps track of all reference locations of
the same variable and for each reference string s, it builds the
corresponding T-model to keep track of where all substrings
of s come from. Since the strings of all references of the
same variable must match, SYMake is able to identify the
matched substrings from the literals of the T-models of those
references and group those substrings into a locset.
For example, during the evaluation at line 15, SYMake
builds the T-model for the variable server.o libs as in Figure 7a. At line 26, it sees a reference to server.o libs and the
corresponding T-model is in Figure 7b. Since two references
at lines 15 and 26 are of the same variable, it can match
the literal nodes in two T-models: Figures 7a and 7b. The
string server.o libs at line 15 comes from 3 literals: 1) server.
(line 11), 2) o (line 4), and 3) libs (line 15). The string
server.o libs at line 26 comes from 3 literals: 1) server. (line
11), 2) o (line 4), and 3) libs (line 23). Thus, libs (line 15)
and libs (line 23) belong to the same locset. Similar reason
is applied to Figures 7c and 7d, and two strings libs at lines
18 and 23 are of the same locset. Thus, all three strings libs
at lines 15, 18, and 23 belong to the same locset.
The T-models are created/updated over time. After the
evaluation of a variable reference r, the entry for r in the
entity table refers to its latest T-model. All locsets are main-
658
Table III
L OCSET D ETECTION ACCURACY R ESULT
System
SCST
LINN
GCC
MIN
LINS
FIRE
TS
Prec
100%
100%
100%
100%
100%
100%
100%
Locsets
Rec
100%
100%
100%
100%
100%
100%
100%
Total
870
3417
1972
632
973
1960
2655
Locations
Corr Inc-Text
2230
724
7169
323
16546
2188
3324
611
1563
167
4668
893
9711
1659
|SDG|
Table IV
C ONTROLLED E XPERIMENT S R ESULT
T (s)
MB
12
27
19
13
12
34
22
51
133
156
259
248
691
552
19184
19539
10909
9666
10623
55043
27302
Tasks
1
2
3
4
5
6
Time
37-23
42-29
35-22
36-25
33-23
42-23
X. C ONCLUSIONS
We introduce SYMake, an infrastructure for make code
analysis. SYMake includes AST building module, a symbolic evaluation algorithm, and an evaluation trace building
algorithm. We used SYMake to develop a tool to detect code
smells and to support refactoring in Makefiles. Our evaluation on real-world Makefiles showed that our renaming tool
is accurate and efficient, and that with SYMake, users could
detect code smells and refactor Makefiles more accurately.
ACKNOWLEDGMENT
This project is funded by US National Science Foundation
(NSF) CCF-1018600 grant. It was also funded in part by
Vietnam Education Foundation for the third author.
659
R EFERENCES
[1] Software Building, en.wikipedia.org/wiki/Software build.
[20] G. Kumfert and T. Epperly, Software in the DOE: The hidden overhead of the build Lawrence Livermore National
Laboratory, Tech. Rep. UCRL-ID-147343, 2002.
[23] H. Kegel and F. Steimann, Systematically refactoring inheritance to delegation in Java, in ICSE 08, pp. 431440. ACM.
[24] A. Kiezun, M. D. Ernst, F. Tip, and R. M. Fuhrer, Refactoring for parameterizing Java classes, in ICSE07. IEEE CS.
[25] J. Liu, D. Batory, and C. Lengauer, Feature oriented refactoring of legacy applications, in ICSE 06, pp. 112121. ACM.
[9] C. Gunter, Abstracting dependencies between software configuration items, TOSEM, vol. 9, no. 1, pp. 94131, 2000.
[28] S. Ducasse, M. Rieger, and S. Demeyer, A language independent approach for detecting duplicated code in ICSM99.
[12] C. S. Pasareanu, P. C. Mehlitz, D. H. Bushnell, K. GundyBurlet, M. Lowry, S. Person, and M. Pape, Combining unitlevel symbolic execution and system-level concrete execution
for testing NASA software, in ISSTA 08, pp. 1526. ACM.
660