An XML C Source Code Interchange Format for CASE Tools

Noritoshi Atsumi

2011 35th IEEE Annual Computer Software and Applications Conference An XML C Source Code Interchange Format for CASE Tools Noritoshi Atsumi Dept. of Information Engineering, Nagoya University, Japan Email: atsumi@nagoya-u.jp Takashi Kobayashi Dept. of Information Engineering, Nagoya University, Japan Email: tkobaya@is.nagoya-u.ac.jp Shinichiro Yamamoto Dept. of Information Science and Technology, Aichi Prefectural University, Japan Email: yamamoto@ist.aichi-pu.ac.jp Kiyoshi Agusa Dept. of Information Engineering, Nagoya University, Japan Email: agusa@is.nagoya-u.ac.jp We have developed a CASE tool platform “Sapid”[2], [3] since 1991. Sapid provides all fundamental features for lower CASE tools, such as lexical analysis, syntax analysis, semantic analysis and control/data flow analysis. Sapid handles the fine-grained information to develop various kinds of lower CASE tools. The analysis results of Sapid consist of the fine-grained elements and the relations among each element, are stored in a relational database. We have also developed many APIs to efficiently obtain the analysis results. Some CASE tools[4], [5], [6] have been built using the APIs. To develop CASE tools on a CASE tool platform, developers must understand the usage of APIs to access internal information of the platform. It is difficult to select the adequate APIs to obtain the desired information, since there are an enormous number of the APIs. Therefore, developers often build a CASE tool to fit the specified target or domain without a CASE tool platform. However, to build a lower CASE tool, it is needed to implement many features. Additionally, it is needed to pass many data among the features. To address these problems, it is required to reduce learning cost of APIs and data structure of analysis results, in addition, collaborate with other CASE tools. In this paper, we propose two XML representation models (CX-model and XREF-model) of C source code analysis results for lower CASE tools. In our XML representation, source code is marked up by only adding tags without change of all fragments of source code. Although our representation do not contain all of information required for various kinds of CASE tools, it contains the information of many fundamental features. XML is simple, easy to understand, interoperable, and there is a wide range of tool support for manipulation, transformation, and search of XML documents. The APIs for handling XML documents are provided in various languages such as C, Java, Ruby, Python and PHP. It is easy to select specific elements in XML documents and to transform the structure of the Abstract—We propose an XML C source code representation to support developing CASE tools. Since source code is a main artifact of software development, most CASE tools have some features related to source code editor, static analyzer, profiler, etc. To develop such tools, detailed information related to source code is needed. However, it is quite difficult to reuse program analysis features because they do not have common interfaces even for parsing and data/control-flow analysis that are most common features for such CASE tools. To address this issue, we focus on XML as an intermediate representation for source code information. Existing XML representations only represent structure of syntax trees and lack some important information required for CASE tools. We propose two models for representing source code; one is for intra-file information, which consists of syntax structure, flow, and type information, the other is for inter-file relation, which is cross-reference information. We also introduce CASE tools with our representation and demonstrate the efficacy in CASE tool development. To evaluate the efficacy, we show that a coding rule checker and a cross-referencer can be easily implemented using common XML processing libraries such as XSLT and XPath. Keywords-Program Analysis; CASE Tool; Program Understanding; Coding Checker I. I NTRODUCTION To efficiently develop software with dependability, various lower CASE tools are proposed and implemented. Recently, an integrated development environment (IDE) is not only a collection of programming tools but also an extensible tool platform. Developers can build CASE tools using the internal information of IDE. To support various kinds of lower CASE tools, a tool platform must collect the detailed information in source code. Existing tool platforms store information of source code by using either proprietary representation or the typical abstract syntax tree (AST)[1]. These representations contain sufficient information and several powerful tool platforms provide well-designed application programming interfaces (APIs) for accessing the information. 0730-3157/11 $26.00 © 2011 IEEE DOI 10.1109/COMPSAC.2011.102 514 517 498 XML elements are defined and modeled for the fine-grained elements in source code such as classes, methods, fields, string literals, numeric literals and comments. documents using XML-related technologies. Therefore, an XML representation of source code helps to reduce the cost of ensuring compatibility with other tools and learning how to use the analysis results. Some XML representations for source code have been proposed [7], [8], [9], [10], [11], [12], [13]. However, they represent only structure of syntax trees and lack some important information required for CASE tools. We represent both the intra-file information and the inter-file relation using XML. The intra-file information consists of syntax structure, flow, and type information. The inter-file relation is cross-reference information. We also introduce CASE tools with our representation and demonstrate the efficacy of our representation in CASE tool development. The contributions of this paper are to show following: • how to represent not only the syntax structure but also flow information and type information using XML • how to represent collectively cross-reference information across multiple files • the efficacy of our representation by developing a coding checker and cross-referencer using our representation The rest of this paper is organized as follows. Section 2 discuss about existing XML representations of source code. Section 3 introduces our XML representation for C source code. Section 4 discuss the application and evaluation of our proposal method. Section 5 concludes the paper. B. ACML ACML is an XML document model that represents structure and semantic of C source code. ACML contains a variety of useful analytical information for developing CASE tools. The efficacy of the representation was shown by developing a program slicer within a short time [20]. ACML has been developed using the same semantic program elements as the XML elements and is defined by about 50 XML elements. The XML representation is a complex structure like symbol tables that are generated by compilers and the file size is very large. C. XML-based Framework R. Al-Ekram et al. proposed a framework for languageneutral program representation [21]. This framework is based on a multilayered abstraction of source code artifacts represented using several XML applications. In the framework, existing XML applications are used for source code representation and defines new applications are defined to represent higher-level program abstractions. D. srcML The format of srcML adds structural information to raw source code. The document view of source code is supported by the preservation of all lexical information including comments, white space, preprocessor directives, etc. from the original source code. This permits transformation equality between the representation in srcML and the related source code document. II. XML R EPRESENTATION OF S OURCE C ODE Typical technologies for representing source code using XML include JavaML 2.0[7], JX-model (XSDML)[14], ACML[10] and srcML[12]. By using XML, we can: 1) perform text manipulation using sed and grep 2) manipulate the programs based on the XML document structure using XSLT, DOM, SAX, XPath, XQuery 3) perform manipulations using general programming languages such as Java, C++, Ruby 4) use existing user interfaces, such as Web browsers Therefore, XML representation brings a number of advantages to development of CASE tools, including support for multiple query languages, support of complex transformations, support for a document format as well as a data format, broad based usage and acceptance of standards [15]. Some tools [16], [17], [18], [19] for software engineering have been proposed and developed by exploiting these advantages. To efficiently develop CASE tools, it is important to determine how to represent the structure of source code. E. Problems of Existing XML Representations Many CASE tools are required to extract information not only about intra-file but also inter-files. For example, a cross-referencer and a refactoring tool are required to extract the information from all of the source files of a program. In addition, a refactoring tool and a slicer are required to extract the dependency analysis information. To provide the information that is required by many CASE tools, it is necessary to support analysis across multiple XML files and to represent the results of dependency analysis. Because the existing XML representations represent the relations over multiple XML files by only ID of the element, it is not easy to get the information about across multiple files. The World Wide Web Consortium (W3C) promulgated XLink and XPointer as specifications to represent relations among XML resources. However, XLink and XPointer are not used widely, because there are few technologies that support them. In our study, we represent information across multiple files as cross-reference information separate from XML A. JavaML 2.0 JavaML 2.0 is an XML representation of Java source code. The JavaML schema does not depend on Java syntax and is defined based on an abstract model that is used in typical object-oriented programming languages. About 90 499 518 515 structure and flow information, and the XML element Typeinfos represents type information for each identifier. The flow information is closely related to the syntax structure. The data type of an identifier is described only at the definition not the occurrences in the source code. The data type of an identifier is closely related to the each occurrence. Therefore we designed these information as a representation model. The DTD of CX-model can be downloaded at http://www.sapid.org/DTD/CX-model.dtd. representation of a source file. The two XML document models for source code represent correspondence relations for each program element using the same ID. III. XML C S OURCE C ODE R EPRESENTATION In this paper, we propose two XML representation models as intermediate representation for CASE tools. Previously, we have used the relational database. An XML is more suitable for representation of syntax structure, because structure of syntax tree is tree representation of source code. Structure of Syntax Trees A. Feature of Our Representation CX-model is an XML representation model for C source code before preprocessing. Before the C compiler starts compiling a source file, the file is processed by a preprocessor. To extract the details on the program behavior, the preprocessed source code is required. However, developers edit the source code before preprocessing and read it to understand. Therefore, CASE tools must provide the information of not only preprocessed code but also source code before preprocessing. The information of preprocessed code is important for CASE tools. The important information is in not only entirely preprocessed code but also the process of the preprocess. However, it is difficult to record the information of the process. We gave preference to XML representation of the source code before preprocessing. While CX-model is similar to srcML, there are three main differences between them. First, CX-model provides all fragments of source code (operators or separators, identifiers, keywords, white spaces and new lines) with dedicated tags. These tags allow developers or tools to add extra white spaces and new lines that were not contained in the original source code. The original white spaces and new lines are always enclosed with terminal elements while extra ones are enclosed with non-terminal elements. Secondly, CX-model aggressively exploits many kinds of attributes while very few attributes are used in srcML. The verbose attributes alleviate additional lexical analysis of the contents of elements or the time-consuming traversal of several elements when the developers and tools obtain the properties of code. Finally, CX-model contains several useful links obtained through semantic analysis for the whole of source code. Sapid provides source code analyzers: preprocessor, syntax analyzer, semantics analyzer, control flow analyzer, and data flow analyzer. Before the syntax analyzer of Sapid starts analyzing a source file, the file is processed by a preprocessor. In this time, the preprocessor records the correspondence relation of the offset in file before and after processing. Using the analysis results with the APIs of Sapid, we implement the generators of CX-model and XREF-model. Our XML representation preserves the structure of the source code and the markup in the program. It is possible to return from our XML representation to the original source code by removing the XML elements. It is worth discussing the problems Badros pointed out in [22]. He stated that the representation marked up by only adding tags would need further lexical analysis of the textual contents, and it would not sufficiently abstract the original source code. To alleviate these problems, CX-model introduces a fine-grained tagging and slightly verbose attributes. The attribute sort indicates which kind of elements. The attribute defid of the element ident indicates the identifier ID of the definition. For each identifier, the correspondence relations between the definition and the references are represented by the ID. The attribute type_id of the element ident indicates the data type of the identifier and the information is described in the element TypeInfos. Flow Information The data and control flows are represented using the element flow as the sibling element of the element Stmt. We represent the flow information for each statement (Stmt element of CX-model). Figure 1 shows our representation of the flow information for the sample program in Figure 2. The flow elements in lines 1 and 2 represent the flow information for the while statement line 3 in Figure 2. The flow element line 1 represents the control flow when the condition of the while statement is false. The flow element line 2 represents the control flow when the condition is true. The flow element line 6 shows that the statement “ans += x--” depends on the variable “x”. 1 2 3 4 5 6 7 B. CX-model CX-model consists of syntax structure, flow, and type information. The XML element File represents syntax i n t sum( i n t x) { i n t ans = 0; w h i l e (x > 0) { ans += x--; } r e t u r n ans; } Figure 2. 500 519 516 Sample Program 1 2 3 4 5 6 7 8 9 10 <flow id="f001" stmt_id="s50331650" next="s50331649" sort="branch_true"/> <flow id="f002" stmt_id="s50331650" next="s58720265" sort="branch_false"/> <Stmt sort="While" id="s50331650">...<!−− w h i l e ( . . . ) { . . . } −−> <flow id="f004" stmt_id="s50331649" next="s58720263" sort="control_normal"/> <Stmt sort="Block" id="s50331649"> <!−− { . . . } −−> <flow id="f008" stmt_id="s58720263" next="s58720263" sort="data_dependence" expr_id="e58720261" dep_id="s33554432"/><flow...>...</flow>... <Stmt id="s58720263">...</Stmt> <!−− a n s += x−−; −−> </Stmt> . . . <Stmt id="s58720265"> . . . </Stmt> <!−− r e t u r n a n s ; −−> </Stmt> Figure 1. XML Representation of Flow Information Type Information The type information is represented using the element TypeInfos without embedding in the element File, so that the syntax structure does not become too complex. Each type in TypeInfos is marked up by the XML elements TypeInfo and has an identification ID. The XML element ident has an attribute type_id and the value of the attribute is associated with TypeInfo by the ID. An element TypeInfo represents the information on a type. The attribute sort represents the category to which the type belongs, i.e., whether the type is struct, union, enum, typedef, basic, pointer, array, function or variable argument. If the category of the type is typedef, the attribute ref shows the ID of the defined type. If it is pointer, ref shows the type to which the pointer points. If it is array, ref shows the type of the data elements. If it is function, ref shows the type of the return value and the children elements typeRef represent the types for each argument. If it is struct or union or enum, the children elements typeRef represent the types of the members. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 <specs sort="func"> <description>Function</description> <program>a.out</program> <spec> <ident id="s33554433">sum</ident> <pos file="8388608" offset="4" line="1">sample.c</pos> <type static="false" ref="s83886080">int</type> <funcVar sort="argument"> <varRef> <reference ref="s33554432">x</reference> <type static="false" ref="s83886081">int</type> </varRef> </funcVar> <funcVar sort="local"> <varRef> <reference ref="s33554434">ans</reference> <type static="false" ref="s83886081">int</type> </varRef> </funcVar> </spec>...<!−− s n i p −−> </specs> Figure 3. XREF-model for Function sum in Fig.2 Table I C OMPARISON OF A BILITY TO ACQUIRE I NFORMATION C. XREF-model XREF-model represents definitions, references, and information related to each program element in all program files as XML documents. Lines 4–24 in Figure 3 shows crossreference on the function sum. The attributes id and ref use the same ID as the identification ID in CX-model. Definition Information Reference Information Type Information Flow Information D. Comparison with Other XML Representations JavaML △ △ △ △ ACML △ △ ○ △ Our Representation ○ ○ ○ ○ information on the referred identifier to represent the relation between the type information and the referred identifiers by using ID and IDREF. The relations in our representation are similar to those in ACML. JavaML and ACML do not represent flow information. To obtain the flow information, it is necessary to analyze the structure of the syntax trees represented by the XML documents. Because our representation represents the flow information in the structure of the syntax trees, it is possible to obtain the flow information simply by tracing the next attribute of the flow elements. If type and flow information is removed from CX-model, it is almost the same as JavaML and ACML. By combining type and flow information and XREF-model, it is possible The comparison with other XML representations in terms of the accessibility to each type of information is presented in Table I. JavaML and ACML represent the relationships between definitions and references using IDs. However, when the definition ID for the reference ID is not in the same file, it must be searched from all other XML files. In our representation, it is easy to access the information across multiple files; hence, the definitions and references in all program files are represented by XREF-model. To extract the type information for the referred identifier in JavaML, it is necessary to refer to the definition of the referred identifier using the ID and to extract the type from the definition. In ACML, it is easy to obtain the type 501 520 517 to reduce the cost for developing CASE tools, because it is not necessary to analyze multiple XML documents or the syntax structures. IV. A PPLICATION AND E VALUATION We developed CX-Checker [5] 1 and XML version of SPIE [6] using the CX-model and XREF-model. In this section, we introduce these tools and show the efficacy of our representation. A. CX-Checker CX-Checker is a coding checker for detecting instances of non-compliance with coding rules. CX-Checker provides a function for adding rules. The rules can be implemented in three ways (XPath, APIs, DOM) to implement the rules. The following rule is an example of the use of XPath. 1 2 Figure 4. //Stmt[@sort="Switch" and (count(Label[kw/text()="case"])=0)] used XML representation that type and flow information is removed from CX-model, it would be hard to implement the rules on type and flow, because we would have to implement the functions to extract the type information and the flow information from the syntax structure of XML documents. The coding rules include about the naming of identifier, the format, the syntax element, the relation among syntax elements, and so on. To detect the violation codes of these rules, the information of the fine-grained elements and the semantic relations among the elements are needed. Because, in our representation, the fine-grained elements are tagged and many semantic relations are represented, we could implement many coding rules easily. This example shows the rule used to detect switch statements without a case label. “//Stmt” is the XPath expression for selecting Stmt nodes in the document from the root node. “[. . .]” is a predicate for selecting nodes. “Stmt[@sort="Switch"]” selects the nodes for which the sort attribute is Switch and “Label[kw/text()="case"]” selects the nodes for which the text of the child element kw of the Label element is case. “count” is the function that returns the number of the element. Therefore, XPath selects switch statements without a case label. In this manner, XPath facilitates the selection of elements with a specific structure. Such selections are the basic components of many coding rules. In rules, which APIs are used, the target XML document is handled by considering it as a set of objects that represent each element of CX-model, and the rule is described as a logical manipulation of each object. The following program is an example in which APIs are used. 1 2 3 4 5 6 Screenshot of Cross Reference by SPIE B. XML version of SPIE XML version of SPIE was implemented by transforming CX-model and XREF-model to HTML using XSLT. The ID of each syntax element is identical to the ID in XREFmodel. Therefore, it is possible to link a syntax element in the source code and that in the cross-reference information by using the ID. Figure 4 shows a snapshot of cross-reference information in SPIE. The line numbers and syntax elements in the figure are linked to the corresponding part of the source code or cross-reference information. The links among program elements are represented by IDs in CX-model and XREFmodel. Therefore, it is easy to transform the links to the hyperlinks using XSLT. By using XML-related technologies, CASE tool developers can develop easily CASE tools using CX-model and XREF-model. CFileElement cfile = new CFileElement(file.getDOM()); f o r (CFunctionElement function : cfile.getFunctions()) { System.out.println(function.getName()); } In this example, the function names are output by using the getName method after the Function elements are obtained from CFileElement by using the getFunctions method: CFileElement represents the File element of CX-model. By using these APIs, developers who are not experts at using CX-model and DOM APIs can describe rules. We tried to implement MISRA-C [23] rules using our representation. MISRA-C rules include the rules on type and on flow. We could implement easily 102 of 127 rules. If we V. C ONCLUSION AND R EMARKS This paper proposed two XML representation models (CX-model and XREF-model) to provide both of intra-file and inter-file information. CX-model is a model for representing intra-file information, which consists of structure of 1 http://www.sapid.org/cxc/(inJapanese) 502 521 518 [3] Sapid Project, “Sophisticated APIs for CASE tool development,” http://www.sapid.org/, 2011. [4] N. Atsumi, S. Yamamoto, and K. Agusa, “Categorization of library function call patterns,” in Proc. of Workshop on Software Product Archiving and Retrieving System, 2004, pp. 11–20. [5] T. Osuka, T. Kobayashi, J. Mase, N. Atsumi, S. Yamamoto, N. Suzumura, and K. Agusa, “CX-Checker: A customizable coding checker for C (in japanese),” in Proc. of IPSJ/SIGSE SES, 2009, pp. 119–126. [6] H. Ohashi, S. Yamamoto, and K. Agusa, “Hypertext-based CASE tool for source program review (in japanese),” vol. 98-295, no. 295, pp. 15–22, 1998. [7] A. Aguiar, G. David, and G. J. Badros, “JavaML 2.0: Enriching the markup language for java source code,” in XML: Aplicações e Tecnologias Associadas, 2004. [8] C. Reichel and R. Oberhauser, “XML-based programming language modeling: An approach to software engineering,” in Proc. of the IASTED Conf. on Software Engineering and Applications, 2004, pp. 424–429. [9] H. Yoshida, S. Yamamoto, and K. Agusa, “A generic finegrained software repository using XML (in japanese),” IPSJ Journal, vol. 44-6, no. 6, pp. 1509–1516, 2003. [10] H. Kawashima and K. Gondow, “Experience with ANSI C markup language for a cross-referencer,” in Proc. of HICSS, 2003, p. 324. [11] M. L. Collard, J. I. Maletic, and A. Marcus, “Supporting document and data views of source code,” in Proc. of ACM Symposium on Document Engineering, 2002, pp. 34–41. [12] J. I. Maletic, M. L. Collard, and A. Marcus, “Source code files as structured documents,” in Proc. of IWPC, 2002, pp. 289–292. [13] J. F. Power and B. A. Malloy, “Program annotation in XML: A parse-tree based approach,” in Proc. of WCRE, 2002, pp. 190–198. [14] K. Maruyama and S. Yamamoto, “A CASE tool platform using an XML representation of Java source code,” in Proc. of SCAM, 2004, pp. 158–167. [15] H. K. Jonathan I. Maletic, Michael Collard, “Leveraging XML technologies in developing program analysis tools,” in Proc. of CASCON, 2004, pp. 80–85. [16] G. McArthur, J. Mylopoulos, and S. K. K. Ng, “An extensible tool for source code representation using XML,” in Proc. of WCRE, 2002, pp. 199–210. [17] J. I. Maletic and M. L. Collard, “Supporting source code difference analysis,” in Proc. of ICSM, 2004, pp. 210–219. [18] Y. X. Sun, H. Y. Chen, and T. H. Tse, “Lean implementations of software testing tools using XML representations of source codes,” in Proc. CSSE(2), 2008, pp. 708–711. [19] M. L. Collard, H. H. Kagdi, and J. I. Maletic, “An XML-based lightweight C++ fact extractor,” in Proc. of IWPC, 2003, pp. 134–143. [20] K. Gondow and H. Kawashima, “Towards ANSI C program slicing using XML,” Electr. Notes Theor. Comput. Sci., vol. 65-3, no. 3, pp. 30–49, 2002. [21] R. Al-Ekram and K. Kontogiannis, “An XML-based framework for language neutral program representation and generic analysis,” in Proc. of CSMR, 2005, pp. 42–51. [22] G. J. Badros, “JavaML: A markup language for java source code,” Computer Networks, vol. 33, no. 1-6, pp. 159–177, 2000. [23] The Motor Industry Software Reliability Association, Guidelines for the use of the C language in vehicle based software. The Motor Industry Research Association, 1998. syntax trees, flow information, and type information. XREFmodel is a model for representing inter-file relation, which is cross-reference information. By using this representation, the developers easily build CASE tools that collaborate with each other. We introduced a coding checker and a cross-referencer using our representation and showed that it is easy to implement lower CASE tools using our representation. The cost of developing CASE tools using other representations will be higher than the cost of using ours, because other representations do not include the relations across multiple files and do not contain flow information. We believe that the relations across multiple files, flow, and type information are important for developing CASE tools and representing the above sets of information in XML documents. The topics we plan to study in the future are as follows: 1) XML representation for preprocessed program For developers, it is important to provide information on the program before preprocessing. By contrast, for CASE tools, it is useful to provide the information for the preprocessed program. To achieve an ideal balance between the two, it is necessary to associate both. However, it is difficult to implement. 2) APIs for CX-model and XREF-model Sapid only supports manipulations based on general XML technologies such as DOM, XSLT and XQuery. Standard XML technologies support only low-level manipulations for XML documents. Manipulations of CX-model and XREF-model alone are not sufficient for developing CASE tools. To develop various CASE tools, it is necessary to provide APIs for frequent manipulation. 3) Collaboration with other tools There are many powerful CASE tools achieved without using our representation. To build more highly CASE tools, it is needed to collaborate with other powerful tools. We are planning to integrate our representation and its converter into popular IDEs (e.g., Eclipse). ACKNOWLEDGEMENT The development of CX-Checker and the extension of CXmodel were dealt with as an On-the-Job Learning theme of the Advanced IT Specialist Course in Graduate School of Information Science, Nagoya University. The development was supported by AISIN SEIKI Co., Ltd. The authors sincerely thank all members involved in the project. R EFERENCES [1] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986. [2] N. Fukuyasu, S. Yamamoto, and K. Agusa, “An evolution framework based on fine grained repository,” in Proc. of IWPSE, 1999, pp. 43–47. 503 522 519

RELATED PAPERS

RELATED TOPICS

Log In

An XML C Source Code Interchange Format for CASE Tools

An XML C Source Code Interchange Format for CASE Tools

Related Papers

RELATED PAPERS

RELATED TOPICS