An extension of association rule mining for software engineering data repositories

H. Tamada

〒奈良県生駒市高山町奈良先端科学技術大学院大学情報科学研究科 ! " !" # " $% An Extension of Association Rule Mining for Software Engineering Data Repositories Shuji Morisaki, Akito Monden, Haruaki Tamada, Tomoko Matsumura, and Ken-ichi Matsumoto Graduate School of Information Science, Nara Institute of Science and Technology 8916-5, Takayama, Ikoma, Nara, Japan This paper focuses on a new approach to association analysis utilizing the software project data described above. Researchers have used association analysis [1] effectively in the past to analyze point-of-sales (POS) data for retailers and Website traffic logs, to discover association rules hidden amongst the data [15]. There has also been research on software project data: through association analysis, Amasaki et al [2] mined preconditions for software projects to fall into disorder (combinations of risk assessment values) using the assessments of large numbers of risk variables. General association analysis methods and rules, however, are not always applicable to software project data because they do not provide for scalar values. The values in software project data generally mix nominal measurements along with ordinal and scalar measurements, and it is therefore not possible to handle these values in a uniform fashion as-is. Software project data contains a number of quantitative measurements of particular interest so we would like to extend the general association analysis approach to take advantage of the scalar values instead of simply reducing them to nominal values. Identifying relationships among these values can lead to improved productivity, reduced bug density, and process improvements, as well as elimination of defect causes. Using their means and variance can help to more finely tune process improvements and cause identification. Finding a rule that identifies situations associated with higher bug density may make it possible to eliminate the causes of these bugs by eliminating the situations expressed by the rule. Similarly, finding rules common to projects with great amounts of variance in productivity may make it possible to reduce the variance by eliminating the situations common to the rules. This paper proposes a method for mining rules appropriate for software project data by extending conventional association analysis methods. To handle staff-months, LOC, and other quantitative Abstract This paper proposes a method to mine rules from software engineering data repositories that contain a number of quantitative attributes such as staff months and SLOC. The proposed method extends conventional association analysis methods to treat quantitative variables in two ways: (1) the distribution of a given quantitative variable is described in the consequent part of a rule by its mean value and standard deviation so that conditions producing the distinctive distributions can be discovered. To discover optimized conditions, (2) quantitative values appearing in the antecedent part of a rule are divided into contiguous finegrained partitions in preprocessing, then rules are merged after mining so that adjacent partitions are combined. 1. Introduction Many software development companies collect data from software projects (records of product size, development duration, staff-hours, numbers of bugs, metrics for risk assessment, customer satisfaction, and the like), with the goal of improving productivity, meeting deadlines, and improving quality in software development. Generally, companies collect and store such software engineering data for use by production engineering divisions, quality assurance divisions, project management offices (PMOs), and other support divisions. Companies may use this information for purposes such as estimating developer effort, predicting reliability, and determining a wide range of development standards (such as bug density and productivity). For such purposes, a number of conventional analysis methods have been widely researched, including cost models [3] [11] [14], reliability models [9], and orthogonal defect classification [4]. 1 Confidence is the probability that consequent B will follow antecedent A. It is expressed as confidence( A ⇒ B) , and is confidence( A ⇒ B ) = a / b , a is defined as in Support where and b = {T ∈ D | A ⊂ T } . variables, the proposed method extends association rules to include quantitative variables in the consequent parts of the rules. The proposed method divides these variables into contiguous fine-grained partitions for the antecedent parts of the rules. After mining extended association rules, the method merges rules by joining partitions next to each other. In Section 2, below, we describe conventional association analysis and the issues for applying conventional association analysis to software project data. In Section 3 we describe the proposed method. Section 4 presents related research. Section 5 summarizes the findings and describes future topics. Lift: Lift is an indicator of the contribution antecedent A makes to consequent C. It is expressed as lift ( A ⇒ B) , and is lift ( A ⇒ B) = confidence( A ⇒ B ) / c , where c = {T ∈ D | B ⊂ T } For example, assume that the number of projects, n = 20, the number of projects that contains A is 10, the number of projects that contains B is 8, and the number of projects that contains both A and B is 6. For A=>B, the support is 0.3 (6/20), the confidence is 0.6 (6/10), and the lift is 1.5 (0.6/8/20). 2. Association Analysis and Its Issues 2.1 Association Analysis Researchers have used association analysis to discover associations hidden amongst data in the POS product-purchasing logs of retail stores [1], Website traffic logs [15], proteins [10], and the like. For example, in the case of POS logs, researchers have mined rules about products purchased together, such as “purchases product A ∧ purchases product B purchases product C.” There are a number of possible uses for the rule in this example: the retailer could place products A, B, and C near to each other in the store so that customers can find them easily; or, it could ensure revenues by setting the prices of antecedent products A and B to make up the discounts on the sale price of consequent product C. Association analysis is defined as follows [1]. Let I= {I1, I2, …, Im} be a set of binary attribute values, called items. A set A ⊂ I is called an item set. Let a database D be a multi-set of I. Each T ∈ D is called a transaction. An association rule is denoted by an expression A ⇒ B , where B = I k (1 ≤ k ≤ m) , 2.2 Issues with Association Analysis for a Software Engineering Data Repository This paper envisions collecting software engineering data as the project progresses, and assumes that attributes include values such as staff effort and LOC as defined in the ISBSG repository [8] and IPA SEC [7]. Table 1 shows sample project data. In Table 1, row 1 is the attribute category, and row 2 is the attribute name. Each of the rows 3 and beyond corresponds to a single project. Many attribute values are measured and logged for each project. Note that all values in the table are made-up examples. Although the number of variables per project will differ depending on the organization and projects in question, there will be several hundred or so. On the other hand, there will be roughly from several tens to several thousands of projects. A company rarely has more than 10,000 projects. As shown in Table 1, a major characteristic of software project data is the existence of such nominal measurements as platform type, target industry, and target process, such ordinal measurements as performance requirements and security, and such scalar measurements as source lines of code (SLOC) and staff-hours (human costs). Association analysis normally is applied to qualitative variables (nominal or ordinal measurements); scalar measurements are generally converted to ordinal measurements via preprocessing. For example, it would be possible to convert SLOC into an ordinal measurement consisting of three categories – high, medium, and low – depending on its value, but the optimum partition must be determined via trial and error, and it is a nontrivial task to discover the optimum partition points for multiple variables. ⇒ A∩ B = φ With data like POS logs, however, which have huge numbers of items, it is not realistic to mine all rules: it takes inordinate amounts of computer processing time, and it is not feasible to interpret the huge number of mined rules manually. For this reason, conditions are placed on rule mining, setting minimum values for one or all of three key indicators of rule importance (support, confidence, and lift). Rules that are not likely to be important are generally pruned. Support: Support is an indicator of rule frequency. It is support ( A ⇒ B) , and expressed as support( A ⇒ B) = a / n , where is a = {T ∈ D | A ⊂ T ∧ B ⊂ T } and n = {T ∈ D} . . Confidence: 2 Table 1. An example of software development project data … … … Architecture Requirements Size … … … … … … Effort (Recorded) staff 68 staff month 8 staff month 16 month Effort (Planned) staff 12 month … … 12 month 7900 8000 Medium High Medium … … … … staff 60 staff month SLOC (Recorded) 14239 30940 SLOC (Planned) 10000 28000 Portability Security High N/A Low High High Capability Medium Database DB2 Oracle My SQL Job Interaction Batch Interaction UNIX Windows Platform … Windows Application Type Customer management Ordering Business Area Type Finance Retail … Personnel affairs Enhancement … Government Developme nt Type New Development Dept. code Industrial Dept. 1 Project Attributes Industrial Dept. 2 Re Developm ent Public Work Dept. 06G01 06S201 06S101 Project ID Manageme nt Attributes … … vik ∈ Vk (1 ≤ i ≤ nk ) Sometimes, the variables in the software project data that most interest us in our analysis are quantitative variables. The variables that interest us are the ones that tie in directly to process improvement and elimination of defect causes. Some examples are productivity (ratio of LOC or FP to staff-hours worked), bug density, bugs detected per test case, and rate of outsourcing of the coding and testing phases. If we can discover conditions (rules) for changing values or distributions that have undesired impact, we can create countermeasures to the conditions. Below, we describe how the proposed method handles quantitative variables (scalar measurements) contained in the target data. are either qualitative variables (nominal or ordinal measurements) or quantitative variables (scalar measurement). Note that in the case of quantitative variables/ordinal measurements, v ik < v ik +1 . Using Table 1 as an example, the third row in the table (the item with project ID 06S101) is P1, and P1 = {<project ID, 06S101>, <dept. code, industrial dept.1>, <development type, new development>, <business area type, finance>,…}. attr1 is project ID, p11 is “06S101,” and V14 = {<effort (planned), 12>, <effort (planned), 60>,…}, and v1 14 is “12.” 3.2 Handling Quantitative Variables 3. Extension of Association Rule To resolve the issue of applying association rules to software project data described in Section 2, the proposed method handles quantitative variables using methods S1 and S2, as follows. S1 is an extension of the association rules that uses statistics of a quantitative variable (mean and standard deviation) without conversion of the consequent part B. S2 can be applied for one or more quantitative variables in antecedent part A. S2 finds optimal fine-grained partitions by logically ORing the predetermined partitions. 3.1 Preliminary Definitions Each value in Table 1 is expressed as an <attribute, value> pair. Let projects be a P = {P1 , P2 ,..., Pn } , set and Pi = {< attr1 , pi1 >, < attr2 , pi 2 >,..., < attrm , pim >}(1 ≤ i ≤ n) , where attrk is the kth attribute. Pi corresponds to the value of the kth attribute. Further, let values of an V = {V1 , V2 ,..., V m } , attribute be a set and Vk = {< attrk , v1k >, < attrk , v2 k >,...,< attrk vn k >} . Here, k 3 smaller than 1 ( σ 2 / σ 1 < 1 ) indicates that situations expressed by the antecedent part A are drivers for smaller deviation. Enhancement of situations expressed by A may lead to smaller deviation of values of kth attribute. (a) Distribution of certain attribute value Frequency σ (b) Distribution under antecedent part A 1 [S2] Partitioning and joining via conversion for antecedent part S2 is applied to the antecedents part A. Using the method proposed by Srikant et al [13], quantitative variables are divided into multiple partitions that are converted into categories. It mines association rules from pre-converted categories, searches for rules in the obtained rule set that can join partitions, and ORs them to join the converted partitions. It is expected that the optimum partitioning will be found by creating a sufficiently large number of partitions. There are two partitioning methods, as described below. Both create d ( d ≤ n ) partitions. (1) For a given quantitative variable attrk, divide vik into d equal parts. Vlk is a set partitioning the elements of Vk into d parts, where Vlk = {< attrk , vik >∈ Vk vik ≥ v1k + u (l − 1)) ∧ vik <(v1k + ul)} (1 ≤ l ≤ d ) and σ 2 f μμ O 1 Attribute of attrk 2 Figure 1 Distributions of attribute value [S1] Extension of consequent part S1 uses the attribute, the mean value, and the standard deviation of a quantitative variable in the consequent part B to create an extended association rule expressed as A ⇒ attrk ( µ , σ ) , where µ = l = A⊂ P 1 l ∑p ik (1 ≤ i ≤ l ) , σ = 1 ∑ ( pik − µ ) 2 (1 ≤ i ≤ l ) l , . The analyst specifies attrk for a rule mining. Rules are mined by calculating the mean and standard deviation of attrk in projects that meet antecedent A. An example would be “<industry, SLOC (84304 163.565).” finance> We define the indicators below (lift of mean and lift of standard deviation) by comparing the means and standard deviations of all items (projects). Lift of mean The lift of mean is µ divided by the mean of the kth attribute of all projects. u= lift of mean = µ ∑ pik ⎧ ⎪ ⎪ ul = ⎨ ⎪n− ⎪ ⎩ deviation = ∑(p ik − µ)2 of n d (l = 1) ∑ u (1 ≤ i ≤ l − 1) i d −l (l ≠ 1) Quantitative variables are split into partitions Vk and converted. The discrete values of the mined rules meeting the following criteria are logically ORed and joined, and the support and confidence are recalculated. Pairs in the mined rules meeting the following criteria are found: Vlk ∧ A′ ⇒ B V(l +1)k ∧ A′ ⇒ B(1 ≤ l ≤ d − 1) ; and the n σ . Vlk = {< attrk , v( l −1)⋅ul +1 >,...,< attrk , vlul >} (1 ≤ l ≤ d ) (1 ≤ i ≤ n) Lift of standard deviation Similarly, lift d (2) Partition the values so that as close as possible to an equal number of vik are in each interval. Vlk is a set partitioning the elements of Vk into d parts, where ， ⇒ v1k − v nk k standard (1 ≤ i ≤ n) ， n For example, given a quantitative rule productivity (2.0, “<development language, C> 0.864),” if the mean productivity of all projects is 0.5, then the lift of mean is 2.0 / 0.5 = 4.0. The higher this value, the greater the effect of the antecedent is on the consequent in this rule. Figure1 shows an example to explain lift of standard deviation. Solid line (a) is distribution of pik of all projects ( 1 ≤ i ≤ n ). Dotted line (b) is distribution of pik of projects that meets antecedent part A ( A ⊂ P ). Lift of standard deviation is the ratio of σ 2 to σ 1 . In this case, lift of standard deviation ⇒ logical OR (∨) is used to join Vlk and V(l+1)k, like so: (Vlk ∨ V(l +1) k ) ∧ A′ ⇒ B(1 ≤ l ≤ d − 1) Although the antecedents of rules are joined, their consequents are not. This process continues until no joinable rules are found. If two rules are joined, the support, the lift of mean, and the lift of standard deviation are recalculated as shown below. Support after joining ⇒ support((Vlk ∨ V (l +1)k ) ∧ A′ B ) = support(Vlk ∧ A ′ B ) + support(V (l +1) k ∧ A ′ ⇒ 4 ⇒ B) Table 3 Examples of Mined Rules Support Rule (customer = existing customer) ∧ (target industrial = experienced) ratio of outsourcing(mean: 0.368, standard deviation: 0.113) (development type = new development) ∧ (maximum number of staffs = smallest (1) ) ratio of outsourcing (mean: 0.118, standard deviation: 0.0630) (customer = existing customer) ∧ (use of commercial packages = without using) ∧ (proportion of staff month(coding and unit testing phase) = large (5 ∨ 6)) proportion of staff month (integration and system testing)(mean: 0.210, standard deviation: 0.0352 (development type = new development) ∧ (target industrial = experienced) ∧ (outsourcer = second or later trading) ∧ (ratio of outsourcing = large (5 ∨ 6) ) proportion of staff month (integration and system testing)(mean: 0.262, standard deviation: 0.150) R1 ⇒ R2 ⇒ R3 ） R4 0.216 Lift of mean 1.510 Lift of standard deviation 0.832 0.216 0.482 0.463 0.216 0.785 0.353 0.216 0.979 1.51 ⇒ ⇒ project data Quantitative variables using S2 and Quantitative method and a partition count conversion converted data Quantitative Variable using S1 rule mining Analyst rules partition merging Interpreting joined rules S1 and S2 are not mutually exclusive methods. If the target data has multiple quantitative variables, it is possible to specify one quantitative variable as a consequent to be applied by S1, and apply S2 to the rest of the quantitative variables (appearing in the antecedent). In other words, it is possible to do the following: A1 ∨ A2 ⇒ attrk ( µ ,σ ) . Here, µ = ∑p and σ ∑(p = i1k + ∑ p i2 k i1k − µ ) + ∑ ( p i2 k − µ ) Fukuda et al [6] have proposed a method for mining association rules including quantitative variables as antecedents. This method is capable of calculating for intervals; for example, given the quantitative variable age, it is able to calculate the values x1, x2 for which the rule “age interval [x1, x2] purchased given service A” has the highest 2 l1 + l 2 where (1 ≤ i1 ≤ l1 ,1 ≤ i2 ≤ l 2 ) . Note, however, that l1 = A1 ⊂ P Figure 2 shows the procedure for extended association rule mining. The cylinders in the figure represent the data, and the squares represent processing. The solid arrows in the figure represent the flow of data, and the dotted arrows represent operations by the analyst. Processing proceeds in the following sequence: conversion, rule mining, and partition joining. The analyst specifies the quantitative variables to use with S2, assigns a partition count d and partition method, and executes the “conversion” procedure. Conversion categorizes quantitative variables into discrete data (converts them to ordinal measurements). The analyst then executes the “mine rules” procedure specifying which quantitative variable to use with S1 and a minimum support level. If the analyst has specified any quantitative variables for S2, the procedure "partition joining" merges rules with adjacent partitions. If the procedure finds rules capable of joining partitions, the rules are combined via a logical OR. When joining, the support, lift of mean, and lift of standard deviation of rules are re-calculated. 4. Related Research l1 + l 2 2 3.3 Procedure ， l 2 = A2 ⊂ P . ⇒ 5 support. Reference [5] also extends this method so that it can handle two quantitative variables. Although these methods can only mine rules with quantitative variables in the antecedent, they are one solution to the issue of handling quantitative variables in association-rule analysis. The present research can also calculate the interval with higher support as Fukuda et al do, by converting quantitative variables into qualitative variables (ordinal measurement), and joining rules via logical ORs. A number of case studies have reported association-analysis methods for software project actual data. Amasaki et al [2] evaluate risk items for each development phase from collected questionnaires, and conduct association analysis for project-confusion factors (whether development budgets or deadline standards will be overrun), with the goal of revealing the factors leading to disorder in software-development projects. Their analysis data, however, does not include quantitative variables, and effective rules are only mined within the scope of conventional association analysis. Song et al [12] mine association rules from defect data logged during development (type of defect cause, correction effort, etc.) to predict defects with a high likelihood of simultaneous occurrence and predict defect-correction effort (staff-hours). Although they convert correction effort, a quantitative variable, into ordinal form, the discrete partitions are hard-wired into four categories: 1 hour or less, 1 hour to one day, one to three days, and longer than three days. Applying S2 to Song et al's data should enable more fine-grained categories to be obtained. Additionally, method S1 could enable access to new knowledge by mining rules with mean correction effort and standard deviation in the consequent. Since consequent parts of mined rules show distributions in the cause of antecedent parts, finding a difference of distribution leads to quick cause identifications, systematic process improvements, better planning, and more precise estimations. If a certain antecedent part increases the mean value of the consequent undesirably, eliminating the situation expressed in the antecedent part will decrease the mean value of the consequent part, providing quick cause identification and systematic process improvement. If a certain antecedent part increases the standard deviation of the consequent part, we can consider the variation expressed in the antecedent during planning and estimation in the project to provide better planning and estimations that are more precise. The proposed method can be applied to very large software-project repositories including missing data. Furthermore, the proposed method can be applied to existing repositories. We are planning further investigation on larger software project repositories and other kinds of repository. Acknowledgements This work is supported by the Comprehensive Development of e-Society Foundation Software program of the Japanese Ministry of Education, Culture, Sports, Science and Technology. References [1] Agrawal R., Imielinski T., Swami A.,: Mining Association Rules between Sets of Items in Large Databases, Proceedings of ACM SIGMOD Conference on Management of Data, pp. 207-216 (1993). [2] Amasaki S. , Hamano Y. , Mizuno O. , and Kikuno T. , “Characterization of Runaway Software Projects Using Association Rule Mining,” In Proceedings of 7th International Conference on Product Focused Software Process Improvement, pp.402-407, June 2006. [3] Boehm B. W., Software Engineering Economics, Prentice Hall, 1981. [4] Chillarege R., Bhandari I.S., Chaar J.K., Halliday M.J., Moebus D.S., Ray B.K., Wong M.Y.: Orthogonal Defect Classification-A Concept for In-Process Measurements, IEEE Transaction on Software Engineering, Vol. 18, No. 11, pp. 943-956 (1992). [5] Fukuda T., Morimoto Y., Morishita S., Tokuyama T., : Data Mining Using Two Dimensional Optimized Association Rules: Scheme, Algorithms, and Visualization, In Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 13-23 (1996). [6] Fukuda T., Morimoto Y., Morishita S., Tokuyama T., : Mining Optimized Association Rules for Numeric Attributes In Proceedings of the 5th ACM SIGACTSIGMOD SIGART Symposium on Principles of Database Systems, pp. 182-191, (1996). [7] IPA SEC, http://www.ipa.go.jp/english/sec/first.html [8] International Software Benchmarking Standards Group Repository Information, 5. Conclusions This paper proposes a method to mine rules from software engineering data repositories that contain a number of quantitative attributes such as staff months, LOC, defect density, test case density, and outsourcing cost. The proposed method extends conventional association analysis methods to treat quantitative variables in two ways. First, the proposed method extends association rules to include a single specified quantitative variable’s mean value and standard deviation in the consequent part. Second, to treat other quantitative variables, the proposed method divides quantitative variables into contiguous fine-grained partitions appearing in the antecedent in preprocessing. Partitions next to each other are joined after rules are mined. 6 Defect Correction Effort Prediction, IEEE Transaction on Software Engineering, Vol. 32, No. 2, pp. 69-82, (2006). [13] Srikant R., Agrawal R., : Mining Quantitative Association Rules in Large Relational Tables, Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 1-12, (1996) [14] Srinvasan K., Fischer D., : Machine Learning Approaches to Estimating Software Development Effort, IEEE Transaction on Software Engineering, Vol.21, No.2, pp. 126-137 (1995). [15] Yang Q., Zhang H.H., Li T., “Mining Web Logs for Prediction Models in WWW Caching and Prefetching, Proceedings of Seventh ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, pp. 473-478, (2001). http://www.isbsg.org/isbsg.nsf/weben/Repository%20inf o [9] Ramamoorthy C. V.; Bastani F. B.,: Software reliability - Status and perspectives, IEEE Transactions on Software Engineering. Vol. 8, No. 4, pp. 354-371. July 1982 [10] She R., Chen F., Wang K., Ester M., Gardy J.L., Brinkman F.L.: Frequent-Subsequence-Based Prediction of Outer Membrane Proteins, Proceedings of 9th ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pp. 436-445, (2003). [11] Shepperd M., Schofield C., : Estimating Software Project Effort Using Analogies, IEEE Transaction on Software Engineering Vol. 23, No. 12, pp. 736-743 (1997). [12] Song Q. , Shepperd M., Michelle Cartwright, and Carolyn Mair: Software Defect Association Mining and 7

Log In

An extension of association rule mining for software engineering data repositories

An extension of association rule mining for software engineering data repositories

Related Papers

RELATED PAPERS