Evaluating software engineering standards

S.L. Pfleeger; N. Fenton; S. Page

Evaluating software engineering standards

Computer, 2000

Evaluating Software Engineering Standards z m Y I Shari Lawrence Pﬂeeger, Norman Fenton, and Stella Page Centre for Software Reliability Given the more than 250 software engineering standards, why do we sometimes still produce less than desirable products? Are the standards not working, or being ignored? zyxwv TSRQ oftware engineering standards abound; since 1976, the Software Engineer- ing Standards Committee of the IEEE Computer Society has developed 19 standards in areas such as terminology, documentation, testing, veriﬁ- cation and validation, reviews, and audits.’ In 1992 alone, standards were completed for productivity and quality metrics, software maintenance, and CASE (computer- aided software engineering) tool selection. If we include work of the major national standards bodies throughout the world, there are in fact more than zyxw A 250 software en- gineering standards. The existence of these standards raises some important questions. How do we know which practices to standardize? Since many of our projects produce less-than-desirable products, are the standards not working, or being ignored? Per- haps the answer is that standards have codiﬁed approaches whose eﬀectiveness has not been rigorously and scientiﬁcally demonstrated. Rather, we have too often relied on anecdote, “gut feeling,” the opinions of experts, or even ﬂawed research, rather than on careful, rigorous software engineering experimentation. This article reports on the results of the Smartie project (Standards and Methods Assessment Using Rigorous Techniques in Industrial Environments), a collaborative eﬀort to propose a widely applicable procedure for the objective assessment of stan- dards used in software development. We hope that, for a given environment and ap- plication area, Smartie will enable the identiﬁcation of standards whose use is most likely to lead to improvements in some aspect of software development processes and products. In this article, we describe how we veriﬁed the practicality of the Smartie framework by testing it with corporate partners. Suppose your organization is considering the implementation of a standard. Smar- tie should help you to answer the following questions: What are the potential beneﬁts of using the standard? Can we measure objectively the extent of any beneﬁts that may result from its use? What are the related costs necessary to implement the standard? Do the costs exceed the beneﬁts? September 1994 0018-Y162/94/$4lX10 zyxwvutsrqp 1YY4 IEEE 71

To that end, we present Smartie in three parts. First, we analyze what typical stan- dards look like, both in software engi- neering and in other engineering disci- plines. Next, we discuss how to evaluate a standard for its applicability and objec- tivity. Finally, we describe the results of a major industrial case study involving the reliability and maintainability of almost two million lines of code. zyxwvutsr EDCBA Software engineering standards Standards organizations have devel- oped standards for standards, including a deﬁnition of what a standard is. For ex- ample, the British Standards Institute de- ﬁnes a standard as A technical speciﬁcation or other docu- ment available to the public, drawn up with the cooperation and consensus or general approval of all interests aﬀected by it, based on the consolidated results exist, our standards rarely reference them. So even though our standards are laudably aimed at promoting community beneﬁts, we do not insist on having those beneﬁts demonstrated clearly and scien- tiﬁcally before the standard is published. Moreover, there is rarely a set of objec- tive criteria that we can use to evaluate the proposed technique or process. Thus, as Smartieresearchers, we sought solutions to some of the problems with software engineering standards. We began our investigation by posing three simple questions that we wanted Smar- tie to help us answer: On a given project, what standards are used? To what extent is a particular stan- dard followed? If a standard is being used, is it eﬀec- tive? That is, is it making a diﬀerence in quality or productivity? How good is the standard? What is aﬀected by the standard? How can we determine compliance What is the basis for the standard? with the standard? “Goodness” of the standard was diﬀi- cult to determine, as it involved at least three distinct aspects. First, we wanted to know whether and how we can tell if the standard is being complied with. That is, a standard is not a good standard if there is no way of telling whether a particular organization, process, or piece of code complies with the standard. There are many examples of such “bad” standards. For instance, some testing standards re- quire that all statements be tested “thor- oughly”; without a clear deﬁnition of “thoroughly,” we cannot determine com- pliance. Second, a standard is good only in terms of the success criteria set for it. In other words, we wanted to know what at- tributes of the ﬁnal product (such as reli- ability or maintainability) are supposed to be improved by using the standard. And ﬁnally. we wanted to know the cost of amlving the standard. After all. if zyx - oi science. technology and experience. aimed at the promotion of optimum com- munity beneﬁts? Do software engineering standards satisfy this deﬁnition? Not quite. Our standards are technical speciﬁcations available to the public, but they are not always drawn up with the consensus or general approval of all interests aﬀected by them. For example, airline passengers were not consulted when standards were set for building the A320’s ﬂy-by-wire software, nor were electricity consumers polled when software standards for nu- clear power stations were considered. Of course, the same could be said for other standards; for example, parents may not have been involved in the writing of safety standards for pushchairs (strollers). Nevertheless, the intention of a standard is to reﬂect the needs of the users or con- sumers as well as the practices of the builders. More importantly, our stan- dards are not based on the consolidated results of science, technology, and expe- r i e n ~ e . ~ Programming languages are de- clared to be corporate or even national standards without case studies and ex- periments to demonstrate the costs and beneﬁts of using them. Techniques such as cleanroom, formal speciﬁcation, or object-oriented design are mandated be- fore we determine under what circum- stances they are most beneﬁcial. Even when scientiﬁc analysis and evaluation What is a standard - and what does it mean for software engineering? Often, a standard’s size and complexity make it diﬀicult to determine whether a particular organization is compliant. If partial compliance is allowed, measure- ment of the degree of compliance is diﬀi- cult, if not impossible -consider, for ex- ample, the I S 0 9000 series and the 14 major activities it promotes4 The Smar- tie project suggests that large standards be considered as a set of smaller “ministan- dards.” A ministandard is a standard with a cohesive, content-related set of require- ments. In the remaining discussion, the term zyxwvut EDCBA standard refers to a ministandard. What is a good standard? We reviewed dozens of software engi- neering standards, including intemational, national, corporate, and organizational standards, to see what we could learn. For each standard. we wanted to know I I zyxwvu i ” compliance with the standard is so costly as to make its use impractical, or practi- cal only in certain situations, then cost contributes to “goodness.” We developed a scheme to evaluate the degree of objectivity inherent in assessing compliance. We can classify each require- ment being evaluated into one of four cat- egories: reference only, subjective, par- tially objective, and completely objective. A reference-only requirement declares that something will happen, but there is no way to determine compliance; for ex- ample, “Unit testing shall be carried out.” A subjective requirement is one in which only a subjective measure of conformance is possible; for example, “Unit testing shall be carried out eﬀectively.” A subjective requirement is an improvement over a ref- erence-only requirement, but it is subject to the diﬀering opinions of experts. A par- tially objective requirement involves a measure of conformance that is somewhat objective but still requires a degree of sub- jectivity; for example, “Unit testing shall be camed out so that all statements and the most probable paths are tested.” An objective requirement is the most desir- able kind, as conformance to it can be de- termined completely objectively; for ex- ample, “Unit testing shall be camed out so that all statements are tested.” Clearly, our goal as a profession should be to produce standards with require- 72 COMPUTER

I Evaluating Software Engineering Standards Shari Lawrence Pfleeger, Norman Fenton, and Stella Page Centre for Software Reliability z oftware engineering standards abound; since 1976, the Software Engineering Standards Committee of the IEEE Computer Society has developed 19 standards in areas such as terminology, documentation, testing, verification and validation, reviews, and audits.’ In 1992 alone, standards were completed for productivity and quality metrics, software maintenance, and CASE (computeraided software engineering) tool selection. If we include work of the major national standards bodies throughout the world, there are in fact more than 250 software engineering standards. The existence of these standards raises some important questions. How do we know which practices to standardize? Since many of our projects produce less-than-desirable products, are the standards not working, or being ignored? Perhaps the answer is that standards have codified approaches whose effectiveness has not been rigorously and scientificallydemonstrated. Rather, we have too often relied on anecdote, “gut feeling,” the opinions of experts, or even flawed research, rather than on careful, rigorous software engineering experimentation. This article reports on the results of the Smartie project (Standards and Methods Assessment Using Rigorous Techniques in Industrial Environments), a collaborative effort to propose a widely applicable procedure for the objective assessment of standards used in software development. We hope that, for a given environment and application area, Smartie will enable the identification of standards whose use is most likely to lead to improvements in some aspect of software development processes and products. In this article, we describe how we verified the practicality of the Smartie framework by testing it with corporate partners. Suppose your organization is considering the implementation of a standard. Smartie should help you to answer the following questions: zyxw Given the more than 250 software engineering standards, why do we sometimes still produce less than desirable products? Are the standards not working, or being ignored? September 1994 zyxwvu What are the potential benefits of using the standard? Can we measure objectively the extent of any benefits that may result from its use? What are the related costs necessary to implement the standard? Do the costs exceed the benefits? zyxwvutsrqpo 0018-Y162/94/$4lX10 1YY4 IEEE 71 To that end, we present Smartie in three parts. First, we analyze what typical standards look like, both in software engineering and in other engineering disciplines. Next, we discuss how to evaluate a standard for its applicability and objectivity. Finally, we describe the results of a major industrial case study involving the reliability and maintainability of almost two million lines of code. exist, our standards rarely reference them. So even though our standards are laudably aimed at promoting community benefits, we do not insist on having those benefits demonstrated clearly and scientifically before the standard is published. Moreover, there is rarely a set of objective criteria that we can use to evaluate the proposed technique or process. Thus, as Smartie researchers, we sought solutions to some of the problems with software engineering standards. We began our investigation by posing three simple questions that we wanted Smartie to help us answer: How good is the standard? What is affected by the standard? How can we determine compliance with the standard? What is the basis for the standard? z zyxwvutsrqp Software engineering standards Standards organizations have developed standards for standards, including a definition of what a standard is. For example, the British Standards Institute defines a standard as A technical specification or other docu- ment available to the public, drawn up with the cooperation and consensus or general approval of all interests affected by it, based on the consolidated results oi science. technology and experience. aimed at the promotion of optimum community benefits? Do software engineering standards satisfy this definition? Not quite. Our standards are technical specifications available to the public, but they are not always drawn up with the consensus or general approval of all interests affected by them. For example, airline passengers were not consulted when standards were set for building the A320’s fly-by-wire software, nor were electricity consumers polled when software standards for nuclear power stations were considered. Of course, the same could be said for other standards; for example, parents may not have been involved in the writing of safety standards for pushchairs (strollers). Nevertheless, the intention of a standard is to reflect the needs of the users or consumers as well as the practices of the builders. More importantly, our standards are not based on the consolidated results of science, technology, and exper i e n ~ eProgramming .~ languages are declared to be corporate or even national standards without case studies and experiments to demonstrate the costs and benefits of using them. Techniques such as cleanroom, formal specification, or object-oriented design are mandated before we determine under what circumstances they are most beneficial. Even when scientific analysis and evaluation 72 “Goodness” of the standard was difficult to determine, as it involved at least three distinct aspects. First, we wanted to know whether and how we can tell if the standard is being complied with. That is, a standard is not a good standard if there is no way of telling whether a particular organization, process, or piece of code complies with the standard. There are many examples of such “bad” standards. On a given project, what standards For instance, some testing standards require that all statements be tested “thorare used? To what extent is a particular stan- oughly”; without a clear definition of “thoroughly,” we cannot determine comdard followed? If a standard is being used, is it effec- pliance. Second, a standard is good only tive? That is, is it making a difference in terms of the success criteria set for it. In other words, we wanted to know what atin quality or productivity? tributes of the final product (such as reliability or maintainability) are supposed to be improved by using the standard. And finally. we wanted to know the cost of amlving” the standard. After all. if compliance with the standard is so costly What is a standard as to make its use impractical, or practical only in certain situations, then cost and what does it contributes to “goodness.” mean for software We developed a scheme to evaluate the engineering? degree of objectivity inherent in assessing compliance. We can classify each requirement being evaluated into one of four categories: reference only, subjective, parOften, a standard’s size and complexity tially objective, and completely objective. make it difficult to determine whether a A reference-only requirement declares particular organization is compliant. If that something will happen, but there is partial compliance is allowed, measure- no way to determine compliance; for exment of the degree of compliance is diffi- ample, “Unit testing shall be carried out.” cult, if not impossible -consider, for ex- A subjective requirement is one in which ample, the I S 0 9000 series and the 14 only a subjective measure of conformance major activities it promotes4 The Smar- is possible;for example, “Unit testing shall tie project suggeststhat large standards be be carried out effectively.” A subjective considered as a set of smaller “ministan- requirement is an improvement over a refdards.” A ministandard is a standard with erence-only requirement, but it is subject a cohesive, content-related set of require- to the differingopinions of experts. A parments. In the remaining discussion, the tially objective requirement involves a measure of conformance that is somewhat term standard refers to a ministandard. objectivebut still requires a degree of subjectivity; for example, “Unit testing shall be camed out so that all statements and the most probable paths are tested.” An objective requirement is the most desirable kind, as conformance to it can be deWe reviewed dozens of software engi- termined completely objectively; for exneering standards, including intemational, ample, “Unit testing shall be camed out so national, corporate, and organizational that all statements are tested.” Clearly, our goal as a profession should standards, to see what we could learn. For be to produce standards with requireeach standard. we wanted to know I - I zyx zyxwvut i zyxwvutsrq What is a good standard? COMPUTER I zyxwvutsrq zyxwvutsrq zyxwvutsrqpo zyxwvu zyx r I Reference Objective I ,Partially I I I Product (external) Product finternall (internal) I Product (external) I I Figure 1. Degree of objectivity in software engineering standards’ requirements. ments that are as objective as possible. However, as Figure 1 illustrates, the Smartie review of the requirements in software engineering standards indicates that we are a long way from reaching that goal. To what do our standards apply? To continue our investigation, Smartie researchers reviewed software engineering standards to determine what aspect of software development is affected by each standard. We considered four distinct categories of requirements in the standards: process, internal product, external product, and resources. Internal product requirements refer to such items as the code itself, while external product requirements refer to what the user experiences, such as reliability. For examples of these categories, we turn to the British Defence Standard DEF STD 0055 (interim): issued by the Ministry of Defence (second revision in 1992) for the procurement of safety-critical software in defense equipment. Some are internal product requirements: Each module should have a single entry and exit. The code should be indented to show its structure. Others are process requirements: The Design Team shall validate the Software Specification against Software Requirements by animation of the formal specification. while some are resource requirements: September 1994 I I Figure 2. A comparison of (a) BS4792 standard for safe pushchairs, with 29 requirements, and (b) DEF STD 00-55 for safe software, with 115 requirements, shows that software standardsplace more emphasis on process than on the final product. All tools and support software . . . shall have sufficient safety integrity. The Design Authority shall demonstrate.. .that the seniority, authority, qualifications and experience of the staff to be employed on the project are satisfactory for the tasks assigned to them. software safety standard. The figure shows what is true generally: Software engineering standards are heavy on process and light on product, while other engineering standards are the reverse. That is, software engineering standards reflect the implicit assumption that using certain techniques and processes, in concert with “good” toolsand Typical of many software standards, DEF people, will necessarily result in a good STD 00-55 has a mixture of all four types product. Other engineering disciplines have far less faith in the process; they inof requirements. sist on evaluating the final product in their standards. Another major difference between our standards and those of other engineering disciplinesis in the method of compliance assessment. Most other disciplines include in their standards a description of Standardization has made life easier in the method to be used to assess complimany disciplines. Because of standard ance; we do not. In other words, other voltage and plugs, an electrical appliance engineers insist that the proof of the pudfrom Germany will work properly in ding is in the eating: Their standards deItaly. A liter of petrol in one country is scribe how the eating is to be done, and the same as a liter in another, thanks to what the pudding should taste like, look standard measurement. These standards, like, and feel like. By contrast, software products of other engineering disciplines, engineers prescribe the recipe, the utenoffer lessons that we can learn as soft- sils, and the cooking techniques, and then ware engineers. So the next step in the assume that the pudding will taste good. Smartie process was to examine other en- If our current standards are not effective, gineering standards to see how they dif- it may be because we need more objecfer from those in software engineering. tive standards and a more balanced mix In particular, we asked of process, product, and resource requirements Is the mix of product, process, and resource roughly the same? Is the mix of objective and nonobjective compliance evaluation roughly Are Software standards like other standards? the same? The answer to both questions is a resounding no. To show just how different software engineering standards are, Figure 2 compares the British standard for pushchair safety with DEF STD 00-55, a The proof of the pudding: Case studies The Smartie framework includes far more than we can describe here - for example, guidelines for evaluating the 73 experiments and case studies on which the standards are based. We address all of these issues in Smartie technical reports, available from the Centre for Software Reliability. For the remainder of this article, we focus on an aspect of Smartie that distinguishes it from other research on standards: its practicality. Because Smartie includes industrial partners, we have evaluated the effectiveness of Smartie itself by applying it to reallife situations. We present here two examples of the Smartie “reality check”: (1) applying the framework to written standards for a major company and (2) evaluating the use of standards to meet specified goals. Both examples involve Company X, a large, nationwide company whose services depend on software. The company is interested in using standards to enhance its software’s reliability and maintainability. In the first example, we examine some of the company’s programming standards to see if they can be improved. In the second example, we recommend changes to the way data is collected and analyzed, so that management can make better decisions about reliability and maintainability. zyxwvu column 12 and on a new line, second and subsequent lines being neatly indented and aligned vertically. . . . Exceptions are ELSE which will start in the same column as its associated IF and which will appear on a line of its own. Each line either conforms or does not, and the proportion of lines conforming to all layout requirements represents overall compliance with the standard. On the other hand, measuring conformance to some naming conventions can be difficult, because such measurements are subjective, as is the case with Initiated in November 1987, the system had had 27 releases by the end of 1992. The 1.7 million lines of code for this system involve two programming languages: Cobol (both batch Cobol and CICS Cobol) and Natural (a 4GL). Less than a third of the code is Natural; recent growth (15.2 percent from 1991 to 1992) has been entirely in Cobol. Three corporate and organizational goals are addressed by measuring this system: (1) monitoring and improving product reliability, (2) monitoring and improving product maintainability, and (3) improving the overall development process. The first goal requires information about actual operational failures, while the second requires data on discovering and fixing faults. The third goal, process improvement, is at a higher level than the other two, so Smartie researchers focused primarily on reliability and maintainability as characteristics of process improvement. The system runs continuously. Users report problems to a help desk whose staff determines whether the problem is a user error or a failure of the system to do something properly. Thus, all the data supplied to Smartie related to software failures rather than to documentation failures. The Smartie team received a complete set of failure information for 199192, so the discussion in this section refers to all 481 software failures recorded and fixed during that period. We reviewed the data to see how data collection and analysis standards addressed the overall goal of improving system reliability and maintainability. In many cases, we recommended a simple change that should yield additional, critical information in the future. The remainder of this section describes our findings. A number is assigned to each “fault” report. We distinguish a fault (what the developer sees) from a failure (what the user sees).6 Here we use “fault” in quotation marks, since failures are labeled as faults. A typical data point is identified by a “fault” number, the week it was reported, the system area and fault type, the week the underlying cause was fixed and tested, and the actual number of hours to repair the problem (that is, the time from when the maintenance group decides to clear the “fault” until the time when the fix is tested and integrated with the rest of the system). Smartie researchers analyzed this data and made several recommendations about how to improve data collection and analysis to zyxwvutsrq Names must be meaningful. The Smartie approach recommends that the standard be rewritten to make it The Smartie framework has guidelines for evaluating the case studies on which the standards are based. zyxwvutsrq Reality check 1: How good are the written standards? We applied the Smartie techniques to a ministandard for using Cobol. The Cobol standard is part of a larger set of mandated standards, called programming guidelines, in the company’s system development manual. Using the guidelines reputedly “facilitate[~]the production of clear, efficient and maintainable Cobol programs.” The guidelines were based on expert opinion, not on experiments and case studies demonstrating their effectiveness in comparison with not following the guidelines. This document is clearly designed as a standard rather than a set of guidelines, since “enforceability of the standards is MANDATORY,” with “any divergence” being “permanently recorded.” We focused on the layout and naming conventions, items clearly intended to make the code easier to maintain. Layout requirements such as the following can be measured in a completely objective fashion: Each statement should be terminated by a full stop. Only one verb should appear on any one line. Each sentence should commence in 74 more objective. For example, improvements might include Names must be English or scientific words which themselves appear as identifiable concepts in the specification document(s). Abbreviations of names must be consistent. Hyphens must be used to separate component parts of names. Conformance measures can then use the proportion of names that conform to the standard. Analysis of the commenting requirements also led to recommendations that would improve the degree of objectivity in measuring conformance. Reality check 2: Do the standards address the goals? Company X collects reliability and maintainability data for many of its systems. The company made available to Smartie all of its data relating to a large system essential to its business. COMPUTER Figure 3. Examples of an existing closure report and a proposed revision. zyxwvutsrq zyxwvutsrq zyxwvutsrqpo zyxwv Existing closure report Fault ID: F752 Reported: 18/6/92 Definition: Logically deleted work done records appear on enquiries Description: Causes misleading information to users. Amend Additional Work Performed RDVIPG2A to ignore work done records with flag-amend = 1 or 2 get a better picture of system maintainability. Nevertheless, the depth of data collection practiced at Company X is to be applauded. In particular, the distinction between hours-to-repair and time between problem-open (“week in”) and problem-close (“week out”) is a critical one that is not usually made in maintenance organizations. The maintenance group designated 28 system areas to which underlying faults could be traced. Each system area name referred to a particular function of the system rather than to the system architecture. There was no documented mapping of programs or modules to system areas. A typical system area involved 80 programs, with each program consisting of 1,000 lines of code. The fault type indicated one of 11, many of which were overlapping. In other words, the classes of faults were not orthogonal, so it was possible to find more than one fault class appropriate for a given fault. In addition, there was no direct, recorded link between “fault” and program in most cases. Nor was there information about program size or complexity. Given this situation, we made two types of recommendations. First, we examined the existing data and suggested simple changes to clarify and separate issues. Second, we extracted additional information by hand from many of the programs. We used the new data to demonstrate that enhanced data collection could provide valuable management information not obtainable with the current forms and data. Issue 1: Faults versus failures. Because the cause of a problem (that is, a fault) is not always distinguished from the evidence to the user of that problem (that September 1994 Revised closure report Fault ID: F752 Reported: 18/6/92 Definition: Logically deleted work done records appear on enquiries Effect: Misleading information to users Cause: Omission of appropriate flag variables for work done records Change: Amend Additional Work Performed RDVIPG2A to ignore work done records with flag-amend = 1 or 2 is, a failure), it is difficult to assess a system’s reliability or the degree of user satisfaction. Furthermore, with no mapping from faults to failures, we cannot tell which particular parts or aspects of the system are responsible for most of the problems users are encountering. Recommendation: Define fault and failure, and make sure the maintenance staff understands the difference between the two. Then, consider failure reports separate from fault reports. For example, a design problem discovered during a design review would be described in a fault report; a problem in function discovered by a user would be described in a failure report. Issue 2: Mapping from program to system area. Use of system areas to describe faults is helpful, but a mapping is needed from program name to system area. The current information does not reveal whether code in one system area leads to problems in another system area. The batch reporting and integration into the system of problem repairs compounds this difficulty because there is then no recorded link from program to fault. This information must have existed at some point in the maintenance process in order for the problem to be fixed;capturing it at the time of discovery is much more efficient than trying to elicit it well after the fact (and possibly incorrectly). Recommendation: Separate the system into well-defined system areas and provide a listing that maps each code module to a system area. Then, as problems are reported, indicate the system area affected. Finally, when the cause of the problem is identified, document the names of the program modules that caused the problem. Issue 3: Ambiguity and informality inherent in the incident closure reports. The description of each problem reflects the creativity of the recorder rather than standard aspects of the problem. This lack of uniformity makes it impossible to amalgamate the reports and examine overall trends. Recommendation: The problem description should include the manifestation, effect, and cause of the problem, as shown in Figure 3. Such data would permit traceability and trend analysis. Issue 4: Fault classificationscheme. Because the scheme contains nonorthogonal categories, it is difficultfor the maintainer to decide in which category a particular fault belongs. For this reason, some of the classificationsmay be arbitrary, resulting in a misleadingpicture when the faults are aggregated and tracked. Recommendation:Redefine fault categories so that there is no ambiguity or overlap between categories. Issue 5: Unrecoverable data. By unrecoverable, we mean that the information we need does not exist in some documented form in the organization. For example, most of the problem report forms related a large collection of faults to a large collection of programs that were changed as a result. What appears to be unrecoverable is the exact mapping of program changes to a particular fault. On the other hand, some information was recoverable, but with great difficulty. For example, we re-created size information 75 zyxwv zyxwvuts Table 1. Recoverable (documented) data versus nonrecoverable (undocumented) data. Nonrecoverable Size information for each module Staticlcomplexity information for each module Mapping of faults to programs Severity categories Operational usage per system (needed for reliability assessment) Success/failure of fixes (needed to assess effectiveness of maintenance process) Number of repeated failures (needed for reliability assessment) manually from different parts of the data set supplied to us, and we could have related problem severity to problem cause if we had had enough time. Recommendation: The data in Table 1 would be useful if it were explicit and available to the analysts. Figures 4 through 8 show what we can learn from the existing data; Figures 9 through 11(page 78) show how much more we can learn using the additional data. Since we have neither mean-timebetween-failure data nor operational usage information, we cannot depict reliability directly. As an approximation, we examined the trend in the number of “faults” received per week. Figure 4 shows that there is great variability in the number of “faults” per week, suggesting that there is no general improvement in system reliability. The chart in Figure 5 contrasts the “faults” received with the “faults” addressed and resolved (“actioned”) in a given week. Notice that there is wide variation in the proportion of “faults” that are actioned each week. In spite of the lack-of-improvement trend, this chart provides managers with useful information; they can use it to begin an investigation into which “faults” are handled first and why. Examining the number of “faults” per system area is also useful, and we display the breakdown in Figure 6. However, there is not enough information to know why particular system areas generate more “faults” than others. Without information such as size, complexity, and operational usage, we can draw no definitive conclusions. Similarly, an analysis of “faults” by fault type revealed that data and program faults dominated user, query, and other faults. However, the fault types are not orthogonal, so again there is little that we can conclude. Figures 7 and 8 show, respectively, 76 I Recoverable 50 40 . ? . ?: ....... .. .. 10 ..... 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 Week zyx Figure 4. Reliability trend charting the number of faults received per week. l*lI I - Faults received and corrected ProDorlion corrected c 0.3 0.2 c zyx zy 0 .-r 8 p a zyxwvu 0.1 =m 0 37 39 41 43 45 47 49 Week Figure 5. Charting the faults received and acted upon in the same week helps show how Company X deals with software failures. mean time to repair fault by system area and by fault type. This information highlights interesting variations, but our conclusions are still limited because of missing information about size. The previous charts contain only the information supplied to us explicitly by Company X. The following charts reflect additional information that was recovered manually. As you can see, this reCOMPUTER zyxwvutsrqpo zyxwv zyxwvutsrqponmlkji zyxwvutsrqponmlkjihgfe zyxwvutsrqpon zyxwv 90 80 70 60 - In 5 50- LL 40- m 30 - 20 10- 0- System area Figure 6. Plotting the number of faults per system area helps isolate fault-prone system areas. 8 7 2 6 4 5 4 3 2 1 0 D O S W l F W C 3 P L G C l J T D l G 2 N Z C C 2 G l U System area examined. Recall that Figures 4,5, and 6 revealed limited information about the distribution of “faults” in the overall system. However, by adding size data, the resulting graph in Figure 10 shows the startling result that C2 - one of the smallest system areas (with only 4,000 lines of code) -has the largest number of “faults.” If the fault rates are graphed by system area, as in Figure 11,it is easy to see that C2 dominates the chart. In fact, Figure 11 shows that, compared with published industry figures, each system area except C2 is of very high quality; C2, however, is much worse than the industry average. Without size measurement, this important information would not be visible. Consequently, we recommended that the capture of size information be made standard practice at Company X. These charts represent examples of our analysis. In each case, improvements to standards for measurement and collection are suggested in light of the organizational goals. Our recommendations reflect the need to make more explicit a small degree of additional information that can result in a very large degree of additional management insight. The current amount of information allows a manager to determine the status of the system; the additional data would yield explanatory information that would allow managers to be proactive rather than reactive during maintenance. Figure 7. Mean time to repair fault (by system area). Lessons learned in case studies 0 3 Fault type covered information enriches the management decisions that can be made on the basis of the charts. By manually investigating the (poorly documented) link between individual September 1994 Figure 8. Mean time to repair fault (by fault type). programs and system areas, we examined the relationships among size, language, and system area. Figure 9 shows the variation between CICS Cobol and Natural in each of the main system areas The Company X case study was one of several intended to validate the Smartie methodology, not only in terms of finding missing pieces in the methodology, but also by testing the practicality of Smartie for use in an industrial environment (the other case studies are not complete as of this writing). The first and most serious lesson learned in performing the case studies involved the lack of control. Because each investigation was retrospective, we could not require measurement of key productivity and quality variables, require uniformity or repetition of measurement, *choose the project, team, or staff characteristics that might have eliminated confounding effects, 77 The last point is the most crucial. Without a baseline, we cannot describe with confidence the effects of using (or not using) the standards. As a consequence, a great deal of expert (but nevertheless highly subjective) judgment was necessary in assessing the results of the case studies. It is also clear that a consistent level of control must be maintained throughout the period of the case study. There were many events, organizational and managerial as well as technical, that affected the outcome of the case study, and about which we had no input or control. In particular, lack of control led to incomplete or inconsistent data. For example, a single problem report usually included several problems related only by the time period in which the problems occurred. Or the description of a single type of problem varied from report to report, depending on the documentation style of the maintainer and the time available to write the description. With such inconsistency, it is impossible to aggregate the problem reports or fault information in a meaningful way; it is also impossible to evaluate the root causes of problems and relate them to the use of standards. Indeed, the very lack of standards in data collection and reporting inhibits us from doing a thorough analysis. A final difficulty with our assessment derives from the lack of information about cost. Although we have Company X data on the time required to fix a problem, the company did not keep careful records on the cost of implementation or maintenance at a level that allows us to understand the cost implications of standards use. That is, even if we can show that using standards is beneficial for product quality, we cannot assess the trade-offs between the increase in quality and the cost of achieving that quality. Without such information, managers in a production environment would be loath to adopt standards, even if the standards were certifiably effective according to the Smartie (or any other) methodology. zyxwvutsrqponmlkjih zyxwvutsrqponml zyxwvuts zyxwvutsrqponm W ClCS Cobol W Batch Cobol Natural zyxwvuts 8 S ’ <U Figure 9. System structure showing system areas with more than 25,000 lines of code and types of programming languages. 3 - 160,000 140,000 120,000 100,000 80,000 60,000 40,000 20,000 0 Area Figure 10. System area size versus number of faults. Faults 80 70 60 50 40 30 20 10 0 U) c 7 LL zyxwvutsr 73 Area Figure 11. Normalized fault rates. choose or rewrite standards so that they were easy to apply and assess, choose the type of standard, or 78 establish a baseline condition or environment against which to measure change. e learned a great deal from reviewing standards and administering case studies. The first and most startling result of our work is that many standards are not really standards at all. Many “standards” are reference or subjective requirements, suggesting that they are really guidelines COMPUTER (since degree of compliance cannot be evaluated). Organizations with such standards should revisit their goals and revise the standards t o address the goals in a more objective way. We also found wide variety in conformance from one employee to another as well as from one module to another. In one of our case studies, management assured us that all modules were 100 percent compliant with the company’s own structured programming standards, since it was mandatory company practice. Our review revealed that only 58 percent of the modules complied with the standards, even though the standards were clearly stated and could be objectivelyevaluated. A related issue is that of identifying the portion of the project affected by the standard and then examining conformance only within that portion. That is, some standards apply only to certain types of modules, so notions of conformance must be adjusted to consider only that part of the system that is subject to the standard in the first place. For example, if a standard applies only to interface modules, then 50 percent compliance should mean that only 50 percent of the interface modules comply, not that 50 percent of the system is comprised of interface modules and that all of them comply. More generally, we found that we have a lot to learn from standards in other engineering disciplines. Our standards lack objective assessment criteria, involve more process than product, and are not always based on rigorous experimental results. Thus, we recommend that software engineering standards be reviewed and revised. The resulting standards should be cohesive collections of requirements to which conformance can be established objectively. Moreover, there should be a clearly stated benefit to each standard and a reference to the set of experiments or case studies demonstrating that benefit. Finally, software engineering standards should be better balanced, with more product requirements in relation to process and resource requirements. With standards expressed in this way, managers can use project objectives to guide standards’ intention and implementation. The Smartie recommendations and framework are practical and effective in identifying problems with standards and in making clear the kinds of changes that are needed. Our case studies have demonstrated that small, simple changes to standards writing, and especially to data collection standards, can improve significantly the quality of information about what is going on in a system and with a project. In particular, these simple changes can move the project from assessment to understanding. and measurement. At present, she is a visiting professorial research fellow at both the City University of London’s Centre for Software Reliability and the University of North London’s Computer Science Department; her work there includes evaluating the extent and effect of standards, and writing guidelines for software engineers on how to perform experiments and case studies. Pfleeger has been a principal scientist at the Contel Technology Center and at Mitre Corporation’s Software Engineering Center. She holds a PhD in information technology and engineering from George Mason University and is a member of the IEEE, the IEEE Computer Society, and the ACM. She is a member of the editorial board €or IEEE Software and of the advisory board of IEEE Spectrum, and is the founder of the ACM’s Committee on the Status of Women and Minorities. zyxwvutsrqpon zyxwvutsrqp zyxwvutsrqpo zyxwvutsrq zyxwvutsr September 1994 Acknowledgments We gratefully acknowledge the assistance of other participants in the SERC/DTIfunded Smartie project: Colum Devine, Jennifer Thornton, Katie Perrin, Derek Jaques, Danny McComish, Eric Trodd, Bev Littlewood, and Peter Mellor. References 1. IEEE Software Engineering Technical Committee Newsletter, Vol. 11, No. 3, Jan. 1993,p. 4. 2. British Standards Institute, British Stun- dards Guide: A Standard for Standards, London, 1981. 3. N. Fenton, S.L. Pfleeger, and R.L. Glass, “Science and Substance: A Challenge to Software Engineers,” IEEE Software, Vol. 11, No. 4, July 1994,pp. 86-95. Norman Fenton is professor of computing science in the Centre for Software Reliability at City University. He was previously director of the Centre for Systems and Software Engi4. International Standards Organization, neering at the South Bank University and a I S 0 9000: Quality Management and Qual- postdoctoral research fellow at Oxford Uniity Assurance Standards - Guidelines for versity’s Mathematical Institute. His research Selection and Use, 1987 (with I S 0 9001 - interests include software measurement and formal methods of software development. 9004). Fenton has written three books on these 5. Ministry of Defence Directorate of Stan- subjects and published many papers, has condardization, Interim Defence Standard 00- sulted widely to industry about metrics pro55: The Procurement of Safety-Crirical grams, and has also led numerous collaborative projects. He is editor of the Chapman and Software in Defence Equipment, Parts I Hall Computer ScienceResearch and Practice 2, Glasgow, Scotland, 1992. Series and is on the editorial board of the Soft6. P. Mellor, “Failures, Faults, and Changes ware Quality Journal. He is a chartered engineer (member of the IEE), an associate felin Dependability Measurement,” J. Infor- low of the Institute of Mathematics and Its mation and Software Technology,Vol. 34, Applications, and a member of the IEEE NO. 10, Oct. 1992,pp. 640-654. Computer Society. zyxwvu Stella Page is a research assistant in the Cen- tre for Software Reliability at City University. She holds an MSc in computer science from University College London. She has participated in several collaborative projects, including Smartie. Shari Lawrence Pfleeger is president of Sys- tems/Software Inc., where she consults with industry and government on issues involving software engineering, process improvement, Readers can contact the authors at the Centre for Software Reliability, City University, Northampton Square, London EClV OHB, England; e-mail (shari,nf, sp}@csr.city.ac.uk. 79

Log In

Evaluating software engineering standards