I
Evaluating Software
Engineering Standards
Shari Lawrence Pfleeger, Norman Fenton, and Stella Page
Centre for Software Reliability
z
oftware engineering standards abound; since 1976, the Software Engineering Standards Committee of the IEEE Computer Society has developed
19 standards in areas such as terminology, documentation, testing, verification and validation, reviews, and audits.’ In 1992 alone, standards were completed
for productivity and quality metrics, software maintenance, and CASE (computeraided software engineering) tool selection. If we include work of the major national
standards bodies throughout the world, there are in fact more than 250 software engineering standards. The existence of these standards raises some important questions.
How do we know which practices to standardize? Since many of our projects produce
less-than-desirable products, are the standards not working, or being ignored? Perhaps the answer is that standards have codified approaches whose effectiveness has
not been rigorously and scientificallydemonstrated. Rather, we have too often relied
on anecdote, “gut feeling,” the opinions of experts, or even flawed research, rather
than on careful, rigorous software engineering experimentation.
This article reports on the results of the Smartie project (Standards and Methods
Assessment Using Rigorous Techniques in Industrial Environments), a collaborative
effort to propose a widely applicable procedure for the objective assessment of standards used in software development. We hope that, for a given environment and application area, Smartie will enable the identification of standards whose use is most
likely to lead to improvements in some aspect of software development processes and
products. In this article, we describe how we verified the practicality of the Smartie
framework by testing it with corporate partners.
Suppose your organization is considering the implementation of a standard. Smartie should help you to answer the following questions:
zyxw
Given the more than
250 software
engineering standards,
why do we sometimes
still produce less than
desirable products?
Are the standards
not working, or
being ignored?
September 1994
zyxwvu
What are the potential benefits of using the standard?
Can we measure objectively the extent of any benefits that may result from its use?
What are the related costs necessary to implement the standard?
Do the costs exceed the benefits?
zyxwvutsrqpo
0018-Y162/94/$4lX10 1YY4 IEEE
71
To that end, we present Smartie in three
parts. First, we analyze what typical standards look like, both in software engineering and in other engineering disciplines. Next, we discuss how to evaluate
a standard for its applicability and objectivity. Finally, we describe the results of a
major industrial case study involving the
reliability and maintainability of almost
two million lines of code.
exist, our standards rarely reference
them. So even though our standards are
laudably aimed at promoting community
benefits, we do not insist on having those
benefits demonstrated clearly and scientifically before the standard is published.
Moreover, there is rarely a set of objective criteria that we can use to evaluate
the proposed technique or process.
Thus, as Smartie researchers, we
sought solutions to some of the problems
with software engineering standards. We
began our investigation by posing three
simple questions that we wanted Smartie to help us answer:
How good is the standard?
What is affected by the standard?
How can we determine compliance
with the standard?
What is the basis for the standard?
z
zyxwvutsrqp
Software engineering
standards
Standards organizations have developed standards for standards, including a
definition of what a standard is. For example, the British Standards Institute defines a standard as
A technical specification or other docu-
ment available to the public, drawn up
with the cooperation and consensus or
general approval of all interests affected
by it, based on the consolidated results
oi science. technology and experience.
aimed at the promotion of optimum community benefits?
Do software engineering standards
satisfy this definition? Not quite. Our
standards are technical specifications
available to the public, but they are not
always drawn up with the consensus or
general approval of all interests affected
by them. For example, airline passengers
were not consulted when standards were
set for building the A320’s fly-by-wire
software, nor were electricity consumers
polled when software standards for nuclear power stations were considered. Of
course, the same could be said for other
standards; for example, parents may not
have been involved in the writing of
safety standards for pushchairs (strollers).
Nevertheless, the intention of a standard
is to reflect the needs of the users or consumers as well as the practices of the
builders. More importantly, our standards are not based on the consolidated
results of science, technology, and exper i e n ~ eProgramming
.~
languages are declared to be corporate or even national
standards without case studies and experiments to demonstrate the costs and
benefits of using them. Techniques such
as cleanroom, formal specification, or
object-oriented design are mandated before we determine under what circumstances they are most beneficial. Even
when scientific analysis and evaluation
72
“Goodness” of the standard was difficult to determine, as it involved at least
three distinct aspects. First, we wanted to
know whether and how we can tell if the
standard is being complied with. That is,
a standard is not a good standard if there
is no way of telling whether a particular
organization, process, or piece of code
complies with the standard. There are
many examples of such “bad” standards.
On a given project, what standards For instance, some testing standards require that all statements be tested “thorare used?
To what extent is a particular stan- oughly”; without a clear definition of
“thoroughly,” we cannot determine comdard followed?
If a standard is being used, is it effec- pliance. Second, a standard is good only
tive? That is, is it making a difference in terms of the success criteria set for it. In
other words, we wanted to know what atin quality or productivity?
tributes of the final product (such as reliability or maintainability) are supposed
to be improved by using the standard.
And finally. we wanted to know the cost
of amlving” the standard. After all. if
compliance with the standard is so costly
What is a standard
as to make its use impractical, or practical only in certain situations, then cost
and what does it
contributes to “goodness.”
mean for software
We developed a scheme to evaluate the
engineering?
degree of objectivity inherent in assessing
compliance. We can classify each requirement being evaluated into one of four categories: reference only, subjective, parOften, a standard’s size and complexity tially objective, and completely objective.
make it difficult to determine whether a A reference-only requirement declares
particular organization is compliant. If that something will happen, but there is
partial compliance is allowed, measure- no way to determine compliance; for exment of the degree of compliance is diffi- ample, “Unit testing shall be carried out.”
cult, if not impossible -consider, for ex- A subjective requirement is one in which
ample, the I S 0 9000 series and the 14 only a subjective measure of conformance
major activities it promotes4 The Smar- is possible;for example, “Unit testing shall
tie project suggeststhat large standards be be carried out effectively.” A subjective
considered as a set of smaller “ministan- requirement is an improvement over a refdards.” A ministandard is a standard with erence-only requirement, but it is subject
a cohesive, content-related set of require- to the differingopinions of experts. A parments. In the remaining discussion, the tially objective requirement involves a
measure of conformance that is somewhat
term standard refers to a ministandard.
objectivebut still requires a degree of subjectivity; for example, “Unit testing shall
be camed out so that all statements and
the most probable paths are tested.” An
objective requirement is the most desirable kind, as conformance to it can be deWe reviewed dozens of software engi- termined completely objectively; for exneering standards, including intemational, ample, “Unit testing shall be camed out so
national, corporate, and organizational that all statements are tested.”
Clearly, our goal as a profession should
standards, to see what we could learn. For
be to produce standards with requireeach standard. we wanted to know
I
-
I
zyx
zyxwvut
i
zyxwvutsrq
What is a good
standard?
COMPUTER
I
zyxwvutsrq
zyxwvutsrq
zyxwvutsrqpo
zyxwvu
zyx
r
I
Reference
Objective
I
,Partially
I
I I
Product
(external)
Product
finternall
(internal)
I
Product
(external)
I
I
Figure 1. Degree of objectivity in
software engineering standards’
requirements.
ments that are as objective as possible.
However, as Figure 1 illustrates, the
Smartie review of the requirements in
software engineering standards indicates
that we are a long way from reaching that
goal.
To what do our
standards apply?
To continue our investigation, Smartie researchers reviewed software engineering standards to determine what aspect of software development is affected
by each standard. We considered four
distinct categories of requirements in the
standards: process, internal product, external product, and resources. Internal
product requirements refer to such items
as the code itself, while external product
requirements refer to what the user experiences, such as reliability. For examples of these categories, we turn to the
British Defence Standard DEF STD 0055 (interim): issued by the Ministry of
Defence (second revision in 1992) for the
procurement of safety-critical software
in defense equipment. Some are internal
product requirements:
Each module should have a single entry and exit.
The code should be indented to show
its structure.
Others are process requirements:
The Design Team shall validate the
Software Specification against Software Requirements by animation of
the formal specification.
while some are resource requirements:
September 1994
I
I
Figure 2. A comparison of (a) BS4792 standard for safe pushchairs, with 29 requirements, and (b) DEF STD 00-55 for safe software, with 115 requirements, shows that
software standardsplace more emphasis on process than on the final product.
All tools and support software . . .
shall have sufficient safety integrity.
The Design Authority shall demonstrate.. .that the seniority, authority,
qualifications and experience of the
staff to be employed on the project
are satisfactory for the tasks assigned
to them.
software safety standard.
The figure shows what is true generally: Software engineering standards are
heavy on process and light on product,
while other engineering standards are the
reverse. That is, software engineering
standards reflect the implicit assumption
that using certain techniques and processes, in concert with “good” toolsand
Typical of many software standards, DEF people, will necessarily result in a good
STD 00-55 has a mixture of all four types product. Other engineering disciplines
have far less faith in the process; they inof requirements.
sist on evaluating the final product in
their standards.
Another major difference between our
standards and those of other engineering
disciplinesis in the method of compliance
assessment. Most other disciplines include in their standards a description of
Standardization has made life easier in the method to be used to assess complimany disciplines. Because of standard ance; we do not. In other words, other
voltage and plugs, an electrical appliance engineers insist that the proof of the pudfrom Germany will work properly in ding is in the eating: Their standards deItaly. A liter of petrol in one country is scribe how the eating is to be done, and
the same as a liter in another, thanks to what the pudding should taste like, look
standard measurement. These standards, like, and feel like. By contrast, software
products of other engineering disciplines, engineers prescribe the recipe, the utenoffer lessons that we can learn as soft- sils, and the cooking techniques, and then
ware engineers. So the next step in the assume that the pudding will taste good.
Smartie process was to examine other en- If our current standards are not effective,
gineering standards to see how they dif- it may be because we need more objecfer from those in software engineering. tive standards and a more balanced mix
In particular, we asked
of process, product, and resource
requirements
Is the mix of product, process, and resource roughly the same?
Is the mix of objective and nonobjective compliance evaluation roughly
Are Software
standards like
other standards?
the same?
The answer to both questions is a resounding no. To show just how different
software engineering standards are, Figure 2 compares the British standard for
pushchair safety with DEF STD 00-55, a
The proof of the
pudding: Case
studies
The Smartie framework includes far
more than we can describe here - for
example, guidelines for evaluating the
73
experiments and case studies on which
the standards are based. We address all
of these issues in Smartie technical reports, available from the Centre for Software Reliability. For the remainder of
this article, we focus on an aspect of
Smartie that distinguishes it from other
research on standards: its practicality.
Because Smartie includes industrial partners, we have evaluated the effectiveness
of Smartie itself by applying it to reallife situations. We present here two examples of the Smartie “reality check”:
(1) applying the framework to written
standards for a major company and (2)
evaluating the use of standards to meet
specified goals.
Both examples involve Company X, a
large, nationwide company whose services depend on software. The company is
interested in using standards to enhance
its software’s reliability and maintainability. In the first example, we examine
some of the company’s programming
standards to see if they can be improved.
In the second example, we recommend
changes to the way data is collected and
analyzed, so that management can make
better decisions about reliability and
maintainability.
zyxwvu
column 12 and on a new line, second
and subsequent lines being neatly indented and aligned vertically. . . . Exceptions are ELSE which will start in
the same column as its associated IF
and which will appear on a line of its
own.
Each line either conforms or does not,
and the proportion of lines conforming
to all layout requirements represents
overall compliance with the standard.
On the other hand, measuring conformance to some naming conventions can
be difficult, because such measurements
are subjective, as is the case with
Initiated in November 1987, the system had had 27 releases by the end of
1992. The 1.7 million lines of code for
this system involve two programming
languages: Cobol (both batch Cobol and
CICS Cobol) and Natural (a 4GL). Less
than a third of the code is Natural; recent growth (15.2 percent from 1991 to
1992) has been entirely in Cobol. Three
corporate and organizational goals are
addressed by measuring this system: (1)
monitoring and improving product reliability, (2) monitoring and improving
product maintainability, and (3) improving the overall development process. The first goal requires information
about actual operational failures, while
the second requires data on discovering
and fixing faults. The third goal, process
improvement, is at a higher level than
the other two, so Smartie researchers focused primarily on reliability and maintainability as characteristics of process
improvement.
The system runs continuously. Users
report problems to a help desk whose
staff determines whether the problem is a
user error or a failure of the system to do
something properly. Thus, all the data
supplied to Smartie related to software
failures rather than to documentation failures. The Smartie team received a complete set of failure information for 199192, so the discussion in this section refers
to all 481 software failures recorded and
fixed during that period. We reviewed the
data to see how data collection and analysis standards addressed the overall goal
of improving system reliability and maintainability. In many cases, we recommended a simple change that should yield
additional, critical information in the future. The remainder of this section describes our findings.
A number is assigned to each “fault”
report. We distinguish a fault (what the
developer sees) from a failure (what the
user sees).6 Here we use “fault” in quotation marks, since failures are labeled as
faults. A typical data point is identified
by a “fault” number, the week it was reported, the system area and fault type,
the week the underlying cause was fixed
and tested, and the actual number of
hours to repair the problem (that is, the
time from when the maintenance group
decides to clear the “fault” until the time
when the fix is tested and integrated with
the rest of the system). Smartie researchers analyzed this data and made
several recommendations about how to
improve data collection and analysis to
zyxwvutsrq
Names must be meaningful.
The Smartie approach recommends
that the standard be rewritten to make it
The Smartie
framework has
guidelines for
evaluating the case
studies on which the
standards are based.
zyxwvutsrq
Reality check 1: How good are the
written standards? We applied the Smartie techniques to a ministandard for using
Cobol. The Cobol standard is part of a
larger set of mandated standards, called
programming guidelines, in the company’s system development manual.
Using the guidelines reputedly “facilitate[~]the production of clear, efficient
and maintainable Cobol programs.” The
guidelines were based on expert opinion,
not on experiments and case studies
demonstrating their effectiveness in comparison with not following the guidelines.
This document is clearly designed as a
standard rather than a set of guidelines,
since “enforceability of the standards
is MANDATORY,” with “any divergence” being “permanently recorded.”
We focused on the layout and naming
conventions, items clearly intended to
make the code easier to maintain. Layout requirements such as the following
can be measured in a completely objective fashion:
Each statement should be terminated
by a full stop.
Only one verb should appear on any
one line.
Each sentence should commence in
74
more objective. For example, improvements might include
Names must be English or scientific
words which themselves appear as
identifiable concepts in the specification document(s).
Abbreviations of names must be consistent.
Hyphens must be used to separate
component parts of names.
Conformance measures can then use the
proportion of names that conform to the
standard. Analysis of the commenting requirements also led to recommendations
that would improve the degree of objectivity in measuring conformance.
Reality check 2: Do the standards address the goals? Company X collects reliability and maintainability data for
many of its systems. The company made
available to Smartie all of its data relating
to a large system essential to its business.
COMPUTER
Figure 3.
Examples of
an existing
closure
report and
a proposed
revision.
zyxwvutsrq
zyxwvutsrq
zyxwvutsrqpo
zyxwv
Existing closure report
Fault ID: F752
Reported: 18/6/92
Definition: Logically deleted work done records
appear on enquiries
Description: Causes misleading information
to users. Amend Additional Work
Performed RDVIPG2A to ignore work done
records with flag-amend = 1 or 2
get a better picture of system maintainability. Nevertheless, the depth of data
collection practiced at Company X is to
be applauded. In particular, the distinction between hours-to-repair and time
between problem-open (“week in”) and
problem-close (“week out”) is a critical
one that is not usually made in maintenance organizations.
The maintenance group designated 28
system areas to which underlying faults
could be traced. Each system area name
referred to a particular function of the
system rather than to the system architecture. There was no documented mapping of programs or modules to system
areas. A typical system area involved 80
programs, with each program consisting
of 1,000 lines of code. The fault type indicated one of 11, many of which were
overlapping. In other words, the classes
of faults were not orthogonal, so it was
possible to find more than one fault class
appropriate for a given fault. In addition,
there was no direct, recorded link between “fault” and program in most cases.
Nor was there information about program size or complexity.
Given this situation, we made two
types of recommendations. First, we examined the existing data and suggested
simple changes to clarify and separate issues. Second, we extracted additional information by hand from many of the programs. We used the new data to
demonstrate that enhanced data collection could provide valuable management
information not obtainable with the current forms and data.
Issue 1: Faults versus failures. Because
the cause of a problem (that is, a fault) is
not always distinguished from the evidence to the user of that problem (that
September 1994
Revised closure report
Fault ID: F752
Reported: 18/6/92
Definition: Logically deleted work done records
appear on enquiries
Effect: Misleading information to users
Cause: Omission of appropriate flag
variables for work done records
Change: Amend Additional Work
Performed RDVIPG2A to ignore work done
records with flag-amend = 1 or 2
is, a failure), it is difficult to assess a system’s reliability or the degree of user satisfaction. Furthermore, with no mapping
from faults to failures, we cannot tell
which particular parts or aspects of the
system are responsible for most of the
problems users are encountering.
Recommendation: Define fault and
failure, and make sure the maintenance
staff understands the difference between
the two. Then, consider failure reports
separate from fault reports. For example,
a design problem discovered during a design review would be described in a fault
report; a problem in function discovered
by a user would be described in a failure
report.
Issue 2: Mapping from program to system area. Use of system areas to describe
faults is helpful, but a mapping is needed
from program name to system area. The
current information does not reveal
whether code in one system area leads to
problems in another system area. The
batch reporting and integration into the
system of problem repairs compounds
this difficulty because there is then no
recorded link from program to fault. This
information must have existed at some
point in the maintenance process in order
for the problem to be fixed;capturing it at
the time of discovery is much more efficient than trying to elicit it well after the
fact (and possibly incorrectly).
Recommendation: Separate the system into well-defined system areas and
provide a listing that maps each code
module to a system area. Then, as problems are reported, indicate the system
area affected. Finally, when the cause of
the problem is identified, document the
names of the program modules that
caused the problem.
Issue 3: Ambiguity and informality inherent in the incident closure reports. The
description of each problem reflects the
creativity of the recorder rather than
standard aspects of the problem. This
lack of uniformity makes it impossible to
amalgamate the reports and examine
overall trends.
Recommendation: The problem description should include the manifestation, effect, and cause of the problem, as
shown in Figure 3. Such data would permit traceability and trend analysis.
Issue 4: Fault classificationscheme. Because the scheme contains nonorthogonal
categories, it is difficultfor the maintainer
to decide in which category a particular
fault belongs. For this reason, some of the
classificationsmay be arbitrary, resulting
in a misleadingpicture when the faults are
aggregated and tracked.
Recommendation:Redefine fault categories so that there is no ambiguity or
overlap between categories.
Issue 5: Unrecoverable data. By unrecoverable, we mean that the information
we need does not exist in some documented form in the organization. For example, most of the problem report forms
related a large collection of faults to a
large collection of programs that were
changed as a result. What appears to be
unrecoverable is the exact mapping of
program changes to a particular fault. On
the other hand, some information was recoverable, but with great difficulty. For
example, we re-created size information
75
zyxwv
zyxwvuts
Table 1. Recoverable (documented) data versus nonrecoverable (undocumented) data.
Nonrecoverable
Size information for each module
Staticlcomplexity information for
each module
Mapping of faults to programs
Severity categories
Operational usage per system (needed for reliability assessment)
Success/failure of fixes (needed to assess effectiveness of maintenance process)
Number of repeated failures (needed for reliability assessment)
manually from different parts of the data
set supplied to us, and we could have related problem severity to problem cause
if we had had enough time.
Recommendation: The data in Table
1 would be useful if it were explicit and
available to the analysts.
Figures 4 through 8 show what we can
learn from the existing data; Figures 9
through 11(page 78) show how much more
we can learn using the additional data.
Since we have neither mean-timebetween-failure data nor operational
usage information, we cannot depict
reliability directly. As an approximation,
we examined the trend in the number of
“faults” received per week. Figure 4
shows that there is great variability in the
number of “faults” per week, suggesting
that there is no general improvement in
system reliability.
The chart in Figure 5 contrasts the
“faults” received with the “faults” addressed and resolved (“actioned”) in a
given week. Notice that there is wide
variation in the proportion of “faults”
that are actioned each week. In spite of
the lack-of-improvement trend, this chart
provides managers with useful information; they can use it to begin an investigation into which “faults” are handled
first and why.
Examining the number of “faults” per
system area is also useful, and we display
the breakdown in Figure 6. However,
there is not enough information to know
why particular system areas generate
more “faults” than others. Without information such as size, complexity, and
operational usage, we can draw no definitive conclusions. Similarly, an analysis of
“faults” by fault type revealed that data
and program faults dominated user,
query, and other faults. However, the
fault types are not orthogonal, so again
there is little that we can conclude.
Figures 7 and 8 show, respectively,
76
I
Recoverable
50
40
. ? .
?:
.......
.. ..
10
.....
1
4
7
10 13 16 19 22 25 28 31 34 37 40 43 46 49
Week
zyx
Figure 4. Reliability trend charting the number of faults received per week.
l*lI
I
-
Faults received and corrected
ProDorlion corrected
c
0.3
0.2
c
zyx
zy
0
.-r
8
p
a
zyxwvu
0.1
=m
0
37 39 41 43 45 47 49
Week
Figure 5. Charting the faults received and acted upon in the same week helps show
how Company X deals with software failures.
mean time to repair fault by system area
and by fault type. This information highlights interesting variations, but our conclusions are still limited because of missing information about size.
The previous charts contain only the
information supplied to us explicitly by
Company X. The following charts reflect
additional information that was recovered manually. As you can see, this reCOMPUTER
zyxwvutsrqpo
zyxwv
zyxwvutsrqponmlkji
zyxwvutsrqponmlkjihgfe
zyxwvutsrqpon
zyxwv
90
80 70 60 -
In
5
50-
LL
40-
m
30 -
20 10-
0-
System area
Figure 6. Plotting the number of faults per system area helps isolate fault-prone
system areas.
8
7
2 6
4
5
4
3
2
1
0
D O S W l F W C 3 P L G C l J T D l G 2 N Z C C 2 G l U
System area
examined. Recall that Figures 4,5, and 6
revealed limited information about the
distribution of “faults” in the overall system. However, by adding size data, the
resulting graph in Figure 10 shows the
startling result that C2 - one of the
smallest system areas (with only 4,000
lines of code) -has the largest number
of “faults.” If the fault rates are graphed
by system area, as in Figure 11,it is easy
to see that C2 dominates the chart. In
fact, Figure 11 shows that, compared
with published industry figures, each system area except C2 is of very high quality; C2, however, is much worse than the
industry average. Without size measurement, this important information would
not be visible. Consequently, we recommended that the capture of size information be made standard practice at
Company X.
These charts represent examples of our
analysis. In each case, improvements to
standards for measurement and collection are suggested in light of the organizational goals. Our recommendations reflect the need to make more explicit a
small degree of additional information
that can result in a very large degree of
additional management insight. The current amount of information allows a manager to determine the status of the system; the additional data would yield
explanatory information that would allow managers to be proactive rather than
reactive during maintenance.
Figure 7. Mean time to repair fault (by system area).
Lessons learned in
case studies
0
3
Fault type
covered information enriches the management decisions that can be made on
the basis of the charts.
By manually investigating the (poorly
documented) link between individual
September 1994
Figure 8. Mean
time to repair fault
(by fault type).
programs and system areas, we examined the relationships among size, language, and system area. Figure 9 shows
the variation between CICS Cobol and
Natural in each of the main system areas
The Company X case study was one of
several intended to validate the Smartie
methodology, not only in terms of finding
missing pieces in the methodology, but
also by testing the practicality of Smartie
for use in an industrial environment (the
other case studies are not complete as of
this writing). The first and most serious
lesson learned in performing the case
studies involved the lack of control. Because each investigation was retrospective, we could not
require measurement of key productivity and quality variables,
require uniformity or repetition of
measurement,
*choose the project, team, or staff
characteristics that might have eliminated confounding effects,
77
The last point is the most crucial.
Without a baseline, we cannot describe
with confidence the effects of using (or
not using) the standards. As a consequence, a great deal of expert (but nevertheless highly subjective) judgment
was necessary in assessing the results of
the case studies. It is also clear that a consistent level of control must be maintained throughout the period of the case
study. There were many events, organizational and managerial as well as technical, that affected the outcome of the
case study, and about which we had no
input or control. In particular, lack of
control led to incomplete or inconsistent
data. For example, a single problem report usually included several problems
related only by the time period in which
the problems occurred. Or the description of a single type of problem varied
from report to report, depending on the
documentation style of the maintainer
and the time available to write the description. With such inconsistency, it is
impossible to aggregate the problem reports or fault information in a meaningful way; it is also impossible to evaluate
the root causes of problems and relate
them to the use of standards. Indeed, the
very lack of standards in data collection
and reporting inhibits us from doing a
thorough analysis.
A final difficulty with our assessment
derives from the lack of information
about cost. Although we have Company
X data on the time required to fix a
problem, the company did not keep
careful records on the cost of implementation or maintenance at a level that
allows us to understand the cost implications of standards use. That is, even if
we can show that using standards is beneficial for product quality, we cannot assess the trade-offs between the increase
in quality and the cost of achieving that
quality. Without such information,
managers in a production environment
would be loath to adopt standards, even
if the standards were certifiably effective according to the Smartie (or any
other) methodology.
zyxwvutsrqponmlkjih
zyxwvutsrqponml
zyxwvuts
zyxwvutsrqponm
W ClCS Cobol
W Batch Cobol
Natural
zyxwvuts
8
S ’
<U
Figure 9. System structure showing system areas with more than 25,000 lines of
code and types of programming languages.
3
-
160,000
140,000
120,000
100,000
80,000
60,000
40,000
20,000
0
Area
Figure 10. System area size versus number of faults.
Faults
80
70
60
50
40
30
20
10
0
U)
c
7
LL
zyxwvutsr
73
Area
Figure 11. Normalized fault rates.
choose or rewrite standards so that
they were easy to apply and assess,
choose the type of standard, or
78
establish a baseline condition or environment against which to measure
change.
e learned a great deal from
reviewing standards and administering case studies. The
first and most startling result of our work
is that many standards are not really standards at all. Many “standards” are reference or subjective requirements, suggesting that they are really guidelines
COMPUTER
(since degree of compliance cannot be
evaluated). Organizations with such standards should revisit their goals and revise
the standards t o address the goals in a
more objective way.
We also found wide variety in conformance from one employee to another as
well as from one module to another. In
one of our case studies, management assured us that all modules were 100 percent compliant with the company’s own
structured programming standards, since
it was mandatory company practice. Our
review revealed that only 58 percent of
the modules complied with the standards,
even though the standards were clearly
stated and could be objectivelyevaluated.
A related issue is that of identifying the
portion of the project affected by the standard and then examining conformance
only within that portion. That is, some
standards apply only to certain types of
modules, so notions of conformance must
be adjusted to consider only that part of
the system that is subject to the standard
in the first place. For example, if a standard applies only to interface modules,
then 50 percent compliance should mean
that only 50 percent of the interface modules comply, not that 50 percent of the
system is comprised of interface modules
and that all of them comply.
More generally, we found that we have
a lot to learn from standards in other engineering disciplines. Our standards lack
objective assessment criteria, involve
more process than product, and are not
always based on rigorous experimental
results.
Thus, we recommend that software engineering standards be reviewed and revised. The resulting standards should be
cohesive collections of requirements to
which conformance can be established
objectively. Moreover, there should be a
clearly stated benefit to each standard
and a reference to the set of experiments
or case studies demonstrating that benefit. Finally, software engineering standards should be better balanced, with
more product requirements in relation to
process and resource requirements. With
standards expressed in this way, managers can use project objectives to guide
standards’ intention and implementation.
The Smartie recommendations and
framework are practical and effective in
identifying problems with standards and
in making clear the kinds of changes that
are needed. Our case studies have
demonstrated that small, simple changes
to standards writing, and especially to
data collection standards, can improve
significantly the quality of information
about what is going on in a system and
with a project. In particular, these simple
changes can move the project from assessment to understanding.
and measurement. At present, she is a visiting
professorial research fellow at both the City
University of London’s Centre for Software
Reliability and the University of North London’s Computer Science Department; her
work there includes evaluating the extent and
effect of standards, and writing guidelines for
software engineers on how to perform experiments and case studies.
Pfleeger has been a principal scientist at the
Contel Technology Center and at Mitre Corporation’s Software Engineering Center. She
holds a PhD in information technology and
engineering from George Mason University
and is a member of the IEEE, the IEEE Computer Society, and the ACM. She is a member
of the editorial board €or IEEE Software and
of the advisory board of IEEE Spectrum, and
is the founder of the ACM’s Committee on
the Status of Women and Minorities.
zyxwvutsrqpon
zyxwvutsrqp
zyxwvutsrqpo
zyxwvutsrq
zyxwvutsr
September 1994
Acknowledgments
We gratefully acknowledge the assistance
of other participants in the SERC/DTIfunded Smartie project: Colum Devine,
Jennifer Thornton, Katie Perrin, Derek
Jaques, Danny McComish, Eric Trodd, Bev
Littlewood, and Peter Mellor.
References
1. IEEE Software Engineering Technical
Committee Newsletter, Vol. 11, No. 3, Jan.
1993,p. 4.
2. British Standards Institute, British Stun-
dards Guide: A Standard for Standards,
London, 1981.
3. N. Fenton, S.L. Pfleeger, and R.L. Glass,
“Science and Substance: A Challenge to
Software Engineers,” IEEE Software, Vol.
11, No. 4, July 1994,pp. 86-95.
Norman Fenton is professor of computing science in the Centre for Software Reliability at
City University. He was previously director of
the Centre for Systems and Software Engi4. International Standards Organization, neering at the South Bank University and a
I S 0 9000: Quality Management and Qual- postdoctoral research fellow at Oxford Uniity Assurance Standards - Guidelines for versity’s Mathematical Institute. His research
Selection and Use, 1987 (with I S 0 9001 - interests include software measurement and
formal methods of software development.
9004).
Fenton has written three books on these
5. Ministry of Defence Directorate of Stan- subjects and published many papers, has condardization, Interim Defence Standard 00- sulted widely to industry about metrics pro55: The Procurement of Safety-Crirical grams, and has also led numerous collaborative projects. He is editor of the Chapman and
Software in Defence Equipment, Parts I Hall Computer ScienceResearch and Practice
2, Glasgow, Scotland, 1992.
Series and is on the editorial board of the Soft6. P. Mellor, “Failures, Faults, and Changes ware Quality Journal. He is a chartered engineer (member of the IEE), an associate felin Dependability Measurement,” J. Infor- low of the Institute of Mathematics and Its
mation and Software Technology,Vol. 34, Applications, and a member of the IEEE
NO. 10, Oct. 1992,pp. 640-654.
Computer Society.
zyxwvu
Stella Page is a research assistant in the Cen-
tre for Software Reliability at City University.
She holds an MSc in computer science from
University College London. She has participated in several collaborative projects, including Smartie.
Shari Lawrence Pfleeger is president of Sys-
tems/Software Inc., where she consults with
industry and government on issues involving
software engineering, process improvement,
Readers can contact the authors at the Centre for Software Reliability, City University,
Northampton Square, London EClV OHB,
England; e-mail (shari,nf, sp}@csr.city.ac.uk.
79