Measuring Software Reliability in Practice: An Industry Case Study
Measuring Software Reliability in Practice: An Industry Case Study
Measuring Software Reliability in Practice: An Industry Case Study
10
our way through with the various stakeholders on • Step1- Data Collection: Collect and study data
clarifying and establishing the difference between (type and format) as well as the design process
hardware reliability and software reliability to determine its nature and how the data relate
measurement as well as explaining the to both the product as well as the process used
quantification process for each one as it relates to its to generate it
own merit. Software failures modes as opposed to • Step2- Model Selection: Select the appropriate
hardware failure modes are either the result of reliability prediction model(s) type
design/logic design defects or unforeseen side • Step3- Model Parameters Estimation: Generate
effects resulting from operational code and/or data the selected model(s) Parameters using MLE
bindings. To add to the ambiguous relation between (Max Likelihood) or LSE (Least Square)
both a good fraction of software failure modes • Step4- Goodness of fit test: Run Fitted Model(s)
especially when observed in the field are cleared by and Verify Goodness-of-Fit to check how
hardware controllable behavior. In most of software appropriate the selected model(s) is
failure effects leading to a processing halt the • Step5- Compute target measures: Undetected
hardware uses a watch dog timer to reset the system. Remaining defects, Time to Next Failure,
While tracking software problems or defects is a Reliability i.e. Probability that next failure will
useful methodology for monitoring the software show up after a given period of time e.g. a
development process and making sure the product is million hours of system operational hours
on track for on time delivery with controllable cost, • Step-6 Tracking: Set and validate measures for
it does not result in a predictive user-oriented decision making and business tradeoffs for
reliability metric. Software reliability quantification product design and deployment life cycle
models, which relate failure data to the inherent
software failure behavior, are used to specify,
predict, estimate, and assess the reliability of 2.3. Measurement and tracking practices
software systems. A number of models are in use As opposed to hardware software reliability
with good results [2], [3]. Customarily, these models measurement and tracking is an open domain where
are applied either during system test or maintenance. no agreed upon standards, guidelines and methods
They rely on collecting failure data, performing drive the systematic product targets setting so these
curve fitting to the appropriate model, and deriving targets drive the product design decisions, the
assessment results or predictive measures for future product life cycle management criteria and
behavior. In this respect we have faced few milestones, the design process improvements, etc.
challenges: For e.g. what would be the typical failure rates of a
• setting a set of release criteria that would networking system as opposed to business
consistently be quantifiable without management system? Or how to design a software
hindering the software release process system for the following targets:
• re-use the selected model(s) applicable to The system should meet 99.999%
one product across products with availability
comparable software behaviors Or the system will have < 5.2 of cumulative
• measure the field behavior and map it to the outage time per year
predicted behavior Or the system will have no more than 1
In our case and given the time and cost constraints unplanned reset per year.
we have determined an empirical method that would A number of quality assurance measures, for e.g.
help in systematically collect the data, analyze it and defect density, code complexity and test coverage
then demonstrate the target reliability metrics at factor, are routinely used to assess the quality of
hand. This method has been equally applied to both software design practices [8]. However to our
the system under measurement in testing phase as knowledge none of the QA metrics have been used
well as field releases deployment. The testing to drive the software availability or MTBF
measurements help in working with the design and estimations in a systematic way. For hardware
business teams on negotiating and assuring the systems 3 parameters drive the service
software system release criteria. The field availability/reliability measurement: the failure rate
measurements help in negotiation, de-risking and or MTTF/MTBF, the fault coverage i.e. probability
demonstration of a set of target SLAs. The method to detect and recover from hardware rooted faults
relies on home grown reliability expertise that uses and MTTR or mean time to restore service. For each
both home grown tracking tools as well as the of these metrics a number of proven and agreed
CASRE tool [4]. The method uses the following upon engineering practices exist [7]. They help in
well know reliability engineering steps [2], [4], ): setting hardware systems availability/reliability
11
budgets, assess them and track them for • Define the approach to the defect discovery
improvements and business tradeoffs. When it rates so it can be assessed to show
comes to software systems it is currently a challenge acceptable target(s). In particular if it takes
to set similar availability/reliability targets that help longer than time to market or higher than
assessing and tracking of the software reliability expected testing cost what release decision
behavior against a set of measurable targets all along to make?
its life cycle. For e.g. what is the typical MTTR of a • How should the deployability criteria be
software system given a certain type of software handled when the data does not fit a given
faults? model or is not agreeing across different
The scope of our case study stems from these models?
challenges where our goal was two fold: • How to handle the criteria when the
• Get a good understanding of systematic goodness-of-fit of the models vary with the
application of the method described above observed data i.e. it takes different times to
with the aim to lay the path for its viability settle on target for the different models?
as a measurement method for large scale Given the business and product development
products. constraints and targets applying a full measurement
• Try and establish a relationship between process based on operational profiling [3] was not
predicted and field measured reliability so feasible. We have rather followed the steps
we get a good understanding of how described in the previous section to run an analytical
feasible and viable a product life cycle modeling approach so we can estimate the target
management can be. release criteria in terms of most probable number of
remaining defects. The latter in turn sets the
3. Practical SRE Measurement of a expected MTBF for a target release time and for
Large Scale Networking System which we could derive the criteria acceptable levels
for product release.
The system under study is a large scale networking
software system measure in hundreds of thousands Step1 Data Collection
to few millions of LOC (excluding comments and
blanks). The system is mostly written in C and it The raw data was collected with close to no
follows some of the best design engineering overhead and directly from the test process defect
practices such as strong architectural reuse, tracking system. Based on a careful and close
separation of concerns and high functional level understanding of the software testing process and
API’s, strong stress and robustness testing as well as testing coverage as well as the knowledge we have
continuous QA monitoring. developed about the system reliability behavior
From this section on the focus of the paper will be through a detailed FMEA of its composing elements
on determining product MTBF assuming we can we have used the failures count method. Each defect
derive its availability based on a target MTTR. The effect was classified as either critical causing a total
MTTR estimation is based on type of the failures failure of the system or major causing partial failure
and their effects on the system. The target SLA to be of the system. A critical failure is a failure mode that
tracked is 99.999% Availability. causes complete impairment of the system. So a
critical failure may cause complete system failure or
complete traffic cut/corruption on all active
3.1. Case study: Setting and de-risking interfaces. A major failure is a failure mode that
product Release criteria causes a partial impairment of the node. So a major
The scope of our study was to set traceable and failure may cause partial or complete traffic cut/loss
verifiable software reliability release criteria targets or corruption on a sub-set of the active interfaces or
so they fit 3 main business requirements: design may affect network management, billing, alarming
requirements, time to market and cost of testing. In functions or communication between cards.
this respect we have faced a number of issues and For the system under study the number of testers and
questions to be able to balance the cost of the SRE testing machines was constant so the interval length
activity with its return on investment: between 2 defect counts was constant per calendar
• Rely exclusively on existing design and week.
testing resources i.e. no additional effort Figure1 shows the cumulative failure count vs.
other than the already budgeted for design cumulative testing time measured in weeks. Figure2
and testing could be used shows the behavior of the discovery rate. Before
week 15 the data was not fitting any exponential or
12
binomial distribution as it was still showing an
increasing rate. Then it started fitting an exponential
distribution with close to constant discovery rate.
180 180
160 160
140 140
Cumulative Discovered Defects
120 120
Cum Sev 1
Cum Sev 2
100 100 Cum Sev 1+2
Jelinski-Moranda - MLE
Poisson - MLE
80 80
Yamada - MLE
Musa Besic-MLE
60 60
40 40
20 20
0 0
0 10 20 30 40 50 60 70 80 90 100
Testing Weeks Figure 3 Curve fitting to Musa-MLE & NHPP
Models
Figure 1 Defect count per testing week
Raw Data
Jelinski-Moranda
Step4: Goodness of fit analysis
Musa Basic
NHPP (TBE) We have spent about 2 to 3 weeks trialing all the
models available in the CASRE tool so we identify
150.0
the suitable ones for our case study data. Figure4
shows a summary of the 4 models among the ones
125.0
we have run and which converged on our data. Their
parameters estimation results are classified from the
100.0 best to the least predictive model. The four models
are:
T o ta l F a i l u r e s
5 .0 0
1 0 .0 0
1 5 .0 0
2 0 .0 0
2 5 .0 0
discovery rate.
Cumulative time between failures - Weeks
We have used the CASRE tool to run the various Musa Besic-MLE 165 18 90% 43
Jelinski-Moranda - MLE 164 17 69% 41
exponential and non exponential models curve
fitting and models parameters estimation. Figure3 Figure 4 Summary of Models Parameters
shows one of the exponential models curve fitting Estimation
(Musa-MLE and NHPP-MLE). Both models show a
failure count decaying and settling on a constant rate The models show a total expected number of critical
of less than 2 defects per week with an estimated and major defects varying between 150 and 165 and
probability of remaining defects less than 18 equal an estimated time to find remaining defects of 2 to 9
respectively to 2% and 5%. So both models show a weeks after 24 weeks of testing. As we can see the
releasable product as the probability to find more Yamada model shows a very high probability of
defects is very low < 2% and 5% with high release whereas Musa-basic and NHPP are have a
confidence and the time required to find the bit lower probability. The J-M though converged
remaining defects is 44 testing weeks or about 9 with good confidence levels is showing a very low
calendar weeks
13
probability that the number of defects found is Field Defect Arrival Behavior
settling on a constant rate.
30
125. 0
15
100. 0 10
Total Failures
5
75. 0
02 04 04 04 05 05 05 05 05 05 05 05 05 05 05 05
50. 0
E+ E+ E + E + E+ E+ E+ E+ E+ E+ E+ E+ E+ E+ E+ E+
.21 .76 .06 .51 .53 .30 .24 .58 .64 .57 .63 .64 .70 .96 .14 .16
7 4 7 9 1 2 3 3 3 3 3 3 3 3 4 4
25. 0
Cumulative Field Operation Hours per Month
Critical Major Log. (Major) Log. (Critical)
0. 0
25
0
10
15
Figure 5 Predictive Models Goodness of fit The field release defect data has uncovered a total of
analysis 143 defects among which 11 were critical and the
We have run for each of these models a goodness- rest were major. The field measured MTBF was
of-fit test to derive the confidence levels. Figure5 5.91E+04 and the ratio of critical to major was 8%.
shows the goodness of fit analysis results. We have The trend of the software release field defects arrival
then derived an expected initial expected field behavior is shown in Figure6.
MTBF of 3.024E+04 hours for this release. A first observation is that we have predicted to have
about a total of 150 to 165 defects in the released
software. And we have observed a total of 143.
3.2. Correlating Predictive Measurement Second one is that the models have predicted an
to Field Measurement initial MTBF of 3.024E+04 hours and we have
Setting reliability targets and tracking them in the observed 2.52E+04. This lead us to run the
field is one of the best proven practices we have analytical models on the field data to get a
been deploying with the hardware products. Our prediction for the next release.
goal is to build a set of similar practices applicable Running the models on the field data only Musa
to the software products. One of the questions we Basic, NHPP and the geometric models are showing
did start investigating in this respect is validation a settled to a constant failure rate (less than 3
and calibration of the reliability predictive models defects). The rest of the models did not converge at
for a systematic use so we can use them as an all. Figure7 captures the models parameters
improvement feedback tool for design and various estimation analysis results where these 3 models
product life cycle business stakeholders. We have have settled on a constant failure rate of less than 3
collected the release field defects based on the same defects behavior after a cumulative operating time of
definitions of defects categorization in severity 2.58E+06 hours. The converged models shows that
levels described in the previous section. The data at that point the remaining defects is expected to be
information was retrieved from service impacting 8 and the time to find them is 6.E+06 with very high
outage reports. These have been extracted from the confidence. From the models we then derived an
problems reports from field that has been classified expected field MTBF of higher than 8.4E+06 hours
as service impacting events. with a very high confidence.
14
Raw Data Point at infinity Raw Data Point at infinity
NHPP (TBE) Musa Basic
Musa-Okumoto
NHPP (TBE)
150.0
100.0
125.0
80.0
100.0
60.0
Total Failures
Total Failures
75.0
40.0
50.0
20.0
25.0
0.0
0.0
0.0000e+000
2.0000e+006
4.0000e+006
6.0000e+006
8.0000e+006
1.0000e+007
0.0000e+000
2.5000e+006
5.0000e+006
7.5000e+006
1.0000e+007
1.2500e+007
1.5000e+007
Cumulative time between failures - Hours Cumulative time between failures - Hours
Figure 7 Software Main Release - Field based Figure 9 Maintenance Release - Models Parameters
Model Estimation Estimation
The subsequent release was a maintenance release
and it has shown a shorter time to settle on a
constant failure rate but did display a typical early 4. Lessons Learned
life behavior. The first failure came only after
A number of lessons learned through this case study
2.71E+04 hours so it did not hit the expected target.
are worth to mention. We list these as we go through
The maintenance release started to settle on a
the steps of the reliability engineering method steps
constant failure at around 6 months only after its
we have followed in our study.
release as compared to the main release which took
Step1:
about a year to settle.
Problem: In this step the open question was: what if
An observation is worth noting as part of our
none of the models did converge and the system test
investigation and for every release in this trial we
is already advanced enough in its target committed
have tried to run all the CASRE models. The models
deployment timelines? And in fact we have faced
parameters estimations run on the maintenance
this very question as we were conducting another
release data are shown in Figure9. And again only
case study in parallel where the data did take a very
Musa basic, NHPP and geometric models did
long time to converge. We went through a labor
converge.
intensive analysis of the data that took us almost half
Field Defects Behavior -Maintenance Load of the effort deployed to complete the work of
delivering the deployment criteria.
SW Field Problems Arrival per Month
15
how to get a reasonable and realistic common view feasibility and viability across variants of
of the system reliability behavior given that some of architectures of the same product line. And third
these models may be overly conservatives and may investigation area is how to use the same approach
prevent the product from being released. to set, measure and assess in a consistent way the
Elements of Solution: Run all the models and reliability behavior of products coming from
classify their results not only based on their different product lines.
goodness of fit to the type of data they run on but
also on the knowledge of the software testing 6. References
process and practices as well as the product
1. S. Benlarbi, Estimating SLAs
functional behavior. A particular element that has
Availability/Reliability in Multi-Services IP
been of great help in this step is the hardware FMEA
Networks. 3rd International Service Availability
performed on the system. Another element was the
Symposium. Helsinki, Finland, May 15-16, 2006
test coverage factor used to apportion test cases to
2. M. Lyu, Handbook of Software Reliability
functional, stress and robustness testing.
Engineering, Ed. , McGraw-Hill, 1996, ISBN-0-07-
039400-8
5. Conclusion and Future Directions 3. J. Musa, Software Reliability Engineering: More
In this paper we have drawn a view on the state of Reliable Software Faster and Cheaper - 2nd. Edition,
the art of software reliability measurement in AuthorHouse, ISBN 1-4184-9388-0
practice. We went through a detailed networking 4. A. P. Nikora, M. R. Lyu, "CASRE - A Windows
system case study in which we have applied Tool for Software Reliability Measurement", invited
software reliability engineering methods in an speaker, meeting of IFIP Working Group 10.4, Lake
industrial setting with the aim to identify few of the Tahoe, CA, June 22-25, 1995
obstacles that impede their systemic deployment in 5. H. Penti and H. Atti, Failure Modes and Effects
practice and for large scale use. We have paved the Analysis of Software-Based Automations Systems.
path to the feasibility and viability of a systematic Technical Report STUK-YTO-TR 190, Helsinki,
deployment for use as field feedback loop Finland, August 2002
improvement tool for product design and life cycle 6. N. F. Schneidewind, Life Cycle Core
management. To this aim we have used targets Knowledge Requirements for Software Reliability
setting, assessment and tracking of software product Measurement. Reliability Review, The R & M
reliability measures. We have also shown through Engineering Journal, Volume23 Number 2, June
our case study some of the lessons learned from our 2003
practice and how to overcome some of the still open 7. Telcordia GR-512-CORE, issue 2, 1998; GR-
issues we have encountered. 929-CORE issue 8, 2002
There are a number of investigation paths we are 8. J. Tian, Software Quality Engineering : Testing,
currently pursuing. Among these, our next step is to Quality Assurance, and Quantifiable Improvement,
run various experiments with our approach to Wiley-IEEE Computer Society Press, 2005, ISBN:
validate and calibrate its viable use for product 0471713457
reliability estimation in terms of product life cycle 9. M. Tortorella, Reliability Challenges of
management and where realistic and measurable Converged Networks and Packet-Based Services.
availability/reliability targets have to be set across Industrial Engineering Working Paper, February 5th,
releases for the same product. For example, one 2003
open question is how to tie up this measurement 10. K.S. Trivedi, “Probability and Statistics with
across releases especially when bulk of the software Reliability, Queuing and Computer Science
code stays the same and only few parts have been Applications; John Wiley &Sons Publisher, Second
added. However the added parts can generate a Edition, 2002
complete different behavior of the system. This is
similar to a hardware product with a new added
module. Each of the product and the module is
measured separately but the new assembled product
may end up with much higher failure rate than the
sum of the two. For hardware systems the increase is
quantifiable in terms of complexity of the support
circuitry to glue or put in communication the two
pieces. Second, in validating and calibrating our
measurement approach we have also to verify its
16