Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Measuring Software Reliability in Practice: An Industry Case Study

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

18th IEEE International Symposium on Software Reliability Engineering

Measuring Software Reliability in Practice: An Industry case study

Saida Benlarbi, Ph.D. and David Stortz


System Reliability Engineering Group
Alcatel-Lucent – IP Division
600 March Road, Kanata - Ottawa, Ontario, Canada K2K 2E6
Email: Saida.Benlarbi@alcatel-lucent.com

networking systems need to meet a set of specific


Abstract: Software reliability modeling techniques design and reliability requirements in order to answer
have been touted as way of measuring and tracking these requirements and be carrier or premium grade
software systems reliability. However a number of performance and reliability products. Hence these
issues make it difficult to use and apply these models products will secure low costs and increased profits for
in practice. In this paper we show some of the vendors, network operators and services providers
challenges and issues that we have encountered in [1],[7]. A number of issues and challenges hinder these
applying these techniques to track and predict a requirements setting and assessment against
networking software system reliability behavior at predictable and/or guaranteed levels of service
two different stages of its life cycle. Through our case reliability and availability. In order to deliver
study we show some of the practical solutions we predictable services that are bound to meet Service
have adopted to overcome these challenges. We also Level Agreements (SLA)s every service class has to be
try to establish a relationship between the software defined in terms of availability, reliability and
testing phase based reliability prediction and field serviceability targets. These are to be measured in
software reliability measurement in order to derive a terms of the network architecture design and traffic
systematic tracking approach. management parameters such as topology and network
configuration dimensioning, latency, packet loss, and
Key Words: Software Reliability, Software bandwidth.
Measurement, Networking Systems Reliability, Over decades of development the Telecom industry
Service Availability came to agree on well established and standardized
communication systems reliability estimation methods
and techniques which allow quantifying their behavior
1. Introduction and measuring it against SLA targets for e.g. Telcordia
As compared to hardware reliability engineering, [6]. However, these estimation practices present a
software reliability is an engineering field in its number of limits and challenges that need to be
infancy. And al contrary to the latter software overcome in order to be able to leverage their use for
reliability is not a widely used engineering practice multi-services IP networks SLAs estimation. In [1] we
whereby a body of common knowledge and practices have provided a summary of some of these challenges
is known and applied on a routine basis to measure and in particular the issue of measuring the reliability of
predict software systems behavior [6]. On the other MS IP networking software systems. It is worth
hand more and more products and services rely for mentioning that most of the industry established
most of their functions on software. In particular multi- reliability engineering measurement practices rely on
Service IP networks are being required to deliver hardware rooted faults tracking for the obvious reason
unprecedented high volumes of diverse traffic that that the latter are well known and predictable faults
span two ends of the reliability requirements spectrum. spawn from known hardware failure mechanisms.
On one end of the spectrum real-time services such as And despite the wealth of software reliability
voice and real time TV require high sensitivity to engineering proposed models and measurement [3]
delays and jitter. On the other end of the spectrum best networking software systems reliability quantification
effort services such as data content delivery services remains an open issue for a number of reasons. These
requires zero traffic losses and high integrity traffic. IP include: confusion as to what to measure, when to

1071-9458/07 $25.00 © 2007 IEEE 9


DOI 10.1109/ISSRE.2007.33
measure it and how to measure it. The second major tracked in the field to be used as a feedback loop for
challenge is the lack of well known and established list performance tracking and improvement all along the
of common software fault types that translate in product life cycle. For network vendors one
predictable failure modes [4]. The third major issue is particular aspect of the software reliability field
a common and agreed upon set of quantitative methods measurement is to be able to demonstrate to their
that allow measuring, prediction and tracking of these customers a proven reliability SLAs that allow for
failure modes reliability behavior both in lab and in the both sides a high business gain and high return on
field so a consist set of metric can be tracked, investment. Networking systems failure modes
quantified and used as a predictive tool as well as an attributable to particular hardware infrastructure,
improvement feedback tool. software system or network element running a
In this paper we focus on one of these 3 challenges specific network protocol or networking service
above mentioned. We present a case study in which we have to be identified and assessed in terms of their
investigate the practical usage of software reliability effects on the network service layers which in turn
engineering quantification as a mean to demonstrate a have to be mapped to specific type of provider/end-
networking system meets a defined set of reliability user service failure effects.
SLAs and correlate it to the field measured reliability. In our case study one first challenge we have
In particular our focus went on trying to examine the faced with software reliability SLAs setting and
relationship between predictive measurements and assessment was the mapping of the telecom vendors
field measurements obtained from the same set of view to the service supplier view and how to get
quantitative software reliability measurement methods them to agree so a common set of agreed upon
and see if we can establish such a method(s) as a metrics is defined and can be quantifiable and
systematic cost effective tracking method for the verifiable. For e.g. the networking system has to
product life cycle. meet 99.999% availability or the system will have <
Section 2 provides a brief framework to software 5.2 minutes per year of cumulative outage time. In
reliability measurement in practice. Section 3 details order to have a common view between the vendor
our networking system case study and the steps we and the service supplier as to how and where to
went through to gather measure and analyze the attribute this reliability budget to the software
networking system data. In section 4 we use our case system we have worked through defining how a
study to show some lessons learned from trying to service failure will be isolated to a given network
firstly establish a common measurement method and element. The latter runs the software system which
secondly establish a relationship between the in turn transports the service between an ingress to
prediction measures and the field measures we have egress point in the system. We also made sure to
observed. distinguish between a physical port and/or channel
and a logical channel/pipe i.e. software functionality
supporting the IP interface as well as its connections
2. Software Reliability Measurement in transporting the running service. A third important
Practice challenge we worked through was to show that
hardware metrics are not suitable for software
quantification and tracking and hence they have to
2.1. Definition of Reliability target be defined separately. They are based on a different
requirements methodology than the well established hardware
Compared to hardware reliability software methodology in the Telecom industry. The software
reliability engineering is not a well known and metrics have to be properly identified in terms of
established discipline relying on set of practices that their contribution to the target networking service
one would readily deploy all along the product life SLA.
cycle [6]. Most of software engineers can
understand and grasp hardware reliability 2.2. Software Reliability Quantification
quantitative measurements. Also arguably software As any other engineering discipline software
engineers do not have difficulties to grasp software reliability engineering comes with a cost that has to
quality measurement in terms of design processes be traded off with the business added value it brings.
and best practices for e.g. measuring the code The cost and difficulties of establishing a software
complexity. However when it comes to the intrinsic reliability measurement process deployed along the
software reliability behavior the question is how to product life cycle is a second major challenge we
define a set of measurable and verifiable reliability have faced in our experience. In order to overcome
requirements that are applicable to a software this challenge we have worked and still are working
system and that can be tested in the lab and then

10
our way through with the various stakeholders on • Step1- Data Collection: Collect and study data
clarifying and establishing the difference between (type and format) as well as the design process
hardware reliability and software reliability to determine its nature and how the data relate
measurement as well as explaining the to both the product as well as the process used
quantification process for each one as it relates to its to generate it
own merit. Software failures modes as opposed to • Step2- Model Selection: Select the appropriate
hardware failure modes are either the result of reliability prediction model(s) type
design/logic design defects or unforeseen side • Step3- Model Parameters Estimation: Generate
effects resulting from operational code and/or data the selected model(s) Parameters using MLE
bindings. To add to the ambiguous relation between (Max Likelihood) or LSE (Least Square)
both a good fraction of software failure modes • Step4- Goodness of fit test: Run Fitted Model(s)
especially when observed in the field are cleared by and Verify Goodness-of-Fit to check how
hardware controllable behavior. In most of software appropriate the selected model(s) is
failure effects leading to a processing halt the • Step5- Compute target measures: Undetected
hardware uses a watch dog timer to reset the system. Remaining defects, Time to Next Failure,
While tracking software problems or defects is a Reliability i.e. Probability that next failure will
useful methodology for monitoring the software show up after a given period of time e.g. a
development process and making sure the product is million hours of system operational hours
on track for on time delivery with controllable cost, • Step-6 Tracking: Set and validate measures for
it does not result in a predictive user-oriented decision making and business tradeoffs for
reliability metric. Software reliability quantification product design and deployment life cycle
models, which relate failure data to the inherent
software failure behavior, are used to specify,
predict, estimate, and assess the reliability of 2.3. Measurement and tracking practices
software systems. A number of models are in use As opposed to hardware software reliability
with good results [2], [3]. Customarily, these models measurement and tracking is an open domain where
are applied either during system test or maintenance. no agreed upon standards, guidelines and methods
They rely on collecting failure data, performing drive the systematic product targets setting so these
curve fitting to the appropriate model, and deriving targets drive the product design decisions, the
assessment results or predictive measures for future product life cycle management criteria and
behavior. In this respect we have faced few milestones, the design process improvements, etc.
challenges: For e.g. what would be the typical failure rates of a
• setting a set of release criteria that would networking system as opposed to business
consistently be quantifiable without management system? Or how to design a software
hindering the software release process system for the following targets:
• re-use the selected model(s) applicable to ƒ The system should meet 99.999%
one product across products with availability
comparable software behaviors ƒ Or the system will have < 5.2 of cumulative
• measure the field behavior and map it to the outage time per year
predicted behavior ƒ Or the system will have no more than 1
In our case and given the time and cost constraints unplanned reset per year.
we have determined an empirical method that would A number of quality assurance measures, for e.g.
help in systematically collect the data, analyze it and defect density, code complexity and test coverage
then demonstrate the target reliability metrics at factor, are routinely used to assess the quality of
hand. This method has been equally applied to both software design practices [8]. However to our
the system under measurement in testing phase as knowledge none of the QA metrics have been used
well as field releases deployment. The testing to drive the software availability or MTBF
measurements help in working with the design and estimations in a systematic way. For hardware
business teams on negotiating and assuring the systems 3 parameters drive the service
software system release criteria. The field availability/reliability measurement: the failure rate
measurements help in negotiation, de-risking and or MTTF/MTBF, the fault coverage i.e. probability
demonstration of a set of target SLAs. The method to detect and recover from hardware rooted faults
relies on home grown reliability expertise that uses and MTTR or mean time to restore service. For each
both home grown tracking tools as well as the of these metrics a number of proven and agreed
CASRE tool [4]. The method uses the following upon engineering practices exist [7]. They help in
well know reliability engineering steps [2], [4], ): setting hardware systems availability/reliability

11
budgets, assess them and track them for • Define the approach to the defect discovery
improvements and business tradeoffs. When it rates so it can be assessed to show
comes to software systems it is currently a challenge acceptable target(s). In particular if it takes
to set similar availability/reliability targets that help longer than time to market or higher than
assessing and tracking of the software reliability expected testing cost what release decision
behavior against a set of measurable targets all along to make?
its life cycle. For e.g. what is the typical MTTR of a • How should the deployability criteria be
software system given a certain type of software handled when the data does not fit a given
faults? model or is not agreeing across different
The scope of our case study stems from these models?
challenges where our goal was two fold: • How to handle the criteria when the
• Get a good understanding of systematic goodness-of-fit of the models vary with the
application of the method described above observed data i.e. it takes different times to
with the aim to lay the path for its viability settle on target for the different models?
as a measurement method for large scale Given the business and product development
products. constraints and targets applying a full measurement
• Try and establish a relationship between process based on operational profiling [3] was not
predicted and field measured reliability so feasible. We have rather followed the steps
we get a good understanding of how described in the previous section to run an analytical
feasible and viable a product life cycle modeling approach so we can estimate the target
management can be. release criteria in terms of most probable number of
remaining defects. The latter in turn sets the
3. Practical SRE Measurement of a expected MTBF for a target release time and for
Large Scale Networking System which we could derive the criteria acceptable levels
for product release.
The system under study is a large scale networking
software system measure in hundreds of thousands Step1 Data Collection
to few millions of LOC (excluding comments and
blanks). The system is mostly written in C and it The raw data was collected with close to no
follows some of the best design engineering overhead and directly from the test process defect
practices such as strong architectural reuse, tracking system. Based on a careful and close
separation of concerns and high functional level understanding of the software testing process and
API’s, strong stress and robustness testing as well as testing coverage as well as the knowledge we have
continuous QA monitoring. developed about the system reliability behavior
From this section on the focus of the paper will be through a detailed FMEA of its composing elements
on determining product MTBF assuming we can we have used the failures count method. Each defect
derive its availability based on a target MTTR. The effect was classified as either critical causing a total
MTTR estimation is based on type of the failures failure of the system or major causing partial failure
and their effects on the system. The target SLA to be of the system. A critical failure is a failure mode that
tracked is 99.999% Availability. causes complete impairment of the system. So a
critical failure may cause complete system failure or
complete traffic cut/corruption on all active
3.1. Case study: Setting and de-risking interfaces. A major failure is a failure mode that
product Release criteria causes a partial impairment of the node. So a major
The scope of our study was to set traceable and failure may cause partial or complete traffic cut/loss
verifiable software reliability release criteria targets or corruption on a sub-set of the active interfaces or
so they fit 3 main business requirements: design may affect network management, billing, alarming
requirements, time to market and cost of testing. In functions or communication between cards.
this respect we have faced a number of issues and For the system under study the number of testers and
questions to be able to balance the cost of the SRE testing machines was constant so the interval length
activity with its return on investment: between 2 defect counts was constant per calendar
• Rely exclusively on existing design and week.
testing resources i.e. no additional effort Figure1 shows the cumulative failure count vs.
other than the already budgeted for design cumulative testing time measured in weeks. Figure2
and testing could be used shows the behavior of the discovery rate. Before
week 15 the data was not fitting any exponential or

12
binomial distribution as it was still showing an
increasing rate. Then it started fitting an exponential
distribution with close to constant discovery rate.

180 180

160 160

140 140
Cumulative Discovered Defects

120 120
Cum Sev 1
Cum Sev 2
100 100 Cum Sev 1+2
Jelinski-Moranda - MLE
Poisson - MLE
80 80
Yamada - MLE
Musa Besic-MLE
60 60

40 40

20 20

0 0
0 10 20 30 40 50 60 70 80 90 100
Testing Weeks Figure 3 Curve fitting to Musa-MLE & NHPP
Models
Figure 1 Defect count per testing week
Raw Data
Jelinski-Moranda
Step4: Goodness of fit analysis
Musa Basic
NHPP (TBE) We have spent about 2 to 3 weeks trialing all the
models available in the CASRE tool so we identify
150.0
the suitable ones for our case study data. Figure4
shows a summary of the 4 models among the ones
125.0
we have run and which converged on our data. Their
parameters estimation results are classified from the
100.0 best to the least predictive model. The four models
are:
T o ta l F a i l u r e s

75.0 • Musa basic, NHPP: exponential failures


distribution
50.0 • Jilinski-Moranda: binomial distribution
• Yamada: Gamma S-shaped distribution
25.0
The models started to converge at 15 testing weeks
and 3 out of the 4 show a releasable product i.e. low
0.0
probability to find more defects in the subsequent
weeks after 24 testing weeks i.e. a quasi constant
0 .0 0

5 .0 0

1 0 .0 0

1 5 .0 0

2 0 .0 0

2 5 .0 0

discovery rate.
Cumulative time between failures - Weeks

Release Criteria Definition - Model fitting Estimates


Figure 2 Behavior of the defects discovery rate Time Needed to find
Total # Remaining Prob to Find Remaining
Remianing defects
defects defects defects
Step2 Model Selection & Step3 Model Parameters (testing weeks)
Yamada - MLE 150 3 98% 10
Estimation Poisson - MLE 165 18 95% 44

We have used the CASRE tool to run the various Musa Besic-MLE 165 18 90% 43
Jelinski-Moranda - MLE 164 17 69% 41
exponential and non exponential models curve
fitting and models parameters estimation. Figure3 Figure 4 Summary of Models Parameters
shows one of the exponential models curve fitting Estimation
(Musa-MLE and NHPP-MLE). Both models show a
failure count decaying and settling on a constant rate The models show a total expected number of critical
of less than 2 defects per week with an estimated and major defects varying between 150 and 165 and
probability of remaining defects less than 18 equal an estimated time to find remaining defects of 2 to 9
respectively to 2% and 5%. So both models show a weeks after 24 weeks of testing. As we can see the
releasable product as the probability to find more Yamada model shows a very high probability of
defects is very low < 2% and 5% with high release whereas Musa-basic and NHPP are have a
confidence and the time required to find the bit lower probability. The J-M though converged
remaining defects is 44 testing weeks or about 9 with good confidence levels is showing a very low
calendar weeks

13
probability that the number of defects found is Field Defect Arrival Behavior
settling on a constant rate.
30

SW Field Problems arrival per Month


Raw Dat a Highly decreasing Nbr of
NHPP (intervals)
Yamada S-Shaped
Failure per period of time
25
150. 0 typical of early life
behavior
20

125. 0
15

100. 0 10
Total Failures

5
75. 0

02 04 04 04 05 05 05 05 05 05 05 05 05 05 05 05
50. 0

E+ E+ E + E + E+ E+ E+ E+ E+ E+ E+ E+ E+ E+ E+ E+
.21 .76 .06 .51 .53 .30 .24 .58 .64 .57 .63 .64 .70 .96 .14 .16
7 4 7 9 1 2 3 3 3 3 3 3 3 3 4 4
25. 0
Cumulative Field Operation Hours per Month
Critical Major Log. (Major) Log. (Critical)
0. 0

Figure 6 Field Release Early life Defects Arrival


20

25
0

10

15

Test Interval Number

Figure 5 Predictive Models Goodness of fit The field release defect data has uncovered a total of
analysis 143 defects among which 11 were critical and the
We have run for each of these models a goodness- rest were major. The field measured MTBF was
of-fit test to derive the confidence levels. Figure5 5.91E+04 and the ratio of critical to major was 8%.
shows the goodness of fit analysis results. We have The trend of the software release field defects arrival
then derived an expected initial expected field behavior is shown in Figure6.
MTBF of 3.024E+04 hours for this release. A first observation is that we have predicted to have
about a total of 150 to 165 defects in the released
software. And we have observed a total of 143.
3.2. Correlating Predictive Measurement Second one is that the models have predicted an
to Field Measurement initial MTBF of 3.024E+04 hours and we have
Setting reliability targets and tracking them in the observed 2.52E+04. This lead us to run the
field is one of the best proven practices we have analytical models on the field data to get a
been deploying with the hardware products. Our prediction for the next release.
goal is to build a set of similar practices applicable Running the models on the field data only Musa
to the software products. One of the questions we Basic, NHPP and the geometric models are showing
did start investigating in this respect is validation a settled to a constant failure rate (less than 3
and calibration of the reliability predictive models defects). The rest of the models did not converge at
for a systematic use so we can use them as an all. Figure7 captures the models parameters
improvement feedback tool for design and various estimation analysis results where these 3 models
product life cycle business stakeholders. We have have settled on a constant failure rate of less than 3
collected the release field defects based on the same defects behavior after a cumulative operating time of
definitions of defects categorization in severity 2.58E+06 hours. The converged models shows that
levels described in the previous section. The data at that point the remaining defects is expected to be
information was retrieved from service impacting 8 and the time to find them is 6.E+06 with very high
outage reports. These have been extracted from the confidence. From the models we then derived an
problems reports from field that has been classified expected field MTBF of higher than 8.4E+06 hours
as service impacting events. with a very high confidence.

14
Raw Data Point at infinity Raw Data Point at infinity
NHPP (TBE) Musa Basic
Musa-Okumoto
NHPP (TBE)

150.0
100.0

125.0

80.0

100.0

60.0
Total Failures

Total Failures
75.0

40.0

50.0

20.0
25.0

0.0
0.0
0.0000e+000

2.0000e+006

4.0000e+006

6.0000e+006

8.0000e+006

1.0000e+007

0.0000e+000

2.5000e+006

5.0000e+006

7.5000e+006

1.0000e+007

1.2500e+007

1.5000e+007
Cumulative time between failures - Hours Cumulative time between failures - Hours

Figure 7 Software Main Release - Field based Figure 9 Maintenance Release - Models Parameters
Model Estimation Estimation
The subsequent release was a maintenance release
and it has shown a shorter time to settle on a
constant failure rate but did display a typical early 4. Lessons Learned
life behavior. The first failure came only after
A number of lessons learned through this case study
2.71E+04 hours so it did not hit the expected target.
are worth to mention. We list these as we go through
The maintenance release started to settle on a
the steps of the reliability engineering method steps
constant failure at around 6 months only after its
we have followed in our study.
release as compared to the main release which took
Step1:
about a year to settle.
Problem: In this step the open question was: what if
An observation is worth noting as part of our
none of the models did converge and the system test
investigation and for every release in this trial we
is already advanced enough in its target committed
have tried to run all the CASRE models. The models
deployment timelines? And in fact we have faced
parameters estimations run on the maintenance
this very question as we were conducting another
release data are shown in Figure9. And again only
case study in parallel where the data did take a very
Musa basic, NHPP and geometric models did
long time to converge. We went through a labor
converge.
intensive analysis of the data that took us almost half
Field Defects Behavior -Maintenance Load of the effort deployed to complete the work of
delivering the deployment criteria.
SW Field Problems Arrival per Month

Elements of Solution: To solve the labor intensive


20
18 analysis and classification process of the data we
16
14 started automating its generations based on various
12
10 sources of information. For e.g. we have started
8
6
using the product FMEA results and the product
4
2
hardware data collection and analysis tools that are
0 already streamlined for the hardware systems so we
04 04 05 +0
5
E+
05 05 +0
5 05 05 05 speed up the understanding of the data behavior.
E+ E+ E+ E E+ E E+ E+ E+
52 22 26 12 81 90 10 39 72 88
2. 9. 2. 5. 5. 6. 8. 8. 8. 8. Step2
Cumulative Field Operation Hours per Month Problem: Two main issues emerged from this step:
1- We had to run all the models available in the
Critical Major Log. (Major) Log. (Critical)
CASRE tool all the time to make sure we get the
highest number of converged models.
Figure 8 Maintenance load Field Defects Arrival 2- The disparity between fitting the various
classes of models was obvious as soon as the data
started converging. The issue we faced was about

15
how to get a reasonable and realistic common view feasibility and viability across variants of
of the system reliability behavior given that some of architectures of the same product line. And third
these models may be overly conservatives and may investigation area is how to use the same approach
prevent the product from being released. to set, measure and assess in a consistent way the
Elements of Solution: Run all the models and reliability behavior of products coming from
classify their results not only based on their different product lines.
goodness of fit to the type of data they run on but
also on the knowledge of the software testing 6. References
process and practices as well as the product
1. S. Benlarbi, Estimating SLAs
functional behavior. A particular element that has
Availability/Reliability in Multi-Services IP
been of great help in this step is the hardware FMEA
Networks. 3rd International Service Availability
performed on the system. Another element was the
Symposium. Helsinki, Finland, May 15-16, 2006
test coverage factor used to apportion test cases to
2. M. Lyu, Handbook of Software Reliability
functional, stress and robustness testing.
Engineering, Ed. , McGraw-Hill, 1996, ISBN-0-07-
039400-8
5. Conclusion and Future Directions 3. J. Musa, Software Reliability Engineering: More
In this paper we have drawn a view on the state of Reliable Software Faster and Cheaper - 2nd. Edition,
the art of software reliability measurement in AuthorHouse, ISBN 1-4184-9388-0
practice. We went through a detailed networking 4. A. P. Nikora, M. R. Lyu, "CASRE - A Windows
system case study in which we have applied Tool for Software Reliability Measurement", invited
software reliability engineering methods in an speaker, meeting of IFIP Working Group 10.4, Lake
industrial setting with the aim to identify few of the Tahoe, CA, June 22-25, 1995
obstacles that impede their systemic deployment in 5. H. Penti and H. Atti, Failure Modes and Effects
practice and for large scale use. We have paved the Analysis of Software-Based Automations Systems.
path to the feasibility and viability of a systematic Technical Report STUK-YTO-TR 190, Helsinki,
deployment for use as field feedback loop Finland, August 2002
improvement tool for product design and life cycle 6. N. F. Schneidewind, Life Cycle Core
management. To this aim we have used targets Knowledge Requirements for Software Reliability
setting, assessment and tracking of software product Measurement. Reliability Review, The R & M
reliability measures. We have also shown through Engineering Journal, Volume23 Number 2, June
our case study some of the lessons learned from our 2003
practice and how to overcome some of the still open 7. Telcordia GR-512-CORE, issue 2, 1998; GR-
issues we have encountered. 929-CORE issue 8, 2002
There are a number of investigation paths we are 8. J. Tian, Software Quality Engineering : Testing,
currently pursuing. Among these, our next step is to Quality Assurance, and Quantifiable Improvement,
run various experiments with our approach to Wiley-IEEE Computer Society Press, 2005, ISBN:
validate and calibrate its viable use for product 0471713457
reliability estimation in terms of product life cycle 9. M. Tortorella, Reliability Challenges of
management and where realistic and measurable Converged Networks and Packet-Based Services.
availability/reliability targets have to be set across Industrial Engineering Working Paper, February 5th,
releases for the same product. For example, one 2003
open question is how to tie up this measurement 10. K.S. Trivedi, “Probability and Statistics with
across releases especially when bulk of the software Reliability, Queuing and Computer Science
code stays the same and only few parts have been Applications; John Wiley &Sons Publisher, Second
added. However the added parts can generate a Edition, 2002
complete different behavior of the system. This is
similar to a hardware product with a new added
module. Each of the product and the module is
measured separately but the new assembled product
may end up with much higher failure rate than the
sum of the two. For hardware systems the increase is
quantifiable in terms of complexity of the support
circuitry to glue or put in communication the two
pieces. Second, in validating and calibrating our
measurement approach we have also to verify its

16

You might also like