Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Software Reliability in Real-Time Systems: by Bharat Bhargava University of Pittsburgh

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Software reliability in real-time systems

by BHARAT BHARGAVA
University of Pittsburgh
Pittsburgh, Pennsylvania

ABSTRACT transportation applications such as air-borne computer, air


traffic control systems, mass transit systems, only momentary
This paper investigates techniques to enhance the continuity cessation of service can be tolerated, no maintenance or man-
of operations of the enroute air traffic control system. First ual repair activity is feasible, and incorrect results are un-
the issues of software reliability and fault tolerance in real- acceptable. In addition, the enroute air traffic control system
time systems are discussed. Next, a list of problems associated collects and processes extensive amounts of valuable data and
with nonstop operations of flight dataprocessing (FDP) sub- safeguarding the data is more important than providing con-
system of the enroute air traffic control system are assumed, tinuity of access with possible damage to such data. The archi-
based on limited knowledge of the system; and possible solu- tectures of nontrivial computer systems involve careful con-
tions are suggested and analyzed. Implementation issues of siderations of tradeoffs between reliability, performance, and
recovery block scheme such as architecture, design of alter- costs.
nates and acceptance tests, and cost vs. reliability are studied. The need for reliability of operations in large automated
Four architectures of recovery block scheme are analyzed, real-time system is becoming increasingly important, particu-
and results of a simulation study using flight data processing larly in transportation applications and nuclear industry. For
subsystem as a test case are discussed. such systems, it is important to have high confidence that the
system will behave as expected for all possible environments.
Of course, the development methodology has a great impact
on the quality of software as well as the effort required for a
INTRODUCTION thorough validation. An example of utilizing of a proper
methodology in nuclear power system is discussed in Rama-
Most large software systems are error-prone. It is expected moorthy et al.8. Though it is nice to start from scratch and
that the many efforts to improve software quality and re- develop reliable software with present technology, many sys-
liability will reduce failures but it is hard to say if they will tems currently operational in daily use cannot afford to wait
completely eliminate them. Sometimes it is believed that ma- for a new software system and architecture. And what if the
turity can provide freedom from software errors but that is not design and development methodology is further advanced
borne out by the experience of extensively used operating during the time the system is rebuilt? This "catch-22" situa-
systems. In the Bell Laboratories' Electronic Switching sys- tion requires us to deal with the present software and its
tems (which employ hardware redundancy and thoroughly problems in such a way that continuity of operations is al-
tested software) software faults accounted for approximately lowed in the current system and the cost of providing re-
20% of all failures. There are many types of errors that man- ity is reduced as parts of the software systems are renovated.
ifest themselves during some unusual data or machine state So software structures must be investigated that provide fault
and lead to a system failure. Some of these errors are: com- tolerance in addition to fault avoidance. A survey of fault
tational (divide by zero), logical, definitional (array not sub- tolerant techniques is available10.
scripted properly), operational (wrong transaction entered
and accepted), etc. For a complete list seethe appendix.
Large software systems are also under constant mod-
ification for improving efficiency and this leads to additional FAULT-TOLERANCE (ERROR DETECTION,
faults. In many applications such as computerized air-line DIAGNOSIS, RECOVERY, AND BYPASS)
reservation system, isolated small breakdowns can be toler-
ated as long as the overall system remains operational. But in The various steps of fault tolerance are as illustrated in Figure
1. A detailed discussion of these steps is available10.
The purpose of error detection is to prevent or to recognize
* This work was performed under contract DOT-RC-92031 with the U.S. De- system failures. This can usually be achieved by designing
partment of Transportation. proper checks on every critical step. Reading the data, exe-
298 National Computer Conference, 1981

included in the software architecture to collect necessary in-


formation before and during the time the fault occurs and thus
V
create a list of suspicious program modules, data, and mes-
sages. This information can be used off-line for thorough
Error
investigation of errors and reasons behind them and at the
Detection same time can drive the reconfiguration and bypass algo-
rithms.
The variety of undetected errors which can exist in a design
v of nontrivial software component is essentially infinite. Due
to the complexity of the component, the relationship between
Hardware any such error and its effect at run-time may be very obscure.
Diagnosis For these reasons, diagnosis of the original cause of software
and Reconfiguration errors should be left to the humans
A different strategy will be to ignore the fault and try to
continue to provide service despite its continued presence.
<- Given that a component has been designated as faulty, whose
v further use should be avoided, one can use either its replace-
ment (by a standby spare) or reconfigure so that its responsi-
Recovery
bilities are taken over by other components available in the
to a Correct system. Reconfiguration necessarily involves some degree of
State performance and/or function degradation.
Once the system goes into an erroneous state its resources
(program states, databases) should be brought to a correct
v state before further processing can be continued. One can use
either forward error recovery in which an attempt is made to
Software correct the error states. Compensation is a prime example of
Reconfiguration such mechanism. Backward error recovery depends on the
provision of recovery points, i.e., a means by which the state
of processes can be recorded and later reinstated. This is a
<- - popular mechanism used by many practitioners and its merit
Y is due to two facts:
Figure 1—Steps of fault-tolerant processing 1. The questions of damage assessment and repair are
treated quite separately from those of how to continue
further service.
2. Damage assessment can be made independent of the
ing a loop or a decision branch, conversing with another pro-
type of fault.
gram, writing the data, are examples of such critical functions.
The recovery block scheme 9 , checkpoints with audit trail",
There is usually a high cost associated with such checks, and
and complete database dump", are examples of backward
vigorous checking is usually avoided for normal operations.
error recovery. The implementation issues of some of these
Only when a failure has already occurred, the complexity of
fault-tolerant techniques in the enroute air traffic control
checks must be increased. Some examples of checks used in
center software are the topic of discussion in the following
software are as follows: sections.
1. Dual or triple modular redundancy
2. Reversal check
3. Inference check PROBLEMS OF CONTINUITY OF OPERATIONS IN
4. Watchdog time FLIGHT DATA PROCESSING SUBSYSTEM
5. Address-in-bound check
6. Acceptance check The enroute air traffic control center (ARTCC) software con-
7. Path testing sists of programs, databases, and messages (or transactions).
8. Branch testing We have tried to identify the problems encountered in contin-
9. Structure testing uing the operations of the flight data processing (FDP) subsys-
10. Special value testing tem of the enroute system by studying the reports 1213 and
11. Symbolic testing discussing experiences with the staff of Cleveland Enroute
Center. We feel that there are four problem areas.
A detected error is only a symptom of the fault that caused
it and does not necessarily identify that fault. Usually there is
many-to-many mapping between errors and possible reasons. Program Errors
In real-time systems, on-line diagnosis is not usually possi-
ble. It is important to observe that off-line diagnosis can be The programs of the FDP system have been designed and
made easier if mechanisms based on diagnosis techniques are coded by a large number of programmers over a period of
Software Reliability in Real-Time Systems 299

time. Though they have reached a mature stage, possibility of is not yet available. Since locking/unlocking and interprogram
some hidden errors still exists. These errors cannot be easily communication is done by user programs, the interleaved
identified and the complete debugging cannot be ensured. operations on data by different processes are error prone.
There are some problems regarding the maintenance of pro- There are three dimensions to such interaction: data objects,
grams. Only an object version of the program module exists processes, and time. A complex case arises when m objects
at an enroute center. In case the ARTCC staff identifies an are being processed by n processes and interaction is to be
error, the object program is patched to fix the bug. The error governed by a time order. For example, Object / should not
is documented and sent back to FAA technical center in At- be referenced by Process x until Objects / and k have been
lantic City for correcting the source program. The pro- referenced by Process y. The possibility of failure of one or
grammers in the Federal Aviation Agency (FAA) technical more processes makes this model all the more complex.
center determine the necessary JOVIAL statements to patch Because of space limitations in this paper, I limit discussion
the source. They compile the new source version and send the to solutions dealing with program errors, unexpected inputs
new object version to the enroute center. Sometimes the new and database inconsistency. Details of these and possible
object version is not quite equivalent to the object version that techniques for ensuring correctness of synchronization of mul-
was patched earlier. Moreover, any side effects of the com- tiple concurrent processes are available4.
piler are not quite known to the enroute center staff. The
problem arises due to the non-availability of a compiler and
source code at each enroute center but the reasons are under-
standable. Careful patching of bugs with coordinated testing AVOIDANCE OF FAILURE
at FAA technical center and enroute center has eliminated DUE TO PROGRAM ERRORS
some such problems.
There are various mechanisms to protect against program
errors.
First of all, proper specification, design, and testing should
Database Inconsistency lead to improved reliability and availability. An example of
this research technique for developing software of nuclear
There are four types of databases used in the FDP sub- power plant is available.8 Secondly, we can study the error
system: static (airspace, airways definitions, bulk flight plan, diagnosis, and correction mechanisms, but they are quite inef-
etc.), dynamic (daily flight plans), real-time generated ficient for real-time processing. But with the development of
(changing tracks), Compool tables (interprogram communi- concepts of recovery lines6"10, two-phase commit5"6, and atom-
cations). There are two distinct reliability issues regarding ic actions6"10, system structures which are more fault-tolerant
databases: they should be protected against a possible loss will be possible. A possibility exists that the system may keep
during a failure and they should be consistent during normal runtime graphs describing the processes, operations, and in-
processing. Since the flight plan database is under constant terleaving interactions. This can provide some suspicion lists
update traffic, it must be thoroughly checked at the time of which can diagnose and correct errors at runtime. Inves-
retrieval (before actual use) and update (before any changes tigation of these techniques is the goal of our future research.
are made). The third mechanism provides error bypass and software
reconfiguration and is most suitable for continuing operations
in real-time systems. Recovery block scheme1* proposed by
Randell is an example of this mechanism. The simplest struc-
Unexpected Inputs ture of the recovery block is
Sometimes an incorrect or illegal input (or message) is en- Ensure A T
tered in the system for processing. This sometimes causes an by P
abort to be initiated. Since the checking and abort decisions Else by A i
are not always made at the source of the input, it leads to
repeated entry of such input, causing repeated aborts and
system failure. Else by A„
This problem is further complicated by the fact that there Else Error
are a variety of input types and their sources. Sometimes the
erroneous error is passed from another program (where it Where AT is the acceptance test condition that is expected to
might be a legal output). be met by successful execution of either the primary program
module P or by the alternate module A, ... or An. The inter-
nal control structure of the recovery block will transfer to the
next alternates if the test conditions are not met by the pre-
Synchronization of Multiple Concurrent Processes vious primary or alternate. A hierarchy of acceptance tests
based on their complexity can also be used. Moreover differ-
In the multiprocessing/multiprogramming environment ent tests for different alternates can be used. The acceptance
processing sequence errors are a possibility. For example, test can also be augmented by a watchdog timer that monitors
flight data is supposed to be brought in core for processing. that an acceptable result is furnished within a specified period.
Sometimes processing of data can start but all necessary data The timer can be implemented in either hardware or software.
300 National Computer Conference, 1981

COST/RELIABILITY ANALYSIS OF 1. Acceptance tests are perfect. This assumption is made to


RECOVERY BLOCK SCHEME eliminate complexity of analysis. Since this assumption
will be made in all types of architectures, for comparison
Even though the idea of recovery block scheme is bright, its purposes, it will factor out. Since perfect tests with a low
implementation poses some challenges. Some of these are cost are seldom available, we will relax this assumption
listed below and have been topics of our research. in our simulation study.
2. Probability of failure of primary and alternates are inde-
1. Analysis of recovery block scheme to identify the selec- pendent. This assumption is made because it is very
tion criterion of alternates and acceptance tests. difficult to obtain dependencies between the failures of
2. Selective implementation of recovery block scheme to primary and alternates. Of course this assumption can
maximize reliability/cost ratio. generally be true for independently designed modules.
3. Application of recovery block scheme to the software 3. The cost of recovering states after a failure is small com-
structure and the processing of a real application. pared to the cost of executing the module.
4. Design of proper alternates and acceptance tests for a
given primary module. Usually the state recovery cost involves popping stacks and
resetting variables. Moreover, use of the recovery cache7
We have studied the first two questions2. The problem is makes the state recovery possible at low cost.
formulated as follows: We have studied the cost, reliability, and CRI of four types
of software structures:

Given a software structure with cost of processing (in


terms of execution time), failure probability, and pro- I. Primary module P with an acceptance test AT.
cessing requirements, and The probability of failure of P is Pp, and the cost of
Given several alternatives for implementing recovery executing and testing P by A T is Cp.
block scheme (choice of alternate and acceptance test II. Primary module P decomposed into m submodules Pu
characteristics, granularity of testing etc.), P2, ..., Pm- Each submodule has an acceptance test.
The probability of failure of each submodule is PPi and
the cost of executing and testing P, is Cp,(for / = 1,
how to decide the best architectural implementation which —m).
has: HI. Primary module P supported by n - 1 alternates A\, A2,
..., A „ _ i .
1. A low overhead during normal processing, and The probability of failure of each alternate is PM, and
2. A high potential for reconfiguration (in case of failure) the cost of executing and testing the alternate is CAi (for
at a low cost. i = 1, — n — 1).
IV. Primary module P is decomposed into m submodules,
Pu P2 ..., Pm- Each submodule has an alternate and an
The evaluation standard to select implementational choices acceptance test. A primary submodule and its alternate
was as follows: form a block Af, (for / = 1, — m).
The probability of failure of components of each block
1. Reliability should achieve some minimum. Mi are defined as in Type II and III.
2. Absolute processing time should not exceed some max-
imum. The reliability and cost equations for each type of software
3. Cost/reliability index (CRI) should be as low as possible. structure are given as follows. More details on arriving at
these equations are available2.
The first two evaluation standards are dictated by the per-
formance criterion of the ARTCC software. The third reflects
the marginal reliability gain and extra processing cost required Type I
to obtain it. As a general rule, reliability is increased by early
detection of error (or increased granularity of testing) and
adding a number of alternates, but the cost of testing, recov-
ery, and running is also simultaneously increased. Hence cost-
reliability index (CRI) defined as the ratio of cost of executing
the module and its reliability is a suitable evaluation criterion.
If CRI' and CRI are the cost-reliability indices of the new and
original software architecture, then the new design is consid-
CRI'
ered better if
CRI <1.
We make the following assumption for the cost/reliability R, = 1 - {PP = probability of failure)
analysis of software architectures utilizing recovery blocks: C, = CP (Cp — cost of executing and testing P)
Software Reliability in Real-Time Systems 301

Type II P-Mi — 1 ~~ PPJPAJ


C\fi — Cpj + PPJCAJ

RfV = RM , - RM 2 '" PlMm


Civ =
CM \ + RM \CM2 "' "*" RM 1*RM2"*PA {*c*
p,
Let us now see under what conditions application of a re-
covery block scheme will improve the CRI of the software
architectures:

Case 1 (Type III vs. Type I)


p2
Selection criterion for employing alternates:
It is obvious that using more alternates will increase re-
liability as well as the execution cost. So the CRI is the most
appropriate evaluation index.
Let
PAi
=x X< 1
PP
pm CAJ
= Y y>i
C
then
CRI,
<1
CRl!
if
R„ = (1-PP,) (1-PP2) - (1-PPJ
Cu = CPl + (1-JPP,) CP2 + - (1-PPI)*(1-PP2) -
(1-PP)Y+PPX<1
-*0--Ppm-i)*CPm
This equation is plotted in the following Figure 2.

Type III
(probability
ratio =PA/PP)
A2 An-

^ Y (cost ratio =
— C A /C P )
nul —1 P PPA\PA2 — PAn-l
=
Cm Cp + PPCAX + — + PPPAX — PAH-ZCAH-I
Figure 2—Domains for better cost/reliability index

Type IV
Note that CRIi = CRIn whenx = 1 and y - 1; i.e., we do
not gain anything on CRI if we put an alternate module with
the equivalent power of the primary as the alternate.
We will gain recovery power (CRIm < CRIi) if (X, Y) is in
P. A> M, the triangle formed by (0, \IPP), (0,0), and (1/1 -Pp, 0).
Usually, an alternate will be more reliable (A' < 1) and cost
more ( F > 1 ) . This represents the operating range in the
shaded area. Moreover, contrary to our intuition, a recovery
block may have smaller CRI if (Y,X) is in the triangle
(0,1/P P ), (0,1), (1,1). It implies that a less reliable alternate
p2 A: M2
(X > 1) may be used for improvements of CRI.

Case 2 (Type I vs. Type II)

Selection criterion for granularity of testing:


p,„ A,„ Mm Let _
Ppi = Pp2 = — = Pp„, ~ Pp
==
Cp| C/>2 == — =
C pm =
Cp
302 National Computer Conference, 1981

Note: Case 4 (Type III vs. Type IV)


PP=(l-PP)m
Selection criterion (more alternates vs. more granularity):
CP s= CP*m This is the most interesting case for implementation of a re-
CRI„_C P | {\-Ppf covery block scheme.
We know that both adding an alternate or increasing gran-
CRI, CP Pp*PP ularity will increase reliability as well as cost. It is more inter-
Let
CP = mCp esting to see how fast CRI grows as the number (n) of alter-
nates increases in Type III architecture and as the granularity
Then (m) of testing increases in Type IV architecture.
CRI„
CRI, < 1 Let
If Pp\ — Pp'z — Pp,» — Pp
= ==
Cp| C/>2 ~ — (^p„, ~ (^p
m*Pp*Pp

Obviously the CRI will decrease as granularity of testing (m) For


increases. Kj<m

Case 3 (Type IV vs. Type II) -T?=X X^l


rPj
The selection criterion for employing alternates if the test- CA
ing granularity can be increased: Cp,

Let For
pp\ Ppl • -=pp,n = p> l^i^n-2
Cpx Cp2 •- = Cpm =Q PAJ+ I
=X
Pn " PAi
PAI V
X^ 1 arid
and
T;rx ^Ai_(^Ai+\ _ y
c ~ c ~
*-sp *^Ai
^=Y Y^\
L-Pi
Let us further analyze cost, reliability, and CRI of Type HI
Then
and Type IV architecture. Let cost, reliability, and CRI be
CRI, increments with n as a variable for Type III and with m as a
<1
CRI,, variable for Type IV. A CRI is the increment in the cost-
reliability index when n or m is increased. For example,
(1 + YPP) [1 - (1 - XPp2)m] (1 ~ PPY < 1 n +1
2 ACRI is the difference between CRI with n + 1
xPp[i-(i-pPf] (i-xpP r
This condition is a function of three variables: X,Y, and m. alternates and CRI with n alternates.
If X — Y = 1 (alternates have the same probability of failure
and cost as the primary), then For Type III module, increasing the number of alternates
CRI, from n to n + 1, we have the following increments:
CRI,, < 1 Acost (increase in cost) = PP" X^-'^Y'CP
if Areliability (increase in reliability) =
(1 - PpX")Pp"X"(n-'V2
(i + pP) [l - (i_- pP2r] a- p.r cx
n+1 CJZY"
pPti-(i-pPr]*(i-plr ACRI n 1 - PPX"
or if
ACRI is a polynominal that will grow very fast as n increases.
(l + ^ u - q - ^ r2\ml
2
] cl For Type IV module, increasing the granularity (which may
pp[(i + pPr-(i-pP r] not always be possible) from m to m + 1, we have the follow-
ing increments:
The left-hand side of the equation becomes smaller as m Let
increases, or for a given m, Pp approaches zero. Thus the
decision must be based on meeting the maximum allowable gm=i-(Ppy/m
execution cost. Obviously the reliability of Type IV architec- Areliability = (1 - Xgl+l)m+l
ture is always greater than that of Type II but the execution
cost of Type IV architecture increases linearly with m. -(1-Xgir
Software Reliability in Real-Time Systems 303

_ cP / i + y f r „ + , ) [ i - ( i - A ^ + l r + ' ]
Acost m +1 Xgl+\
CP ^l + Ygm)[\-{l-Xf>2m)m}
v 2
ra + 1 Xg'~n Domain of alternate (c)

Ira + 1 Acost Domain of alternate (a)


ACRI
Ira Areliability
Domain of primary
ACRI will grow very slowly as m increases.
From the above analysis, we draw the following conclu-
sions. Domain of alternate (b)

1. Increase number of alternates till the minimum required


reliability is achieved.
2. Increase granularity of testing till maximum cost has Figure 3—Possible domains of primary and alternate program modules
been reached.
3. In case additional reliability is required, after a certain
value of ra and n (depending on X, Y, Pp, and Cp), it the complex but efficient quick sort; the alternates may be
will be preferable to increase ra. bubble sort, merge sort, or tree sort.
A third approach could be to allow limited capabilities in
Relaxing the assumption of imperfect acceptance tests in- the alternates and provide extensive facilities in the primary.
troduces four new parameters for our cost analysis: Just for the sake of making up a simple example, we can say
AT accepts condition when module is correct: the primary module can sort negative and positive numbers
Probability u, and also manage duplicates while the alternates can handle
AT rejects condition when module is correct: only positive numbers, or the primary can handle numerical as
Probability v, well as erroneous (nonnumerical) data while the alternate can
AT accepts condition when module is incorrect: work on a limited range of numerical data.
Probability s, As regards design of proper acceptance tests, we agree that
AT rejects condition when module is incorrect: the test has to be simpler than the primary block. Moreover
Probability t. the testing procedure will be simple and extremely reliable.
We are currently working on generalizing our analysis and this But then how well can the acceptance test check the results?
will be the subject of a future paper. For example, if we use check-sum to be the acceptance test for
The fourth issue regarding implementation of recovery a sort module, we can be sure that output contains all data
block scheme is to develop a methodology to systematically that came as input, but we cannot be sure if the numbers are
design proper alternates and acceptance test for a given pri- properly sorted. If we use the acceptance test to check if the
mary module. We outline our present understanding briefly in last number is greater than the middle which is greater than
the following paragraphs. the first, we can be sure that the output is possibly in ascend-
One can design alternates which would ing order but we cannot be sure whether the output contains
all numbers which have been properly sorted. If we go to the
1. Execute in a subset of domain of primary, but the failure extreme and check all numbers, then the acceptance test will
probability of primary and alternate are independent of become as complex as the primary module and will become
each other (alternate more specialized than primary). very costly. A different approach to the design of acceptance
2. Execute in a domain which does not overlap with the tests could be based on the concept of time out, which is used
domain of primary. in the enroute system hardware. For example, we could ob-
3. Execute in a domain which covers the domain of primary tain the worst bounds on the algorithms of the primary mod-
and more (alternate more general than primary). The ule (e.g., quick sort takes 2nlog2n comparisons, or binary
three cases are shown in Figure 3. search requires log2« comparisons) and translate them in the
maximum time required to complete the execution of the
In addition, one would like to design acceptance tests which module. If there is a fault in the module, the time out mech-
cost 10 to 15% of the execution of a primary but cover pre- anism can be used to switch to the alternates.
cisely domain of primary (or union of primary and alternates). Another interesting question to study is whether the accep-
We give three strategies for designing an alternate. One tance test of the whole block should be unique or we should
strategy can be to utilize the most recently developed module use different hierarchy based on graceful degradation of sys-
as the primary and utilize the one which has been in use and tem's performance and use one level for primary and another
time-tested as the first alternate. Obviously the primary mod- level for the alternate module, etc. We also believe that the
ule will be very efficient and its components optimized but its acceptance tests have to be adaptive and can change as we
reliability is questionable. gain more and more confidence with the system.
Another strategy could be to have the primary and alter- While thinking of the feasibility of employing the recovery
nate modules utilize different methods to achieve the same (or block architecture to enroute system, we observe that some of
similar function). For example, the primary module may use the subsystems are computation oriented (e.g. deal with nu-
304 National Computer Conference, 1981

Supervisory and
Interfacility
data message processing (FDP) subsystem of ARTCC for
Outputs Subsystem simulation study. The FDP consists of 11 re-entrant sub-
DISK STORAGE programs which call on a pool of 23 subroutines. The memory
APPLICATIONS limitations available on DEC-10 forced us to consider a part
SUBSYSTEM
of FDP and the simulated FDP only contained five sub-
"! TRACK DATA programs: FDP, DAM, DSP, DFA, and DUZ. A detailed
| PROCESSING
! SUBSYSTEM description is available13. Subsequently only 15 subroutines
called by these subprograms were used. It is important to note
that only the processing requirements of the subprogram and
subroutines were simulated, i.e., subprograms and subrou-
tines were considered black boxes which were triggered by
Route Conversion Subsystem. randomly generated messages due to the flight plan gener-
Posting Determination
Subsystem ator. Thus no attempt was made to code the actual functions
of the subprograms or subroutines. The simulation study un-
Inquiry Processing
- > Subsystem, Super- dertaken here can be easily used for studying any subsystem
visory and Inter-
facility Outputs
of the ARTCC software. Figure 4 shows the flight data mes-
Preliminary
Processing Subsystem. sage processing subprograms and subroutines.
Subsystem Flight Status Alerts
Subsystem. Details on execution time and size of each subprogram and
Display Channel
Outputs Subsystem
subroutine are available 1 .
Two modes of FDP processing were simulated:

1. FDP processing with acceptance test at each subroutine


(Type II module).
2. FDP processing with an acceptance test at each sub-
routine. In addition an alternate was provided for each
subroutine. The probability of failure of primary and
Figure 4—FDP subsystem subprograms flow
alternate were independent (Type IV module).
In the first mode, the following actions are performed.
merical analysis issues such as correlation, convergence, etc.)
while others are database update and transaction oriented. 1. A message is generated randomly by the flight plan gen-
The application of acceptance tests as discussed earlier could erator.
be inexpensive for the computation oriented subsystem, but 2. Based on FDP processing requirements, appropriate
for the heavy traffic database system which is under constant messages are generated to initiate the processing of sub-
modifications and heavy references, integrity monitoring programs. (For example, one message from flight plan
could be expensive. So the interesting question arises: Is it generator produces 2.2 amendment messages). Sub-
simpler and cheaper to check the input stream or the program grams in turn call subroutines.
flow? In a report 3 I have answered this question partially by 3. The processing of subroutine fails randomly.
concluding compile-time validation is better than run-time or 4. If a subprogram call finds the subroutine in failed state,
post-execution validation. all partial processing completed by subprogram for the
current message is lost and subprogram starts executing
from the beginning (as if the current message was reis-
sued). Note that all previous subroutine calls are also
APPLICATION OF RECOVERY BLOCK SCHEME
repeated. One could also abort the system at this point.
ON A SIMULATED FLIGHT
In our simulation, the abort simply means reexecution.
DATA PROCESSING SUBSYSTEM
5. Action 4 is repeated until the current message execution
Even though the analytical approach allows us in identifying is completed.
key variables such as cost of alternate, and acceptance test, In the second mode of FDP processing, the following ac-
probability of failure etc., it does make substantial assump- tions are performed:
tions about the structure of the system to keep analysis simple.
In order to study the impact of recovery block scheme on the 1. Same as in mode 1.
software structure of ARTCC, one must relax the following 2. Same as in mode 1.
two assumptions: 3. Same as in mode 1.
4. If a subroutine fails, its alternate is executed. If the
1. All modules have identical probability of failure and alternate subroutine fails, another alternate can be tried
execution cost. or else the subprogram loses the partial processing and
2. Acceptance tests are perfect. restarts from the beginning. If the alternate succeeds,
normal processing continues.
To do this, we adopted a simulation approach in which the 5. Action 4 is repeated until the current message execution
above parameters were made variables. We selected the flight is completed.
Software Reliability in Real-Time Systems 305

0.02 0.04 0.06 ( '8 0.10 0.12 0.14


R E L I A B I L I T Y VS FAULT RATE
1 .00 Figure 6a—FDP reliability vs. probability of failure of a subroutine
(Test accuracy = 0.8; CAICp= 1.2)
> Accuracy
Figure 5—Cost-accuracy function for acceptance tests
(These functions are plotted in Figure 5; Y is the cost of
testing and X is the accuracy of the test [or test strength]).
In both modes, the acceptance test used to detect the failure 4. The accuracy of the acceptance test was varied from .7
of the subroutine is assumed imperfect. If the subroutine fail- to .9.
ed and this fact is not detected by the acceptance test, it is
assumed to be always detected at the end of the execution of The probability of failure of the largest subroutine was
the subprogram. In such a case, all processing is lost and considered to be the variable (failure rate). The probability of
execution is restarted. Note that in case of loss of the execu- failure of all other subroutines were normalized based on the
tion, the time spent is added to the new processing cost. ratio of their execution time with that of the largest subroutine
The following parameters were used as input variables: execution cost.
The graphs showing cost-reliability index vs failure rate and
1. Probability of failure of the subroutine or alternate (PP reliability vs failure rate for acceptance test strength varying
or PA). from 0.8 to 0.9 and cost ratio of the alternate to primary
2. Execution cost ratio of alternate/primary (CA/CP). varying from 1.0 to 1.2 are as shown in Figures 6a-7d.
CJCp = 1.2 means that the cost of restart and execution We note from our simulation that the reliability of the
of the alternate is 1.2 times the cost of executing the simulated flight data processing subsystem with recovery
primary. blocks exceeds the reliability without recovery blocks (Figures
3. Cost-accuracy function of the acceptance test. 6a, b) as probability of failure increases. When the probability
4. Accuracy of the acceptance test (test strength). An accu- of failure of the subroutine exceeds 0.10, the reliability with
racy of 0.8 means that acceptance test will catch only recovery block is twice the reliability without recovery block.
80% of errors. The rest 20% will be caught at the end of Such reliability improvement can increase the mean-time be-
subprogram's execution. tween failure.
We note from our simulation that for cost-accuracy func-
The following parameters were measured for the two tions of the type exponential and tangent the CRI without
modes of FDP processing: recovery blocks does not exceed the CRI with recovery blocks
till the probability of failure of the subroutines exceeds 0.15.
1. Execution cost of FDP. This is true for test strength range of 0.8-0.9 and CAICP range
2. Cost of testing. of 1.0-1.2.
3. Reliability of the FDP. For square and step function for cost accuracy, the CRI
4. Cost-reliability index of the FDP. without recovery blocks reaches the CRI with recovery block,

In the simulation study:

1. The probability of failure of primary or alternates was


varied from 0.0 to 0.10.
2. The execution cost ratio of alternate/primary was varied
from 1.0 to 1.2
3. The cost-accuracy functions were
a. Y = Exp(x) - 1
b. Y = Tan(x)
c. Y = X * 2
d. Y = 0.1 * 7- for x 34 0.06 10 0.12 0.14
k R E L I A B I L I T Y VS FAULT RATE
0.9 Figure 6b—FDP reliability vs. probability of failure of a subroutine
0.1 + TFT—r * (x - k) for x > k
1.0 —A: (Test accuracy = 0.9; CAICp= 1.2)
306 National Computer Conference, 1981

There are two types of integrity assertions that can be de-


fined in a database. One type is based on structural con-
straints. For example, we can declare that duplicate keys or
records are not allowed, every table must contain only those
items which are fully dependent on the key attributes and no
transitive dependencies among attributes are allowed. The
second type of constraints concerns the actual values stored in
the database. Some examples are as follows:

1. Value of an item must exist between a lower and upper


bound; some arithmetic relationship exists among vari-
Figure 7a—FDP cost-reliability index vs. probability of failure of a subroutine ous items (time of flight arrival < time of flight de-
(Test accuracy = 0.8; CAIC =\.0) parture).
2. There must be a trend in change of values over a period
of time (while an aircraft is ascending, new altitude >
when the probability of failure of the subroutines is around old altitude, or when an aircraft is handed over to the
0.10 (see Figures 7a-7d). In particular if CAICP = 1.0, test next ARTCC, it is not handed back to the old ARTCC).
strength = 0.8, and square cost-accuracy function, the CRI 3. Certain records must exist in the database if some other
without recovery block exceeds the CRI with recovery block, record was already in the database (flight plan data must
when the probability of failure of subroutines exceeds 0.08 exist if the plane is in the ARTCC airspace).
(see Figure 7a). From this, we conclude that recovery block
scheme will only give a lower CRI if the probability of failure More examples can be found3.
exceeds 0.08. The language to express integrity assertion could be the
One underlying assumption that a failed execution can be same as one used for accessing the data. One can always use
restarted in the flight data processing subsystem without any tables (such as header to a datafile) to describe integrity asser-
penalty (except the loss of processing up to current state) is tions. These tables are brought in core at the time of access of
really not true in general. One would expect the CRI of the these files. This mechanism has been used in many other
system with no recovery block to be increasing much faster applications5 and is a subject of my further research.
than shown in our simulation. This could result in better jus- The monitoring or validation of integrity assertions can be
tification of recovery block scheme even at lower probabilities done before executing the transaction, at run-time, or after
of failure. executing the transaction. The three methods are briefly de-
The simulation model has been developed to try different scribed below.
scenarios regarding the implementation of recovery block
scheme in the flight data processing subsystem. The results 1. Pre-execution. The method requires
presented in this report are only illustrative rather than con- a. simulating the transaction to find results that would
clusive. More experimentation with actual parameters of the be written if assertions are not violated (what is to be
software of FDP is needed. The simulation programs have written?),
been written in the language SIMULA and run on DEC-10. b. checking the assertions,
c. executing the transaction if all assertions were found
true.
AVOIDANCE OF FAILURE DUE TO 2. Run-time validation. The method requires
DATABASE INCONSISTENCY a. executing a transaction ignoring its "write" opera-
tions,
One of the key issues in improving the reliability of ARTCC b. checking the assertions,
software is to ensure that incorrect data is not stored in the
databases (flight plan database, Compool tables, etc.). One
way to achieve this is to define integrity assertions on the
structure and semantics of the database and surround the
database by an integrity monitor. Any access to the database
must pass through the integrity monitor for verification.
Transactions violating the assertions are disallowed. There
are three research questions regarding this approach to ensure
integrity of the database:

1. Design of integrity assertions


2. Language of integrity assertions
3. Monitoring of integrity assertions

These issues have been reported by us and others3'6 and are Figure 7b—FDP cost-reliability index vs. probability of failure of a subroutine
briefly discussed below. (Test accuracy = 0.8; CA/C =1.2)
Software Reliability in Real-Time Systems 307

(S
These validations can be very expensive if the probability
that integrity assertions are violated is low. The cost of these
valdations, as a ratio of cost of executing a transaction as the
probability error varies, is a topic of further research.

SYNCHRONIZATION OF
MULTIPLE CONCURRENT PROCESSES
cs

^3.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 The problem of synchronizing the interleaved execution of
Figure 7c-—FDP cost-reliability index vs. probability of failure of a subroutine several processes was discussed briefly in the section "Prob-
(Test accuracy = 0.9; CA/CP=1.0) lems of Continuity of Operations in Flight Data Processing
Subsystem." This problem has received increased attention in
the last few years because of the need to design highly concur-
rent processing systems. Such systems contain some form of
c. performing the "write" operations if all assertions
concurrency control algorithms which ensure correctness of
were found true.
synchronization.
3. Post-execution validation. The method requires
a. executing the transactions completely, We assume that each process will take the system from an
b. checking the assertions, initial consistent state to a final consistent state. We define a
serial execution in which each process runs to completion
c. performing correction actions.
before the next one starts. Thus, serial execution will keep the
system in a consistent state. It is possible that some other
In ARTCC, we know a priori the types of transactions (or
interleaved execution of the processes can also preserve the
messages) that will be entered in the system. For each mes-
consistency. Such executions have the same effect as a serial
sage, we have the list of items it will read (readset) and the list
execution and are called serializable executions.
of items that it will write (writeset). Under the assumptions
that readset and writeset are determined before the trans- The flight data processing subsystem employs a synchroni-
action starts executing, we found that the pre-execution vali- zation algorithm which is based on the locking/unlocking of
dation cost is less than or equal to the runtime validation cost resources needed by the process. We find that locking is in
which is less than or equal to the post-execution validation general a pessimistic approach because it reduces concurrency
cost. These results have been obtained3, and I briefly list three and hence reduces efficiency also. Moreover, locking is a type
lemmas for comparing the validation methods. of partial commitment because the system must ensure that a
process unlocks all the resources after its completion. If for
some reason such as process failure, memory loss, or system
Lemma /.The cost of pre-execution validation is never
abort, unlocking cannot be completed, the system becomes
larger than the cost of run-time validation.
inconsistent. Moreover, there are no general purpose dead-^
Lemma 2. The cost of pre-execution validation is never
lock free algorithms. Thus deadlock detection and resolution
larger than the cost of post-execution validation. becomes an overhead added to the maintenance of locks.
Lemma 3. The cost of run-time validation is never larger
than the cost of post-execution validation. I have proposed an optimistic concurrency control al-
gorithm4 which does not commit a process unless it has com-
These lemmas are important because we can design our pleted and its effects have been validated. I call the approach
optimistic because I believe that in general, few processes will
ARTCC software such that no execution takes place without
interfere and conflict with each other and hence more concur-
any violation of integrity assertions. The problems of run-time
rent executions can be allowed by the system.
and post-execution validation such as storing the original
states for backup are thus avoidable. In this paper41 discuss the details of our algorithm and its
performance against the locking algorithms in a database
transaction processing environment. I note that the optimistic
approach will perform as well as the locking approach when
no conflicts exist. When conflicts increase, the optimistic ap-
proach does better than locking in both crash and non-crash
environment. Further research is being done on this approach
to study its fault-tolerant capabilities, and this will be a topic
of my future research.

CONCLUSIONS AND PLANS FOR FURTHER WORK


J1
0.00
1
0.02
1 :
0.04
1
0 06
r
0 08
—i
0.10
-i
0.12
1
0.14
The goal of this research is to investigate the required archi-
Figure 7d—FDP cost-reliability index vs. probability of failure of a subroutine tecture of the automated enroute air traffic control system
(Test accuracy = 0.9; CA/Cp=l,2) that will increase its reliability and will provide capabilities to
308 National Computer Conference, 1981

handle errors and degrade gracefully so that some minimum the Transportation System Centre at Cambridge for informa-
level of continuity of operations can be maintained. We are tion about the enroute system.
interested in design algorithms that are robust and efficient,
and have so far investigated techniques that are useful for
handling program errors, database loss and inconsistency, and APPENDIX—SOFTWARE ERRORS AND THEIR
concurrent processing. Our emphasis has been on the imple- FREQUENCY OF OCCURRENCE IN REAL-TIME
mentation and performance issues of such techniques in the SOFTWARE
flight data processing subsystem.
We have just begun to understand the reliability issues of Software errors and their frequency of occurrence in real-time
the enroute system, and plan to investigate software struc- software
tures that are fault-tolerant and lead to robust processing. My
short-range goal is to study the concepts of atomic actions, The types of errors can be grouped into the following major
commitment levels for backup and recovery, and the assur- classes:
ance consistency of database and concurrent processing.
1. Computation errors: errors in or resulting from coded
equations, equations that pro-
REFERENCES duced values directly from the
physical problem being solved,
1. Bhargava, B., H. Chuang, C. Hua, L. Lilien, and T. Altman. "Software
and Processing Structures with Performance Requirements of Enroute Air and equations used in book-
Traffic Control System." Interim report to the Department of Transporta- keeping sense. Typical errors are
tion, Department of Computer Science, University of Pittsburgh, Decem- mathematical modeling, index,
ber 1979. conversion, and mixed-mode
2. Bhargava, Bharat, and Cecil Hua. "Cost Analysis of Recovery Block
Scheme and Selection Criterion for Alternates." Technical Report, April
arithmetic.
1980.
3. Bhargava, Bharat, and Leszek Lilien. "On Optimal Placement of Integrity 2. Logic errors: incorrect logic code, missing con-
Assertions in a Transaction Processing System." Technical Report, January dition test, flag not tested, etc.
1980.
4. Bhargava, Bharat. "An Optimistic Concurrency Control Algorithm and Its
Performance Evaluation Against Locking Approach." Paper presented at 3. Data input errors: format errors, input read from in-
International Computer Symposium, Taipei, December 1980. correct data file, invalid input
5. Gray, J., P. McJones, M. Blasgen, et al., "The Recovery Manager of a read from correct data file, etc.
Data Management System." IBM Technical Report RJ 2623.
6. Gray, J.N., "Notes on Database Operating Systems." In Operating Sys-
tems: An Advanced Course. Berlin: Springer Verlag, Heidelberg 1978.
4. Data output errors: format errors, data written on
7. Lee, P.A., et al. "A Recovery Cache for the PDP-11."IEEE Transactions wrong file, incomplete or missing
on Computers, 1980, pp. 546-549. output, output field size too small,
8. Ramamoorthy, C.V., et al., "A Systematic Approach to the Development etc.
and Validation of Critical Software for Nuclear Power Plants." Paper
presented at 4th International Conference on Software Engineering, Sep-
tember 17-19, 1979. 5. Data-handling
9. Randell, B. "System Structure for Software Fault Tolerance." IEEE Trans- errors: errors made in reading, writing,
actions, Software Engineering, SE-1,2 (1975), pp. 220-232. moving, storing, and modifying
10. Randell, B., P.A. Lee, and P.C. Treveaven. "Reliability Issues in Com- data, etc.
puting System Design." Computing Surveys (1978), pp. 123-166.
11. Verhofstad, J.S.M. "Recovery Techniques for Database Systems." ACM
Computing Surveys (1978), pp. 167-196. 6. Interface errors: routine/routine interface errors,
12. "Design Specifications—Application Subsystem." U.S. Dept. of Trans- routine/system software interface
portation, NASP-5105, Vol. 2. errors, wrong routine called, and
13. "Subsystem Design Data: Flight Data Processing." LLS. Dept. of Trans-
incompatibilities between data-
portation, NAS Enroute State A (Model A3d2.8), NASP-5154-11, April
1979. base and using routines, etc.
14. Zellweger, Andres. "Productivity and Safety of the Control Process." Pro-
ceedings of the Consultative Planning Conference, U.S. Department of 7. Definition errors: errors in specification of global
Transportation, March 1978. variables and constants, data not
properly defined/dimensioned,
etc.
ACKNOWLEDGMENT
8. Present database
I would like to thank the members of the software reliability errors: data not initialized, initialized to
project, Cecil Hua, Tom Altman, Leszek Lilien, Redda Bour- wrong values, incorrect data
na, and Professor Henry Chuang, at the University of Pitts- units, etc.
burgh for their help in this study. Professor Chuang also pro-
vided the information included in the appendix. 9. Documentation
I would also like to thank Roy Smith and Ed Mayhard of errors: errors in design and operational
the Cleveland Enroute Air Traffic Center and David Clapp of documents.
Software Reliability in Real-Time Systems 309

10. Operation errors: wrong database used, wrong tapes TABLE—Percentage breakdown of major error types in
used, configuration control er- real-time software (at final stage of development)
rors, etc.

11. Others: time limit exceeded, storage limit Real-Time Real-Time


exceeded, compilation errors, etc. Application Operating
Major Error Types Software System
The frequency of occurrence of each type of error is deter- Computational (1) 11.2 2.5
mined by the aforementioned factors. Conclusive results Logic (2) 18.1 34.6
about error occurrence are difficult to obtain. The extensive Data input (3) 1.1 3.7
Software Reliability Study, performed by TRW for Rome Air Data output (4) 2.2 4.9
Development Center, has revealed the results shown in the Data handling (5) 6.7 21.0
table for real-time software, to which ATC software belongs. Interface (6) 6.7 7.4
Data definition (7) 7.9 7.4
The table shows the percentage breakdown of major error
Database (8) 32.6 4.9
types which resulted from analysis of error data obtained in a Others (9,.10,11) 13.5 13.6
large state-of-the-art real-time software projecf. The software
was developed using top-down structured programming ap-
proach under rigorously enforced standards and procedures.
The application software is in FORTRAN, and the operating
system is in assembly language.

You might also like