Reliability Models of NMR Systems

Paulo Sousa

IEEE TRANSACTIONS ON RELIABILITY, VOL. R-24, NO. 2, JUNE 1975 108 Reliability Models of NMR Systems Francis P. Mathur, Member IEEE Paulo T. de Sousa, Member IEEE Abstract-Majority voted redundant systems are widely used. A reliability model is developed and analyzed for N-tuple Modular Redundancy-NMR: (n + I)-out-of-2n + 1)- where the units are subject to stuck-at-0, stuck-at-1 or stuck-at-X failures and where failures can occur in a mutually compensatory manner. A reconfiguration of the NMR redundancy, the NMR/Simplex strategy, is proposed and evaluated and its model shown to be included in the general model for the compen- sated NMR. Reader Aids: Purpose: Widen state of the art Special math needed for explanations: Probability, combinatorial analysis Special math needed for results: Same Results useful to: Theoretically inclined reliability engineers, designers of fault-tolerant computers. Ro N n exp (-XT). Total number of active redundant units at the beginning of the time interval of interest; (N = 2n + 1, N > 1). T t or z Simplex TMR NMR (TMR)sim, (NMR)sim (TMR)comp, (NMR)Comp R("System problemancnbeoitde., (T O). mmyvles Dummy variables for time; (0. t, z S 1) A nonredundant unit or system, (N = 1). Triple Modular Redundant system, (N = 3). N-tuple Modular Redundant system. TMR/Simplex system, NMR/Simplex system. Compensated TMR system, Compensated NMR system. The format of a compact notation for simplify- Characteriza- ing the writing of reliability equations. tion") ["Time"] Here "R" the reliability is followed in parentheses by the "System Characterization" such as (NMR), R(NMR) [T] (TMR), or (Simplex) and is then succeeded in square brackets by the parameter "Time". The parameter "Time" is usually the mission time T and can be omitted; e.g., is the reliability of an NMR system for a mission 1. INTRODUCTION 1.1 Statement of the Degree of active redundancy, N = 2n + 1 in majority logic; (n > 0). Mission time; (T> 0). The use of protective redundancy to enhance reliability hasduainT R Relability of the voter found wide acceptance as a fundamental procedure [2, 3, 9]. v Majority voted systems are among the best known redundant Module failure, where the output is stuck at a structures. Triple Modular Redundancy (TMR) was the earliest Stuck-at-I constant logical 1. of these systems [10]. The simplex unit is triplicated and each Module Stuck-at-0 failure, where the output is stuck at a of the three independent units feed into a 2-out-of-3 voter. The constant logical 0. system fails if more than one unit fails. Variations of this stratModule failure, where the output is indetermiegy have been developed, such as the TMR/Simplex [1, 6] . N- Stuck-at-X nate, viz., not stuck at a constant logical value. tuple Modular Redundancy (NMR) is a generalization of TMR Pr {stuck-at-1 module fails } [8]. The simplex unit is replicated N times (N = 2n + 1 with n PI PPr {stuck-at-0 module fails} an integer). At least n + 1 out of the units have to be operaPr {stuck-at-X module fails Px tional for the structure to survive. This classical interpretation of NMR systems underestimates their reliability. A majority of units can fail and the system 2. BACKGROUND AND ARCHITECTURAL CONCEPTS will still survive. These cases are called compensating failures. Their consideration provides a more realistic evaluation of the 2.1 Compensating Failures NMR reliability. Section 2 introduces the compensating failures approach System reliability is the sum of the probabilities of the and generalizes the TMR/Simplex concept. Section 3 develops mutually exclusive success paths through the system [91 . The the mathematical equations to model these -systems, and furreliability of a TMR system with an infallible voting device is ther analysis is undertaken in section 4. The remainder of section 1 gives notation and nomenclature. (under classical assumptions [2, 3, 8] ): 1.2 Nomenclature and Notation X Constant failure rate of a nonredundant active unit; (X2> 0). It includes stuck-at-0, stuck-at-I and stuck-at-X failures. R(TMR) =3R - 2R3.- (1) This classical model is an underestimation [4] ,because it does not consider all the nonfailure situations. It only considers two success paths: 109 MATHUR AND de SOUSA: RELIABILITY MODELS OF NMR SYSTEMS M-m M-il 1) all units survive mission time T = IM. 2) one unit fails and two units do not fail during mission Pr{m-comp-M } I io il =m =m il !io!(M-il -i0) time T. A third successful event can happen when a unit fails to a i io M-ilio (5) Pi Po Px stuck-at-0 (stuck-at-i) situation between the times 0 and T. The system does not fail, because the majority of the units did These results will be applied in section 3.2 to develop the not fail. Suppose that there is now a second failure during the mission time. If this second unit fails to a stuck-at-I (stuck-at- reliability expression of a compensated NMR system. 0) situation, it compensates the first one. The output will be given correctly by the third nonfailed unit. The probability of 2.2 NMR/Simplex occurrence of this third success path should be added to (1). The variant of the TMR scheme called TMR/Simplex has It is: been analyzed in [6]: 3RO(I -R0)2 - Pr{2 units compensate each other 1 2 units (6) R(TMR)sim [T] = 1.5 R o-O5R0o fail } M The generalization of this scheme from the TMR/Simplex to the NMR/Simplex structure will now be shown. The notation NMR/ Simplex is an abbreviation for a sequence such as NMR/(N - 2) MR/ ...TMR/Simplex. Initially there is an odd number N of units operating in a voted system. Whenever a unit fails, that unit as well = _E majority R(NMR) R' (I-R)-Ni (2) as one of the remaining good units are discarded. From that i=n+l (iY) moment, the system is in a (N - 2)MR/Simplex mode. This Equation (2) is the sum of the probabilities of all the cases where at least n + 1 (a majority among the N units) of the rep- process will repeat itself and eventually will lead to a simplex system. An expression for the reliability of such a system will licas will survive. Equation (2) is just a lower bound, since in be derived in section 3.1. In section 4.3 it will be shown that many cases the failures can compensate one another. A more such an expression is a particular case of the compensated NMR general NMR reliability model will be developed. In general, three types of failures can be defined: stuck-at-i, reliability. stuck-at-0 and stuck-at-X. Since these three are the only types 3. MODELING of failures considered: These results will be generalized to the NMR case. The N-tuple Modular Redundant design consists of N replicated units feeding a (n + 1)-out-of-N voter and has a reliability: Pi +Po +Px = I . (3) It is assumed that an indeterminate failure such as stuck-at-X cannot compensate a determinate failure such as stuck-at-O or stuck-at-i. Definition: Two failed units are said to be compensated if one of them is stuck-at-0 and the other stuck-at-i. This situation is called one-compensation-out-of-two modules and abbreviated as "l-comp -2". In a NMR structure where two modules fail in a compensating manner the system becomes effectively a (N - 2)MR structure. Definition: There are m-compensations-out-of-M failed units (m-comp-M) if among the M failed units there are m pairs of compensated modules. This phenomenon is governed by a multinomial distribution with pdf: ~i !io!i! Pi 'Po°Pxx (4) 3.1 NMR/Simplex In a NMR/Simplex redundant system there are two cases leading to mission success: Case 1: All units survive the mission time T. The probability of this event is RO [T]. Case 2: One unit fails at time z E (0, T); that unit and another one are discarded. The probability of this event is: NfTXeXZ e-(N1)Xz o R((N-2)MR)sim [T-z] dz. The reliability equation for the system is then: (7) R(NMR)sim [TI = RN + NA f TeNXz R((N-2)MR)sim [T-z] dz =RON ± NARN rTeXNt R((N- )MR)s4 [t] dt . (8) (9) Iis shown in the Appendix that this recursive integral equation (9) has the solution: n R(NMR)sim [T]=AAn jEB. R2i+i1; where ti, io and ix are the number of units respectively stuckat-i, stuck-at-0, and stuck-at-X; ii + io + ix M. The probability of having mn-comp-M is the sum of the probabilities of all the cases where there are at least m units stuckAn(2n + 1) (2,n)I22n,B1n at-i and m units stuck-at-0, i.e.: (n)(-l)I/(2j ± 1). (10) 110 IEEE TRANSACTIONS ON RELIABILITY, JUNE 1975 3.2 Compensated NMR I.0 According to section 2.1, the reliability of a NMR system with compensating failures is: R(NMR)comp OpR(NMR) + n i=1 (ff) R' (I I - R Pi =PO-=/2,Px=0 P =Po /px -°3 = 035, w=0°2,Px=0 /5 9P 0.9 ~~~~~~~~CLASSICAL O)Nf-' Pr{(n+1-i)-comp-NN-i)}. L (11) 0.50.5 -I 0.6 Substituting (2) and (5) into (11) yields: R(NMR)c m_= I (y) Rk (1-R )I I A 0.8 This formula is the sum of the reliability of the classical NMR >/ and the probabilities of the compensating events. Compensat- , ing events are the ones where only a minority of the units sur- go07 vive (i units, i = 1 to n), but among the N- i units that failed ,/w there are n + 1 - i compensations. The failed units have an a/ effective number of votes of [(N-i)- 2(n+ 1-i)] = i- 1, _EW and the vote of the i nonfailed units will determine the output. > 0.6 There are no restrictions about the time, between 0 and T, that the several failures occur. The system will never be in a failure situation, regardless of the order in which the unit failures occur. S SIMPLEX p 0.7 Ro 0.8 0.9 o.0 Figure 1. COMPENSATED TMR,with severalvalues forp1, po andpx Vk(N,p ,po) (12) The classical model does not consider compensating failures, pensated model. Therefore, the crossover point occurs for the classical TMR system at RO = 1/2, which agrees with previous -k No Vk(N,Pl,P°)k i0=n+l-k results [7, 3] . ~~1=n+l-k -ki, is no crossover point in (15) for any values of pl, Po (10io P_ N-k-E1 -ioThere pP) 0 io )po and px such that pipo . 1/6. For other values there is a crossover point, but lower than the classical case (Figure 1). For N i j_ the cases (p1or po = 0), (p1= po = 1/2,px = O), (pI = Po= Vk(fN, Pl,po) . I() k1 Px = 1/3), the bound (15) holds for all NMR systems (Figures ' is just another form for the multinomial distribution depicting 2 and 3). e factor Pr {(n + 1 -k)-comp-(N-k) }. For k > n 1, Vk is 1. 4.2 VOTER RELIABILITY = N ; E bR' n1 I 0 (pi I,) = (13) which is equivalent to making pI = O or po = O in the com- (-1)(I1) 4. RESULTS AND APPLICATIONS 4.1 Crossover Point In a majority decision structure there is a voting mechanism. To assume this voting mechanism or monitor is perfect is to oversimplify a problem. In order for the structure to operate properly, the voter has to give accurate results, whether or not there are faults in the units. Regarding the voter as a series element in the reliability block diagram [6, 3] , the reliability of a NMR structure is: The crossover point is the minimum value of the reliability of a nonredundant component for which there is improvement in the reliability using a redundant system. It is geometrically interpreted as the point where the curves for the redundant and the nonredundant systems cross. For a compensated TMR R(NMR)* = R(NMR) * R . (16) system, the crossover situation is defined by: It is useful to know the minimum value of the voter reliabil6R0p1p0 + 3Rg(1 - 4PPo) + 2R3(3p1P - 1) = Ro (14) ity for which there is gain in the system reliability over a simplex design: where the left-hand side is obtained from (13) withN =3. The R mn=R(Simplex) =[NzbR- -l 17 lower bound of applicability of a TMR system is the nontrivial Rvmn R(NMR) L=[i1 b,R' j .17 root of(14): Figures 4 and 5 plot Rv(min) versus Ro . Figure 4 shows the (15) influence ofN in NMR systems, with Pi =Po =Px = 1/3. Ro= (6pipo -1)/(6p1p0 -2) 111 MATHUR AND de SOUSA: RELIABILITY MODELS OF NMR SYSTEMS 1.0 1.0 0.9 9 0.9 w~~~~~~~~~~~ 4t N I /5 / a _. ~ 0 0.7 w 0.7 at w ~~~~CROSSOVER FOR ALL CURVES >. ol IS AT SYSTEM REL=R0:=0.25 0.6 D~~~~~~~~~ 0.61 0.0 0.2 0.4 0.6 R (SIMPLEX) Figure 4. Rv(min) for COMPENSATED NMR systems withvp1 0.6 0.5 0.7 0.8 0.9 1.0 0.8 po, 1.0 I.0 Figure 2. COMPENSATED NMR systems with P1 p0 =px =1/3 TMR 0.6 0.8 0.9 .0 PI =O.25,p0=O.2,p)eO.55 0p1o:px= 1/3 - 9 0.9 CLASSICAL 6 w 5 0 > 0.7 0.8 -J 0.2 0.0 0. 7 w 0.4 R (SIMPLEX) 1.0 for COMPENSATED Figure 5.PiRv(min) = po =1/2, (13 bcoesTMR with several values of pl, I-. p0 0.6 NO CROSSOVER 0.5 0.5 0.6 0.7 0.8a Figure 3. COMPENSATED NMR, with p1 = po NMR/SIMPLEX systems) Therefore,, the applicability constraints of a compensated 0.9 = 1.0 1/2, px = 0 (also Ro < (6pipo -1)/(6plpo -2) then R(TMR) <Ro, irrespective of RV; 2)ifRV<(24p1po 8)I(24PsPo -9)thenR(TMR)< R0, irrespective ofR0. Fo.$ and px. TMR system become tighter when the product of the parameters pi and po becomes smaller. The maximum value of that product is 1/4 (p1 = po = 1/2). 4. NMR/SIMPLEX AS A PARTICULAR CASE OF COMPENSTOATE NMRY KTk4 1 2=1 1 R(2) Q)2 z i=n+1-2 ('2fl+1 -Q)(8 .(8 112 IEEE TRANSACTIONS ON RELIABILITY, JUNE 1975 Equations (10) and (18) yield the same results. Thus, the model of the NMR/Simplex system is a particular case of the more general Compensated NMR system where p, = Po = 1/2 and p= 0. The behavior of NMR/Simplex systems is depicted in Figure 3 for several values of N. The system reliability always increases with N, because these curves do not have a crossover point other than O or 1. REFERENCES [1] M. Ball and F.H. Hardie, "Architecture for Extended Mission Aerospace Computer", IBM Report 66-825-1 753, Oswego, N.Y., [21 J.L. Bricker, "A Unified Method for Analyzing Mission Reliability for Fault-Tolerant Computer System", IEEE Transactions on Reliability, Vol. R-22, pp. 72-77, June 1973. [3] N.G. Dennis, "Reliability Analyses of Combined Voting and Standby Redundancies", IEEE Transactions on Reliability, Vol. R-23, pp. 66-75, June 1974. [4] P.H. Giroux, Comments on "Estimates for Best Placement of Voters in a Triplicated Logic Network", IEEE Transactions on Electronic Computers, Vol. EC-15, p. 382, June 1966. ACKNOWLEDGMENT [5] P.G. Hoel, S.C. Port, and C.J. Stone, Introduction to Probabil- The authors appreciate the many helpful suggestions of the Editor and a referee. ity Theory, Houghton Mifflin Company, Boston, 1971. [61 F.P. Mathur, "On Reliability Modeling and Analysis of Ultra- Reliable Fault-Tolerant Digital Systems", IEEE Transactions on Computers, Vol. C-20, pp. 1376-1382, Nov. 1971, (Special Issue APPENDIX on Fault-Tolerant Computing). [7] F.P. Mathur, "Automation of Reliability Evaluation Procedures Through CARE-The Computer-Aided Reliability Estimation Proof that NMR/Simplex Reliability Expression (10) is the of the FJCC, Vol. 41, Program", AFIPS Conference solution for (9). The proof will on n. W1 be by induction u n on n.pp. 65-82a, Anaheim, California,Proceedings Dec. 1972. be Expression (10) can be written as: [8] F.P. Mathur and A. Avizienis, "Reliability Analysis and Architec- solutionsion(0).Thecanoot written n 1 R(NMR).Sim 22n+ ture of a Hybrid Redundant Digital System: Generalized Triple 2n +22 n (-1)' (n+ ± 2j+i Ri=O i 2j + 1 (A-1) 1. For n= 1, Modular Redundancy with Self-Repair", in 1970 Spring Joint Computer Conference, AFIPS Conference Proceedings, Vol. 36, Montvale, N.J., pp. 375-383, May 1970. [9] D.S. Taylor, "A Reliability and Comparative Analysis of Two Standby System Configurations", IEEE Transactions on Reliability, Vol. R-22, pp. 13-19, April 1973. (10) yields: [101 J. von Neumann, "Probabilistic Logics and the Synthesis of R(TMR)Sim= 3/2 (Ro - 1/3Ro) (A-2) Reliable Organisms from Unreliable Components", Automata Studies, C.E. Shannon and J. McCarthy, eds., Princeton University Press, Princeton, N.J., 1956, pp. 43-98. which is the same as (6). 2. For the induction step, assume (A-1) is true for (n - 1), i.e., for a (N - 2)MR system: H)1 2i ± 1 n-I n- I 2n) ) ) ( n ~nn R((N2)MR). R ((N-2)MR)sim Manuscript received March 11, 1974;revised August 10,1974, and November 7,1974. j=0 22n-1 (A-3) R2j+ 1 Substituting (A-3) into (9): sim = Francis Parkash Mathur (M'65) was born in California, on October 2, N + pvjN n 2 22n 1o 2n )n - n-1 n-I i=o (I1 (-1)X jT eXt(Ne2t 1) dt 2j + 1 n-I = RN + An z j=O Bin (RO'+ -RN). (A-4) An andB.n are defined in (10). Equation (A4) reduces easily to (10), since n1=o jn ] A z B = 1 . (A-S) Equation (1 0) is still valid for n = 0; it yields the Simplex reliability R0(N = 1). QE.D. Francis P. Mathur/239 Electrical Engineering/University of Missouri/ Columbia, Missouri 65201 USA 1940. He received the B.E.E. (honors) degree from the National University of Ireland, University College, Dublin in 1963, the M.S.E.E. from the University of California at Los Angeles (UCLA) in 1967, and the PhD with distinction in Computer Science also from UCLA in 1970. From 1963 to 1966 he was an Industrial Engineer with Consolidated Electrodynamics Corp., Pasadena, California. In 1966 he joined the Jet Propulsion Laboratory, California Institute of Technology where he worked on the development of the Strapdown Electrostatic Aerospace Navigator system before joining the Self Test And Repair computer development project, with responsibilities in software development and reliability analysis. In 1969 he was a recipient of NASA's Apollo Achieve- ment Award. He left JPL as a member of the technical staff in '72 to his current position as Associate Professor at the University of ~~~~~~~~~~~~~~~~accept Missouri, College of Engineering, Bioengineering and Advanced Automation program. is an advisor to the Sri Aurobindo International Center of Education, Pondichery, India where he spent the fall of '70 and summers of '72, '73 and '74 assisting in the development of a Computing Center ~~~~~~~~~~~He and in instituting a computer engineering department. He is a member of the ACM and Sigma Xi, and is a Distinguished Visitor of the IEEE. MATHUR AND de SOUSA: RELIABILITY MODELS OF NMR SYSTEMS 113 Paulo T. de Sousa/Electrical Engineering Department/University of instructor in the Department of Electrical Engineering. He held a Research Assistantship in the Bioengineering Program, Electrical Engineering Department, at the University of Missouri, where he is a candiPaulo Teixeira de Sousa was born in Nova Lisboa, Angola on January 25, date for the Ph.D degree. His current research interest is Fault-Tolerant 1947. He received the "licenciatura" degree in Electrical Engineering Computing. from the University of Luanda, Angola in 1971 and the M.S. degree Mr. de Sousa is a member of Tau Beta Pi, Eta Kappa Nu, and ACM from the University of Missouri at Columbia in 1972. and a past Rotary Foundation Fellow. After graduation in 1971, he joined the University of Luanda as an n n x Luanda/Luanda, Angola Abstracts of Reliability Dissertations and Theses Educational Institutions are invited to submit copies of their students' Master's Degree Thesis or PhD dissertations which deal specifically with some aspect of Reliability. As a service to Transactions readers and the educational institutions, the dissertations and theses will be reviewed and their abstracts published. Title: Author: Integrated Circuit Reliability Prediction H.R. Goldenberg Degree: Master of Science in Electrical Engineering Thesis Advisor: Dr. M.L. Shooman In this thesis a method is developed for calculating the average hazard rate for a catastrophic-failure test using the test duration, the number of failures, and the sample population. Author develops an equivalency criterion which allows a transformation between hazard model shapes. The model shapes examined in detail are the constant, the Weibull, and the piece-wise linear shapes. It is :shown that significant errors in reliability can result when constant hazard is assumed. The principal failure modes of digital integrated circuits are examined, and are related to the complexity of the circuit by qualitative arguments. Various proposed reliability and complexity models are examined to determine what measures of complexity have previously been used. These measures, and the additional ones formulated by the author, are compared in a pairwise regression analysis to determine the degree of correlation present. The analysis is performed for the TTL 7400 series, and the Schottky 74S00 series devices. Author shows that the hazard function of the integrated circuits is a Weibull function of time with a scale parameter proportional to the number of gates in the device. n 3 m Title: Author: Degree: School: Dissertation Directed by: Reliability Analysis of Transmission Systems of Regular Distributed Structures. R.J. Morgan Dr. of Philosophy University of New South Whales Dr. W.H. Holmes, Associate Professor of Electrical Engineering This two-part dissertation addresses the problem of reliability analysis of a redundant transmission distributed system whose structure is more general than a series-parallel scheme. In part I, the author treats the system as four-state, homogeneous Markov chain. The main contribution developed here concerns the limit theorem of system reliability. The theorem states that in the limit as system length becomes infinite, the reliability of the distributed system approaches the reliability of the series system of equivalent elements. The significance of the theorem lies in the fact that it extends the Messinger-Shooman results to a more general class of structures. The theorem also contributes to the field of graph theory in that it describes a "weak" connectivity property of cascade connected bipartite graphs, under the constraint of system T. L. Regulinski Senior Member IEEE "growth" in a single dimension. In part Il, the author treats the system in continuous domain where system branches represent amplifiers, and nodes perform as dual input repeaters. Corresponding to each amplifier is a pdf describing the uncertainty associated with amplifier gains. The problem considered is one of obtaining the best decision at each repeater in order to maximize reliability as length increases. It is shown that the decision function derived for the series-parallel system is quite different from the one for the distributed system due to the presence of an additional failure mechanism in the latter case. The generation of the optimum decision function is the major contribution contained in the second part of this dissertation. Portions of this work appear in Morgan & Holmes, "Reliability analysis of regular chain structures," IEEE Trans. Reliability vol. R-23, April 1974, pp 11-16. Title: Author: Degree: School: Dissertation Directed by: Deteriorating Markov Processes Under Uncertainty D.B. Rosenfield Dr. of Philosophy in Operations Research Stanford University Dr. Gerald Lieberman A model is developed that represents a deteriorating Markov process with imperfect or costly information. The process might be, for example, a deteriorating machine or inventory system with several states. In the context of the machine example, at each time period, the machine operator has three possible actions to choose from. If repair is chosen, an expected repair cost is incurred and the system reverts to the best state. If inspection is chosen, an inspection cost and an expected operating cost are incurred, and the operator determines exactly which state the system will be in at the beginning of the next time period. Finally, if no action is chosen, an expected operating cost is incurred, and the operator obtains no new information about the state of the process. Of course, such processes have been studied by others; however, under the imperfect information assumption the results are incomplete. Under the structure assumed, author characterizes a state space of the process as observed by the operator. In observed state (i, k), the operator knows that k time units ago, the underlying Markov process was in state i, and that no new information has been gathered in k time units. Under straight forward assumptions on costs and under the assumptions that the Markov matrix P is IFR or stochastically increasing and Pij = 0 for j K i, author shows that there are numbers k*(i) non-increasing in i such that it is optimal to repair if k > k*(i) in state (i, k) and optimal not to repair otherwise. That optimality holds for the n-period, infinite-horizon (discounted), and average-cost criteria. Under the stronger assumption that P is totally positive of order two (TP2), author further shows that, under the latter two criteria, the interval k E [0, k*(i) - 1 ] for state (i, k) can be broken into at most three additional regions: a no-action optimal region, an inspection-optimal region and a second no-action optimal region. nnr

RELATED PAPERS

RELATED TOPICS

Log In

Reliability Models of NMR Systems

Reliability Models of NMR Systems