LEC6 Maint Reliability
LEC6 Maint Reliability
LEC6 Maint Reliability
By
LECTURE 6
Reliability & Maintainability
II. Computational tools
Reliability – Maintainability -
Availability
2
Overview
• Reliability concept.
• Computational methods.
• Procedures.
• Markov analysis.
• Human reliability assessment.
• Special topics.
3
Exact Approxi-
Markov
methods mations
Practical Math analysis
approach functions
ETA
Computational Reliability Procedures
methods estimation
RCA
Data Formulas
HRA
bases
Specials
Growth EOH
models 4
Reliability concept
• Reliability
the probability that a component or a system will function
properly, without any failure, under the process conditions
imposed and for the desired period
• Reliability theory
• Originated in the US Army in the 50s
• Boomed in the 60s
• Very important in both industry and service industry
• Often neglected in engineering education
5
• R&M (reliability & maintainability) programs
6
• Asset management (“asset healthcare”) viewpoint
Equipment/asset can be used !
Simple and fast
Seldom a down state maintenance
Availability Resource management
of the equipment
Equipment
70 % impact 70% impact 10 % impact
design
Equipment
10 % impact 20 % impact 70 % impact
maintenance
7
(Based on: SAMI, 2008)
Computational methods
Empirical data
• Based on f(t)
R(t) = Prob(T>t)
R(t) = 1-F(t) (see previous chapter)
h(t) = f(t)/R(t)
Failure rate characterizes DFR, CFR, IFR (bath-tub curve)
Failure rate can help to predict impending failures (PF curve)
8
Bath-tub curve
DFR
IFR
CFR
9
10
PF curve
11
12
• Practical approach
– Availability formula
MTBF
A
MTBF MTTR
13
– Example of availability computation
• Consider the system below.
– Compute the MTTR, MTBF and availability for the system
– Compute the systems reliability for a 6-months and a 1-year period
14
Answer:
= 0.9997
15
– Remarks Difference: repair time
Constant failure rate !!
• MTTF vs MTBF
• Use of MTBF assumes CFR
• MTBF
– Continuous case
– Discrete case
16
Physics-on-failure
17
Standard data
– Reliability databases (see Failure statistics)
– Formulas
18
System structure
RBD (reliability block diagrams)
19
• Exact formulas
20
– Example
• Compute the reliability of the 2-component systems
below; the component reliability is 0.9
• Answer:
1 2 Rseries=R1*R2=0.9*0.9=0.81
1 Rparallel=R1+R2-R1*R2
=1-(1-R1)*(1-R2)
2 =1-(1-0.9)*(1-0.9)=0.99
21
– Example
• Compute the reliability of the following system
22
• Answer
23
• Approximations
– Cut & tie sets
24
– Example
• Compute for the following simple system the upper
bound and lower bound on the system reliability.
Compare with the exact reliability. Components have an
identical reliability of 0.9.
25
• Answer
1 2 1 2 4
4
3 3
3 4
27
Potential
Failure Mode and Effects Analysis
D
S C O R
e
e l c P Actions Results
t
v a c N
e
Item Potential Potential s Potential Causes/ u Current Responsibility S O D R
c
Failure Effect(s) of s Mechanisms(s) r Design Recommended & T arget Actions e c e P
Function Mode Failure Failure Controls Action(s) Completion Date Taken v c t N
- Changes to
Standards,
What can go wrong? What How often does Procedures,
are the it happen? or Guides
- No Function Cause(s)?
- Partial/ Over/
Degraded
Function
- Intermittent
Function How good is
How can this this method of
be found? finding it?
- Unintended
Function
28
29
– Risk Priority Number (RPN)
• A measure used when assessing risk to
help identify critical failure modes associated with your
design or process. The RPN values range from 1
(absolute best) to 1000 (absolute worst).
• The FMEA RPN is commonly used in the automotive
industry (SAE J1739 FMEA) and it is somewhat similar to
the criticality numbers used in Mil-Std-1629A.
• RPN = S * O *¨D
– S = severity
– O = occurrence
– D = detection
• Use of RPN
– Find S=9 or 10 failure and immediately do something about it.
– Prioritize the other failures based on (decreasing) RPN
• Example
– http://www.sigmazone.com/gondola_lift_fmea.htm 30
– Severity (S) - Severity is a numerical subjective estimate of how severe
the customer (next user) or end user will perceive the EFFECT of a failure.
31
– Occurrence (O) - Occurrence or sometimes termed LIKELIHOOD, is a
numerical subjective estimate of the LIKELIHOOD that the cause, if it
occurs, will produce the failure mode and its particular effect.
32
– Detection (D) - Detection is sometimes termed EFFECTIVENESS. It is
a numerical subjective estimate of the effectiveness of the controls to
prevent or detect the cause or failure mode before the failure reaches
the customer. The assumption is that the cause has occurred.
33
34
• FTA
– Fault tree analysis
– Steps
• Define the system: components, functional relationships,
requirements
• Define the top event (primary failure)
• Construct the tree (top-down), with AND, OR gates
• Estimate the probability for each primary fault
• Calculation of the probability of occurence of the top
event (bottom-up)
– Software support
35
36
• ETA
– Event tree analysis
– Visualization (cfr FTA)
– Binary logic
37
38
• RCA
– Root cause analysis
– DOE-NE-STD-1004-92
– Many tools and approaches
– Principles
• Usually more than one root cause
• Steps
– Define scope
– Gather quantitative data
– Identify causal factors
– Find root cause for each causal factor
– Develop solution recommendations
– Implement solutions and follow-up on results
40
(http://ecmweb.com/mag/electric_explaining_motor_failure/)
42
(http://www.onesixsigma.com/article/what-can-the-nhs-learn-from-toyota) 43
• HAZOP
– Hazard and operability study
44
(http://pie.che.ufl.edu/guides/hazop/hazop2.gif)
45
Markov analysis
• Hypotheses
• Memoryless – exponential failure distributions
• Basics
46
• Method
– Identify all possible states of the system
– Determine all possible transitions between these
states and quantify them
– Work out the system of differential equations or
draw up the transition matrix
– Calculate the probability of a given state prevailing
by solving the differential equations or by
multiplying the relevant probabilities
– Determine the limiting conditions of the
probabilities
47
• Example
Answer
48
49
Human reliability assessment (HRA)
• Hi-tech low-tech situations
• Hybrid area
• Objectives
– Error identification
– Error quantification
– Error reduction
50
51
• Step 1: Problem definition
– Purpose
» to precisely define the problem and its setting in terms
of the system goals and the overall forms of human-
caused deviations from those goals.
– Human-system interaction: examples
» maintenance/testing errors
» operator errors initiating te incident
» recovery actions by operators (incident termination,
restoration, …)
» errors by which operators make the incident worse
52
Probl em To effect emergency shutdown (ESD) of a chemical plant
during a loss of power scenario.
Probl em Setti ng A computer-controlled, operator-supervised plant suffers a
sudden loss of main power. The VDU display system will also
fail and so the operator , backed up by the supervisor must
initiate ESD manually using hardwired controls in the Central
Control Room. However, due to valve failures on plant, these
actions are only partially successfull, and so the operator must
send out another operator onto plant to determine which ESD
valves have not closed. The CCR operator, via engineering
drawings, can then determine which manual valves must be
closed on plant. The outside operator must then go to close
these valves, completing this action successfully within 2
hours from the onset of the scenario.
System G oal s The overall system goals are safe shutdown of all feeds to the
plant within 2 hours of loss of power. In this scenario there
are no production goals once the event occurs since safety is
clearly under threat. Prior to the event, the operator is
concerned with achieving steady feed throughout via
monitoring the top two levels of a VDU display hierarchy, and
notifying the supervisor of any alarms higher than level 2. The
outside operator (on plant) will have various duties associated
with maintenance tasks.
Overal l Human Error No operator initiating events were identified, and maintenance
errors were not relevant except that.
Consi derati ons Iidentifying and moving the local manual valves could prove
difficult. Recovery actions involve identifying the appropriate
valves to close and closing them. Errors of failing to realise
that ESD has not been 100% effective, and of mis -identifying
tha valves, appear most likely. Loss of power is so evident that
misdiagnosis or failure to diagnose is not considered to be 53
credible.
• Step 2: task analysis
• Purpose
– ccomplete and comprehensive description of tasks that have to be
performed by the operators to achieve the system goals
• Types of task analysis
– sequential
» operator actions in chronological order
– hierarchical
» in terms of hierarchy of goals
– tabular
» cognitive decision-maing aspects, knowledge and expectations
on the situation
warming up furnace
1. 2. 3. 4.
prepare plant and services start air blower start oil pump heat up to 800 degrees
1. ensure plant 2. ensure gas-oil 3. ensure oxygen 1. increase temperature 2. monitor oxygen 3. monitor temperature 4. switch to automatic
is ready available analysing system controller as per chart as per diagram
is OK
54
• Step 3: human error analysis
– Purpose
» To identify all significant human errors affecting performance
of the system, and ways in which human errors can be
recovered
» CRITICAL STEP
– Examples: external error modes
» error of omission: act omitted (not carried out)
» error of commission: act carried out inadequately, act carried out in
wrong sequence, act carried out too early/late, error of quality (too
little/too much)
» extraneous error: wrong (unrequired) act performed
– Techniques
» SHERPA, PHECA
55
Systematic Human Error Reduction & Prediction Approach (SHERPA)
1. Failure to consider special circumstances . A task is similar to other tasks but special circumsta
prevail which are ignored, and the task is carried out inappropriately
2. Short cut involved . A wrong intention is formed based on familiar cues which acti vate a short c
inappropriate rule
3. Stereotype takeover . Owing to a strong habit, actions are diverted along some familiar but uninte
pathway
4. Need for information not prompted . Failure of external or internal cues to prompt need to searc
information
5. Misinterpretation. Response is based on wrong apprehension of information such as misreading of
or an instrument, or misunderstanding of a verbal message
6. Assumption. Response is inappropriately based on information supplied by the operator ( by re
guesses, etc.) which does not correspond with information available from outside
7. Forget isolated act. Operator forgets to perform an isolated item, act or function, i.e. an a
function which is not cued by the functional context, or which does not have an immediate effect
the task sequence. Alternatively it may be an item which is not an integrated part of a memo
structure
8. Mistake among alternatives . A wrong intention causes the wrong object to be selected and acte
or the object presents alternative modes of operation and the wrong one is chosen.
9. Place losing error. The current position in the action sequence is misidentified as being later tha
actual position
10. Other slip of memory (as can identified by the analysis)
11. Motor variability. Lack of manual precision, too big/small force applied, inappropriate ti
(including deviations from "good craftmanship").
12. Topographic or spatial orientation inadequate . In spite of the operator's correct intention and co
recall of identification marks, tagging, etc. he unwittingly performs a task/act in the wrong place o
the wrong object. This occurs because of following an immediate sense of locality where this i
applicable or not updated, perhaps due to surviving imprints of old habits, etc.
56
• Step 4: representation
– Purpose
» logical format for errors, so that their effects on the
system goals can be evaluated
– Examples
» fault tree
» operator event tree pump A
filter
power
pump B supply
total loss
of output
or
1 pump problem 2
filter pipeline
blocked or fails
3 both pumps
defective
power
failure and
4 5
pump A pump B
fails fails
Fig. 6.9: FTA example
58
• Step 5: screening
– Purpose
» which errors maybe ignored, which errors should be
taken into account in the rest of the study
– Example (SHARP)
» human error + unlikely hardware error
» p(human error)=1 effect ? - p(human error)<1
59
• Step 6: quantification
• Purpose
– quantification of human error and error recovery probabilities, in
order to define the likelihood of success in achieving the system
goals
• Definition HEP
numberof errorsoccured
humanerror probability
numberof opportunities for errorto occur
– data are scarce !
» denominator problem, confidentiality, unwillingness to publish
data, lack of awareness to collect approriate data
» legislation, “near-misses”, …
• Many techniques
– SLIM, HEART, THERP
– Selection criteria: accuracy, validity, usefulness, effective use of
resources, acceptability, maturity
60
Human error probability
Performance shaping factors
Selection criteria
a) Accuracy: numerical accuracy (in comparison with known HEPs); consistency
between experts and assessors.
b) Validity: use of ergonomics factors/PSF to aid quantification; theoretical basis in
ergonomics, psychology; empirical validity; validity as perceived b y assessors,
experts, etc.; comparative validity (comparing results of one technique with results of
another for the same scenario).
c) Usefulness: qualitative usefulness in determining error reduction mechanisms;
sensitivity analysis capability, allowing ass essment of effects on HEPs of error
reduction mechanisms.
d) Effective use of resources: equipment and personnel requirements; data
requirements, e.g. SLIM and PC require calibration data (at least two known HEPs);
training requirements of assessors and/or ex perts.
e) Acceptability: to regulatory bodies; to the scientific community; to assossors;
auditability of the quantitative assessment.
f) Maturity: current maturity; development potential.
61
• Step 7: impact assessment
• Purpose
– impact of errors on system reliability (or risk) and see if
level if acceptable or if improvement is necessary
• Action
– find major contributor to system non-performance and act
according to the findings
» human error
» hardware or software error
» combination
62
• Step 8: error reduction
• Purpose
– To identify error reduction mechnanisms, means of
supporting error recovery likelihood, and ways of improving
human performance
• Techniques
– prevention by hardware or software changes
– increase sytem tolerance
– enchance error recovery
– error reduction at source
• PSF (performance shaping factors)
63
technology
process chemistry
process
materials
training
experience
personnel mental model
personality
health
personnel communication
interactions information
company ploicy
man-machine controls
interface displays
clothing
equipment aids
operator
system
65
Special topics
• Duane’s reliability growth model
– formula
log c log0 (logT logT0 )
where
66
67
– Example
• A company is developing a new type of electronic test
equipment. The results of the first reliability qualification rest
were encouraging: 11 failures/600 hours. Before starting the
production of the equipment the MTBF (in operation) should be
500 hours.
• How much more testing is needed if =0.3 and if =0.5 ?
Answer:
for =0.3: 297000 hours
for =0.5: 12670 hours
How to solve ?
68
Duane equation
Taking the derivative ...
log c log 0 *(log T log T0 )
dn T0
*(1 ) * T
T dT 0
c 0 *
T0
1/
Final step
T c * T0
0
T0 1 T
i
1
*(1 ) * T
* 0 *(1 ) c 1 *(1 )
0 0 T
Instantaneous MTBF i or
dT c i *(1 )
i
dn
Solution to the problem
Cumulative MTBF 1/ 1/
*(1 )
T T c * T0 i * T0
c 0 0
n
T
n
c
Combining ...
T 1 T0
n
T *
T 0
0 *
T0 69
• EOH concept
• Equivalent operating hours
• (Gas) turbines
70
• RPN computation is a very important issue in FMEA ...
71
72
Wrap-up
• Reliability concept.
• Computational methods.
• Procedures.
• Markov analysis.
• Human reliability assessment.
• Special topics.
73
Exact Approxi-
Markov
methods mations
Practical Math analysis
approach functions
ETA
Computational Reliability Procedures
methods estimation
RCA
Data Formulas
HRA
bases
Specials
Growth EOH
models 74