Practical It Problem Management
Practical It Problem Management
TO IT PROBLEM
MANAGEMENT
IT Pro Practice Notes
Andrew Dixon
First Edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
© 2022 Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, LLC
Reasonable efforts have been made to publish reliable data and information, but the
author and publisher cannot assume responsibility for the validity of all materials or
the consequences of their use. The authors and publishers have attempted to trace
the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted,
reproduced, transmitted, or utilized in any form by any electronic, mechanical, or
other means, now known or hereafter invented, including photocopying, microfilming,
and recording, or in any information storage or retrieval system, without written
permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.
copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC
please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks
and are used only for identification and explanation without intent to infringe.
ISBN: 978-1-032-21729-1 (hbk)
ISBN: 978-0-367-63622-7 (pbk)
ISBN: 978-1-003-11997-5 (ebk)
DOI: 10.1201/9781003119975
Typeset in Berling
by SPi Technologies India Pvt Ltd (Straive)
CONTENTS
Chapter 3 Failure
Modes 15 Chapter 6 Drill Down 31
Hypothetical Case Study 32
Possible Failure Modes for a
Desktop Windows PC 16 Summary 35
Case Study: A Virus
Outbreak 18
Summary 19
v
Contents
vi
Contents
Identifying Value 67
Conclusion 77
Creative Problem Solving 67
A Glossary 79
Six Sigma 69
B Sample Checklists 85
Summary 69
Index 87
vii
BIOGRAPHY
ix
INTRODUCTION
Now, here, you see, it takes all the running you can do, to keep in
the same place. If you want to get somewhere else, you must run
at least twice as fast as that!
– The Red Queen, Through the Looking Glass, Lewis Carroll
DOI: 10.1201/9781003119975-1 1
Introduction
taught. It is hoped that this practical guide will encourage you to adopt
the tools presented within it, understand how and just importantly
when to apply them, and in so doing upskill your staff and improve
your processes.
ACKNOWLEDGEMENTS
I would like to thank Hannah for her encouragement to write this book
and for providing the introduction with the publisher, Mark for his
comments and advice on the draft version and John for his guidance.
2
CHAPTER 1
GETTING YOUR PRIORITIES RIGHT
This book is not designed to help you pass an exam in problem manage-
ment, although it may help you set up a problem management process
within your organisation (Chapter 10 looks at formal processes). Above
and beyond that, this book looks at the bigger picture of how problem
management adds value to an organisation and why it is important.
Take, for example, the Apollo 13 moon mission. This is an oft-quoted
example because of the famous expression
DOI: 10.1201/9781003119975-2 3
Getting Your Priorities Right
WORKAROUNDS
In problem management a solution which addresses the immediate
impact issues without addressing the underlying causes is known as a
workaround. In the case of Apollo 13, they realised that they had a spare
source of oxygen – the Lunar Module and a spare source of propulsion –
the gravity of the moon. The astronauts lived in the Lunar Module
for the next four days whilst the spacecraft travelled to the moon and
back, using the Lunar Module’s propulsion system to guide the whole
craft. This preserved the resources in the Command Module, so that
it could be used for re-entry. The astronauts survived and were hailed
as heroes, as were the staff of Mission Control who had assessed the
impact and provided the workaround.
ITIL 4 defines an incident to be
4
Workarounds
The incident was over once the mission was over and the astronauts
were safe. The tank which had exploded was somewhere in space – so
it couldn’t be repaired.
The problem remained. Before another Apollo mission could take
place, they needed to understand what had happened and how they
could remove or reduce the risk of it happening again. This is called
root cause analysis.
Note that although evidence was gathered, it was not important that
this analysis was done until after the workaround had brought the astro-
nauts home. In any problem, mitigating the impact is the first priority.
Sometimes, this can only be done by identifying the root cause and
addressing it. However, that is not always the case and is a call to make.
The review board determined that Oxygen Tank 2 was faulty before
the mission and that activating a fan within the tank caused an electric
arc which caused the fire and explosion.5 There were a number of con-
tributing factors. The tank was later redesigned to remove the risk from
all of the contributing factors. Performing the review was critical to the
success of later Apollo missions – any one of which could have ended in
disaster if the root cause analysis had not been done correctly.
The root cause analysis identified both a sequence of events which
led to the accident and a design fault:
1. Tank 2 was originally in Apollo 10, but was removed to fix a
fault. It was dropped when it was removed.
2. There were thermostats which were designed to operate at 28 volts,
but were powered with 65 volts – they failed to operate correctly.
3. The temperature gauge was only rated up to 29° Celsius (84°
Fahrenheit), so failed to detect the failed thermostats.
4. During testing, tank 2 needed to be emptied and the drain
system didn’t work, so they boiled off the oxygen. Without the
functioning thermostats, temperatures may have reached 540°
Celsius (1004° Fahrenheit).
5. The high temperatures appear to have damaged the Teflon
insulation.
Tests on similarly configured tanks produced telemetry readings which
were in accord with the telemetry readings captured during Apollo 13’s
flight, which gave the investigators confidence that this is what had
happened.
5
Getting Your Priorities Right
PREVENTING PROBLEMS
Problem management does not occur in a vacuum. When I trained to
do First Aid at Work, one of the things I was taught was that it was bet-
ter to avoid an accident than to pick up the pieces afterwards. If I saw
a trip hazard, I could remove it or wait until someone tripped and then
administer first aid. If I saw a drawing pin on the floor, then I could pick
it up and put it back on the noticeboard, or I could treat someone with
a drawing pin in the foot.
The cost of the Apollo series of missions is estimated at $25.4 b illion,
so it can be argued that this mission cost in excess of $1 billion and
failed to achieve its primary objective of reaching the moon. The mis-
takes which led up to this were, therefore, very expensive mistakes.
The thermostatic switches used in Oxygen Tank 2 should have been
replaced when the operating specifications were changed.
When the tank was dropped, it should have been fully tested in an
end to end lifecycle test.
Oxygen Tank 2 was filled during a countdown demonstration test.
When it could not be emptied using the correct procedure, a work-
around was applied of boiling off the oxygen (which would normally
be stored in liquid form).
At each point, if a different decision had been taken then this disaster
may not have happened and a $1 billion mission may not have failed.
Problem management exists in the context of providing an end to
end service and needs to operate alongside enterprise architecture,
continual improvement and risk management.
Workarounds should not be used to pass the problem further down
the line. If the drain pipe did not work, this should have indicated that
there was a more serious issue in existence. Just removing the oxygen
ignored the issue.
A NO BLAME CULTURE
There is no suggestion in this case that people covered up a story,
but it is good practice in problem management and in its sister major
6
Summary
SUMMARY
Problems are the causes of incidents. If the impact of the incident is
sufficient to warrant it, the problem needs to be investigated.
Rule 1: Assess the impact first.
Rule 2: Provide a workaround when appropriate. Sometimes
it is more important to address the incident rather than the
underlying cause.
Rule 3: Understanding the root cause allows reoccurrences of
problems to be avoided.
7
Getting Your Priorities Right
Notes
1 https://en.wikipedia.org/wiki/Apollo_13
2 https://en.wikipedia.org/wiki/Apollo_13
3 ITIL Foundation: ITIL 4 Edition, Axelos Ltd, Stationery Office, 2019
https://www.axelos.com/store/book/itil-foundation-itil-4 -edition
4 ITIL Foundation: ITIL 4 Edition, Axelos Ltd, Stationery Office, 2019
https://www.axelos.com/store/book/itil-foundation-itil-4 -edition
5 https://en.wikipedia.org/wiki/Apollo_13
8
CHAPTER 2
TIMELINES
Let us assume that you have identified a problem, but you do not know
what is causing the problem. A good place to start is by collecting a
history or sequence of events and creating a timeline.
When you go to see a doctor because you are ill, they will typically
do three things: they will ask you for your symptoms (what you think is
wrong with you, what you can feel), they will look for signs (by taking
your temperature and your blood pressure and by prodding and pok-
ing you), and they will take a history. In taking a history, they want to
collect two sets of data. They want to know when the pain started and
how it has developed. Is the pain the same at all times of the day or
does it vary according to your activities? It may ease overnight and then
return during the day. It may be worse when you are lying down and
improve when you are upright. Collecting the history of the pain and
other symptoms is important but there is another set of data which the
doctor will also typically ask about. They also want to know whether
your routine has changed. Have you been out of the country? Did you
just run a marathon? A change in your routine may have caused or con-
tributed to the illness. I am highly unlikely to have contracted malaria
if I have stayed in the UK all my life, but if the doctor is told that I have
just returned from a country where mosquitoes are prevalent, then
they will pay more attention to that possible cause of my ailment.
In problem management, the collecting of a history and the creation
of a timeline are also interested in those two sets of data. Is the problem
continuous or does it vary according to the time of day or according to
the patterns of business activity (see below)? Just as importantly, what is
the context in which this is happening? Problem management is closely
aligned to change enablement and arguably you will struggle to do prob-
lem management well if you do not implement change enablement.
DOI: 10.1201/9781003119975-3 9
Timelines
Some problems only occur during busy periods of the year. If you
consider the Tax Office web application for people to submit their tax
returns (in the UK this is operated by HMRC–Her Majesty’s Revenue
and Customs), then although people can submit their tax returns at
any point in the year, the web site becomes busier and busier as the
deadline date approaches (the self-assessment deadline for online sub-
missions is the 31st January each year for everyone). It is essential that
IT staff understand the patterns of business activity for their organisa-
tion in order to anticipate surges in demand such as this. In the next
chapter, we look at failure modes and the inability to cope with demand
needs to be considered alongside other failure modes.
12
Case Study: Database Corruption
caused incidents. The change had been identified and corrected and the
incidents had stopped. A while later the incidents had started again. It
was clear that although the team believed that it was the same error,
actually the symptoms were different and the signs were different. I
refined my broad-brush timeline and concentrated on a more narrow
time window based on when the second set of incidents had started.
By the end of my analysis, I was fairly confident that something had
happened on a particular day, even though it was not clear what that
event was. I turned to the change enablement records in their ITSM
tool. They had a good system for recording changes to the service and
to the servers which it ran on. However, the implementation dates
were sometimes vague. A Windows update was recorded as having
been applied sometime during the week. A configuration change had
been made sometime between two dates within the month. I identified
five changes which might have been made on that date. None of them
appeared to be a smoking gun. None of them seemed likely candidates
for the errors being seen.
I arranged a second consultation with the technical members of the
team and explained my thinking. I explained why I thought that this
was the most likely date for a change. I listed the changes from their
ITSM tool which could have been made, expressing doubt that any
were likely candidates. The team considered each change in turn and
ruled them out. They then referred to their team log and noted that
an engineer from the vendor had been granted access to the system on
that date.
I asked what the engineer had been doing. The engineer was not
meant to be making changes in that part of the system, but it became
apparent that they had. The company went back to the vendor with a
complaint and a demand that they fix the issue. The issue was fixed in
due course.
In my report I emphasised that a change enablement system is only
as good as the data in it. It is really important that vendors follow the
same process as the company. Vendor changes must be pre-approved
and they must be recorded within the change enablement process.
When a vendor is making a change, it is a good idea to ask for a method
statement explaining what they are going to do and how they are going
to do it.
13
Timelines
SUMMARY
Rule 1: Create a timeline for the problem. Start with a broad-brush
timeline and then fill in details where it adds value.
Rule 2: Cross-reference your timeline with changes from
your ITSM tool and with system logs and other sources of
information.
Rule 3: Look for negative evidence as well as positive evidence.
Look further back in time to ensure that you are not missing
evidence.
Rule 4: Try to distinguish between different problems which may
have similar symptoms but different root causes.
Rule 5: All changes to a system should be recorded, including
changes made by the vendor.
14
CHAPTER 3
FAILURE MODES
Failure modes are descriptions of the various ways in which a system can
fail. Suppose you have a Windows PC on your desk and you get a blue
screen of death – when the PC stops working and displays an error mes-
sage on a blue background. You will restart the PC and carry on. This is
a single incident and it is normally not worth worrying about. Suppose
you keep getting a blue screen of death. You clearly have a problem and
you will want to address that problem so that your work is not constantly
interrupted. So, the first question to ask is, what different components
of your Windows PC could cause a blue screen of death? Understanding
what components could fail is the study of failure modes.
There are two reasons why it is valuable to understand failure modes.
The first is to know where to look when a system fails. The second is
to determine whether the reliability of a system could be improved by
reducing the likelihood of certain failures.
We can break a Windows PC down into at least three layers – the
hardware, the operating system and the applications which run on the
operating system. Any of those three layers could cause a blue screen
of death. One possible cause is a memory fault. Memory faults could
happen at any of the three layers. Running the Windows Memory
Diagnostic Tool might reveal which layer is at fault. Equally, it may be
a case of following the timeline and establishing what behaviour causes
the crash. It may occur only when a particular application is run. It may
occur only when a certain amount of memory has been used.
For people working on a Service Desk, it is useful to list the different
failure modes for a Windows PC. At first glance there only appear to
be a few failure modes – I listed only three above. In practice, there are
multiple failure modes within each layer. Each of these can be further
broken down. The following list may be helpful:
DOI: 10.1201/9781003119975-4 15
Failure Modes
16
Possible Failure Modes for a Desktop Windows PC
The failure modes for a simple server are not dissimilar, but we typi-
cally add complexity to server infrastructure in order to add resilience.
I define resilience as the ability for a system to continue to provide its
main service despite the failure of an individual component. If the hard
disk of my Windows PC fails, then I have lost the use of my Windows
PC until it is repaired and I may have lost all data which was written
to that hard disk since my last backup. Whilst this may be acceptable
and tolerable for a single desktop computer, it is not desirable for a
corporate server. To improve resilience, servers typically have RAID
disks (Redundant Array of Inexpensive Disks), dual network connec-
tions and dual power supplies. RAID disks protect against the failure
of a single disk within the system by splitting the data across multiple
disks with a layer of redundancy in the data (so that one or more disks
may be removed from the array without losing any data). Dual network
connections protect against the loss of the network interface or the loss
of a single network switch. Dual power supplies which are plugged into
separate power supply paths protect against the loss of the power sup-
ply unit, but may also protect against the loss of raw mains or of the
fuseboard. Many commercial data centres certify 100% mains power
supply, but only on the condition that each system is plugged into both
the ‘A’ feed and the ‘B’ feed. They are permitted to remove either feed
without notifying the customer.
The first important point is that if you wish to achieve a resilient
system, then you need to map all of your failure modes and ensure that
they are all resilient. Suppose dual power supplies are both plugged into
the same power feed. This protects against the failure of the switched
mode power supply unit, but not against power loss. Suppose dual
power supply–fed servers are plugged into network switches which
were fed from a single power supply. The servers would have stayed up
during a power interruption, but could have been disconnected from
the network.
A second important point is that it is essential to monitor for com-
ponent failure. Failure to do so and respond swiftly delays but does not
prevent incidents. For a large RAID array, there needs to be a replace-
ment timeline agreed, so that failed disks are replaced and the RAID
array rebuilt before the next disk fails. Suppose network routers are
linked by two fibres following different paths. If one fibre fails without
17
Failure Modes
anyone realising, then if the other fibre is damaged the link would be
broken.
When a complicated problem arises (see next chapter for a discussion
of different types of problem), the standard method of problem resolu-
tion is to seek the root cause and then either fix the root cause or find a
workaround to mitigate or minimise the problem. In order to find the
root cause, it is important to understand the different failure modes, so
that they can be eliminated in turn and the root cause identified.
18
Summary
zero-day virus, but due to a high rate of re-infection, it was only when
the failures could be addressed (with virus signatures provided by the
anti-virus companies) that the problem was resolved. The clean-up
operation was hampered because initially the infected computers were
quarantined but the failure modes were not adequately understood.
When the user, whose user profile had been infected, moved to another
computer, their profile was downloaded from the server and infected
the second computer. Collecting timelines and infection statistics
quickly showed this up and the user profiles were also quarantined.
During the review process, each of the failure modes was considered
and, where possible, addressed.
SUMMARY
Understanding how computers fail, their failure modes, assists in the
root cause analysis of problems.
Rule 1: A resilient service is only as resilient as the weakest link
Rule 2: It is best practice to improve security at multiple levels so
that if one level fails, you do not lose all security. This is known
as Defence in Depth
19
CHAPTER 4
COMPLEXITY THEORY
OBVIOUS
Many incidents are never classified as problems since it is obvious how
to fix them, they are unlikely to re-occur and there is no benefit in
doing problem management on them. An obvious example of this is
a paper jam in a printer or photocopier due to the paper being fed in
incorrectly. How to remove the jammed paper is usually obvious and
the user then needs to insert the remaining paper correctly. If this is
an infrequent occurrence, then it does not need problem management.
If it occurs frequently, then there may be a training issue. Just occa-
sionally the obvious problem turns out not to be obvious and standard
problem-solving techniques are required.
Complexity Theory says that for obvious problems, the resolver
should sense the issue, categorise it according to previous experience
and respond accordingly.
DOI: 10.1201/9781003119975-5 21
Complexity Theory
COMPLICATED
A problem is defined to be complicated if it has a single root cause for
which the resolution is not obvious. The root cause may not be known,
so it is not always possible to state up front that a problem should be
classified as complicated. The standard method of resolving compli-
cated problems is to do root cause analysis. Later chapters of this book
will look at a number of techniques for conducting root cause analysis
in order to resolve complicated problems.
22
Complex
could not find a fault. Then the engineer looked at the paper that they
were using and asked when the company had changed paper suppliers.
During a quiet period of the year, the company had indeed changed
paper suppliers to one offering a slightly lower grade at a cheaper
price. This paper was rated for use by normal photocopiers, but the
engineer explained that it was inappropriate for the high-speed unit
in reprographics which ran at a much higher temperature than normal
photocopiers do, especially during long runs. The paper was degrading
during these long runs, and this was causing the paper jams. The engi-
neer had identified the root cause and recommended that the company
purchase a higher grade of paper for this particular unit.
COMPLEX
As was mentioned in the previous chapter, computer systems are
becoming ever more complex. It is increasingly the case that some
problems do not have a single root cause. It may be that a number of
components have all partially deteriorated or failed. It may be the case
that no individual component has failed at all, but that the interaction
between components is no longer within tolerance. Problems where
there is no single root cause are called complex problems.
Whilst there are standard techniques for discovering the root cause
of a complicated problem, it is more difficult to deal with a complex
23
Complexity Theory
CHAOTIC
It is sometimes the case that any change to a system seems to make it
worse. In this case, the problem is described as chaotic. The Cynefin
Framework recommends that where a chaotic problem arises the objec-
tive is to make changes which will transform it into a complex problem,
which may then be addressed. We should also consider why chaotic
problems occur. It may be the case that a chaotic problem is the cause
of poor management or deliberate, malicious action.
A standard method of recovery from a chaotic problem is to wipe the
system and rebuild. This is particularly important if the chaos has been
caused maliciously. Consider a virus infestation. If an individual PC has
been infected, then it is theoretically possible to examine the system
and clean it. However, if the behaviour of the virus is poorly under-
stood and the infestation has not been caught quickly, the quickest
route is to wipe the hard drive and reinstall from a known good backup
(or from scratch). It is important to consider network drives within
this. Some viruses will infect files on network drives. If the local system
is rebuilt, but the network drives are left intact, then as soon as the user
25
Complexity Theory
logs back in they can re-infect the PC. Macro-enabled documents and
spreadsheets or applications which are shared across network drives,
should be checked before users are allowed to continue to use them.
In an Agile or DevOps situation, poor change enablement could result
in multiple changes to the different levels of a system being made at
the same time. A network change, a database change, an application
change and a web interface change could all occur at the same time and
cause a chaotic system. Reverting to a known good situation is the best
approach in situations like this.
It is worth remembering that a timeline is still worth establishing
even when a chaotic problem is believed to exist. If a system has to be
rebuilt from scratch, it is important to know to which point in time it
should be built and therefore which versions of each software item.
Complexity Theory says that for a chaotic problem the resolver should
act first and then sense before responding.
SUMMARY
Understanding the nature of the problem will assist with which tech-
niques are used and how they are used.
Note
1 https://en.wikipedia.org/wiki/Cynefin_framework
26
CHAPTER 5
AUTOMATION AND ARTIFICIAL
INTELLIGENCE
IDENTIFICATION
Automated tools are good at two activities which lend them to problem
solving. The first is large-scale processing of data, and the second is pat-
tern matching. Keeping a diverse fleet of end user devices patched and
updated to the latest software and device drivers is a complicated and
time-consuming task. At their simplest, automated tools can offer to
automatically identify software which does not have the latest patches
and, if directed, to apply those patches to the software. Note that there
is a complication here in that some updates are free and should always
be applied and others have licencing implications. An automated tool
cannot apply those updates without authorisation.
Given that patches are typically released to address bugs in software
which could otherwise lead to problems, this is a form of proactive prob-
lem management.
DOI: 10.1201/9781003119975-6 27
Automation and Artificial Intelligence
DIAGNOSIS
It is not clear that current automated tools are conducting intelligent
diagnostics for themselves. They may flag an event for further investiga-
tion by the local support team, but the tool does not need to understand
the cause of a BSOD event to know that there is an underlying problem.
If the automated tool does provide a higher level of diagnosis, then it
may appear to the organisation using it to be highly intelligent, but this
higher level of diagnosis may in fact be the activity of engineers work-
ing for the third party who provide and support the automated tool.
In addition to keeping the patching list up to date, engineers may also
diagnose problems which their system has identified. Edging into the
area of artificial intelligence, the automated tool may conduct pattern
matching on all the errors which it has detected within a given period of
time (e.g. all the BSOD on a given day) and look for common attributes
which connect these systems and which are more prevalent within this
population than in the wider computer population. Engineers can then
look at this information in order to try to discern a root cause. As a
theoretical example, it may be that the pattern recognition software
28
Resolution
identifies that BSOD events are more likely to occur if an older ver-
sion of an office productivity suite is used in combination with an
up-to-date version of a marketing tool application. The pattern recog-
nition software may also identify that the systems concerned also have
social media management applications in common; however, if this
marketing tool application utilises the office productivity suite to send
emails, then a causal link may be established between those two appli-
cations whereas the other applications may have no relationship to each
other. The engineers could then check compatibility notes (which an
artificial intelligence tool would not typically be able to do) in order to
establish whether the functional requirements for the marketing tool
application mention specific versions of the office productivity suite. If
this is a known error, the engineers can configure the automation tool
to flag up the inconsistencies to the local support teams. Note that it
is not a problem which can be fixed by the automation tool in isolation
since there are licensing implications to address. Upgrading the office
productivity suite to the latest version is not free and would require
new software licences.
RESOLUTION
An automation tool can be involved in resolution at different levels.
If a software patch is identified during the diagnostics stage, then the
automation tool can deploy that patch without a support person need-
ing to visit the end user device. Whereas identification of a hardware
fault would involve the local support team. Within the context of a
data centre, the automation tool may identify that a server is at risk
of failing and remove the server from the production cluster, moving
services onto other servers. The data centre team can then repair the
server in their own time.
SHIFT LEFT
There are a number of ways in which automated tools offer a shift left
capability to organisations.
29
Automation and Artificial Intelligence
SUMMARY
The use of automated tools provides a level of proactive problem man-
agement which should not be dismissed, even though it may appear
to only address minor issues. Automated tools are very dependent on
the support infrastructure which surrounds them. If they are not kept
up-to-date, then their value will deteriorate extremely quickly.
30
CHAPTER 6
DRILL DOWN
There are obvious problems where the root cause has been seen before
and is well understood, but for which it is important to be able to
quickly identify which of a range of root causes is presenting on any
given occasion. This is particularly important for Service Desk analysts
who will be presented with a particular scenario and need to work
methodically through that scenario to identify which root cause has
resulted in the behaviour being exhibited. Service Desks have a num-
ber of aims, of which one is to resolve, on first contact, as many inci-
dents as possible (the first contact fix rate). They also wish to minimise
the number of incidents which result in a second line engineer need-
ing to visit on site to resolve the incident. If the Service Desk ana-
lyst (SDA) can identify and resolve an incident during the first contact
(whether that is a phone call, a chat session or a walk in), then this is
an efficient use of resources. One common way to achieve this is the
use of the Drill Down technique, and this is commonly applied with
the aid of conditional checklists. We will explore the use of checklists in
more detail in Chapter 14, but the conditional checklist is a technique
whereby the SDA will ask a series of questions from a checklist and
the answer to one question will determine the next question asked.
In theory, at the end of the checklist the SDA will know which root
cause has exhibited this behaviour and therefore what cause of action
should be followed. In some cases, this will be an escalation to a second
or third line team; in others it will necessitate a remote session to the
client PC in order to apply a fix. Drill Down is the technique which
is employed to facilitate this. Drill Down can be used without condi-
tional checklists but is particularly efficient when used with their aid.
The objective of Drill Down is to take an exhibiting condition which
could have multiple causes and drill down into the issue to understand
DOI: 10.1201/9781003119975-7 31
Drill Down
32
Hypothetical Case Study
since the PC obtained its IP address and that the DHCP server has allo-
cated the same address to another PC causing a conflict. In situations
such as this the computer may have popped up an alert informing the
user. However, users do not always understand these alerts and may
have ignored it, not realising that it was related.
Note that the diagnostics command will typically indicate what state
the network is in as far as the computer is concerned, but may not
indicate why.
Once the SDA has completed the Drill Down, they should have col-
lected sufficient information either to effect a remote fix or to triage
the incident to the correct team to address.
It is worth remembering that this incident started with the user report-
ing an email issue, but after Drill Down it has been established that it
was in fact a network issue totally unrelated to the email service. It is
quite common for the presenting issue to not be related to the root cause.
This is one of the reasons why the Drill Down technique is so useful.
SUMMARY
Drill Down works very well in situations where the resolver is try-
ing to narrow down an issue which is likely to be a well-k nown issue,
but which cannot be immediately identified for whatever reason (for
example because the resolver is remote and is relying on the user to
explain their issue).
Rule 1: Use a standardised checklist (see Chapter 14 for more
information about checklists) to ensure a common approach.
Rule 2: Be prepared to short cut the checklist if there is good
evidence for a probable cause or for eliminating certain
options – don’t treat users as idiots.
Rule 3: Each question asked should have a purpose and should
enable the possible causes to be narrowed down further.
Rule 4: Remember that the presenting issue may be unrelated to
the root cause.
Rule 5: Drill Down does not always find the cause. If a first time
fix is not going to be possible, collect the appropriate amount of
information and pass the incident to the correct team.
35
CHAPTER 7
DIVIDE AND CONQUER
DOI: 10.1201/9781003119975-8 37
Divide and Conquer
the desktop computer. The two obvious ones are the switched mode
power supply and the motherboard. If both of these are working, then
one would expect some visual or audible clue that the computer is not
totally dead. The repair engineer will isolate these two components by
unplugging the power connection to the motherboard and testing it
with a multi-meter. This should give a clear indication as to whether
the switched mode power supply is working or not. If it isn’t, then the
repair engineer can replace it. There is a possibility that the evidence
will point towards the motherboard being faulty, when in fact it is a
component connected to the motherboard. Typically, before replacing
the motherboard, everything else will be disconnected or unplugged to
confirm that it is the motherboard itself.
My next example is domestic rather than computer-related. A kitchen
has many appliances and in this scenario, the homeowner is regularly
suffering power cuts. Typically twice per week the fuse board RCD1
protection will trip for the kitchen circuit suggesting that there is a fault.
However, it has been impossible to identify a single appliance which is
causing this. How can we diagnose this fault? Divide and Conquer is an
ideal diagnostics tool for a task like this (although PAT2 testing would
be advised as well). Suppose the homeowner leaves half their appliances
plugged in and only plugs the other appliances in when they are actually
using them. They keep a diary of which appliances are plugged in when
the RCD trips. Each time the RCD trips they change the combination
of appliances left plugged in (obviously always leaving the fridge and
freezer on, but changing combinations of other appliances). It is almost
certainly the case that the RCD trips are being caused by poor insula-
tion within two or more appliances, each not sufficient in its own right
to cause the trip, but together sufficient. With the use of the diary, it
should be possible to identify two appliances which together cause the
trip. It may be that there are three appliances contributing, but never-
theless the task is to identify two which are contributing first and then
see if they together are sufficient. If the two on their own are not suffi-
cient then a third needs to be identified. Once the appliances have been
identified they need to either be repaired or replaced.
Software defects can also be identified using Divide and Conquer.
Software developers may put break points in their code in order to
examine the state of memory at a given point in order to establish
whether a fault is occurring before or after that point.
38
Case Study: Corrupted Surnames
39
Divide and Conquer
longer fields fixed the issue. The identity card system field lengths were
also checked, and it was confirmed that they matched those in the HR
system; it was only the intermediary system which could not cope with
the exceptionally long surname.
SUMMARY
Complex systems may be isolated into separate components using a
process of Divide and Conquer in order to assess which component
is suffering the problem. Understanding the failure modes within the
system is very valuable in determining how to divide the components.
Rule 1: Identify possible failure modes
Rule 2: Design a test which will distinguish between one set
of failure modes and another, thereby reducing the possible
number of failure modes
Rule 3: Repeat this until it is possible to identify a single failure
mode as a contributing factor
Notes
1 Residual Current Device
2 Portable Appliance Testing is an electrical appliance safety test
process
40
CHAPTER 8
CAUSE AND EFFECT
DOI: 10.1201/9781003119975-9 41
Cause and Effect
• Processes
• Equipment
If we look again at the five factors which together caused the Apollo
disaster, then we can see how they may be classified according to the
above:
1. Tank 2 was originally in Apollo 10, but was removed to fix a
fault. It was dropped when it was removed.
2. There were thermostats which were designed to operate at 28
volts, but were powered with 65 volts – they failed to operate
correctly.
3. The temperature gauge was only rated up to 29° Celsius, so
failed to detect the failed thermostats.
4. During testing, Tank 2 needed to be emptied and the drain
system didn’t work, so they boiled off the oxygen. Without the
functioning thermostats, temperatures may have reached 540°
Celsius.
5. The high temperatures appear to have damaged the Teflon
insulation.
We may see that there is a failing of strategy at multiple levels here. The
re-use of equipment without a thorough understanding of the impli-
cations of that re-use lies at the heart of this disaster, contributing to
factors 1, 2, 3 and 4 to a greater or lesser extent.
It may be argued that it is a failing of strategy when staff ignore
mistakes (such as dropping the tank) and apply undocumented work-
arounds when things go wrong (such as boiling off the oxygen).
There were failings of environment with the use of equipment rated
at 28 volts being used in a 65-volt environment and also of not being
clear what the temperature range for equipment was (factors 2 and 3).
There was a clear lack of communication between teams in order to
arrive at these factors, even though that poor communication may not
have been spotlighted in the official report. Not all staff who work in
IT enjoy writing documentation, but so many problems arise because
when a change needs to be made no one understands the original speci-
fication. If we want to reduce the number of problems and increase the
speed of resolution of problems, then we need to tackle poor commu-
nication and documentation.
The culture in which people work is a huge driver for the rate of prob-
lems within an organisation. Whilst I strongly believe in a no blame
42
Case Study: Data Centre Failure
culture where we do not blame an individual for doing their job (unless
they have been clearly negligent or wilfully disobedient), it is important
to understand that the incidence of problems will reduce only if the
culture which allows them to occur is tackled. In an industry such as
space travel the culture must be safety first, but the Apollo 13 disaster
emphasised that NASA did not operate such a culture. Unfortunately,
later disasters would suggest they did not learn this lesson.
Processes are put in place for a reason. It may be argued with hind-
sight that factor 4 failed to follow process, although it was signed off
at the time. This raises the interesting question of when alternative
options should be considered and when they should not be. Finding a
workaround for a failed process at the time of the failure can be fraught
with danger. It is very easy to choose a process which appears on the
surface to work but has unexpected consequences. Applying quick fix
workarounds should always be seen as a risk-based activity where the
risks of not doing it need to be balanced with the risks of doing it and
it going wrong. In both this disaster (which wasn’t fatal) and the space
shuttle Challenger disaster (which was), the risk of launch delay was
balanced with the risk of a catastrophic loss of the mission. It was later
said of the Challenger disaster, ‘Violating a couple of mission rules was
the primary cause of the Challenger accident.’1
The equipment failed – that was the presenting fault, but it only failed
because of other failures which had already occurred, and which had
been ignored.
We may combine all these contributing factors together in a fishbone
diagram (Figure 8.1).
In Chapter 3 we looked at failure modes and concentrated mainly on
the technical/equipment failure modes. It is important to remember
that there are wider issues to consider and not all failure modes are
associated with the equipment itself.
43
44
SUMMARY
When considering failure modes and when tackling issues, do not only
consider the technical, equipment-based vulnerabilities, but also con-
sider other factors.
Rule 1: Consider all factors
Rule 2: If undiagnosed failures have already occurred, then
problems may be compounded. Identify and fix them before
they affect services.
Note
1 ht t ps: //e n.w i k iped ia.org /w i k i / Space _ Shut t le _ C ha l le nge r_
disaster#cite_note-18
45
CHAPTER 9
RESOLUTION EVALUATION
METHODS
In Chapter 4 we looked at complexity theory and noted that not all prob-
lems have a single root cause. It is tempting to think that if a root cause
(or multiple root causes) can be identified, then that is the end goal, the
cause(s) can be fixed and normal operations resumed. Problem manage-
ment in the real world does not end with the identification of the root
cause.
ALTERNATIVE SOLUTIONS
There are some problems which are caused by something which breaks
or something which changes and breaks something. In cases like this, it
is usually possible to identify the obvious fix. As the complexity of prob-
lems increases, even understanding the root cause of the problem does
not guarantee a single solution. There may be different ways of fixing the
problem (or applying a workaround). Where different options could exist,
these need to be identified and a risk analysis undertaken for each one.
The objective for each option is to determine both the likelihood that it
will resolve the problem and any potential problems or negative conse-
quences, in order to minimise the risk. Some options may fix the present-
ing problem but cause secondary problems or move the problem from
one part of the system to another. Remember that in the Apollo disaster,
the oxygen was boiled off to fix one problem, but caused a greater one.
A plumbing analogy for this is that if a pipe is blocked, applying water
at high pressure carries two risks – one is that the increased pressure
might break a joint or alternatively you might just move the blockage
DOI: 10.1201/9781003119975-10 47
Resolution Evaluation Methods
from one part of the system to another, maybe more inaccessible, part.
It is tempting to think that if a disk fills up, moving the data to another
disk will resolve the problem, but care needs to be taken that the sec-
ond disk has the storage capacity required (not just at that moment but
for a period of time), and that it also has the throughput capacity to
cope with the additional demands being placed on it. One methodol-
ogy available for this type of analysis is called Kepner-Tregoe.1
CONTINUAL IMPROVEMENT
Some root cause resolutions may be achievable within the normal
Business As Usual (BAU) operations of the organisation. If this is the
case, then the proposed resolution should be added to the Continual
Improvement register and prioritised alongside all the other service
improvements being considered. The cost benefit of resolving this
problem needs to be quantified in such a way that it can be compared
with other priorities within the improvements list. As an example of
the issues which may be encountered – suppose a problem is causing
an hour’s additional work to the operations team once a month, but is
otherwise not adversely affecting the service. Clearly it would be ben-
eficial to resolve this problem. It is not clear, however, that this should
48
Projects
be done to the exclusion of all else. There may be two other competing
claims on the time of the operations team. One may be a more obscure
problem which only manifests itself every few months but is service
affecting, whilst the other may be an efficiency suggestion rather than
a problem. The implementation of the efficiency suggestion may save
2 hours per month. At this point, the cost benefit of each option needs
to be evaluated together with the team capacity required for implemen-
tation. If the implementation of the efficiency suggestion is as cheap as,
if not cheaper than, fixing the initial problem, it may be deemed better
value for money since the net saving is 1 hour a month. Conversely, fix-
ing the problem may only takes 3 hours whilst the implementation of
the efficiency suggestion might take 30 hours to implement. If this is
the case, the payback period may be deemed too long or the operations
team may not have the resource to dedicate to such a piece of work.
The service affecting problem will also have a cost benefit which may
be more difficult to evaluate in the same way but it may be deemed a
higher priority than either of the other two depending on the nature of
the organisation and the predictability and impact of the outage.
Equally, it is sometimes acceptable to live with a workaround for a
significant period of time rather than expend the energy involved in
resolving the problem.
An example of this was a Java memory leak in an ITSM tool. The Java
memory leak resulted in the application periodically crashing. Eventually
two workarounds were applied. The first was to increase the amount of
memory available to the application, which ensured that the application
could run reliably for at least a week without crashing. The second was
to restart the application once a week out of normal business hours. It is
important to note that in this case a problem resolution was outside of
the organisation’s capability since it was a commercial product. It is not
clear how quickly the supplier could have fixed the root cause, but the
workarounds removed the immediate need for a resolution.
PROJECTS
Even when the resolution is within the control of the organisation itself,
it is not always possible to implement a resolution within BAU opera-
tions. A new project may need to be established in order to fund and
49
Resolution Evaluation Methods
SUMMARY
Identifying the root cause of a problem does not mean that a fix should
be applied immediately. An options analysis exercise should be con-
ducted to consider all options and then a cost-benefits analysis exercise
should consider this problem resolution in the context of the total work
backlog of the team concerned. It is often the case that proposed fixes
are added to a Continual Improvement register for consideration as
part of a wider process.
Note
1 https://www.kepner-t regoe.com/
50
CHAPTER 10
ITIL PROBLEM MANAGEMENT
In their ITIL 4 framework, Axelos Ltd define the practice of problem man-
agement as being distinct from the incident management practice. Reactive
problem management involves responding to incidents which have already
occurred in order to understand the underlying causes and address these.
Proactive problem management is about identifying risks and responding
to those risks before they manifest themselves in incidents.
PROBLEM CONTROL
ITIL 4 recommends that a key aspect of problem management is the
process developed for controlling and managing problems. Each prob-
lem which is identified (either through reactive or proactive problem
management) should be recorded in a problem record within an ITSM
tool or similar system. Problem records should be linked to related
resources. In reactive problem management, the related incidents
should be linked to the problem record. Configuration Items (CIs) such
as desktops, servers, printers and software assets should also be linked
to the problem record as required. The problem record is a way to:
• collate the information
• prioritise the effort
• coordinate who is involved
52
Problem Control
53
ITIL Problem Management
staff effort and if the impact of this problem had been less, this
might not have been considered cost effective.
• Resolving the root cause: Having evaluated the optimum means of
fixing the root cause, this needs to be added to the work queue
for the relevant teams, appropriately prioritised alongside their
other work. Adequate testing of any changes to the system need
to be done before the fix is implemented and normal change
enablement processes followed. Once the fix is in place, the
result on users who have been affected needs to be evaluated.
Sometimes the fix at the server end will not resolve the issue
for the end users, who may also need to make a change on
their desktops (e.g. clearing the cache). If users are still using a
workaround, they need to be notified that the permanent fix is
now in place. The Known Error may be removed from the Service
Desk list of current problems once this has been completed.
• Long-term monitoring: Unlike incidents which should be marked
as resolved as soon after resolution as possible, a problem record
will typically be left in a semi-open state for a period of time in
order to assess whether the fix which has been applied has been
effective. Not all fixes address all issues. If the incidents reoccur,
then the problem record should be re-activated and moved back to
the identification stage. However, it should be noted that it is often
the case that the incidents for two related problems will all be
linked to the first problem record. If there is evidence that the first
problem has been successfully fixed, but that a second problem
exists with a different root cause, then a new problem record
should be created and the relevant incidents moved across. As a
general rule of thumb, an incident should not be linked to two
problem records as there should not be two independent problems
causing it (as distinct from one problem with multiple root causes).
• Closed: a problem record which has been monitored for a reasonable
length of time, with no recurrences may be marked as closed.
KNOWLEDGE MANAGEMENT
One key aspect of both proactive problem management and reactive
problem management is knowing how data is meant to flow between
55
ITIL Problem Management
SUMMARY
A formal practice and process for problem management, such as
the ITIL 4 practice, is a good way of methodically keeping track of
problems.
Note
1 https://support.hpe.com/hpesc/public/docDisplay?docId=emr_
na-a00092491en_us
56
CHAPTER 11
PROBLEM BOARDS AND
PROBLEM RECORDS
DOI: 10.1201/9781003119975-12 57
Problem Boards and Problem Records
58
How Problem Boards Work
• Management liaison
• Note taker
• Service Desk representative
THE 5 WHYS
The 5 whys is a technique which I find particularly useful in the con-
text of problem boards. In essence it is the same as drill down (Chapter
6). In drill down an expert will ask the customer a series of questions in
order to narrow the possible range of causes. In the 5 whys technique,
a non-technical person (such as the chair of the problem board) will
ask a technical expert a series of questions in order to home in on the
root cause. When a technical expert is asked why an event has hap-
pened, they will typically give a fairly general statement as to which
sub system of the service they believe to be at fault. Note that the 5
whys technique asks why this has happened rather than what has hap-
pened. If the storage has failed, it is usually fairly easy to state that the
storage has gone off line. From the point of view of root cause analysis,
the important question is why? An example might be as follows (not-
ing that this may span multiple problem board meetings as more data
is collected):
Chair: Why has the service failed?
Expert: The storage system became unavailable
Chair: Why did the service not fail over to the other site?
Expert: The virtual machines continued to run, but hung waiting for
their storage
Chair: Why did the storage system become unavailable, when there
are two controllers?
60
What Happens When Problem Management Doesn’t Work?
Expert: Both controllers ran out of memory at more or less the same
time and the first was still restarting when the second failed
Chair: Why did both controllers run out of memory?
Expert: We are talking to the supplier about this, it looks like a bug in
the controller firmware
Notice that although it is described as a 5 whys process, it is not essen-
tial to ask precisely 5 whys. The point is that the dialogue should con-
tinue until it is clear why the behaviour occurred in order to be able
to ask the question of what should be done about it. In this case, there
were two actions which could be taken: one to reduce the likelihood
that the memory capacity would be exceeded, and the other was to
request a firmware patch from the supplier.
and fully patched version, but was not the very latest version, so it was
agreed that the telephony team would trial the newer version. Key staff
being affected would be contacted and asked to collect more detailed
information about the exact time the incidents occurred, which the
networks team would then be able to use to track what was happen-
ing by comparing this with the events log for the VPN servers. The
desktop support team would evaluate a diagnostics tool which could
run on the users’ computers and might collect more detailed routing
information.
All of this is good problem management process. We never found
out whether it would have got to the root cause, because before the
second meeting of the problem board the issue was resolved via a dif-
ferent route. A separate incident came in from a project team con-
cerning the call centre development system and when the networks
team diagnosed that incident, they discovered that there was a rout-
ing mismatch and that some traffic destined for the production system
was being re-d irected to the development system in error. The rout-
ing table was adjusted and the problem disappeared. The timeline was
double-checked. The routing change had been correctly logged some
months previously. It was only when there was increased activity from
the project team on the development system that the error impacted
the production service.
The conclusion is that problem solving techniques are not perfect
and sometimes they won’t help. However, having a consistent method-
ology is far better than searching for a needle in a haystack, even if it
is not always the quickest method. In my experience, if people can fix
it quickly, they will do so without recourse to using these techniques.
Problem management is valuable when that option has failed.
SUMMARY
Where a problem spans the expertise of multiple teams it may be
advantageous to form a problem board to coordinate the effort and
encourage effective collaboration.
62
CHAPTER 12
THE DRIVE FOR EFFICIENCY
DOI: 10.1201/9781003119975-13 63
The Drive for Efficiency
Defects
Lean uses the term ‘defects’, but these may be thought of as problems
affecting the output of a system. Anything which causes the product
the user consumes to fail to conform to the standards and specifica-
tions required may be considered to be a defect. This may be the result
of incorrect source data, software bugs which incorrectly process that
data or hardware issues which corrupt the data.
Excess Processing
There is a maxim which states that data should be processed only once.
Many people are aware of systems which should talk to each other but
which are not integrated and the result is that operators have to manu-
ally re-type data in order to transfer it from one system to another.
This is an extreme example of excess processing, but there are other
examples where a poorly designed system expects the operator to
make calculations which the system could do for them, or they have to
remember information between different screens. There is a significant
risk where data has to be re-t yped that errors will be introduced. Even
the number of clicks required to achieve a simple task may be classified
as excess processing.
An example from the ITSM world was a university Service Desk
which was required to log an incident for every person who came to
their walk up desk. Many of the visits could be classified under one of
ten headings and did not require further follow up (headings included
purchasing consumables, resetting passwords, providing routine
advice). Quick action links were provided which enabled the Service
Desk analysts to record these events with the click of one button, rather
64
Using Lean to Optimise Processes
than multiple clicks. The data was collected, but not at the expense of
the Service Desk staff.
Overproduction
In a factory setting, it is obvious that producing too many products is
unnecessary and creates both cash flow issues and storage issues. It may
be less clear in an IT scenario what is meant by overproduction. Lean
is not just interested in the end product, but in the overproduction of
components. In IT teams, it may be helpful to think about the layers
of a typical client server application. Typically, a client interface (often
web-based) will talk to a web server, which will in turn talk to an
application server, which will in turn talk to a database server. In order
to scale up a service and provide resilience, there are usually multiple
servers in each layer. Assuming that demand for this service fluctuates
both according to time of day and according to the seasons, then there
is no need to run the maximum number of servers all the time. Clearly,
the service needs to be able to cope with peak demand when neces-
sary, but it should be able to scale up and down to meet that demand.
Some infrastructures will do this automatically, but other infrastruc-
tures require this to be a manual process. The typical consequence of it
being a manual process is that the number of servers is rarely changed
and there is ‘overproduction’ of servers for much of the time.
A potential problem is caused by this inflexibility if the anticipated
peak demand is exceeded and performance drops as a result.
Waiting
The obverse of overproduction is waiting – where a user has to wait for
a service because there is insufficient resource available at the time that
they wish to use it.
This may on occasion be measured in fractions of a second, when
they have to wait for a screen to update because the servers or other
infrastructure have not been scaled to match demand. Another form
of waiting is for maintenance. Older computer systems were often
designed in such a way that maintenance and updates require the
whole service to be taken off-line whilst the change is made. It may be
possible to do this at a quiet period of the day, but it is sometimes the
case that a few users still have to wait for the service to be restored.
65
The Drive for Efficiency
Inventory
Holding excess stock does not at first sight seem to be related to prob-
lem management. It is, however, a real problem for large IT organisa-
tions who purchase desktops, laptops, servers and storage arrays in large
quantities. Economies of scale mean that they are encouraged to buy
in bulk. However, this needs to be balanced with the demand and also
with the ability to commission and deploy the equipment. In a badly
organised company, it is possible for excess computer equipment to sit
in store rooms beyond the warranty period or even in extreme situa-
tions beyond the supported lifetime. Universities in the UK suffer from
a poor accounting technique which does not allow individual depart-
ments to carry over surplus revenue from year to year. As a result, it is
quite common for IT departments who have an excess at the end of a
year to advance purchase equipment for the next year. Sometimes, that
equipment will sit in boxes for much of the year because staff do not
have the time to commission it, or the project it was bought for was not
ready to use it. The consequence of this is that the lifespan of the equip-
ment is shortened or the equipment is made to run beyond its support
contract increasing the risk of it failing outside of its warranty period.
Transportation
Processes which involve the frequent or routine movement of physical
equipment need to be evaluated to consider whether the transportation
of those goods is optimal. There are organisations where business units
purchase their own IT equipment and have it delivered locally. The
IT department then needs to collect that equipment and take it some-
where else to be commissioned before returning it to the right business
unit. This is highly inefficient. Even the location of storerooms within
a building should be considered to determine whether the processing
route is optimal.
66
Creative Problem Solving
Motion
Another aspect of transportation is the number of visits to the data
centres which are required for routine operations. Many data centres
now operate on a ‘lights out’ basis where there is no need to visit the
data centre to manage the equipment at all. Other data centres see
daily trips to change backup tapes, commission new equipment on a
piecemeal basis and a wide variety of other activities which could be
managed in a more structured way. The cost of these daily visits should
be compared with the cost of additional equipment which could negate
the need.
Non-Utilised Talent
Computer systems lend themselves to automation and scripting, and
yet highly technical teams often find themselves doing routine and
repetitive tasks because it is easier to keep doing the repetitive tasks
than to find the time to upskill the staff and develop better ways of
working.
One of the consequences of manual repetitive tasks is that mistakes
creep in and problems result. Checklists are a way of both reducing
this type of error and enabling more junior members of a team to per-
form these tasks.
Identifying Value
One of the emphases of Lean is to identify where the value is derived
within a process and within a value stream and increase that value
whilst reducing the waste.
A side product of undertaking Lean is that when a problem does
arise, the overall process should be better understood and it should be
easier to identify the root cause.
67
The Drive for Efficiency
In the first stage of this technique, the problem is identified and goals
agreed. Root cause analysis is then applied to the problem. Resolution
evaluation methods are treated in two sections in this technique as the
ideal state is first articulated and then options are evaluated against this
ideal state. When the problem is a clear case of something not working,
then an ideal state is usually obvious, but in the real world this is not
always the case and it can be important to ensure that all members of
the problem team are agreed on what ‘good’ looks like. Only once the
options have been evaluated and a best option chosen can a solution
be implemented. The results then need to be measured to ensure that
68
Six Sigma
the desired outcome has been achieved. If it hasn’t, then the cycle may
need to be repeated. If it has, then the next highest-priority problem
may be tackled.
SIX SIGMA
The Six Sigma methodology also grew out of manufacturing and places
emphasis on the quality of the output of production. In other aspects
there is commonality between this and Lean. This book will not
explore Six Sigma further.
SUMMARY
Making systems more efficient is an effective way of doing proactive
problem management.
69
CHAPTER 13
APPLYING THE PRINCIPLES TO
THE WORLD OUTSIDE OF IT
DOI: 10.1201/9781003119975-14 71
Applying the Principles to the World Outside of IT
• the pump circulates the water around the system – this is either
on or off and often has a light to indicate which state it is in; it
may also have three speeds but note, that faster may not be better
• the room thermostat will send a signal to the control unit to
turn off the heating of the radiators once the ideal temperature
has been reached
• the cylinder thermostat will send a signal to the control unit
to turn off the heating of the hot water cylinder once the ideal
temperature has been reached
• the gate valve is placed at a T junction in the pipework with the
pump upstream from it, the cylinder off one leg of the T and
the central heating radiators off the other leg. The gate valve can
be in one of three positions, allowing water to flow either just
to the cylinder, just to the radiators or to both at the same time.
The gate valve is the part which fails most frequently in my
personal experience, and it is useful to note that it has a manual
bypass feature which allows the system to continue to operate
(albeit sub-optimally) should the gate valve fail. The manual
bypass places the gate valve in the mid-position, providing heat
to both the cylinder and the radiators.
If your heating system suddenly stops working, then divide and conquer
techniques may be applied to try to identify which of the above com-
ponents is at fault. Obviously if there is a water leak, then that needs
to be attended to and the system re-fi lled with water before anything
else can be done.
If the heating system fails, then the first task is to collect data and a
timeline. Is the system still producing hot water at all? Is it still produc-
ing central heating at all? What temperature are various pipes around
the different components? Is the temperature constant or does it vary
according to time of day?
It is worth switching the control unit between various options and
collecting data for each setting:
• Hot water on; central heating (radiators) off
• Hot water off; central heating on
• Both on
• Both off
72
Fixing Your Heating System
If the pipes get hot, then this would suggest that the gas boiler is work-
ing and that the pump is working. If the gas boiler comes on but no
pipes get hot, just lukewarm, then the pump may have failed. If either
heating or hot water is working fine but the other is not working at
all, this usually indicates that the gate valve has failed. At this point,
using the manual override lever on the gate valve should restore service
to both halves (albeit that it may not reach full temperature). If the
manual override works, then the gate valve is not working or the power
to the gate valve is not working.
If using the manual override makes no difference, then the thermo-
stats should be considered and ruled out.
When we first moved into our current house the heating system
was working, but sub-optimally. After having the radiators upgraded
to having individual thermostats with no discernible improvement, I
determined that one of the active components must be faulty. I had a
spare control unit, so I tried replacing that, but it made little differ-
ence. I brought in a plumber and took advice. They suggested that the
pump probably needed changing and replaced it. It didn’t make much
difference. I contacted another plumber by email and gave them the
symptoms and asked them what they would do next, without telling
them that the pump had already been replaced. They said that the
pump needed replacing. I contacted the original plumber and asked
them to look at the gate valve. They said that they had looked at it
during the first visit, and they didn’t think the gate valve was faulty
but agreed to replace it (since I was paying). It was only when they
removed the wet side of the gate valve that we found out that the flap
which should form the gate within the pipework had broken in half
and was not making a good seal. Although the motor was rotating the
paddle, the paddle was making no difference to the system without
the flap on the end of it. Replacing the whole gate valve unit fixed the
problem.
Two years later, the electromechanical side of the gate valve failed
and I was able to diagnose and replace it for myself at considerably less
cost than the visit from the plumber.
If one has a toolbox of problem solving techniques, they may be
applied in all walks of life.
73
Applying the Principles to the World Outside of IT
SUMMARY
Although this book describes problem management and problem solv-
ing from an IT perspective, the same techniques may be applied in all
walks of life.
74
CHAPTER 14
USING CHECKLISTS
In his book The Checklist Manifesto: How to Get Things Right, Atul
Gawande describes how checklists are a valuable resource in all walks
of life. They are a valuable tool in both proactive and reactive problem
management. The obvious example is in the use of the Drill Down
technique. One can apply the Drill Down technique using personal
experience in order to know what question to ask next. There are
two reasons why this is sub-optimal. The first, which is a main theme
of Gawande’s book, is that we all forget items from a list. If my wife
gives me a shopping list of items to buy from the store, then if I try to
remember the list it is quite likely that I will remember most of the
items, but I will sometimes forget one or two items. The same will be
true of applying the Drill Down technique just using personal experi-
ence – one will forget certain questions or ask them in the wrong order.
In comparison, if I write a shopping list for the store which is laid out in
the order in which I walk around the store and my wife ticks the items
we want this week, then my accuracy is improved. This is the principle
for the Drill Down technique – the right questions in the right order.
The second reason is that using personal experience works extremely
well for the expert who has been doing the job for many years, but it is
very difficult to transfer that knowledge to the new starter. If there is a
written down checklist of questions to ask, then that list can be given
to the new starter and they will be at near full speed within a very short
period of time.
In terms of my shopping list, this proved useful during the Covid-19
pandemic of 2020/21 when I had to self-isolate for a week and could
not do my weekly shop. The shop offered an online facility, but on a
three-week lead time, which was of no use. A friend offered to do the
shopping for me. I was able to send them a printed list with not only
DOI: 10.1201/9781003119975-15 75
Using Checklists
the items we required but also the aisles in the store where they would
find them.
Checklists are also useful in other stages of problem management,
to remind people to consider all the options. We typically look at six
different possible cause and effects (Chapter 8), and a checklist may
be useful to ensure that all have been considered. Checklists are really
useful during problem boards to ensure that all items of the agenda
have been covered off.
Some sample checklists are provided in Appendix B.
SUMMARY
Building a set of checklists for your own circumstances is a valuable
way of ensuring that each problem is dealt with in a consistent way and
that steps are not forgotten.
76
CONCLUSION
‘Oh, how I wish I could shut up like a telescope! I think I could, if only
I knew how to begin.’ For, you see, so many out-of-the-way things had
happened lately, that Alice had begun to think that very few things
indeed were really impossible.
– Down the Rabbit-Hole, Alice’s Adventures in Wonderland,
Lewis Carroll
Begin at the beginning and go on till you come to the end: then stop.
DOI: 10.1201/9781003119975-16 77
APPENDIX A
GLOSSARY
DOI: 10.1201/9781003119975-17 79
Appendix A
80
Appendix A
81
Appendix A
82
Appendix A
83
Appendix A
Notes
1 ITIL Foundation: ITIL 4 Edition, Axelos Ltd, Stationery Office,
2019
2 ITIL Foundation: ITIL 4 Edition, Axelos Ltd, Stationery Office,
2019
84
APPENDIX B
SAMPLE CHECKLISTS
DOI: 10.1201/9781003119975-18 85
Appendix B
86
INDEX
5 Whys, 60 I
Impact, 3, 59
A Incident Management, 54
Incidents, 4
Artificial Intelligence, 27
ITIL 4, 51
Automation, 27
C K
87
Index
S W
Shift Left, 29 Workaround, 4, 59
88