Hack and - Linux Troubleshooting, Part I - High Load - Linux Journal PDF
Hack and - Linux Troubleshooting, Part I - High Load - Linux Journal PDF
Hack and - Linux Troubleshooting, Part I - High Load - Linux Journal PDF
Hack and / Linux Troubleshooting, Part I: High Load
May 01, 2010 By Kyle Rankin (/users/kylerankin) Like 61 people like this. Sign Up to see what
your friends like.
in
What do you do when you get an alert that your system
load is high? Tracking down the cause of high load just
takes some time, some experience and a few Linux tools.
This column is the first in a series of columns
dedicated to one of my favorite subjects:
troubleshooting. I'm a systems administrator
during the day, and although I enjoy many aspects
of my job, it's hard to beat the adrenaline rush of
tracking down a complex server problem when
downtime is being measured in dollars. Although (/issue/191)
it's true that there are about as many different From Issue #191
reasons for downtime as there are Linux text March 2010 (/issue/191)
editors, and just as many approaches to
troubleshooting, over the years, I've found I
perform the same sorts of steps to isolate a problem. Because my column is generally
aimed more at tips and tricks and less on philosophy and design, I'm not going to
talk much about overall approaches to problem solving. Instead, in this series I
describe some general classes of problems you might find on a Linux system, and
then I discuss how to use common tools, most of which probably are already on your
system, to isolate and resolve each class of problem.
For this first column, I start with one of the most common problems you will run into
on a Linux system. No, it's not getting printing to work. I'm talking about a sluggish
server that might have high load. Before I explain how to diagnose and fix high load
though, let's take a step back and discuss what load means on a Linux machine and
how to know when it's high.
Uptime and Load
When administrators mention high load, generally they are talking about the load
average. When I diagnose why a server is slow, the first command I run when I log in
to the system is uptime:
$ uptime
18:30:35 up 365 days, 5:29, 2 users, load average: 1.37, 10.15, 8.10
As you can see, it's my server's uptime birthday today. You also can see that my load
average is 1.37, 10.15, 8.10. These numbers represent my average system load
during the last 1, 5 and 15 minutes, respectively. Technically speaking, the load
average represents the average number of processes that have to wait for CPU time
during the last 1, 5 or 15 minutes. For instance, if I have a current load of 0, the
system is completely idle. If I have a load of 1, the CPU is busy enough that one
process is having to wait for CPU time. If I do have a load of 1 and then spawn
another process that normally would tie up a CPU, my load should go to 2. With a
load average, the system will give you a good idea of how consistently busy it has
been over the past 1, 5 and 10 minutes.
Another important thing to keep in mind when you look at a load average is that it
isn't normalized according to the number of CPUs on your system. Generally
speaking, a consistent load of 1 means one CPU on the system is tied up. In simplified
terms, this means that a singleCPU system with a load of 1 is roughly as busy as a
http://www.linuxjournal.com/magazine/hackandlinuxtroubleshootingpartihighload?page=0,0 1/5
1/9/2015 Hack and / Linux Troubleshooting, Part I: High Load | Linux Journal
fourCPU system with a load of 4. So in my above example, let's assume that I have a
singleCPU system. If I were to log in and see the above load average, I'd probably
assume that the server had pretty high load (8.10) during the last 15 minutes that
spiked around 5 minutes ago (10.15), but recently, at least during the last 1 minute,
the load has dropped significantly. If I saw this, I might even assume that the real
cause of the load has subsided. On the other hand, if the load averages were 20.68,
5.01, 1.03, I would conclude that the high load had likely started in the last 5 minutes
and was getting worse.
How High Is High?
After you understand what load average means, the next logical question is “What
load average is good and what is bad?” The answer to that is “It depends.” You see, a
lot of different things can cause load to be high, each of which affects performance
differently. One server might have a load of 50 and still be pretty responsive, while
another server might have a load of 10 and take forever to log in to. I've had servers
with load averages in the hundreds that were certainly slow, but didn't crash, and I
had one server that consistently had a load of 50 that was still pretty responsive and
stayed up for years.
What really matters when you troubleshoot a system with high load is why the load
is high. When you start to diagnose high load, you find that most load seems to fall
into three categories: CPUbound load, load caused by out of memory issues and
I/Obound load. I explain each of these categories in detail below and how to use
tools like top and iostat to isolate the root cause.
top
If the first tool I use when I log in to a sluggish system is uptime, the second tool I
use is top. The great thing about top is that it's available for all major Linux systems,
and it provides a lot of useful information in a single screen. top is a quite complex
tool with many options that could warrant its own article. For this column, I stick to
how to interpret its output to diagnose high load.
To use top, simply type top on the command line. By default, top will run in
interactive mode and update its output every few seconds. Listing 1 shows sample top
output from a terminal.
______________________
Kyle Rankin is a director of engineering operations in the San Francisco Bay Area,
the author of a number of books including DevOps Troubleshooting and The Official
Ubuntu Server Book, and is a columnist for Linux Journal.
http://www.linuxjournal.com/magazine/hackandlinuxtroubleshootingpartihighload?page=0,0 2/5
1/9/2015 Hack and / Linux Troubleshooting, Part I: High Load | Linux Journal
Comments
Comment viewing options
Threaded list expanded Date newest first 50 comments per page Save settings
Select your preferred way to display the comments and click "Save settings" to activate your changes.
Load averages (/magazine/hackandlinuxtroubleshootingpartihighload#comment357854)
Submitted by Chris Sardius (http://linuxblock.com) (not verified) on Fri, 10/29/2010 01:51.
I mange some quadcore linux systems and basically get a spike in load averages every now
and then. I looking forward to deploying a permanent fix for this . I
uptime (/magazine/hackandlinuxtroubleshootingpartihighload#comment352770)
Submitted by slashbob (/users/boblounsburygmailcom) on Thu,
06/10/2010 17:21.
Well, my Ubuntu 10.04 uptime only lasted about 35 days and then
there was a kernel security update so I had to reboot.
(/users/boblounsburygmailcom)
From what I've seen Debian/Ubuntu and CentOS have kernel
security updates about every three or four months.
Maybe it's time for FreeBSD...
in reply to uptime question (/magazine/hackandlinuxtroubleshootingpartihighload#comment352687)
Submitted by ayam666@hotmail.com (/users/ayam666hotmailcom) on
Wed, 06/09/2010 00:53.
"...what distro are you using to get 365 days of uptime..."
Just to comment on the question, I used to manage a few FreeBSD (/users/ayam666hotmailcom)
boxes (as internet gateway, mail server, DNS server and Samba
server) for a company that virtually had no budget for their IT department.
We did not have brand named hardware. One of the servers was even built using part I scavenged from old
boxes.
Most of them were very stable with uptime of more than 365 days.
The only time we had to switch off the server was because of a planned electricity upgrade by the electricity
department a possibility of outage of more than 5 hours just after midnight. Other than that, it was one
case of bad sector on one of the mirrored hard disk.
In my opinion, it might not be Linux or the OS that causes low uptime. Rather, its the applications that we
run on it bring down the system most of the time.
Like Kyle, I only upgraded the OS for security reasons.
Regards,
Yance
iotop on centos (/magazine/hackandlinuxtroubleshootingpartihighload#comment352485)
Submitted by Anonymous on Thu, 06/03/2010 12:27.
# rpm i iotop0.41.noarch.rpm
error: Failed dependencies:
python(abi) = 2.6 is needed by iotop0.41.noarch
Setting up Install Process
Package python2.4.327.el5.i386 already installed and latest version
1. Where can I get a iotop compatible with python2.4
http://www.linuxjournal.com/magazine/hackandlinuxtroubleshootingpartihighload?page=0,0 3/5
1/9/2015 Hack and / Linux Troubleshooting, Part I: High Load | Linux Journal
2. If not, how can I upgrade my python on Centos 5?
Bob, maybe its not the distro (/magazine/hackandlinuxtroubleshootingpartihighload#comment351930)
Submitted by Anonymous on Sun, 05/16/2010 09:14.
Bob, maybe its not the distro itself that's causing your problem.
uptime (/magazine/hackandlinuxtroubleshootingpartihighload#comment351918)
Submitted by slashbob (/users/boblounsburygmailcom) on Sat,
05/15/2010 10:53.
Jerry,
Thanks for the info. I guess Debian stable has, from my two years (/users/boblounsburygmailcom)
of experience, had kernel security updates at least every 3 months.
I was waiting to see which would come out first, Ubuntu 10.04 or CentOS 5.5, and had decided that which
ever was released first is what I would use (for now). Knowing that Ubuntu would be supported for 5 years
and CentOS for roughly the same time frame for 5.5 or 7 years from release.
Anyway, I've installed Ubuntu and it's been running nicely for 14 days now. When I ssh into the server it has
an interesting summary of system information like: System Load, Swap usage, CPU temperature, Users
logged in, Usage of /home, and then it tells me how many packages can be updated and how many are
security updates which I thought was kind of nice.
During install it asked if I wanted to setup unattendedupgrades for security updates, I know it can be a little
scary, but this is just a home server so I agreed. So everyday it checks for security updates and if there are
any it installs them and sends me an email of what was done (Thanks must go to Kyle for his great tutorials
of setting up a mail server with postfix. Thanks Kyle!).
We'll see what kinds of uptimes I get now.
Uptime... (/magazine/hackandlinuxtroubleshootingpartihighload#comment351841)
Submitted by Jerry McBride (/users/jerrymcbride) on Thu, 05/13/2010 10:40.
Say Bob,
I've been running Gentoo these last few years and the only thing that shortens
the uptime on my servers is when a kernel comes out with new features or (/users/jerrymcbride)
security updates. That aside, I can/could break the 365 day uptime with ease
on these boxes.
The only real time I had problems getting a box to run for any lenght of time, was when I finally figured out
that it had some badly manufactured memory in it. Swapped the junk out for some name brand sticks and it
ran as I expected it to.
With linux, any problems with operation, I would suspect hardware issues before digging aroung the OS.
Just my two cents...
Jerry
Jerry McBride
I use a mix of distributions (/magazine/hackandlinuxtroubleshootingpartihighload#comment351385)
Submitted by Kyle Rankin (/users/kylerankin) on Thu, 04/29/2010 09:48.
I use a mix of distributions and have gotten 12 year uptimes on most of
them. In production I'm also resistant to update something just so the version
number is higher, so as long as I have reasonably stable applications and
reliable power, it's not too difficult to maintain a high uptime. The main (/users/kylerankin)
enemy of it these days is kernel upgrades, which again, I resist unless there is a good reason (security) to
do so.
Kyle Rankin is a director of engineering operations in the San Francisco Bay Area, the author of a number
of books including DevOps Troubleshooting and The Official Ubuntu Server Book, and is a columnist for
Linux Journal.
uptime (/magazine/hackandlinuxtroubleshootingpartihighload#comment348690)
Submitted by slashbob (/users/boblounsburygmailcom) on Sat,
02/27/2010 17:21.
So, the age old question ...
Kyle, what distro are you using to get 365 days of uptime? (/users/boblounsburygmailcom)
I've been running Debian for a couple of years and never seen anything greater than 77 days. Thinking of
switching my server to Slackware or something else.
/bob
http://www.linuxjournal.com/magazine/hackandlinuxtroubleshootingpartihighload?page=0,0 4/5
1/9/2015 Hack and / Linux Troubleshooting, Part I: High Load | Linux Journal
One Click, Universal Protection: Linux Backup and Recovery Webinar
Implementing Centralized Security (http://www.linuxjournal.com/content/registerlinux
backupandrecoverywebinar)
Policies on Linux Systems
(http://www.linuxjournal.com/content/watchoneclick
universalprotectionimplementingcentralizedsecurity Most companies incorporate backup procedures
policieslinuxsystem) for critical data, which can be restored quickly if a
loss occurs. However, fewer companies are
prepared for catastrophic system failures, in
As Linux continues to play an ever increasing which they lose all data, the entire operating
role in corporate data centers and institutions, system, applications, settings, patches and more,
ensuring the integrity and protection of these reducing their system(s) to “bare metal.” After all,
systems must be a priority. With 60% of the before data can be restored to a system, there
world's websites and an increasing share of must be a system to restore it to.
organization's missioncritical workloads running
on Linux, failing to stop malware and other In this one hour webinar, learn how to enhance
advanced threats on Linux can increasingly your existing backup strategies for better disaster
impact an organization's reputation and bottom recovery preparedness using Storix System
line. Backup Administrator (SBAdmin), a highly
flexible baremetal recovery solution for UNIX
Learn More (http://www.linuxjournal.com/content/watch and Linux systems.
oneclickuniversalprotectionimplementingcentralizedsecurity
policieslinuxsystem)
Learn More (http://www.linuxjournal.com/content/register
linuxbackupandrecoverywebinar)
Sponsored by Bit9
Sponsored by Storix
wQJTUnIHRNng
allgenius Advertisement
http://www.linuxjournal.com/magazine/hackandlinuxtroubleshootingpartihighload?page=0,0 5/5