High Performan Ce Computing: Making Everythi NG Easier!
High Performan Ce Computing: Making Everythi NG Easier!
High Performan Ce Computing: Making Everythi NG Easier!
Making Everythin
ial Edition
e
c
n
a
m
r
o
f
r
e
High P
Computing
Learn to:
Pick out hardware and software
Find the best vendor to work with
Get your people up to speed on
HPC
High Performance
Computing
FOR
DUMmIES
High Performance Computing For Dummies, Sun and AMD Special Edition
Published by
Wiley Publishing, Inc.
111 River Street
Hoboken, NJ 07030-5774
Copyright 2009 by Wiley Publishing, Inc., Indianapolis, Indiana
Published by Wiley Publishing, Inc., Indianapolis, Indiana
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the
prior written permission of the Publisher. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, the Wiley Publishing logo, For Dummies, the Dummies Man logo, A Reference
for the Rest of Us!, The Dummies Way, Dummies.com, Making Everything Easier, and related trade
dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the
United States and other countries, and may not be used without written permission. AMD, the AMD
Arrow logo, AMD Opteron, AMD Virtualization, AMD-V, and combinations thereof are registered
trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the
HyperTransport Technology Consortium. Sun, the Sun logo, Solaris, StorageTek, Sun Fire, Sun xVM
Ops Center, and Sun Ray are trademarks or registered trademarks of Sun Microsystems, Inc. in the
United States and other countries. All other trademarks are the property of their respective owners.
Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE
NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES,
INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE.
NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS.
THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT
ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL
PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE
FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS
REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER
INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE
INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT
MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN
THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
For general information on our other products and services, please contact our Customer Care
Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For
details on how to create a custom For Dummies book for your business or organization, contact
bizdev@wiley.com. For information about licensing the For Dummies brand for products or services, contact BrandedRights&Licenses@Wiley.com.
ISBN: 978-0-470-49008-2
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
Publishers Acknowledgments
Project Editor: Jennifer Bingham
Editorial Manager: Rev Mengle
Sun Contributors: Tony Warner, Patrice Brancato, Allison Michlig, Frances Sun
AMD Contributors: Jeff Jones, Jeff Underhill
Table of Contents
Introduction....................................................... 1
About This Book ........................................................................ 1
Icons Used in This Book ............................................................ 2
iv
Introduction
Chapter 1
Chapter 1: HPC: Its Not Just for Rocket Scientists Any More
The Top500
No, the Top500 isnt a car race. It is
a list of the worlds fastest computers. Of course, some background is
needed. To get on the Top500 list,
you must run a single benchmark
program on your HPC system and
submit it to the Top500 organization.
The list is created twice a year and
includes some rather large systems.
Not all Top500 systems are clusters,
but many of them are built from the
same technology. Of course, as with
all lists, there are some limitations.
First, the list is for a single benchmark
(HPL or High Performance Linpack).
Results for other benchmarks may
shuffle the standings, but systems
are likely to remain in roughly the
same place in the list if other benchmarks are used. Second, the list only
includes those systems that were
submitted for consideration. There
Table 1-1
System Size
Market Share
$2.7B
Technical Divisional
($250K$500K)
$1.6B
Technical Departmental
($100K$250K)
$3.4B
Technical Workgroup
(under $100K)
$2.4B
Chapter 1: HPC: Its Not Just for Rocket Scientists Any More
Chapter 2
Getting to HPC
In This Chapter
Examining the commodity cluster
Looking at whats available
Getting to know the cluster
Drilling down into some details
10
11
12
13
to the cluster. The head node has one or more networks with
which it communicates to the worker (or compute) nodes.
These networks are private and are only accessible inside
the cluster.
All clusters have worker nodes that do the bulk of the computing. These nodes are almost always identical throughout
the cluster (although there is no strict requirement that all
nodes need to be the same). Indeed in some cases it is advisable to configure your cluster with different compute nodes.
This all depends on what applications you are trying to run.
For example some applications require more local memory
than others, while some require a specific processor architecture in order to run at best performance.
In terms of networks, a cluster may have as little as one private
network, which is usually Gigabit Ethernet (GigE), although
InfiniBand is becoming a popular option. Almost all servers
have at least two GigE connections on the motherboard and
therefore all clusters have at least one private GigE network. A
cluster will often generate three types of traffic:
Computation traffic between compute nodes.
File system traffic often from an NFS (Network File
System) server. (But not always. Direct attach storage
has its advantages.)
Administration traffic that provides node monitoring and
job control across the cluster.
Depending on your applications, compute and/or file system
traffic may dominate the cluster network and cause the nodes
to become idle. For this reason, additional networks are
added to increase overall cluster network throughput. In
many cases, a high speed interconnect is used for the second
network. Currently the two most popular choices for this
interconnect are 10 Gigabit Ethernet (10GigE) and InfiniBand.
As a lower-cost solution it is possible to use a second GigE
network as well, however this would not offer anywhere near
the network performance of 10GigE or InfiniBand.
Figure 2-1 shows a typical cluster configuration. Theres a
head node that may contain a large amount of storage that is
shared via the network by all the worker nodes (using some
form of Network File System). A varying number of worker
14
Head
disk
NFS
to local
Worker Worker
Node Node
Worker Worker
Node Node
15
Image courtesy of the Texas Advanced Computing Center and Advanced Micro Devices.
Figure 2-3: Typical 1U Cluster Node the Sun Fire X2200 M2 server
(1U indicates that the server is 1.75 inches high)
16
Figure 2-4: The Sun Blade 6048 chassis holds up to 48 blade server modules, 1,152 cores, delivering up to 12 TFLOPS in a single rack.
17
available, the job gets executed and the results are returned
to the user. Users who need to run many similar jobs with
different parameters or data sets find clusters ideal for this
kind of work. They can submit hundreds of jobs and allow the
cluster to manage the work flow. Depending on the resources,
all the jobs may run at the same time or some may wait in the
queue while other jobs finish. This type of computing is local
to a cluster node, which means the node doesnt communicate
with other nodes, but may need high speed file system access.
An example of this usage mode is given in Figure 2-5. Each
job can be from a different user or the same user. The jobs
can run concurrently, thus increasing computing throughput.
Even though more jobs can be run in the same time period,
they never run any faster than normal. Because each job is
independent, there is no communication between the jobs.
Many
Single
Core
Programs
Cluster
Results
Figure 2-5: A cluster used to simultaneously run many core independent
jobs (each small square box is a cluster node).
18
One Parallel
Program
Cluster
Result
Figure 2-6: A cluster used to run a parallel program. (Each small square box
is a cluster node).
19
20
Amdahls law
In the early days of parallel computing, Gene Amdahl threw a bucket of
cold water on those who got overly
excited about piling on processors.
Amdahl showed you cant keep
throwing more and more cores at a
problem and expect it to keep speeding up. His law works as follows. The
parallel speedup of any program is
limited by the time needed for any
sequential portions of the program to
Chapter 3
Crunching numbers:
Processors and nodes
The processor is the workhorse of the cluster. And, keeping the
workhorse busy is the key to good performance. Parallel programs are often distributed across many nodes of the cluster
(for more info on this, see Chapter 2). However, multi-core has
changed this situation a bit. Cluster nodes may now have 8 or
even 16 cores per node (for example, the Sun Fire X4440 server
with four Quad-Core AMD Opteron processors). It is
22
23
24
25
26
27
28
Operating systems
The predominant operating system for HPC can be summed
up in one word: Linux. Prior to the advent of Linux, the HPC
or supercomputing market used UNIX exclusively. Linux
represents a plug-and-play alternative and doesnt add any
licensing fees for the compute nodes (which can be quite
large in number). In addition to the Linux kernel, much of the
important supporting software has been developed as part of
the GNU project.
The GNU/Linux core software is open-source (see Chapter 5)
and can be freely copied and used by anyone. There are,
however, requirements to ensure source code is shared. The
openness and shareability of GNU/Linux has made it an
ideal HPC operating system. It has allowed HPC developers
to create applications, build drivers, and make changes that
would normally not be possible with closed source.
Virtually all Linux installations are done with a commercial
or freely available software distribution package. While the
commercial availability of free software may seem puzzling,
most commercial open source vendors use a support-based
model. Youre free to look at and alter the source code, but if
you need support, you have to open your wallet.
Users may recognize some of the Commercial GNU/Linux
distributions such as Red Hat, SUSE, and others. There are
community versions (no support options) available as well.
Red Hat Fedora, Open SUSE, and CentOS are examples of this
approach. It should be noted that although these distributions are highly polished in their own right, they dont contain
all the software needed to support an HPC cluster. Cluster
distributions are available that fill this gap.
29
No free lunch
In order to use the combined power of
an HPC cluster, your software needs
to be made parallel. By the way, the
same goes for multi-core as well. A
typical computer program is written
to run on a single CPU or core. It will
not automatically use extra cores or
nodes in the cluster because there is
no free lunch in HPC. To run in parallel, one must change the internal
workings of the program. There are
several ways to accomplish this task.
If you want to run your program on
multiple cores, then using pthreads
or OpenMP is a good solution. If,
on the other hand, you want to run
30
File systems
Almost all clusters use the standard NFS file system to share
information across nodes. This is a good solution; however,
NFS wasnt designed for parallel file access (for instance,
multiple processes reading and wiring to the same file). This
limitation had become a bottleneck for HPC systems. For this
reason, parallel file systems were developed.
One of the areas where the open GNU/Linux approach has
served the HPC community is with file systems. There are a
multitude of choices, all of which depend on your application
demands. HPC file systems are often called parallel file systems because they allow for aggregate (multi-node) input and
output. Instead of centralizing all storage on a single device,
parallel file systems spread out the load across multiple separate
storage devices. Parallel file systems are very often designed
because they must be matched to a particular cluster.
One popular and freely available parallel file system is Lustre
from Sun Microsystems. Lustre is a vetted, high-performance
parallel file system. Other options include PVFS2, which is
designed to work with MPI. Cluster file systems cover a large
area. In addition to massive amounts of input, scratch, and
checkpoint data, most HPC applications produce large amounts
of output data that are later visualized on specialized systems.
One thing to keep your eye on is pNFS (NFS Version 4.1),
which is designed for parallel NFS access. Most of the existing
parallel file systems plan to support the pNFS specification
and bring some standardization to the parallel file system
arena. ZFS, a file system designed by Sun, offers some exciting
possibilities for HPC because it is the first 128 bit file system
with many advanced features (for all intents and purposes 128
31
bits means it will never hit any storage size limits). ZFS is the
fix for silent data corruption.
32
33
34
Chapter 4
36
37
goals. Some companies can even help you asses your business needs. Sun Microsystems, for example, has Solutions
Centers worldwide to help with this. To find out more visit
www.sun.com/solutioncenters/index.jsp.
38
Does it scale?
The term scalability is often used with
clusters (and parallel computing). It
basically means how many processors
you can throw at a program before it
will not go any faster. Some programs
have difficulty using even two processors while other use thousands. The
difference is scalability. Scalability
depends on the nature of the program
(the algorithm) and the underlying
39
Fortran, really?
When most people ask about HPC
software theyre surprised to learn
that most HPC codes are written
in Fortran. Although many consider
Fortran to be an ancient language, it actually enjoys quite a bit
of use within the HPC community.
The reason for using the second
oldest computer language is largely
historical.
Many of the HPC programs were originally written in Fortran and users are
very reluctant to replace what works.
Indeed, some HPC programs are
composed of over one million lines
of source code, and some mature
40
What is a Beowulf?
If you have heard the name Beowulf
mentioned in relation to HPC clusters,
you arent alone. However, what is surprising to some people is that a Beowulf
is not a cluster. It was the name of a
project at NASA where researcher Jim
Fisher accepted Tom Sterlings offer
to create a personal supercomputer.
41
Understanding Benchmarks
The purpose of a running a benchmark is to eliminate faulty
assumptions. Choosing a processor based solely on its SPEC
42
HPC coffee
Creating a good tasting coffee may be
art, but keeping it fresh is a science.
A leading food packager has found
that instead of metal cans, plastic
containers maintain coffee freshness longer after theyre opened.
The simple solution was to switch
to plastic coffee containers right?
Well not so fast. Plastic created
another problem. Coffee continues
to release gasses after packaging.
A metal container can easily sustain
the pressure, but a metal container
doesnt keep the coffee as fresh.
The solution was to design a check
valve that releases the pressure
that builds up in the plastic coffee
container. This solution, while fixing
the first problem, created a second
issue. When the coffee is shipped
Chapter 5
pen source software is used extensively with HPC clusters. In one sense openness has helped to foster the
growth of commodity HPC by lowering the cost of entry. It has
always been possible to cobble together some hardware and
use freely available cluster software to test the whole cluster
thing for little or no cost.
44
Chapter 6
46