2 Basic Computer Architecture
2 Basic Computer Architecture
2 Basic Computer Architecture
The main components in a typical computer system are the processor, memory, input/output devices, and the communication channels that connect them. The processor is the workhorse of the system; it is the component that executes a program by performing arithmetic and logical operations on data. It is the only component that creates new information by combining or modifying current information. In a typical system there will be only one processor, known at the central processing unit, or CPU. Modern high performance systems, for example vector processors and parallel processors, often have more than one processor. Systems with only one processor are serial processors, or, especially among computational scientists, scalar processors. Memory is a passive component that simply stores information until it is requested by another part of the system. During normal operations it feeds instructions and data to the processor, and at other times it is the source or destination of data transferred by I/O devices. Information in a memory is accessed by its address. In programming language terms, one can view memory as a onedimensional array M. A processor's request to the memory might be ``send the instruction at location M[1000]'' or a disk controller's request might be ``store the following block of data in locations M[0] through M[255].'' Input/output (I/O) devices transfer information without altering it between the external world and one or more internal components. I/O devices can be secondary memories, for example disks and tapes, or devices used to communicate directly with users, such as video displays, keyboards, and mouses. The communication channels that tie the system together can either be simple links that connect two devices or more complex switches that interconnect several components and allow any two of them to communicate at a given point in time. When a switch is configured to allow two devices to exchange information, all other devices that rely on the switch are blocked, i.e. they must wait until the switch can be reconfigured. A common convention used in drawing simple ``stick figures'' of computer systems is the PMS notation [32]. In a PMS diagram each major component is represented by a single letter, e.g. P for processor, M for memory, or S for switch. A subscript on a letter distinguished different types of components, e.g. for primary memory and for cache memory. Lines connecting two components represent links, and lines connecting more than two components represent a switch. Although they are primitive and might appear at first glance to be too simple, PMS diagrams convey quite a bit of information and have several advantages, not the least of which is they are independent of any particular manufacturer's notations.
As an example of a PMS diagram and a relatively simple computer architecture, Figure 1 shows the major components of the original Apple Macintosh personal computer. The first thing one notices is a single communication channel, known as the bus, that connects all the other major components. Since the bus is a switch, only two of these components can communicate at any time. When the switch is configured for an I/O transfer, for example from main memory ( ) to
the disk (via K ), the processor is unable to fetch data or instructions and remains idle. This organization is typical of personal computers and low end workstations; mainframes, supercomputers, and other high performance systems have much richer (and thus more expensive) structures for connecting I/O devices to internal main memory that allow the processor to keep working at full speed during I/O operations.
2.1 Processors
The operation of a processor is characterized by a fetch-decode-execute cycle. In the first phase of the cycle, the processor fetches an instruction from memory. The address of the instruction to fetch is stored in an internal register named the program counter, or PC. As the processor is waiting for the memory to respond with the instruction, it increments the PC. This means the fetch phase of the next cycle will fetch the instruction in the next sequential location in memory (unless the PC is modified by a later phase of the cycle). In the decode phase the processor stores the information returned by the memory in another internal register, known as the instruction register, or IR. The IR now holds a single machine instruction, encoded as a binary number. The processor decodes the value in the IR in order to figure out which operations to perform in the next stage. In the execution stage the processor actually carries out the instruction. This step often requires further memory operations; for example, the instruction may direct the processor to fetch two operands from memory, add them, and store the result in a third location (the addresses of the operands and the result are also encoded as part of the instruction). At the end of this phase the machine starts the cycle over again by entering the fetch phase for the next instruction. Instructions can be classified as one of three major types: arithmetic/logic, data transfer, and control. Arithmetic and logic instructions apply primitive functions of one or two arguments, for example addition, multiplication, or logical AND. In some machines the arguments are fetched from main memory and the result is returned to main memory, but more often the operands are all in registers inside the CPU. Most machines have a set of general purpose registers that can be used for holding such operands. For example the HP-PA processor in Hewlett-Packard workstations has 32 such registers, each of which holds a single number.
The data transfer instructions move data from one location to another, for example between registers, or from main memory to a register, or between two different memory locations. Data transfer instructions are also used to initiate I/O operations. Control instructions modify the order in which instructions are executed. They are used to construct loops, if-then-else constructs, etc. For example, consider the following DO loop in Fortran:
DO 10 I=1,5 ... CONTINUE
To implement the bottom of the loop (at the CONTINUE statement) there might be an arithmetic instruction that adds 1 to I, followed by a control instruction that compares I to 5 and branches to the top of the loop if I is less than or equal to 5. The branch operation is performed by simply setting the PC to the address of the instruction at the top of the loop. The timing of the fetch, decode, and execute phases depends on the internal construction of the processor and the complexity of the instructions it executes. The quantum time unit for measuring operations is known as a clock cycle. The logic that directs operations within a processor is controlled by an external clock, which is simply a circuit that generates a square wave with a fixed period. The number of clock cycles required to carry out an operation determines the amount of time it will take. One cannot simply assume that if a multiplication can be done in take nanoseconds to perform nanoseconds then it will
nanoseconds the next instruction will begin execution nanoseconds following the branch. The actual timings depend on the organization of the memory system and the communication channels that connect the processor to the memory; these are the topics of the next two sections.
2.2 Memories
Memories are characterized by their function, capacity, and response times. Operations on memories are called reads and writes, defined from the perspective of a processor or other device that uses a memory: a read transfers information from the memory to the other device, and a write transfers information into the memory. A memory that performs both reads and writes is often just called a RAM, for random access memory. The term ``random access'' means that if location M[x] is accessed at time , there are no restrictions on the address of the item accessed at time . Other types of memories commonly used in systems are read-only memory, or ROM, and programmable read-only memory, or PROM (information in a ROM is set when the chips are designed; information in a PROM can be written later, one time only, usually just before the chips
are inserted into the system). For example, the Apple Macintosh, shown in Figure 1, had a PROM called the ``toolbox'' that contained code for commonly used operating system functions. The smallest unit of information is a single bit, which can have one of two values. The capacity of an individual memory chip is often given in terms of bits. For example one might have a memory built from 64Kb (64 kilobit) chips. When discussing the capacity of an entire memory system, however, the preferred unit is a byte, which is commonly accepted to be 8 bits of information. Memory sizes in modern systems range from 4MB (megabytes) in small personal computers up to several billion bytes (gigabytes, or GB) in large high-performance systems. Note the convention that lower case b is the abbreviation for bit and upper case B is the symbol for bytes. The performance of a memory system is defined by two different measures, the access time and the cycle time. Access time, also known as response time or latency, refers to how quickly the memory can respond to a read or write request. Several factors contribute to the access time of a memory system. The main factor is the physical organization of the memory chips used in the system. This time varies from about 80 ns in the chips used in personal computers to 10 ns or less for chips used in caches and buffers (small, fast memories used for temporary storage, described in more detail below). Other factors are harder to measure. They include the overhead involved in selecting the right chips (a complete memory system will have hundreds of individual chips), the time required to forward a request from the processor over the bus to the memory system, and the time spent waiting for the bus to finish a previous transaction before initiating the processor's request. The bottom line is that the response time for a memory system is usually much longer than the access time of the individual chips. Memory cycle time refers to the minimum period between two successive requests. For various reasons the time separating two successive requests is not always 0, i.e a memory with a response time of 80 ns cannot satisfy a request every 80 ns. A simple, if old, example of a memory with a long cycle time relative to its access time is the magnetic core used in early mainframe computers. In order to read the value stored in memory, an electronic pulse was sent along a wire that was threaded through the core. If the core was in a given state, the pulse induced a signal on a second wire. Unfortunately the pulse also erased the information that used to be in memory, i.e. the memory had a destructive read-out. To get around this problem designers built memory systems so that each time something was read a copy was immediately written back. During this write the memory cell was unavailable for further requests, and thus the memory had a cycle time that was roughly twice as long as its access time. Some modern semiconductor memories have destructive reads, and there may be several other reasons why the cycle time for a memory is longer than the access time. Although processors have the freedom to access items in a RAM in any order, in practice the pattern of references is not random, but in fact exhibits a structure that can be exploited to improve performance. The fact that instructions are stored sequentially in memory (recall that unless there is a branch, PC is incremented by one each time through the fetch-decode-execute cycle) is one source of regularity. What this means is that if a processor requests an instruction from location at time , there is a high probability that it will request an instruction from location . References to data also show a similar pattern; for example if a
program updates every element in a vector inside a small loop the data references will be to v[0], v[1], ... This observation that memory references tend to cluster in small groups is known as locality of reference. Locality of reference can be exploited in the following way. Instead of building the entire memory out of the same material, construct a hierarchy of memories, each with different capacities and access times. At the top of the hierarchy there will be a small memory, perhaps only a few KB, built from the fastest chips. The bottom of the hierarchy will be the largest but slowest memory. The processor will be connected to the top of the hierarchy, i.e. when it fetches an instruction it will send its request to the small, fast memory. If this memory contains the requested item, it will respond, and the request is satisfied. If a memory does not have an item, it forwards the request to the next lower level in the hierarchy. The key idea is that when the lower levels of the hierarchy send a value from location to the
next level up, they also send the contents of , , etc. If locality of reference holds, there is a high probability there will soon be a request for one of these other items; if there is, that request will be satisfied immediately by the upper level memory. The following terminology is used when discussing hierarchical memories:
The memory closest to the processor is known as a cache. Some systems have separate caches for instructions and data, in which case it has a split cache. An instruction buffer is a special cache for instructions that also performs other functions that make fetching instructions more efficient. The main memory is known as the primary memory. The low end of the hierarchy is the secondary memory. It is often implemented by a disk, which may or may not be dedicated to this purpose. The unit of information transferred between items in the hierarchy is a block. Blocks transferred to and from cache are also known as cache lines, and units transferred between primary and secondary memory are also known as pages. Eventually the top of the hierarchy will fill up with blocks transferred from the lower levels. A replacement strategy determines which block currently in a higher level will be removed to make room for the new block. Common replacement strategies are random replacement (throw out any current block at random), first-in-first-out (FIFO; replace the block that has been in memory the longest), and least recently used (LRU; replace the block that was last referenced the furthest in the past). A request that is satisfied is known as a hit, and a request that must be passed to a lower level of the hierarchy is a miss. The percentage of requests that result in hits determines the hit rate. The hit rate depends on the size and organization of the memory and to some extent on the replacement policy. It is not uncommon to have a hit rate near 99% for caches on workstations and mainframes.
The performance of a hierarchical memory is defined by the effective access time, which is a function of the hit ratio and the relative access times between successive levels of the hierarchy.
For example, suppose the cache access time is 10ns, main memory access time is 100ns, and the cache hit rate is 98%. Then the average time for the processor to access an item in memory is
Over a long period of time the system performs as if it had a single large memory with an 11.8ns cycle time, thus the term ``effective access time.'' With a 98% hit rate the system performs nearly as well as if the entire memory was constructed from the fast chips used to implement the cache, i.e. the average access time is 11.8ns, even though most of the memory is built using less expensive technology that has an access time of 100ns. Although a memory hierarchy adds to the complexity of a memory system, it does not necessarily add to the latency for any particular request. There are efficient hardware algorithms for the logic that looks up addresses to see if items are present in a memory and to help implement replacement policies, and in most cases these circuits can work in parallel with other circuits so the total time spent in the fetch-decode-execute cycle is not lengthened.
2.3 Buses
A bus is used to transfer information between several different modules. Small and mid-range computer systems, such as the Macintosh shown in Figure 1 have a single bus connecting all major components. Supercomputers and other high performance machines have more complex interconnections, but many components will have internal buses. Communication on a bus is broken into discrete transactions. Each transaction has a sender and receiver. In order to initiate a transaction, a module has to gain control of the bus and become (temporarily, at least) the bus master. Often several devices have the ability to become the master; for example, the processor controls transactions that transfer instructions and data between memory and CPU, but a disk controller becomes the bus master to transfer blocks between disk and memory. When two or more devices want to transfer information at the same time, an arbitration protocol is used to decide which will be given control first. A protocol is a set of signals exchanged between devices in order to perform some task, in this case to agree which device will become the bus master. Once a device has control of the bus, it uses a communication protocol to transfer the information. In an asynchronous (unclocked) protocol the transfer can begin at any time, but there is some overhead involved in notifying potential receivers that information needs to be transferred. In a synchronous protocol transfers are controlled by a global clock and begin only at well-known times. The performance of a bus is defined by two parameters, the transfer time and the overall bandwidth (sometimes called throughput). Transfer time is similar to latency in memories: it is the amount of time it takes for data to be delivered in a single transaction. For example, the transfer
time defines how long a processor will have to wait when it fetches an instruction from memory. Bandwidth, expressed in units of bits per second (bps), measures the capacity of the bus. It is defined to be the product of the number of bits that can be transferred in parallel in any one transaction by the number of transactions that can occur in one second. For example, if the bus has 32 data lines and can deliver 1,000,000 packets per second, it has a bandwidth of 32Mbps. At first it may seem these two parameters measure the same thing, but there are subtle differences. The transfer time measures the delay until a piece of data arrives. As soon as the data is present it may be used while other signals are passed to complete the communication protocol. Completing the protocol will delay the next transaction, and bandwidth takes this extra delay into account. Another factor that distinguishes the two is that in many high performance systems a block of information can be transferred in one transaction; in other words, the communication protocol may say ``send items from location .'' There will be some initial overhead in setting up the transaction, so there will be a delay in receiving the first piece of data, but after that information will arrive more quickly. Bandwidth is a very important parameter. It is also used to describe processor performance, when we count the number of instructions that can be executed per unit time, and the performance of networks.
2.4 I/O
Many computational science applications generate huge amounts of data which must be transferred between main memory and I/O devices such as disk and tape. We will not attempt to characterize file I/O in this chapter since the devices and their connections to the rest of the system tend to be idiosyncratic. If your application needs to read or write large data files you will need to learn how your system organizes and transfers files and tune your application to fit that system. It is worth reiterating, though, that performance is measured in terms of bandwidth: what counts is the volume of data per unit of time that can be moved into and out of main memory. The rest of this section contains a brief discussion of video displays. These output devices and their capabilities also vary from system to system, but since scientific visualization is such a prominent part of this book we should introduce some concepts and terminology for readers who are not familiar with video displays. Most users who generate high quality images will do so on workstations configured with extra hardware for creating and manipulating images. Almost every workstation manufacturer includes in its product line versions of their basic systems that are augmented with extra processors that are dedicated to drawing images. These extra processors work in parallel with the main processor in the workstation. In most cases data generated on a supercomputer is saved in a file and later viewed on a video console attached to a graphics workstation. However there are situations that make use of high bandwidth connections from supercomputers directly to video displays; these are useful when the computer is generating complex data that should be viewed in ``real time.'' For
example, a demonstration program from Thinking Machines, Inc. allows a user to move a mouse over the image of a fluid moving through a pipe. When the user pushes the mouse button, the position of the mouse is sent to a parallel processor which simulates the path of particles in a turbulent flow at this position. The results of the calculations are sent directly to the video display, which shows the new positions of the particles in real time. The net effect is as if the user is holding a container of fluid that is being poured into the pipe. There are many different techniques for drawing images with a computer, but the dominant technology is based on a raster scan. A beam of electrons is directed at a screen that contains a quick-fading phosphor. The beam can be turned on and off very quickly, and it can be bent in two dimensions via magnetic fields. The beam is swept from left to right (from the user's point of view) across the screen. When the beam is on, a small white dot will appear on the screen where the beam is aimed, but when it is off the screen will remain dark. To paint an image on the entire screen, the beam is swept across the top row; when it reaches the right edge, it is turned off, moved back to the left and down one row, and then swept across to the right again. When it reaches the lower right corner, the process repeats again in the upper left corner. The number of times per second the full screen is painted determines the refresh rate. If the rate is too low, the image will flicker, since the bright spots on the phosphor will fade before the gun comes back to that spot on the next pass. Refresh rates vary from 30 times per second up to 60 times per second. The individual locations on a screen that can be either painted or not are known as pixels (from ``picture cell''). The resolution of the image is the number of pixels per inch. A high resolution display will have enough pixels in a given area that from a reasonable distance (an arm's length away) the gaps between pixels are not visible and a sequence of pixels that are all on will appear to be a continuous line. A common screen size is 1280 pixels across and 1024 pixels high on a 16'' or 19'' monitor. The controller for the electron gun decides whether a pixel will be black or white by reading information from a memory that has one bit per pixel. If the bit is a 1, the pixel will be painted, otherwise it will remain dark. From the PMS diagram in Figure 1 you can see that the display memory on the Macintosh was part of the main memory. The operating system set aside a portion of the main memory for displays, and all an application had to do to paint something on the screen was to write a bit pattern into this portion of memory. This was an economical choice for the time (early 1980s), but it came at the cost of performance: the processor and video console had to alternate accesses to memory. During periods when the electron gun was being moved back to the upper left hand corner, the display did not access memory, and the processor was able to run at full speed. Once the gun was positioned and ready for the next scan line, however, the processor and display went back to alternating memory cycles. With the fall in memory prices and the rising demand for higher performance, modern systems use a dedicated memory known as a frame buffer for holding bit patterns that control the displays. On inexpensive systems the main processor will compute the patterns and transfer them to the frame buffer. On high performance systems, though, the main processor sends information to the ``graphics engine'', a dedicated processor that performs the computations. For example, if the user
wants to draw a rectangle, the CPU can send the coordinates to the graphics processor, and the latter will figure out which pixels lie within the rectangle and turn on the corresponding bits in the frame buffer. Sophisticated graphics processors do all the work required in complex shading, texturing, overlapping of objects (deciding what is visible and what is not), and other operations required in 3D images. The discussion so far has dealt only with black and white images. Color displays are based on the same principles: a raster scan illuminates regions on a phosphor, with the information that controls the display coming from a frame buffer. However, instead of one gun there are three, one for each primary color. When combining light, the primary colors are red, green, and blue, which is why these displays are known as RGB monitor. Since we need to specify whether or not each gun should be on for each pixel, the frame buffer will have at least three bits per pixel. To have a wide variety of colors, though, it is not enough just to turn a gun on or off; we need to control its intensity. For example, a violet color can be formed by painting a pixel with the red gun at 61% of full intensity, green at 24%, and blue at 80%. Typically a system will divide the range of intensities into 256 discrete values, which means the intensity can be represented by an 8-bit number. 8 bits times 3 guns means 24 bits are required for each pixel. Recall that high resolution displays have 1024 rows of 1280 pixels each, for a total of 1.3 million pixels. Dedicating 24 bits to each pixel would require almost 32MB of RAM for the frame buffer alone. What is done instead is to create a color map with a fixed number of entries, typically 256. Each entry in the color map is a full 24 bits wide. Each pixel only needs to identify a location in the map that contains its color, and since a color map of 256 entries requires only 8 bits per pixel to specify one of the entries there is a savings of 16 bits per pixel. The drawback is that only 256 different colors can be displayed in any one image, but this is enough for all applications except those that need to create highly realistic images.
The user's view of a computer system is of a complex set of services that are provided by a combination of hardware (the architecture and its organization) and software (the operating system). Attributes of the operating system also affect the performance of user programs. Operating systems for all but the simplest personal computers are multi-tasking operating systems. This means the computer will be running several jobs at once. A program is a static description of an algorithm. To run a program, the system will decide how much memory it needs and then start a process for this program; a process (also known as a task) can be viewed as a dynamic copy of a program. For example, the C compiler is a program. Several different users can be compiling their code at the same time; there will be a separate process in the system for each of these invocations of the compiler. Processes in a multi-tasking operating system will be in one of three states. A process is active if the CPU is executing the corresponding program. In a single processor system there will be only
one active process at any time. A process is idle if it is waiting to run. In order to allocate time on the CPU fairly to all processes, the operating system will let a process run for a short time (known as a time slice; typically around 20ms) and then interrupt it, change its status to idle, and install one of the other idle tasks as the new active process. The previous task goes to the end of a process queue to wait for another time slice. The third state for a process is blocked. A blocked process is one that is waiting for some external event. For example, if a process needs a piece of data from a file, it will call the operating system routine that retrieves the information and then voluntarily give up the remainder of its time slice. When the data is ready, the system changes the process' state from blocked to idle, and it will be resumed again when its turn comes. The predominant operating systems for workstations is Unix, developed in the 1970s at Bell Labs and made popular in the 1980s by the University of California at Berkeley. Even though there may be just one user, and that user is executing only one program (e.g. a text editor), there will be dozens of tasks running. Many Unix services are provided by small systems programs known as daemons that are dedicated to one special purpose. There are daemons for sending and receiving mail, using the network to find files on other systems, and several other jobs. The fact that there may be several processes running in a system at the same time as your computational science application has ramifications for performance. One is that it makes it slightly more difficult to measure performance. You cannot simply start a program, look at your watch, and then look again when the program stops to measure the time spent. This measure is known as real time or ``wall-clock time,'' and it depends as much on the number of other processes in the system as it does on the performance of your program. Your program will take longer to run on a heavily-loaded system since it will be competing for CPU cycles with those other jobs. To get an accurate assessment of how much time is required to run your program you need to measure CPU time. Unix and other operating systems have system routines that can be called from an application to find out how much CPU time has been allocated to the process since it was started. Another impact of having several other jobs in the process queue is that as they are executed they work themselves into the cache, displacing your program and data. During your application's time slice its code and data will fill up the cache. But when the time slice is over and a daemon or other user's program runs, its code and data will soon replace yours, so that when yours resumes it will have a higher miss rate until it reloads the code and data it was working on when it was interrupted. This period during which your information is being moved back into the cache is known as a reload transient. The longer the interval between time slices and the more processes that run during this interval the longer the reload transient. Supercomputers and parallel processors also use variants of Unix for their runtime environments. You will have to investigate whether or not daemons run on the main processor or a ``front end'' processor and how the operating system allocates resources. As an example of the range of alternatives, on an Intel Paragon XPS with 56 processors some processors will be dedicated to system tasks (e.g. file transfers) and the remainder will be split among users so that applications do not have to share any one processor. The MasPar 1104 consists of a front-end (a DEC workstation) that handles the system tasks and 4096 processors for user applications. Each processor has its own
64KB RAM. More than one user process can run at any one time, but instead of allocating a different set of processors to each job the operating system divides up the memory. The memory is split into equal size partitions, for example 8KB, and when a job starts the system figures out how many partitions it needs. All 4096 processors execute that job, and when the time slice is over they all start working on another job in a different set of partitions.
Another important interaction between user programs and computer architecture is in the representation of numbers. This interaction does not affect performance as much as it does portability. Users must be extremely careful when moving programs and/or data files from one system to another because numbers and other data are not always represented the same way. Recently programming languages have begun to allow users to have more control over how numbers are represented and to write code that does not depend so heavily on data representations that it fails when executed on the ``wrong'' system. The binary number system is the starting point for representing information. All items in a computer's memory - numbers, characters, instructions, etc. - are represented by strings of 1's and 0's. These two values designate one of two possible states for the underlying physical memory. It does not matter to us which state corresponds to 1 and which corresponds to 0, or even what medium is used. In an electronic memory, 1 could stand for a positively charged region of semiconductor and 0 for a neutral region, or on a device that can be magnetized a 1 would represent a portion of the surface that has a flux in one direction, while a 0 would indicate a flux in the opposite direction. It is only important that the mapping from the set {1,0} to the two states be consistent and that the states can be detected and modified at will. Systems usually deal with fixed-length strings of binary digits. The smallest unit of memory is a single bit, which holds a single binary digit. The next largest unit is a byte, now universally recognized to be eight bits (early systems used anywhere from six to eight bits per byte). A word is 32 bits long in most workstations and personal computers, and 64 bits in supercomputers. A double word is twice as long as a single word, and operations that use double words are said to be double precision operations. Storing a positive integer in a system is trivial: simply write the integer in binary and use the resulting string as the pattern to store in memory. Since numbers are usually stored one per word, the number is padded with leading 0's first. For example, the number 52 is represented in a 16-bit word by the pattern 0000000000110100. The meaning of an formula, -bit string when it is interpreted as a binary number is defined by the
Compiler writers and assembly language programmers often take advantage of the binary number system when implementing arithmetic operations. For example, if the pattern of bits is ``shifted left'' by one, the corresponding number is multiplied by two. A left shift is performed by moving every bit left and inserting 0's on the right side. In an 8-bit system, for example, the pattern 00000110 represents the number 6; if this pattern is shifted left, the resulting pattern is 00001100, which is the representation of the number 12. In general, shifting left by multiplying by . bits is equivalent to
Shifts such as these can be done in one machine cycle, so they are much faster than multiplication instructions, which usually takes several cycles. Other ``tricks'' are using a right shift to implement integer division by a power of 2, in which the result is an integer and the remainder is ignored (e.g. 15 4 = 3) and taking the modulus or remainder with respect to a power of 2 (see problem 8). distinct -digit strings.
A fundamental relationship about binary patterns is that there are 2 For example, for there are
= 256 different strings of 1's and 0's. From this relationship it -bit word is . : the
is easy to see that the largest integer that can be stored in an patterns are used to represent the integers in the interval
An overflow occurs when a system generates a value greater than the largest integer. For example, in a 32-bit system, the largest positive integer is . = 4,294,976,295. If a program tries to add 3,000,000,000 and 2,000,000,000 it will cause an overflow. Right away we can see one source of problems that can arise when moving a program from one system to another: if the word size is smaller on the new system a program that runs successfully on the original system may crash with an overflow error on the new system. There are two different techniques for representing negative values. One method is to divide the word into two fields, i.e. represent two different types of information within the word. We can use one field to represent the sign of the number, and the other field to represent the value of the number. Since a number can be just positive or negative, we need only one bit for the sign field. Typically the leftmost bit represents the sign, with the convention that a 1 means the number is negative and a 0 means it is positive. This type of representation is known as a sign-magnitude representation, after the names of the two fields. For example, in a 16-bit sign-magnitude system, the pattern 1000000011111111 represents the number and the pattern 0000000000000101 represents +5.
The other technique for representing both positive and negative integers is known as two's complement. It has two compelling advantages over the sign-magnitude representation, and is now universally used for integers, but as we will see below sign-magnitude is still used to represent real numbers. The two's complement method is based on the fact that binary arithmetic in fixed-length words is actually arithmetic over a finite cyclic group. If we ignore overflows for a moment, observe what happens when we add 1 to the largest possible number in an number is represented by a string of 1's): -bit system (this
The result is a pattern with a leading 1 and 0's. In an -bit system only the low order bits of each result are saved, so this sum is functionally equivalent to 0. Operations that lead to sums with very large values ``wrap around'' to 0, i.e. the system is a finite cyclic group. Operations in this group are defined by arithmetic modulo 2 . , which is for all between
For our purposes, what is interesting about this type of arithmetic is that represented by a 1 followed by 0's, is equivalent to 0, which means
0 and . A simple ``trick'' that has its roots in this fact can be applied to the bit pattern of a number in order to calculate its additive inverse: if we invert every bit (turn a 1 into a 0 and vice versa) in the representation of a number and then add 1, we come up with the representation
of . For example, the representation of 5 in an 8-bit system is 00000101. Inverting every bit and adding 1 to the result gives the pattern 11111011. This is also the representation of 251, but in arithmetic modulo 2 (see problem 7). In practice we divide all we have so this pattern is a perfectly acceptable representation of
-bit patterns into two groups. Patterns that begin with 0 represent the
positive integers and patterns beginning with 1 represent the negative integers . To determine which integer is represented by a pattern that begins with a 1, compute its complement (invert every bit and add 1). For example, in an 8-bit two's complement system the pattern 11100001 represents , since the complement is . Note that the leading bit determines the sign, just as in a sign-magnitude system, but one cannot simply look at the remaining bits to ascertain the magnitude of the number. In a sign-magnitude system, the same pattern represents .
The first step in defining a representation for real numbers is to realize that binary notation can be extended to cover negative powers of two, e.g. the string ``110.101'' is interpreted as
Thus a straightforward method for representing real numbers would be to specify some location within a word as the ``binary point'' and give bits to the left of this location weights that are positive powers of two and bits to the right weights that are negative powers of two. For example, in a 16-bit word, we can dedicate the rightmost 5 bits for the fraction part and the leftmost 11 bits for the whole part. In this system, the representation of 6.625 is 0000000011010100 (note there are leading 0's to pad the whole part and trailing 0's to pad the fraction part). This representation, where there is an implied binary point at a fixed location within the word, is known as a fixed point representation. There is an obvious tradeoff between range and precision in fixed point representations. bits
for the fraction part means there will be numbers in the system between any two successive integers. With 5 bit fractions there are 32 numbers in the system between any two integers; e.g. the numbers between 5 and 6 are 5 (5.03125), 5 (5.03125), etc. To allow more precision, i.e. smaller divisions between successive numbers, we need more bits in the fraction part. The number of bits in the whole part determines the magnitude of the largest positive number we can represent, just as it does for integers. With 11 digits in the whole part, as in the example above, the largest number we can represent in 16 bits is . Moving one bit from the whole part to the fraction part .
in order to increase precision cuts the range in half, and the largest number is now
To allow for a larger range without sacrificing precision, computer systems use a technique known as floating point. This representation is based on the familiar ``scientific notation'' for expressing both very large and very small numbers in a concise format as the product of a small real number and a power of 10, e.g. . This notation has three components: a base (10 in this example); an exponent (in this case 23); and a mantissa (6.022). In computer systems, the base is either 2 or 16. Since it never changes for any given computer system it does not have to be part of the representation, and we need only two fields to specify a value, one for the mantissa and one for the exponent. As an example of how a number is represented in floating point, consider again the number 6.625. In binary, it is
If a 16-bit system has a 10-bit mantissa and 6-bit exponent, the number would be represented by the string 1101010000 000010. The mantissa is stored in the first ten bits (padded on the right with trailing 0's), and the exponent is stored in the last six bits. As the above example illustrates, computers transform the numbers so the mantissa is a manageable number. Just as is preferred to or in scientific notation, in binary the
mantissa should be between and . When the mantissa is in this range it is said to be normalized. The definition of the normal form varies from system to system, e.g. in some systems a normalized mantissa is between and .
Since we need to represent both positive and negative real numbers, the complete representation for a real number in a floating point format has three fields: a one-bit sign, a fixed number of bits for the mantissa, and the remainder of the bits for the exponent. Note that the exponent is an integer, and that this integer can be either positive or negative, e.g. we will want to represent very small numbers such as . Any method such as two's complement that can represent both positive and negative integers can be used within the exponent field. The sign bit at the front of the number determines the sign of the entire number, which is independent of the sign of the exponent, e.g. it indicates whether the number is or .
In the past every computer manufacturer used their own floating point representation, which made it a nightmare to move programs and datasets from one system to another. A recent IEEE standard is now being widely adopted and will add stability to this area of computer architecture. For 32-bit systems, the standard calls for a 1-bit sign, 8-bit exponent, and 23-bit mantissa. The largest number that can be represented is , and the smallest positive number (closest to 0.0) is of the standard are presented in an appendix to this chapter.
Figure 2 Distribution of Floating Point Numbers View Figure
. Details
Figure 2 illustrates the numbers that can be stored in a typical computer system with a floating point representation. The figure shows three disjoint regions: positive numbers negative numbers standard representation standard. . . , 0.0, and
is the largest number that can be stored in the system; in the IEEE is the smallest positive number, which is in the IEEE
Programmers need to be aware of several important attributes of the floating point representation that are illustrated by this figure. The first is the magnitude of the range between and .
There are about integers in this range. However there are only different 32-bit patterns. What this means is there are numbers in the range that do not have representations. Whenever a calculation results in one of these numbers, a round-off error will occur when the system approximates the result by the nearest (we hope) representable number. The arithmetic circuitry will produce a binary pattern that is close to the desired result, but not an exact representation. An interesting illustration of just how common these round-off errors are is the fact that 1 does not have a finite representation in binary, but is instead the infinitely repeating pattern .
The next important point is that there is a gap between , the smallest positive number, and 0.0. A round-off error in a calculation that should produce a small non-zero value but instead results in 0.0 is called an underflow. One of the strengths of the IEEE standard is that it allows a special denormalized form for very small numbers in order to stave off underflows as long as possible. This is why the exponent in the largest and smallest positive numbers are not symmetrical. Without denormalized numbers, the smallest positive number in the IEEE standard would be around .
Finally, and perhaps most important, is the fact that the numbers that can be represented are not distributed evenly throughout the range. Representable numbers are very dense close to 0.0, but then grow steadily further apart as they increase in magnitude. The dark regions in Figure 2 correspond to parts of the number line where representable numbers are packed close together. It is easy to see why the distribution is not even by asking what two numbers are represented by two successive values of the mantissa for any given exponent. To make the calculations easier, suppose we have a 16-bit system with a 7-bit mantissa and 8-bit exponent. No matter what the exponent is, the distance between any two successive values of the mantissa, e.g. between be . For numbers closest to 0.0, the exponent will be a negative number, e.g. and , will , and the
when exponents are large, the distance between two numbers will be approximately .
The most widely recognized aspect of a machine's internal organization that relates to performance is the clock cycle time, which controls the rate of internal operations in the CPU (Section 2.1). A
shorter clock cycle time, or equivalently a larger number of cycles per second, implies more operations can be performed per unit time. For a given architecture, it is often possible to rank systems according to their clock rates. For example, the HP 9000/725 and 9000/735 workstations have basically the same architecture, meaning they have the same instruction set and, in general, appear to be the same system as far as compiler writers are concerned. The 725 has a 66MHz clock, while the 735 has a 99MHz clock, and indeed the 735 has a higher performance on most programs. There are several reasons why simply comparing clock cycle times is an inadequate measure of performance. One reason is that processors don't operate ``in a vacuum'', but rely on memories and buses to supply information. The size and access times of the memories and the bandwidth of the bus all play a major role in performance. It is very easy to imagine a program that requires a large amount of memory running faster on an HP 725 that has a larger cache and more main memory than a 735. We will return to the topic of memory organization and processor- memory interconnection in later sections on vector processors and parallel processors since these two aspects of systems organization are even more crucial for high performance in those systems. A second reason clock rate by itself is an inadequate measure of performance is that it doesn't take into account what happens during a clock cycle. This is especially true when comparing systems with different instruction sets. It is possible that a machine might have a lower clock rate, but because it requires fewer cycles to execute the same program it would have higher performance. For example, consider two machines, A and B, that are almost identical except that A has a multiply instruction and B does not. A simple loop that multiplies a vector by a scalar (the constant 3 in this example) is shown in the table below. The number of cycles for each instruction is given in parentheses next to the instruction.
Table 3 View Table
The first instruction loads an element of the vector into an internal processor register X. Next, machine A multiplies the vector element by 3, leaving the result in the register. Machine B does the same operation by shifting and adding, i.e. . B copies the contents of X to another register Y, shifts X left one bit (which multiplies it by 2), and then adds Y, again leaving the result in X. Both machines then store the result back into the vector in memory and branch back to the top of the loop if the vector index is not at the end of the vector (the comparison and branch are done by the dbr instruction). Machine A might be slightly slower than B, but since it takes fewer cycles it will execute the loop faster. For example if A's cycle time is 9 MHz (.11 time is 10 MHz (.10 will require 1.2 s. s per cycle) and B's cycle s but B
s per cycle) A will execute one pass through the loop in 1.1
As a historical note, microprocessor and microcomputer designers in the 1970s tended to build systems with instruction sets like those of machine A above. The goal was to include instructions with a large ``semantic content,'' e.g. multiplication is relatively more complex than loading a value from memory or shifting a bit pattern. The payoff was in reducing the overhead to fetch instructions, since fewer instructions could accomplish the same job. By the 1980s, however, it became widely accepted that instruction sets such as those of machine B were in fact a better match for VLSI chip technology. The move toward simpler instructions became known as RISC, for Reduced Instruction Set Computer. A RISC has fewer instructions in its repertoire, but more importantly each instruction is very simple. The fact that operations are so simple and so uniform leads to some very powerful implementation techniques, such as pipelining, and opens up room on the processor chip for items such as on-chip caches or multiple functional units, e.g. a CPU that has two or more arithmetic units. We will discuss these types of systems in more detail later, in the section on superscalar designs (Section 3.5.2). Another benefit to simple instructions is that cycle times can also be much shorter; instead of being only moderately faster, e.g 10MHz vs. 9MHz as in the example above, cycle times on RISC machines are often much faster, so even though they fetch and execute more instructions they typically outperform complex instruction set (CISC) machines designed at the same time. In order to compare performance of two machines with different instruction sets, and even different styles of instruction sets (e.g. RISC vs. CISC), we can break the total execution time into constituent parts [11]. The total time to execute any given program is the product of the number of machine cycles required to execute the program and the processor cycle time:
The number of cycles executed can be rewritten as the number of instructions executed times the average number of cycles per instruction:
The middle factor in this expression describes the average number of machine cycles the processor devotes to each instruction. It is the number of cycles per instruction, or CPI. The basic performance model for a single processor computer system is thus
The three factors each describe different attributes of the execution of a program. The number of instructions depends on the algorithm, the compiler, and to some extent the instruction set of the machine. Total execution time can be reduced by lowering the instruction count, either through a
better algorithm (one that executes an inner loop fewer times, for example), a better compiler (one that generates fewer instructions for the body of the loop), or perhaps by changing the instruction set so it requires fewer instructions to encode the same algorithm. As we saw earlier, however, a more compact encoding as a result of a richer instruction set does not always speed up a program since complex instructions require more cycles. The interaction between instruction complexity and the number of cycles to execute a program is very involved, and it is hard to predict ahead of time whether adding a new instruction will really improve performance. The second factor in the performance model is CPI. At first it would seem this factor is simply a measure of the complexity of the instruction set: simple instructions require fewer cycles, so RISC machines should have lower CPI values. That view is misleading, however, since it concerns a static quantity. The performance equation describes the average number of cycles per instruction measured during the execution of a program. The difference is crucial. Implementation techniques such as pipelining allow a processor to overlap instructions by working on several instructions at one time. These techniques will lower CPI and improve performance since more instructions are executed in any given time period. For example, the average instruction in a system might require three machine cycles: one to fetch it from cache, one to fetch its operands from registers, and one to perform the operation and store the result in a register. Based on this static description one might conclude the CPI is 3.0, since each instruction requires three cycles. However, if the processor can juggle three instructions at once, for example by fetching instruction while it is locating the
operands for instruction and executing instruction , then the effective CPI observed during the execution of the program is just a little over 1.0 (Figure 3). Note that this is another illustration of the difference between speed and bandwidth. Overall performance of a system can be improved by increasing bandwidth, in this case by increasing the number of instructions that flow through the processor per unit time, without changing the execution time of the individual instructions. The third factor in the performance model is the processor cycle time . This is usually in the realm of computer engineering: a better layout of the components on the surface of the chip might shorten wire lengths and allow for a faster clock, or a different material (e.g. gallium arsenide vs. silicon based semiconductors) might have a faster switching time. However, the architecture can also affect cycle time. One of the reasons RISC is such a good fit for current VLSI technology is that if the instruction set is small, it requires less logic to implement. Less logic means less space on the chip, and smaller circuits run faster and consume less power [12]. Thus the design of the instruction set, the organization of pipelines, and other attributes of the architecture and its implementation can impact cycle time.
Figure 3 Pipelined execution View Figure
We conclude this section with a few remarks on some metrics that are commonly used to describe the performance of computer systems. MIPS stands for ``millions of instructions per second.'' With the variation in instruction styles, internal organization, and number of processors per system
it is almost meaningless for comparing two systems. As a point of reference, the DEC VAX 11/780 executed approximately one million instructions per second. You may see a system described as having performance rated at ``X VAX MIPS.'' This is a measure of performance normalized to VAX 11/780 performance. What this means is someone ran a program on the VAX, then ran the same program on the other system, and the ratio is X. The term ``native MIPS'' refers to the number of millions of instructions of the machine's own instruction set that can be executed per second. MFLOPS (pronounced ``megaflops'') stands for ``millions of floating point operations per second.'' This is often used as a ``bottom-line'' figure. If you know ahead of time how many operations a program needs to perform, you can divide the number of operations by the execution time to come up with a MFLOPS rating. For example, the standard algorithm for multiplying matrices requires operations ( inner products, with multiplications and matrices in 0.35 seconds.
additions in each product). Suppose you compute the product of two Your computer achieved
Obviously this type of comparison ignores the overhead involved in setting up loops, checking terminating conditions, and so on, but as a ``bottom line'' it gets to the point: what you care about (in this example) is how long it takes to multiply two matrices, and if that operation is a major component of your research it makes sense to compare machines by how fast they can multiply matrices. A standard set of reference programs known as LINPACK (linear algebra package) is often used to compare systems based on their MFLOPS ratings by measuring execution times for Gaussian elimination on matrices [8].
The term ``theoretical peak MFLOPS'' refers to how many operations per second would be possible if the machine did nothing but numerical operations. It is obtained by calculating the time it takes to perform one operation and then computing how many of them could be done in one second. For example, if it takes 8 cycles to do one floating point multiplication, the cycle time on the machine is 20 nanoseconds, and arithmetic operations are not overlapped with one another, it takes 160ns for one multiplication, and
so the theoretical peak performance is 6.25 MFLOPS. Of course, programs are not just long sequences of multiply and add instructions, so a machine rarely comes close to this level of performance on any real program. Most machines will achieve less than 10% of their peak rating, but vector processors or other machines with internal pipelines that have an effective CPI near 1.0 can often achieve 70% or more of their theoretical peak on small programs.
Using metrics such as CPI, MIPS, or MFLOPS to compare machines depends heavily on the programs used to measure execution times. A benchmark is a program written specifically for this purpose. There are several well-known collections of benchmarks. One that is be particularly interesting to computational scientists is LINPACK, which contains a set of linear algebra routines written in Fortran. MFLOPS ratings based on LINPACK performance are published regularly [8]. Two collections of a wider range of programs are SPEC (System Performance Evaluation Cooperative) and the Perfect Club, which is oriented toward parallel processing. Both include widely used programs such as a C compiler and a text formatter, not just small special purpose subroutines, and are useful for comparing systems such as high performance workstations that will be used for other jobs in addition to computational science modelling.