Cs 903advanced Computer Architecture Unit - I
Cs 903advanced Computer Architecture Unit - I
Cs 903advanced Computer Architecture Unit - I
UNIT – I
UNIT – II
UNIT – III
Bus, Cache, and Shared Memory - Backplane Bus Systems, Cache Memory Organizations,
Shared-Memory Organizations, Sequential and Weak Consistency Models. Pipelining and
Superscalar Techniques - Linear Pipeline Processors, Nonlinear Pipeline Processors, Instruction
Pipeline Design, Arithmetic Pipeline Design, superscalar and Superpipeline Design.
UNIT – IV
UNIT – V
Parallel Models, Languages and Compilers - Parallel Programming Models, Parallel Languages
and Compilers. Dependence Analysis of Data Arrays, Code Optimization and Scheduling, Loop
Parallellization and Pipelining. Parallel Program Development and Environments - Parallel
programming Environments, Synchronization and Multiprocessing Models, Shared-Variable
Program Structures, Message-Passing program Development, Mapping Programs onto
Multicomputers.
REFERENCES
1.INTRODUCTION
From an application point of view, the mainstream of usage of computer is experiencing
a trend of four ascending levels of sophistication:
• Data processing
• Information processing
• Knowledge processing
• Intelligence processing
With more and more data structures developed, many users are shifting to computer roles
from pure data processing to information processing. A high degree of parallelism has
been found at these levels. As the accumulated knowledge bases expanded rapidly in
recent years, there grew a strong demand to use computers for knowledge processing.
Intelligence is very difficult to create; its processing even more so. Todays computers are
very fast and obedient and have many reliable memory cells to be qualified for data-
information-knowledge processing.
st
I generation of computers ( 1945-54)
The first generation computers where based on vacuum tube technology. The first
large electronic computer was ENIAC (Electronic Numerical Integrator and Calculator),
which used high speed vacuum tube technology and were designed primarily to calculate
the trajectories of missiles. They used separate memory block for program and data.
Later in 1946 John Von Neumann introduced the concept of stored program, in which
data and program where stored in same memory block. Based on this concept EDVAC
(Electronic Discrete Variable Automatic Computer) was built in 1951. On this concept
IAS (Institute of advance studies, Princeton) computer was built whose main
characteristic was CPU consist of two units (Program flow control and execution unit).
In general key features of this generation of computers where
1) The switching device used where vacuum tube having switching time between 0.1 to 1
milliseconds.
2) One of major concern for computer manufacturer of this era was that each of the
computer designs had a unique design. As each computer has unique design one cannot
upgrade or replace one component with other computer. Programs that were written for
one machine could not execute on another machine, even though other computer was also
designed from the same company. This created a major concern for designers as there
were no upward-compatible machines or computer architectures with multiple, differing
implementations. And designers always tried to manufacture a new machine that should
be upward compatible with the older machines.
3) Concept of specialized registers where introduced for example index registers were
introduced in the Ferranti Mark I, concept of register that save the return-address
instruction was introduced in UNIVAC I, also concept of immediate operands in IBM
704 and the detection of invalid operations in IBM 650 were introduced.
4) Punch card or paper tape were the devices used at that time for storing the program. By
the end of the 1950s IBM 650 became one of popular computers of that time and it used
the drum memory on which programs were loaded from punch card or paper tape. Some
high-end machines also introduced the concept of core memory which was able to
provide higher speeds. Also hard disks started becoming popular.
5) In the early 1950s as said earlier were design specific hence most of them were
designed for some particular numerical processing tasks. Even many of them used
decimal numbers as their base number system for designing instruction set. In such
machine there were actually ten vacuum tubes per digit in each register.
6) Software used was machine level language and assembly language.
7) Mostly designed for scientific calculation and later some systems were developed for
simple business systems.
8) Architecture features
Vacuum tubes and relay memories
CPU driven by a program counter (PC) and
accumulator Machines had only fixed-point arithmetic
9) Software and Applications
Machine and assembly
language Single user at a time
No subroutine linkage mechanisms Programmed
I/O required continuous use of CPU
10) examples: ENIAC, Princeton IAS, IBM 701
nd
II generation of computers (1954 – 64)
The transistors were invented by Bardeen, Brattain and Shockely in 1947 at Bell Labs
and by the 1950s these transistors made an electronic revolution as the transistor is
smaller, cheaper and dissipate less heat as compared to vacuum tube. Now the transistors
were used instead of a vacuum tube to construct computers. Another major invention was
invention of magnetic cores for storage. These cores where used to large random access
memories. These generation computers has better processing speed, larger memory
capacity, smaller size as compared to pervious generation computer.
The key features of this generation computers were
nd
1) The II generation computer were designed using Germanium transistor, this
technology was much more reliable than vacuum tube technology.
2) Use of transistor technology reduced the switching time 1 to 10 microseconds thus
provide overall speed up.
2) Magnetic cores were used main memory with capacity of 100 KB. Tapes and disk
peripheral memory were used as secondary memory.
3) Introduction to computer concept of instruction sets so that same program can be
executed on different systems.
4) High level languages, FORTRAN, COBOL, Algol, BATCH operating system.
5) Computers were now used for extensive business applications, engineering design,
optimation using Linear programming, Scientific research
6) Binary number system very used.
7) Technology and Architecture
Discrete transistors and core memories I/O
processors, multiplexed memory access
Floating-point arithmetic available
Register Transfer Language (RTL) developed
8) Software and Applications
High-level languages (HLL): FORTRAN, COBOL, ALGOL with compilers and
subroutine libraries
Batch operating system was used although mostly single user at a time
9) Example : CDC 1604, UNIVAC LARC, IBM 7090
rd
III Generation computers(1965 to 1974)
In 1950 and 1960 the discrete components ( transistors, registers capacitors) were
manufactured packaged in a separate containers. To design a computer these discrete
unit were soldered or wired together on a circuit boards. Another revolution in computer
designing came when in the 1960s, the Apollo guidance computer and Minuteman
missile were able to develop an integrated circuit (commonly called ICs). These ICs
made the circuit designing more economical and practical. The IC based computers are
called third generation computers. As integrated circuits, consists of transistors, resistors,
capacitors on single chip eliminating wired interconnection, the space required for the
computer was greatly reduced. By the mid-1970s, the use of ICs in computers became
very common. Price of transistors reduced very greatly. Now it was possible to put all
components required for designing a CPU on a single printed circuit board. This
advancement of technology resulted in development of minicomputers, usually with 16-
bit words size these system have a memory of range of 4k to 64K.This began a new era
of microelectronics where it could be possible design small identical chips ( a thin wafer
of silicon’s). Each chip has many gates plus number of input output pins.
rd
Key features of III Generation computers:
1) The use of silicon based ICs, led to major improvement of computer system. Switching
speed of transistor went by a factor of 10 and size was reduced by a factor of 10,
reliability increased by a factor of 10, power dissipation reduced by a factor of 10. This
cumulative effect of this was the emergence of extremely powerful CPUS with the
capacity of carrying out 1 million instruction per second.
2) The size of main memory reached about 4MB by improving the design of magnetic
core memories also in hard disk of 100 MB become feasible.
3) On line system become feasible. In particular dynamic production control systems,
airline reservation systems, interactive query systems, and real time closed lop process
control systems were implemented.
4) Concept of Integrated database management systems were emerged.
5) 32 bit instruction formats
6) Time shared concept of operating system.
7) Technology and Architecture features
Integrated circuits (SSI/MSI)
Microprogramming
Pipelining, cache memories, lookahead processing
15
benchmarks were developed. Computer architects have come up with a variety of metrics
to describe the computer performance.
Clock rate and CPI / IPC : Since I/O and system overhead frequently
overlapsprocessing by other programs, it is fair to consider only the CPU time used by a
program, and the user CPU time is the most important factor. CPU is driven by a clock
with a constant cycle time (usually measured in nanoseconds, which controls the rate of
internal operations in the CPU. The clock mostly has the constant cycle time (t in
nanoseconds). The inverse of the cycle time is the clock rate (f = 1/τ, measured in
megahertz). A shorter clock cycle time, or equivalently a larger number of cycles per
second, implies more operations can be performed per unit time. The size of the program
is determined by the instruction count (Ic). The size of a program is determined by its
instruction count, Ic, the number of machine instructions to be executed by the program.
Different machine instructions require different numbers of clock cycles to execute. CPI
(cycles per instruction) is thus an important parameter.
Average CPI
It is easy to determine the average number of cycles per instruction for a particular
processor if we know the frequency of occurrence of each instruction type.
Of course, any estimate is valid only for a specific set of programs (which defines the
instruction mix), and then only if there are sufficiently large number of instructions.
In general, the term CPI is used with respect to a particular instruction set and a given
program mix. The time required to execute a program containing Ic instructions is just T
= Ic * CPI * τ.
Each instruction must be fetched from memory, decoded, then operands fetched from
memory, the instruction executed, and the results stored.
The time required to access memory is called the memory cycle time, which is usually k
times the processor cycle time τ. The value of k depends on the memory technology and
the processor-memory interconnection scheme. The processor cycles required for each
instruction (CPI) can be attributed to cycles needed for instruction decode and execution
(p), and cycles needed for memory references (m* k).
The total time needed to execute a program can then be rewritten
as T = Ic* (p + m*k)*τ.
16
MIPS: Themillions of instructions per second, this is calculated by dividing the
numberof instructions executed in a running program by time required to run the
program. The MIPS rate is directly proportional to the clock rate and inversely proportion
to the CPI. All four systems attributes (instruction set, compiler, processor, and memory
technologies) affect the MIPS rate, which varies also from program to program. MIPS
does not proved to be effective as it does not account for the fact that different systems
often require different number of instruction to implement the program. It does not
inform about how many instructions are required to perform a given task. With the
variation in instruction styles, internal organization, and number of processors per system
it is almost meaningless for comparing two systems.
MFLOPS (pronounced ``megaflops'') stands for ``millions of floating point
operationsper second.'' This is often used as a ``bottom-line'' figure. If one know ahead of
time how many operations a program needs to perform, one can divide the number of
operations by the execution time to come up with a MFLOPS rating. For example, the
3 2
standard algorithm for multiplying n*n matrices requires 2n – n operations (n inner
products, with n multiplications and n-1additions in each product). Suppose you compute
the product of two 100 *100 matrices in 0.35 seconds. Then the computer achieves
(2(100)3 – 100)/0.35 = 5,714,000 ops/sec = 5.714 MFLOPS
The term ``theoretical peak MFLOPS'' refers to how many operations per second would
be possible if the machine did nothing but numerical operations. It is obtained by
calculating the time it takes to perform one operation and then computing how many of
them could be done in one second. For example, if it takes 8 cycles to do one floating
point multiplication, the cycle time on the machine is 20 nanoseconds, and arithmetic
operations are not overlapped with one another, it takes 160ns for one multiplication, and
6
(1,000,000,000 nanosecond/1sec)*(1 multiplication / 160 nanosecond) = 6.25*10
multiplication /sec so the theoretical peak performance is 6.25 MFLOPS. Of course,
programs are not just long sequences of multiply and add instructions, so a machine
rarely comes close to this level of performance on any real program. Most machines will
achieve less than 10% of their peak rating, but vector processors or other machines with
internal pipelines that have an effective CPI near 1.0 can often achieve 70% or more of
their theoretical peak on small programs.
17
Throughput rate : Another important factor on which system’s performance is
measuredis throughput of the system which is basically how many programs a system can
execute per unit time Ws. In multiprogramming the system throughput is often lower than
the CPU throughput Wp which is defined as
Wp = f/(Ic * CPI)
Unit of Wp is programs/second.
Ws <Wp as in multiprogramming environment there is always additional overheads like
timesharing operating system etc. An Ideal behavior is not achieved in parallel computers
because while executing a parallel algorithm, the processing elements cannot devote
100% of their time to the computations of the algorithm. Efficiency is a measure of the
fraction of time for which a PE is usefully employed. In an ideal parallel system
efficiency is equal to one. In practice, efficiency is between zero and one
s of overhead associated with parallel execution
Speed or Throughput (W/Tn) - the execution rate on an n processor system, measured
inFLOPs/unit-time or instructions/unit-time.
Speedup (Sn = T1/Tn) - how much faster in an actual machine, n processors compared
to1 will perform the workload. The ratio T1/T∞is called the asymptotic speedup.
Efficiency (En = Sn/n) - fraction of the theoretical maximum speedup achieved by
nprocessors
Degree of Parallelism (DOP) - for a given piece of the workload, the number
ofprocessors that can be kept busy sharing that piece of computation equally. Neglecting
overhead, we assume that if k processors work together on any workload, the workload
gets done k times as fast as a sequential execution.
Scalability - The attributes of a computer system which allow it to be gracefully
andlinearly scaled up or down in size, to handle smaller or larger workloads, or to obtain
proportional decreases or increase in speed on a given application. The applications run
on a scalable machine may not scale well. Good scalability requires the algorithm and the
machine to have the right properties
Thus in general there are five performance factors (Ic, p, m, k, t) which are influenced by
four system attributes:
• instruction-set architecture (affects Ic and p)
18
• compiler technology (affects Ic and p and m)
If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent
NUMA
Disadvantages:
Distributed Memory
• Like shared memory systems, distributed memory systems vary widely but share
a common characteristic. Distributed memory systems require a communication
network to connect inter-processor memory.
• Processors have their own local memory. Memory addresses in one processor do
not map to another processor, so there is no concept of global address space
across all processors.
• Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task
of the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's responsibility.
• Modern multicomputer use hardware routers to pass message. Based on the
interconnection and routers and channel used the multicomputers are divided into
generation
st
o 1 generation : based on board technology using hypercube architecture
and software controlled message switching.
o 2nd Generation: implemented with mesh connected architecture, hardware
message routing and software environment for medium distributed –
grained computing.
rd
o 3 Generation : fine grained multicomputer like MIT J-Machine.
• The network "fabric" used for data transfer varies widely, though it can be as
simple as Ethernet.
Advantages:
Disadvantages:
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to
this memory organization.
• Non-uniform memory access (NUMA) times
• EREW - Exclusive read, exclusive write; any memory location may only
beaccessed once in any one step. Thus forbids more than one processor from
reading or writing the same memory cell simultaneously.
• CREW - Concurrent read, exclusive write; any memory location may be read
anynumber of times during a single step, but only written to once, with the write
taking place after the reads.
• ERCW – This allows exclusive read or concurrent writes to the same
memorylocation.
• CRCW - Concurrent read, concurrent write; any memory location may be
writtento or read from any number of times during a single step. A CRCW PRAM
model must define some rule for resolving multiple writes, such as giving priority
to the lowest-numbered processor or choosing amongst processors randomly. The
PRAM is popular because it is theoretically tractable and because it gives
algorithm designers a common target. However, PRAMs cannot be emulated
optimally on all architectures.
VLSI Model:
Parallel computers rely on the use of VLSI chips to fabricate the major components such
as processor arrays memory arrays and large scale switching networks. The rapid advent
of very large scale intergrated (VSLI) technology now computer architects are trying to
implement parallel algorithms directly in hardware. An AT2 model is an example for two
dimension VLSI chips
Keywords
MIPS one Million Instructions Per Second. A performance rating usually referring
tointeger or non-floating point instructions
vector processor A computer designed to apply arithmetic operations to long vectors
orarrays. Most vector processors rely heavily on pipelining to achieve high performance
pipelining Overlapping the execution of two or more operation
PROGRAM AND NETWORK PROPERTIES
6. CONDITION OF PARALLELISM
The ability to execute several program segments in parallel requires each segment to be
independent of the other segments. We use a dependence graph to describe the relations.
The nodes of a dependence graph correspond to the program statement (instructions), and
directed edges with different labels are used to represent the ordered relations among the
statements. The analysis of dependence graphs shows where opportunity exists for
parallelization and vectorization.
Data and resource Dependence
Data dependence: The ordering relationship between statements is indicated by the
datadependence. Five type of data dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists
from s1 to S2 and if at least one output (variables assigned) of S1feeds in as input
Bernstein’s Conditions - 2
In terms of data dependencies, Bernstein’s conditions imply that two processes can
execute in parallel if they are flow-independent, antiindependent, and output-
independent. The parallelism relation || is commutative (Pi || Pj implies Pj || Pi ), but not
transitive (Pi || Pj and Pj || Pk does not imply Pi || Pk ) . Therefore, || is not an equivalence
relation. Intersection of the input sets is allowed.
Medium-sized grain; usually less than 2000 instructions. Detection of parallelism is more
difficult than with smaller grains; interprocedural dependence analysis is difficult and
history-sensitive. Communication requirement less than instruction level SPMD (single
procedure multiple data) is a special case Multitasking belongs to this level.
Subprogram-level Parallelism
Job step level; grain typically has thousands of instructions; medium- or coarse-grain
level. Job steps can overlap across different jobs. Multiprograming conducted at this level
No compilers available to exploit medium- or coarse-grain parallelism at present.
Job or Program-Level Parallelism
Corresponds to execution of essentially independent jobs or programs on a parallel
computer. This is practical for a machine with a small number of powerful processors,
but impractical for a machine with a large number of simple processors (since each
processor would take too long to process a single job).
Communication Latency
Balancing granularity and latency can yield better performance. Various latencies
attributed to machine architecture, technology, and communication patterns used.
Latency imposes a limiting factor on machine scalability. Ex. Memory latency increases
as memory capacity increases, limiting the amount of memory that can be used with a
given tolerance for communication latency.
Interprocessor Communication Latency
• Needs to be minimized by system designer
• Affected by signal delays and communication patterns Ex. n communicating tasks
may require n (n - 1)/2 communication links, and the complexity grows
quadratically, effectively limiting the number of processors in the system.
Communication Patterns
• Determined by algorithms used and architectural support provided
• Patterns include permutations broadcast multicast conference
• Tradeoffs often exist between granularity of parallelism and communication
demand.
Node degree reflects number of I/O ports associated with a node, and should ideally be
small and constant.
Network is symmetric if the topology is the same looking from any node; these are easier
to implement or to program.
Diameter : The maximum distance between any two processors in the network or in other
words we can say Diameter, is the maximum number of (routing) processors through
which a message must pass on its way from source to reach destination. Thus diameter
measures the maximum delay for transmitting a message from one processor to another
as it determines communication time hence smaller the diameter better will be the
network topology.
Connectivity: How many paths are possible between any two processors i.e., the
multiplicity of paths between two processors. Higher connectivity is desirable as it
minimizes contention.
Arch connectivity of the network: the minimum number of arcs that must be removed for
the network to break it into two disconnected networks. The arch connectivity of various
network are as follows
• 1 for linear arrays and binary trees
• 2 for rings and 2-d meshes
• 4 for 2-d torus
• d for d-dimensional hypercubes
Larger the arch connectivity lesser the conjunctions and better will be network topology.
Channel width : The channel width is the number of bits that can communicated
simultaneously by a interconnection bus connecting two processors
Bisection Width and Bandwidth: In order divide the network into equal halves we require
the remove some communication links. The minimum number of such communication
links that have to be removed are called the Bisection Width. Bisection width
basicallyprovide us the information about the largest number of messages which can
be sentsimultaneously (without needing to use the same wire or routing processor at the
same time and so delaying one another), no matter which processors are sending to which
other processors. Thus larger the bisection width is the better the network topology is
considered. Bisection Bandwidth is the minimum volume of communication allowed
between two halves of the network with equal numbers of processors This is important
for the networks with weighted arcs where the weights correspond to the link width i.e.,
(how much data it can transfer). The Larger bisection width the better network topology
is considered.
Cost the cost of networking can be estimated on variety of criteria where we consider the
the number of communication links or wires used to design the network as the basis of
cost estimation. Smaller the better the cost
Data Routing Functions: A data routing network is used for inter –PE data exchange.
Itcan be static as in case of hypercube routing network or dynamic such as multistage
network. Various type of data routing functions are Shifting, Rotating, Permutation (one
to one), Broadcast (one to all), Multicast (many to many), Personalized broadcast (one to
many), Shuffle, Exchange Etc.
Permutations
Given n objects, there are n ! ways in which they can be reordered (one of which is no
reordering). A permutation can be specified by giving the rule for reordering a group of
objects. Permutations can be implemented using crossbar switches, multistage networks,
shifting, and broadcast operations. The time required to perform permutations of the
connections between nodes often dominates the network performance when n is large.
Perfect Shuffle and Exchange
Stone suggested the special permutation that entries according to the mapping of the k-bit
binary number a b … k to b c … k a (that is, shifting 1 bit to the left and wrapping it
around to the least significant bit position). The inverse perfect shuffle reverses the effect
of the perfect shuffle.
Hypercube Routing Functions
If the vertices of a n-dimensional cube are labeled with n-bit numbers so that only one bit
differs between each pair of adjacent vertices, then n routing functions are defined by the
bits in the node (vertex) address. For example, with a 3-dimensional cube, we can easily
identify routing functions that exchange data between nodes with addresses that differ in
the least significant, most significant, or middle bit.
Multistage Networks
Many stages of interconnected switches form a multistage SIMD network. It is basicaaly
consist of three characteristic features
• The switch box,
• The network topology
• The control structure
Many stages of interconnected switches form a multistage SIMD networks. Eachbox is
essentially an interchange device with two inputs and two outputs. The four possible
states of a switch box are which are shown in figure 3.6
• Straight
• Exchange
• Upper Broadcast
• Lower broadcast.
A two function switch can assume only two possible state namely state or exchange
states. However a four function switch box can be any of four possible states. A
multistage network is capable of connecting any input terminal to any output terminal.
Multi-stage networks are basically constructed by so called shuffle-exchange switching
element, which is basically a 2 x 2 crossbar. Multiple layers of these elements are
connected and form the network.
Figure 2.5 A two-by-two switching box and its four interconnection states
A multistage network is capable of connecting an arbitrary input terminal to an arbitrary
i.e., n and the total cost depends on the total number of switches used and that is Nlog 2N.
The control structure can be individual stage control i.e., the same control signal is used
to set all switch boxes in the same stages thus we need n control signal. The second
control structure is individual box control where a separate control signal is used to set
the state of each switch box. This provide flexibility at the same time require n2/2 control
signal which increases the complexity of the control circuit. In between path is use of
partial stage control.
log2 N. We use binary sequence to represent the vertex (PE) address of the cube. Two
processors are neighbors if and only if their binary address differs only in one digit place
For an n-dimensional cube network of N PEs is specified by the following n routing
functions
Ci (An-1 …. A1 A0)= An-1…Ai+1 A’i Ai-1……A0 for i =0,1,2,…,n-1
A n- dimension cube each PE located at the corner is directly connected to n neighbors.
The addresses of neighboring PE differ in exactly one bit position. Pease’s binary n cube
the flip flop network used in staran and programmable switching network proposed for
Phoenix are examples of cube networks.
In a recirculating cube network each ISa for 0<=A+< N-1 is connected to n OSs whose
addresses are An-1…Ai+1 A’i Ai-1……A0 . When the PE addresses are considered as
the corners of an m-dimensional cube this network connects each PE to its m neighbors.
The interconnections of the PEs corresponding to the three routing function C0, C1 and
C2 are shown separately in below figure.
• Examples
Figure 2.10 The recirculating Network
It takes n<= log2 N steps to rotate data from any PE to another.
Example: N=8 => n=3
Keywords
Dependence graph : A directed graph whose nodes represent calculations and
whoseedges represent dependencies among those calculations. If the calculation
represented by node k depends on the calculations represented by nodes i and j, then the
dependence graph contains the edges i-k and j-k.
data dependency : a situation existing between two statements if one statement can
storeinto a location that is later accessed by the other statement
granularity The size of operations done by a process between communications events.
Afine grained process may perform only a few arithmetic operations between processing
one message and the next, whereas a coarse grained process may perform millions
control-flow computers refers to anarchitecturewith one or more program counters
thatdetermine the order in which instructions are executed.
dataflow A model of parallel computing in which programs are represented asdependence
graphs and each operation is automatically blocked until the values on whichit depends are
available. The parallel functional and parallel logic programming models are very similar
to the dataflow model.
network A physical communication medium. A network may consist of one or morebuses,
a switch, or the links joining processors in a multicomputer.
Static networks: point-to-point direct connections that will not change during program
execution
Dynamic networks: switched channels dynamically configured to match user program
communication demands include buses, crossbar switches, and multistage networks routing The
act of moving a message from its source to its destination. A routingtechnique is a way of
handling the message as it passes through individual nodes. Diameter D of a network is the
maximum shortest path between any two nodes, measured by the number of links traversed; this
should be as small as possible (from a communication point of view).
Channel bisection width b = minimum number of edges cut to split a network into two parts each
having the same number of nodes. Since each channel has w bit wires, the wire bisection width B
= bw. Bisection width provides good indication of maximum communication bandwidth along
the bisection of a network, and all other cross sections should be bounded by the bisection width.
Wire (or channel) length = length (e.g. weight) of edges between nodes.