Student Notes 1
Student Notes 1
Student Notes 1
Course Objectives
No Objective
CO2 To learn the major components of a computer and their interconnections, both with
each other and the outside world with detailed discussion of internal and external
memory and of input–output (I/O) devices.
CO3 To examines the internal architecture and organization of the processor with an
extended discussion of computer arithmetic and the instruction set architecture.
CO5 To learn the Hardware Description Language and simulator to design and verify the
basic components of a computing system.
Text Book(s)
T1 Stallings William, Computer Organization & Architecture, Pearson Education, 8th Ed.,
2010.
Reference Book(s)
R1 C Hamacher, Z Vranesic and S Zaky, Computer Organization by McGrawHill, 5th Ed.
2002
R2 Hennenssy& D.A. Patterson, Computer Organization & Design, Morgan Kaufmann 4th
Ed., 2009
R3 The Essentials of Computer Organization and Architecture, Linda Null and Julia
Lobur, Jones and Bartlett publisher, 2003 [24x7 Online Book]
Module 1
Computer System Components and Interconnections
Function:
In general terms, there are four functions of a computer system
Data processing i.e., the computer must be able to process data.
Data storage the computer must be able to store data.
Data movement the computer must be able to move data between itself and
the outside world.
Control the computer must be able to control all these three functions.
Structure:
The main structural components are:
Central processing unit (CPU): Controls the operation of the computer and
performs its data processing functions; often simply referred to as processor.
Main memory: Stores data.
Input/Output: Moves data between the computer and its external
environment.
System interconnection: Some mechanism that provides for communication
among CPU, main memory, and I/O. A common example of system
interconnection is by means of a system bus, consisting of a number of
conducting wires to which all the other components attach.
Control unit: Controls the operation of the CPU and hence the computer
Arithmetic and logic unit (ALU): Performs the computer’s data processing
functions
Registers: Provides storage internal to the CPU
CPU interconnection: Some mechanism that provides for communication
among the control unit, ALU, and registers.
Concept of program
• Fetch instructions
• Interpret instructions
• Fetch data
• Process data
• Write/store data
8086
o Much more powerful (16-bit data)
o Instruction cache, pre-fetch few instructions
o 8088 (8-bit external bus) used in first IBM PC
80286
o 16 MByte memory addressable
o Up from 1MB (in 8086)
80386
o 32-bit processor with multitasking support
80486
Sophisticated powerful cache and instruction pipelining
o
Built in maths co-processor
o
Pentium
o Superscalar
o Multiple instructions executed in parallel
Pentium Pro
o Increased superscalar organization
o Aggressive register renaming
o Branch prediction and Data flow analysis
Pentium II
o MMX technology, graphics, video & audio processing
Pentium III
o Additional floating-point instructions for 3D graphics
Pentium 4
o Further floating point and multimedia enhancements
Itanium Series
o 64 bit with Hardware enhancements to increase speed
Fetch Cycle:
• Program Counter (PC) holds address of next instruction to fetch
• Increment PC
– Unless told otherwise
Execute Cycle:
– e.g. jump
Interrupts
• Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of
processing Program
• Timer:
I/O
• Hardware failure
Performance Assessment
Computer Performance
Performance is one of the key parameter to-
When we say one computer has better performance than another, what does it
mean? Criterion for the performance?
Clock Rate
Operation performed by processor are governed by system clock (fundamental
level of processor speed measurement)
Clock Rate (clock cycles per second in MHz or GHz) is inverse of clock cycle
time (clock period)
Performance
Instruction set
Choice of language
CPU Performance
To maximize performance, need to minimize execution time
performance = 1 / execution_time
Cycles Per Instruction (CPI)
• Program execution time (T) = Ic x CPI x t
– T = Ic x [p +(m x k] x t
MIPS Rate:
• Common measure of performance
– Ic/(T x 106)
Benchmark Programs
• It is a collection of a programs that provides representative test of a computer
in a particular application area
– Memory Connections
– CPU Connections
Types of Buses:
e.g. CPU needs to read an instruction (data) from a given location in memory.
•
– Interrupt request
– Clock signals
• Synchronous
– Events determined by clock signals and synchronized on leading edge
of clock
• Asynchronous
– The occurrence of one event on a bus follows and depends on the
occurrence of a previous event
– Events on the bus are not synchronized with clock
Bus Interconnection
• More than one module controlling the bus
– Centralized
– Distributed
• Handles integers
Integer Representation:
• No minus sign
Sign-Magnitude Representation
• Left most bit is sign bit
• 0 means positive
• 1 means negative
Integer Arithmetic:
Addition and Subtraction:
• Normal binary addition
– i.e. a - b = a + (-b)
Multiplication:
• Works out on partial product for each digit
to a recoded value R
Division Algorithm
1. Load divisor in M register and dividend into the A,Q registers
b) If operation is unsuccessful and MSB of A<>0, then set Q 0ß0 and restore the
previous value of A
Floating-Point Arithmetic:
• Round
• Round
MODULE - 3
Instruction Set Architecture
Microprocessor Registers:
• Visible registers
• Invisible registers
• AX- Accumulator
• CX – Count
• DX- Data
• FLAGS
– For controlling the microprocessor operation
• FLAGS
• SS- Stack
Instruction Set:
• The complete collection of instructions that are understood by a CPU
• ADD A, B
General Instruction Format
– Example
• MOV AL, [1234h]; Transfers one byte data from memory location
given by [DS+1234] to AL
– OPERAND(s) may be
• Part of instruction
Types of Instructions
• Data movement instructions
• Arithmetic - add, subtract, increment, decrement, convert byte/word and compare.
• Logic - AND, OR, exclusive OR, shift/rotate and test.
• String manipulation - load, store, move, compare and scan for byte/word.
• Control transfer - conditional, unconditional, call subroutine and return from
subroutine.
• Input/Output instructions.
• Other - setting/clearing flag bits, stack operations, software interrupts, etc.
Module - 4
Cache Memory Organization
Memory Hierarchy
• Computer memory exhibits the widest range of
– Physical Type
• Semiconductor (RAM), Magnetic (Disk), Optical (CD)
– Physical Characteristics
• Volatile/Non Volatile, Erasable/Non Erasable
– Organization
• Physical arrangement of bits
– Performance
• Access time (Latency), Transfer time, Memory cycle time
• Location
– CPU, Internal and External
• Capacity
– Number of Words/Bytes
• Unit of transfer
– Internal
• Usually governed by data bus width
– External
• Usually a block which is much larger than a word
• Addressable unit
– Smallest location which can be uniquely addressed
– Word or byte internally and Cluster on disks
• Access Methods
– Sequential
• Start at the beginning and read through in order
• Access time for a record is location dependent
• e.g. Magnetic Tape
– Direct
• Individual blocks have unique address based on physical location
• Access is by jumping to vicinity plus sequential search
• Access time depends on location and previous location e.g. Hard Disk
• Access Methods
– Random
• Wired-in addressing mechanism
• Individual addresses identify locations exactly
– Associative
• Data is located based on a portion of its contents rather than its address
• Called as Content Addressable Memory (CAM)
Cache Memory
It is a Small amount of fast memory that sits between normal main memory and CPU
Mapping Function:
The correspondence between the main memory blocks (group of words) and in the
cache lines is specified by a mapping function
Direct Mapping
• Each block of main memory maps to only one cache line
– i.e. if a block is in cache, it must be in one specific place
• Mapping Function
– jth Block of the main memory maps to ith cache line
• i = j modulo M (M = number of cache lines)
Fully Associative Mapping
A main memory block can load into any line of cache memory
Memory address is interpreted as tag and word
Tag uniquely identifies block of main memory
Set Associative Mapping
• Cache is divided into a number of sets
• Each set contains a fixed number of lines
• A given block maps to any line in a given set
– e.g. 2 lines per set
– 2 way associative mapping
– A given block can be in one of 2 lines in the set
Cost
Cs = (C1S1+C2S2)/(S1+S2)
• Access Efficiency (T1/Ts)
– Measure of how close average access time is to M1 access time
– On chip cache access time is about 25 to 50 times faster than main memory
– Off chip cache access time is about 5 to 15 times faster than main memory
access time
Static RAM:
• Memories that consists of circuits capable of retaining their state as long as power is
applied
• Bits stored as on/off switches
• Complex construction so larger per bit and more expensive
Dynamic RAM
• Bits stored as charge in capacitors charges leak so need refreshing even when
powered
• Simpler construction and smaller per bit so less expensive
• Address line active when bit read or written
– Transistor switch closed (current flows)
• Slower operations, used for main memory
Flash memory
Similar technology as EEPROM i.e. single transistor controlled by trapped charge
A single cell can be read but writing is on block basis and previous contents are erased
Greater density and a lower cost per bit, and consumes less power for operation
Used in MP3 players, Cell phones, digital cameras
Larger flash memory modules are called Flash Drives (i.e. Solid State Storage Devices)
External memory
• Semiconductor memory can not be used to store large amount of information or data
– Due to high per bit cost of bits
• Large storage requirements is full filled by
– Magnetic disks, Optical disks and Magnetic tapes
– Called as secondary storage
Magnetic Disk Structure
• Disk substrate (non magnetizable material) coated with magnetizable material (e.g.
iron oxide…rust)
• Advantage of glass substrate over aluminium
– Improved surface uniformity
– Reduction in surface defects
• Reduced read/write errors
– Lower fly heights
– Better stiffness
– Better shock/damage resistance
•
Fixed head
Movable head
Head mechanism
(# tracks/surface) x (# surfaces/platter) x
(# platters/disk)
• Example:
– 5 platters/disk
• Rotational delay
• Transfer time
RAID Structure
• RAID ( Redundant array of independent disks)– multiple disk drives provides
reliability via redundancy.
• RAID schemes improve performance and improve the reliability of the storage system
by storing redundant data.
• RAID Level 0:- RAID level 0 refers to disk arrays with striping at the level of
blocks but without any redundancy (such as mirroring or parity bits), as shown
Figure.
• RAID Level 2:- RAID level 2 is also known as memory-style code error
correcting code (ECC) organization. Memory systems have long detected certain
errors by using parity bits. Each byte in a memory system may have a parity bit
associated with it that records whether the number of bits in the byte set to 1 is even
(parity = 0) or odd (parity = 1). If one of the bits in the byte is damaged (either a 1
becomes a 0, or a 0 becomes a 1), the parity of the byte changes and thus will not
match the stored parity. Similarly, if the stored parity bit is damaged, it will not match
the computed parity. Thus, all single-bit errors are detected by the memory system.
• RAID Level 4:- RAID level 4, or block-interleaved parity organization, uses block-
level striping, as in RAID and in addition keeps a parity block on a separate disk for
corresponding blocks from A! other disks. This scheme is diagramed in Figure (e). If
one of the disks fails, the parity block can be used with the corresponding blocks from
the other disks to restore the blocks of the failed disk.
• RAID Level 6:- RAID level 6, also called the P + Q redundancy scheme, is much like
RAID level 5 but stores extra redundant information to guard against multiple disk
failures. Instead of parity, error correcting codes such as the Reed-Solomon codes are
used. In the scheme shown in Figure 2 bits of redundant data are stored for every 4
bits of compared with 1 parity bit in level 5 --- and the system can tolerate two disk
failures.
Module-6
Input/Output Organization
• Wide variety of peripherals
– Some I/O devices are slower than processor and memory while some are faster
than processor and memory
I/O MODULE
I/O Module Functions
• Error Detection
– Mechanical, electrical malfunctioning and transmission
I/O Methods:
• Programmed
• Interrupt driven
Programmed I/O:
• CPU has direct control over I/O
– Sensing status
– Read/write commands
– Transferring data
• Commands
• Read/Write
• Memory mapped I/O
– Devices and memory share an address space
• Isolated I/O
– Separate address spaces
I/O module gets data from peripheral whilst CPU does other work
If interrupted:-
Process interrupt
• This control circuit can be part of I/O device interface and called as DMA controller
– Read/Write
– Device address
• Not an interrupt
• Slows down CPU but not as much as CPU itself doing transfer
DMA Transfer Method: Burst Mode
• DMA controller is given exclusive access to the main memory to transfer a block of
data without interruption
• It reads a block of data using burst mode from the main memory and stores it into its
input buffer
Module-7
RISC Characteristics
• Relatively few instructions
– 128 or less
• Hardwired control
CISC Characteristics
• A large number of instructions
• However, it soon became apparent that a complex instruction set has a number of
disadvantages
• These include a complex instruction decoding scheme, an increased size of the control
unit, and increased logic delays.
– mov ax, 20
– mov bx, 5
– mul ax, bx
mov ax,0
mov bx, 20
mov cx,5
loop again
Instruction Pipelining
• New inputs are accepted at one end before previously accepted inputs appear as
outputs at the other end The same concept can be apply to the Instruction execution
Pipeline Performance
• Total time required to execute n instructions for a pipeline with k stages and cycle
time ‘t’
Tk,n = [k+(n-1)]t
• Speedup factor
– Sk = nk/[k+(n-1)]
• Note:
– Each stage in pipeline is expected to complete its operation in one clock cycle;
hence the clock period should be sufficiently long to complete the task being
performed in any stage.
Stalls/Hazards in Pipeline
• Any condition that causes the pipeline to stall is called Hazard
Data Hazards
Resource Hazards
• One instruction may need to access memory as part of the Write stage while another
instruction is being Fetched
Instruction/Control Hazards
– Assume that jump will not happen. Always fetch next instruction
• Predict by opcode
– Some instructions are more likely to result in a jump than others. Can get up to
75% success
• Machine Parallelism
• Note
– A program may not have enough instruction level parallelism to take full
advantage of machine parallelism
– Occurs when instruction moves from the decode stage to the first execute
stage
– Processor look ahead to locate instructions that can be brought into the
pipeline and executed
• Ordering issues
Issue Policies
• In-order issue with in-order completion
Machine Parallelism
• To enhance the performance of the Super Scalar Processors
– Register Renaming
Module-8
operation etc.
Control Unit:
• Generates the control signals to execute each micro-operation
• The control signals generated by the CU cause the opening and closing of logic
gates, resulting in the execution of the micro-ops
Types of Micro-operations:
• Transfer data between registers
– To keep time
• Instruction register
• Flags
– State of CPU
– Interrupts, Acknowledgements
• Instruction register
• Clock
– To keep time
At the end of an instruction cycle, the control unit must feed back to the counter to reinitialize
it at T1
• Output of the CU
• To the memory
It is a logic circuit that does the sequencing through microinstructions and generates
control signals to execute each microinstruction
Microinstruction Format
• Assign one bit position to each control signal in each micro instruction
– Ex. One ALU operation can be done at a time, PCin and PCout
cannot be active at same time
Micro-instruction Types
• Vertical micro-programming
• Horizontal micro-programming
– Each micro-instruction specifies many different micro-operations to be
performed in parallel
Microinstruction Sequencing
– For example if most of the instructions involve several addressing modes then
for each combination a separate microroutine needs to be stored, which leads
to lots of duplication
• So organization of memory should be such that common part can be shared by each
variation of a machine instruction
2. Word specified in control address register is read into control buffer register
3. Control buffer register contents generates control signals and next address information
4. Sequence logic loads new address into control address register based on next address
information from control buffer register and ALU flags
Module-9
Multiprocessor Organizations
• Simplifies synchronization
SMP Advantages
• Performance
• Availability
– Since all processors can perform the same functions, failure of a single
processor does not halt the system
• Incremental growth
• Scaling
Cache Coherence
– Multiple copies of same data in different caches exist simultaneously, can
result in an inconsistent view of memory
– Write through can also give problems unless caches monitor memory traffic
• Cache coherence is defined as the situation in which all cached copies of shared data
have the same value at all times
Software Solutions
• Compiler and operating system deal with problem
– Attractive because of, overhead transferred from run time to compile time and
design complexity transferred from hardware to software
– Simple Approach: Prevent any shared data variables from being cached. This
leads to inefficient cache utilization???
– Efficient Approach: Analyze code to determine safe periods for caching shared
variables. Compiler needs to insert instructions to enforce cache coherence
during the critical periods
– Directory protocols
– Snoopy protocols
Directory Protocols
• A directory (in main memory) contains the global state information about the contents
of the various local caches
• Individual cache controller requests are checked against directory stored in main
memory and appropriate transfers are performed
Snoopy Protocols
• These protocols distribute cache coherence responsibility among cache controllers
Write Invalidate
• Multiple readers, one writer
• When a write is required, all other caches of the line are invalidated
• Writing processor then has exclusive access until line required by another processor
Write Update
• Multiple readers and multiple writers
• Neither of these two approaches is superior to the other under all possibilities
– Performance depends upon the number of local caches and the pattern of
memory reads and writes
MESI Protocol
• Modified- The line in the cache has been modified
• Exclusive-The line in the cache is same as in main memory and is not present in any
other cache
• Shared-The line in the cache is same as in the main memory and it present in another
cache
Clusters
• It is a group of interconnected whole computers working together as unified resource
gives illusion of being one machine
• Basically used for Server applications
• Benefits
Superior price/performance
– High availability
– Fault tolerant
• Load balancing
– Incremental scalability
• Parallelizing Computation
– Parallelized Application
– Parametric computing
Blade Servers
• Common implementation of cluster approach
– Save space
• SMP
• Clustering
– Superior availability
– All processors have access to all parts of memory using load & store
– All processors have access to all parts of memory using load & store
***********************