HW SW Codesign
HW SW Codesign
HW SW Codesign
0. Organization
0-2
Organization (1)
Lecture: introductionary course + consultations
Exercises: delivered during consultations
0-3
Organization (2)
Course materials:
slide copies, exercise sheets, papers
the slides contain material from Marco Platzner, Peter
Marwedel, Lothar Thiele, Frank Vahid, Reinhard Wilhelm
References:
P. Marwedel: Embedded System Design, Springer, 2006.
F. Vahid, T. Givargis: Embedded System Design: A Unified
Hardware/Software Introduction, John Wiley & Sons, 2002.
0-4
Textbook & slides
course based
on the book and the slides
“Embedded System Design” by
Peter Marwedel
0-5
Overview
Administration
Course synopsis
Introduction and motivation
0-6
Course Synopsis
Different Levels of Model Representation
Specifications
Models
Abstraction Levels
Dealing with Contradictory Constraints
Exploration
Simulation
• Worst-Case Eexecution Time
Optimization
Hardware/Software Mapping
Partitioning
Scheduling
Allocation
Software Code Optimizations
Compilation
Estimation
0-7
Benefits ? Learn about …
… challenges and approaches in modern system design
… useful optimization methods
… performance estimation of embedded systems
… a current research area
0-8
Overview
Administration
Course synopsis
Introduction and motivation
0-9
What is HW/SW Codesign?
... integrated design of systems that consist of hardware-
and software-components
0 - 10
Hardware/Software Boundaries
General purpose systems (PC, workstation)
processor design:
processor ļ compiler, operating system
0 - 11
Target Architectures
0 - 12
Why Codesign? (1)
Modern embedded systems require “design” optimization
many functions, great variability, high flexibility
heterogeneous target systems
• processors, ASICs, FPGAs, systems-on-chip, …
many design goals
• performance, cost, power consumption, reliability, ...
0 - 13
Why Codesign? (2)
Optimization of the “design process”
0 - 14
Codesign methodologies
Different Levels of Model Representation
Dealing with Contradictory Constraints
Hardware/Software Mapping
Software Code Optimizations
Estimation
0 - 15
System Design
0 - 16
System Design
0 - 17
Motivation (1)
According to forecasts, future of IT characterized
by terms such as
Disappearing computer,
Ubiquitous computing,
Pervasive computing,
Ambient intelligence,
Post-PC era,
Cyber-physical systems.
Basic technologies:
Embedded Systems
Communication technologies
0 - 18
Motivation (2)
0 - 19
Embedded Systems & Cyber-Physical
Systems
“Dortmund“ Definition: [Peter Marwedel]
Information processing systems embedded into a larger
product
Communication Embedded
Technology Systems
Dependability
Robots
Optical networking
Quality of
Real-time
Control systems
service
Network management
Feature extraction
Distributed applications
and recognition
Service provision
Sensors/actors
UMTS, DECT, Hiperlan, ATM
A/D-converters
Pervasive/Ubiquitous computing
Distributed systems
Embedded web systems
0 - 21
Growing importance of embedded systems
Spending on GPS units exceeded $100 mln during Thanksgiving week, up 237%
from 2006 … More people bought GPS units than bought PCs, NPD found.
[www.itfacts.biz, Dec. 6th, 2007]
…, the market for remote home health monitoring is expected to generate $225
mln revenue in 2011, up from less than $70 mln in 2006, according to Parks
Associates. . [www.itfacts.biz, Sep. 4th, 2007]
According to IDC the identity and access management (IAM) market in Australia
and New Zealand (ANZ) … is expected to increase at a compound annual growth
rate (CAGR) of 13.1% to reach $189.3 mln by 2012 [www.itfacts.biz, July 26th, 2008].
Accessing the Internet via a mobile device up by 82% in the US, by 49% in
Europe, from May 2007 to May 2008 [www.itfacts.biz, July 29th, 2008]
0 - 22
Automotive electronics
Functions by embedded processing: Multiple networks
ABS: Anti-lock braking systems Body, engine, telematics, media,
safety
ESP: Electronic stability control
Airbags
Efficient automatic gearboxes Multiple processors
Theft prevention with smart keys Up to 100
Blind-angle alert systems • 8-bit – door locks, lights, etc.
... etc ... • 16-bit – most functions
• 32-bit – engine control, airbags
Processing where the action is
Sensors and actuators distributed
all over the vehicle
Networked together
0 - 23
Avionics
0 - 24
Railways
Safety features
contribute significantly
to the total value of
trains, and dependability
is extremely important
0 - 25
Telecommunication
Mobile phones have been one of the fastest growing
markets in the recent years,
• Multiprocessor
• 8-bit/32-bit for UI
• DSP for signals
• 32-bit in IR port
• 32-bit in Bluetooth
• 8-100 MB of memory
• All custom chips
• Power consumption & battery life depends on
software
base stations
• Massive signal processing
• Several processing tasks per connected
mobile phone
• Based on DSPs
• Standard or custom
• 100s of processors
Geo-positioning systems,
Fast Internet connections,
Closed systems for police, ambulances, rescue staff.
0 - 26
Medical systems
For example:
• Artificial eye: several approaches,
e.g.:
• Camera attached to glasses;
computer worn at belt; output
directly connected to the brain,
“pioneering work by William
Dobelle”. Previously at
[www.dobelle.com]
0 - 27
Extremely Large
Functions requiring computers:
Radar
Weapons
Damage control
Navigation
basically everything
Computers:
Large servers
1000s of processors
0 - 28
Inside your PC
Custom processors
Graphics, sound
32-bit processors
IR, Bluetooth
Network, WLAN
Harddisk
RAID controllers
8-bit processors
USB
Keyboard, mouse
0 - 29
Authentication systems
0 - 30
Consumer electronics
Examples
0 - 31
Industrial automation
Examples
0 - 32
Forestry Machines
Networked computer system
Controlling arms & tools
Navigating the forest
Recording the trees harvested
Crucial to efficient work
Operator panel
Graphical display
Touch panel
Joystick
Buttons
Keyboard
“Tough enough to be out in the woods”
0 - 33
© Jakob Engblom
Smart buildings
Examples
Integrated cooling,
lightning, room
reservation, emergency
handling,
communication
Goal: “Zero-energy
building”
0 - 34
Robotics
“Pipe-climber”
Robot “Johnnie“
Lego mindstorms
Standard controller
• 8-bit processor
• 64 kB of memory
Electronics to interface
to motors and sensors
0 - 35
Estimation
Hardware, software and system as a whole suitability
0 - 36
a a ot a o si n
nt o tion
o o a a
-
Em st ms
Embedded systems ES in o mation o ssin
s st ms m into a a o t
E amples
-3
Em st ms
t na o ss
man int a
s nso s a t ato s
m s st m
-
aa an ist i t a t at o ms
ABS
ASR
ACC ESP
engine
control powertrain
control
-
E am o sso
ell Processor B combines
general-purpose architecture core with
coprocessing elements which greatly accelerate multimedia
and ector processing applications, as well as many other
forms of dedicated computation
-6
omm ni atin Em st ms
sensor networks ci il engineering, buildings, en ironmental
monitoring, traffic, emergency situations
smart products, wearable ubi uitous computing
-
n s in n o mation an omm ni ation
nternet
ew pplications and
System Paradigms
-
om a ison
Embedded Systems eneral Purpose omputing
ew applications that are Broad class of applications
known at design-time
ot programmable by end Programmable by end user
user
i ed run-time re uirements aster is better
additional computing power
not useful
riteria riteria
• cost • cost
• power consumption • a erage speed
• predictability
• meeting time bounds
•
-
si n a n s
- 0
a n s o Em ot a
ynamic en ironments
apture the re uired beha iour
alidate specifications
Efficient translation of specifications
into implementations
How can we check that we meet real-
time constraints
How do we alidate embedded real-
time software large olumes of data,
testing may be safety-critical
-
m m ntation t nati s
n a os o sso s
o amma a a
• i o amma at a a s
-
n a i it
ES ust be ,
probability of system working correctly
pro ided that is was working at t
probability of system working correctly d
time units after error occurred
probability of system working at time t
no harm to be caused
confidential and authentic communication
E en perfectly designed systems can fail if the assumptions
about the workload and possible errors turn out to be wrong
aking the system dependable must not be an after-thought,
it must be considered from the ery beginning
- 3
E i i n
ES must be efficient
ode-si e efficient
especially for systems on a chip
Run-time efficient
eight efficient
ost efficient
Energy efficient
-
a tim onst aints
any ES must meet -
real-time system must react to stimuli from the controlled
ob ect or the operator within the time inter al by the
en ironment
or real-time systems, right answers arri ing too late are wrong
-
opet ,
ll other time-constraints are called
guaranteed system response has to be e plained without
statistical arguments
-
a im st ms
a tim
Jakob Engblom
- 6
a ti i s st ms
ypically, ES are
-
i at s st ms
towards a certain
nowledge about beha ior at design time
can be used to minimi e resources and to
ma imi e robustness
-
ont nts
hat is an Embedded System
-
st a tion o s an nt sis
- 0
so st a tions
Beha ior
st m
Process odule rchitecture
unction R L
at mo s Structure
it mo s
t o i it mo s
i mo s
a o t mo s
-
ont nts
hat is an Embedded System
-
si n a oa s
inition nt sis is the process of generating the
description of a system in terms of related lower-le el
components from some high-le el description of the e pected
beha ior
- 3
st m si n
Specification
ntellectual ntellectual
Prop ode Prop Block
ntellectual ntellectual
Prop ode Prop Block
ntellectual ntellectual
Prop ode Prop Block
ntellectual ntellectual
Prop ode Prop Block
:
application specification
design space e ploration and system optimi ation
estimation
-
a in o m
-
a in an in
- 30
a in an in
Similarity to allocation or load distribution problem in high-
le el synthesis or real-time operating systems
dedicated
P2 HWcomponents
P1 P4
P3
SW
(processors)
-3
Estimation
he principle of synthesis based on abstraction only makes
sense if there are a ailable
Estimate properties of the ne t layer s of abstraction
esign decisions are based on these estimated properties f
the estimation is not correct or not accurate enough , the
design will be sub-optimal or e en not working correctly
si n a i im in
E o ation a st a tion o
Estimation o
si n
o a
o ti s si n a
E o ation
si n a o
E o ation a st a tion
-3
a a ot a o si n
i i ation an o so om tation
Intellectual Intellectual
Prop. Code Prop. Block
Eric amman ichard Helm alph ohnson ohn lissides Design Patterns ddision-
Wesley
2-
amp e ser er pattern in a a
pu lic oid add istener listener
2-
ser er pattern it m te es
2-
te es sing monitors are mine ie ds
pu lic sync roni ed oid add istener listener
alu ests
for int i i mylisteners.len th i
re
eC
u
ha
y
my isteners i . alueChan ed ne alue
ld
he
ed
lock
2-
Simp e o ser er pattern gets comp icated
pu lic sync roni ed oid add istener listener
2-
Pro ems it t read ased conc rrency
2-
ontents
StateCharts
ata- lo Models
2-
e irements or Speci ication ec ni es
proc
proc
E amples states processes procedures. proc
E amples processors racks
printed circuit oards
2-
e irements or Speci ication ec ni es
-
e uired for reacti e systems.
-
Components send streams of data
to each other.
o o stac es or
2- 2
ode s o omp tation De inition
2-
S ared memory
Potential race conditions )inconsistent results possi le
) Critical sections sections at hich e clusi e access to
resource r e. . shared memory must e uaranteed.
send recei e
2-
oc ing sync rono s message passing
send recei e
2-
Sync rono s message passing SP
process
process process
processBB
.... ....
arar aa...
... arar ...
...
aa ...
...
ccaa ----output
output cc ----input
input
end
end end
end
2-
omponents
on Neumann model
2-
omponents
inite state machines
ifferential e uations
w2x
b
wt 2
2-
amp e Discrete ent D
2-2
Sensiti ity ists in D
Sensi ity lists are a shorthand for a sin le ait on-statement
at the end of the process ody
process y
egin
prod and y
end process
is e ui alent to
process
egin
ait on y
prod and y
end process
2-2
No lan ua e that meets all lan ua e re uirements
) usin compromises
2 - 22
ontents
Models of Computation
ata- lo Models
2-2
assica tomata
Classical automata
e
Moore-automata
Y O Z Z G X, Z e e
Mealy-automata
Y O X Z Z G X, Z
e
2-2
State arts
) ) StateCharts Harel
2-2
ntrod cing ierarc y
2-2
De initions
Current states of SMs are also called states.
States hich are not composed of other states are called
.
States containin other states are called - .
or each asic state s the super-states containin s are called
.
Super-states S are called - - if e actly one of the
su -states of S is acti e hene er S is acti e.
superstate
ancestor state of E
su states
2-2
De a t State ec anism
ry to hide internal
structure from outside
orld
) efault state
illed circle
indicates su -state
entered hene er
super-state is entered.
Not a state y itself
2-2
istory ec anism
m
k
2-2
om ining istory and De a t State
same meanin
2-
onc rrency
Con enient ays of descri in concurrency are re uired.
- - : FSM is in all (immediate) sub-states of a
super-state.
2-
Entering and Leaving AND-Super-States
incl.
F
F M
L
M
L
2 - 33
putati n state sets
omputation of state sets by from
leaves to root:
basic states: state set state
-super-states: state set union of children
-super-states: state set artesian product of children
F M
2-3
pes States
2-3
i ers
Since time needs to be modeled in embedded systems,
timers need to be modeled.
n State harts, special edges can be used for timeouts.
2-3
sing i ers in Ans ering a ine
2-3
epresentati n putati ns
esides states, arbitrary many other variables can be
defined. his way, not all states of the system are
modeled e plicitly.
hese variables can be changed as a result of a state
transition ( ). State transitions can be dependent
on these variables ( ).
action unstructured
state space
variables
condition
2-3
eneral r Edge La els
event condition action
2-3
Events and a ti ns
can be composed of several events:
and 2 : event that corresponds to the simultaneous
occurrence of e and e .
r 2 : event that corresponds to the occurrence of either
e or e or both.
n t : event that corresponds to the absence of event e.
2-
E a ple
e a1 c a2
x y z
e:
a1:
a2:
true
c: false
e:
a1:
a2:
true
c: false
2-
e State arts Si ulati n ases
:
. ffect of e ternal changes on events and conditions is
evaluated,
. he set of transitions to be made in the current step and right
hand sides of assignments are computed,
. ransitions become effective, variables obtain new values.
2- 2
E a ple
e
phas
phase
Status
phase
2-
e le ts del l ed ard are
Same
Sameseparation
separationinto
intophases
phasesfound
foundin
inother
otherlanguages
languagesas
as
well,
well, especially
especially those
those that
that are
are intended
intended to
to model
model hardware.
hardware.
2-
re n se anti s State arts
nfortunately, there are several time-semantics of
State harts in use. his is another possibility:
step is e ecuted in arbitrarily small time.
nternal (generated) events e ist only within the ne t step.
ternal events can only be detected after a stable state
has been reached.
e ternal events
stable stable
state state state
transitions t
transport of internal events step
2-
E a ples
state diagram:
stable state
2-
E a ple
on-determinism
a a
A C E G
a a
B D F H
state diagram:
a E,H
a
A,B C,D
a F,G
2-
E a ple
state diagram (only
stable states are
represented, only a
a c and b are e ternal):
a
a a a
a
ac
a a
2-
Evaluati n State arts
2-
Evaluati n State arts
enerated ,
ot useful for applications,
o description of - ,
o - ,
o description of .
2-
SDL
(S L) is a
specification language targeted at the unambiguous
specification and description of the behaviour of reactive
and distributed systems.
2- 2
uni ati n a ng SDL- S s
ommunication between FSMs (or processes ) is based on
essage-passing, assuming a p tentiall inde initel large
- ueue.
ach process fetches
ne t entry from F F ,
checks if input enables
transition,
if yes: transition takes
place,
if no: input is discarded
(e ception: S -
mechanism).
2- 3
Deter inisti
Let tokens be arriving at F F at the same time:
rder in which they are stored, is unknown
State harts
2-
Data l Language del
communicating through
FF uffer
rocess rocess
FF uffer
FF uffer
rocess
2-
il s p Data l Languages
:
mperative language style: program counter is king
ataflow language: movement of data is the priority
Scheduling responsibility of the system, not the programmer
:
ll processes run simultaneously
rocesses can be described with imperative code
rocesses can y communicate through buffers
Se uence of read tokens is identical to the se uence of written
tokens
2-
Data l Languages
ppropriate for applications that deal with :
Fundamentally concurrent: maps easily to parallel hardware
erfect fit for block-diagram specifications (control systems, signal
processing)
Matches well current and future trend towards multimedia
applications
:
ost Language (process description), e.g. , , ava, .... .
oordination Language (network description), usually home made ,
e.g. ML.
2-
E a ple E - vide de der
2-
a n r ess Net r s
2-
A a n r ess
From ahn s original paper
2- 2
A a n r ess
From ahn s original paper:
int i init
send(i, v) u h v
for( )
i wait(u)
hat does this do
send(i, v)
rocess sends initial value, then
passes through values.
2- 3
A a n r ess Net r
hat does this do
rints an alternating se uence of s and s.
g f
h
init
mits a once and then copies input to output
2-
Deter ina
:
system is random if the information about the
system and its inputs is not sufficient to determine its outputs.
:
efine the y of a channel to be the se uence of tokens
that have been both written and read. process network is
said to be e e a e if the histories of all channels depend
only on the histories of the input channels.
:
Functional behavior is independent of timing (scheduling,
communication time, e ecution time of processes).
Separation of functional properties and timing.
2-
Determinacy
[x1,x2,x3,…] [y1,y2,y3,…]
F
monotonic mapping
x x
y
,
2 - 66
Determinacy
[x1,x2,x3,…] [y1,y2,y3,…]
F
orma de inition
[x1, x2, x3, ]
x [x1] [x1, x2] [x1, x2, x3, ]
, 1, ,
F o
F F
2-6
r Determini m
determinate
y
Rea oning
, y y
y
y y, ,
y
, y
y
2-6
in n eterminacy
y
amp e
2-6
in n eterminacy
1 [ ]
F [ , ]
F 1, 2
2 [ ]
1 [ , ]
F [ , , ]
F 1 , 2
2 [ ]
[ ], [ ] [ , ], [ ]
F F [ , ][ , , ]
2-
c e in a n et r
2-
Deman ri en c e in
y y
2- 2
m ar rit m
o nded memor
itho t dead o
y ontin e
y dead o , in rease si e
2-
Fr m n inite t Finite er i e
y
n,
n
y
y
2-
Dea c am e
x y
2
, ,
1,
1,
1,
2- 5
am e Finite i e er in
2
1
1, 1, 1,
1,
1, 1,
1, 1,
1,
2- 6
ar rit m in cti n
1
, , ,
1 1 1
1 2 1 1 1
3 1 1
2 y
2-
ar rit m in cti n
y y
1 1 1 1
2 1 1 1 1 1
1
3 1 1
2 y
2-
a ati n a n r ce et r
ro
y
x
on
y y
x y
2-
ync r n Data DF
, y, 1
estri tion
i ed n m er o token
amp e 1
1 1 2 3 2 1
2-
DF c e in
c ed e y at compi e time
y
Re t x y
2-
a ancin ati n
3a 2
3d
1
3
2 2 a
3 3
d 2a
2 1
3
1 2
2- 2
in t e a ancin ati n
ain D c ed ing t eorem
n
y x n1
n1 x
y y
y
amp e
2-
Determine eri ic c e e
1
o i e c ed e
3 2 3
…
2 1
3
1 2
y y
e i i it
, y
2-
ar are t are e i n
De i n ace rati n
c r re r a a
-
y tem De i n
y y
-2
De i n ace rati n
icati n rc itect re
a in
timati n
-
Detai e ie De i n ace rati n
e a ati n
c n tr ct ma e timate
arc itect re a icati n er rmance
m ti ecti e
timi ati n
-
am e im e e
, 2,
1
1 3
, y,
-5
Example 1: Evolutionary Algorithms for DSE
individual
n selection binding
o recombination decode binding
p mutation
scheduling
design point
“chromosome” = encoded
allocation + binding (implementation)
fitness evaluation
fitness
1
Definition: A specifica-
tion graph is a graph 5 RISC
GS=(VS,ES) consisting
data flow graph GP,
of a problem 3 SB
ES=EP∪EA∪EM 6 HWM2
4
GP EM GA
3-
Example 1: Mapping
1
0 1
0
1 5 RISC
8
21 3 SB
1
29 7 HWM1
20
α
1 2
1
21 6 β RISC HWM1
2 PTP bus
30 4 shared
bus
HWM2
τ
3-
Example 1: hallenges
ncoding of (allocation+binding)
simple encoding
e g one bit per resource one variable per binding
eas to implement
man infeasible partitioning solutions
encoding + repair
e g simple encoding and modif such that for each vp VP there
e ists at least one va VA ith a E(vp) = va
reduces number of infeasible partitioning solutions
eneration of the initial population mutation
ecombination
3-
Example 1: ase Study
3-
Example 1: ase Study
3-
am le ase tud
rame memor dual orted rame memor bloc matc module
ut module
out ut module
subtract/add module
/ module u ma e coder
3 - 12
am le olut o
INM
INM OUTM
OUTM FM
FM RISC2
RISC2
SBS
3 - 13
am le olut o
INM
INM OUTM
OUTM DPFM
DPFM HC
HC DCTM
DCTM BMM
BMM SAM
SAM
SBF
3-1
am le o t are t es s
2 2
A B C D F
CD DAT
ec s o s
nS oC
I
ABABABCCABABA CODE(A) CALL(A) FOR 1 TO 2
CODE(B) CALL(B) CODE(A)
CODE(A) CALL(A) CALL(B)
CODE(B) CALL(B) CODE(C)
CODE(C) CALL(C) CODE(A)
3-1
am le t m at o r ter a
P
PROCEDURE A
FOR 1 TO 3
CALL(A)
CODE(B)
CODE(B)
2
A B
3-1
am le rade o s
3-1
am le rade o ur aces
3-1
am le lorat o trate
3-1
am le a rocess et or
3-2
am le ard are rc tecture
3 - 21
am le esult o u ct o al mulat o
n(p)
b(s)
3 - 22
am le esult o lat orm e c mar s
P
( )
p
(p )
3 - 23
am le ac o t e e elo e al s s
3-2
am le am le lorat o esult
2 2
3-2
ard are/ o t are odes
stem mulat o
doc dr re or a a
S I P S
-1
stem es
S
S S
S C I S H S
I I
P C P B
M C N
-2
utl e
D S
S C
S H A
-3
stem a d odel
A
I S D
T
A
-
tate
T
t
t
T
-
tate
I
A
p s
-
me
I
p s
-
e ts a d screte e t stems
A
T
I
I
-
screte e t stems
AD S
A D S
T D S
I D S
-
me dr e s e t dr e
()
-1
me dr e s e t dr e
- -
T
T
A
- 11
me dr e s e t dr e
-
S
A
- 12
utl e
S C
S C
S H A
- 13
screte e t odel a d mulat o
T Æ -
-1
om o e ts o a screte e t mulat o
I
T
C
C
A
P
-1
screte e t mulat o e
t rout e
I
le
set to
D
rocess b
call subs stem
module s remo e
e e t rom
-1
screte e t mulat o
A
P
s n Æ may “produce” new events.
Problem: Within the same simulation cycle, “cause” and “effect” events share
the same time of occurrence
Solution: The simulator uses a zero duration virtual time interval, called delta-
cycle (į)
The role of a delta-cycle is to order “simultaneous” events within a simulation
cycle, i.e. identifying which event caused another; “causes” and “effects” are
separated by delta-cycles.
Simulation cycles may be composed of several delta-cycles (į)
A C D BC E
į 2į į
-1
Outline
System Classification
Example SystemC
4 - 18
S te O e ie
4-1
le
4-
le O
4- 1
o ule
processes
4-
o e e
4-
o ule
4- 4
o e o uni tion
rocesses can directly communicate through s als.
odule
nput
ports
port
rocess
utput
ports
sensitivity
rocess
nternal
signal
4-
n e o uni tion
SystemC . introduces general purpose primitives
C a el
A container for communication and synchronization, e.g. can
have state and private data, transport data, transport events.
They implement one or more te aces
ter ace
Specify a set of access methods to the channel
But it does not implement those methods
E e t
le ible, low level synchronization primitive, sed to construct
other forms of synchronization
Have no type and no value
ther comm. sync. models can be built based on the
above primitives
4-
nnel n ot
4-
Wait and Notify
Wait: halt rocess e ecution until event is raised
wait() with arguments => dynamic sensitivity
• wait(sc_event)
• wait(time)
• wait(time_out, sc_event)
4 - 28
i lation le ents ain Pro ra
nitiali ation Phase
date si nals
N er of
ready
ro esses
e te all the ro esses
d an e si lation ti e
ntil a lo in oint
date si nals
4-2
a le i le Channel
4-
a le i le Channel nterfa e
4-
a le i le Channel
4- 2
a le i le Prod er Cons er
4-
a le ahn Pro ess Net or
4- 4
a le ahn Pro ess Net or
4-
a le ahn Pro ess Net or
output1.write(0.0);
4-
a le ahn Pro ess Net or
4- 7
a le ahn Pro ess Net or
4- 8
yste C and odels of Co tation
4-
tline
ystem lassification
am le ystem
i atio at i t a tio
4-4
lti le e els of stra tion
unctional nti ed n tional e el
se: model un timed functionality
ommunication: shared varia les messages
y ical languages: atla
Cycle-
timed
D F B. "Timed functional model"
C. "Transaction model"
Approximate-
timed
C E
Un-
timed
A B E. "Cycle-accurate computation model”
Computation
Un- Approximate- Cycle-
timed timed timed
F. "Register transfer model"
System Modeling Graph
(2003 Dan Gajski and Lukai Cai)
4 - 42
A "Un-Timed Functional Model"
Computation se uential B
execution v aa
Un-timed be avior B , parallel
B2 B3, execution
Communication B B2 B3
Un-timed transfer B B
B B
ariables v2 v bb v3 v - b b
Communication
Cycle- D F B
timed
v v2 v3
c se u(v )
Approximate-
timed
C E
Un-
timed
A B
Un- Approximate- Cycle- Computation
timed timed timed
4-4
B Timed Functional Model”
Computation (on processin elements - Es) Messa e-
Time annotation (estimate) E
passin
B c
v aa
Communication
Messa e-passin no protocol
implementation E
Un-timed transfer B
c
v3 v - b b
Mappin
Es (arc itecture) allocation and process- E
to- E mappin B
B c v v2 v3
v2 v bb c se u(v )
Communication
B
A v aa
Cycle- D F
timed
v
code - time
B B
B B estimates e
Approximate- v2 v bb v3 v - b b
timed
C E
DELA () or
v2 v3
ait()
Un- Bv
timed
A B c
v2 v3
se u(v )
v__st_tmp = v__st;
Annotate C code
__DELAY(LI+LI+LI+LI+LI+LI+OPc);
startup(proc);
if(events[proc][0] & 1) {
Model __DELAY(OPi+LD+LI+OPc+LD+OPi+OPi+IF);
C code execute(proc);
execution delay }
Compile erformance
enerated C and Estimation
run natively
4-4
C: “Transaction Model”
Computation PE4
Approximate-timed (Arbiter)
(estimate)
PE1
Communication PE3
Approximate-timed B1 3
B3
(estimate) using simplified v1 = a*a;
v3= v1- b*b;
cv12
(abstract) bus protocols
cv2
Mapping 1 2
v3
cv11
Mapping of computation PE2
and communication B4
B2 1 Master interface v4 = v2 + v3;
v2 = v1 + b*b; c = sequ(v4);
2 Slave interface
3 Arbiter interface
Communication
Cycle- D F
timed
Approximate-
timed
C E
Un-
timed
A B
Un- Approximate- Cycle- Computation
timed timed timed
4 - 46
D: “C cle Accurate Communication Model”
Computation PE4
Approximate-timed (Arbiter)
(estimate)
PE1
Communication PE3
3
Protocol bus channels B1
B3
v1 = a*a;
(time cycle-accurate v3= v1- b*b;
e
ready
ack ready
ack v3
PE2
Mapping of computation
B4
and communication B2 1 Master interface v4 = v2 + v3;
v2 = v1 + b*b; c = sequ(v4);
Communication 2 Slave interface
3 Arbiter interface
Cycle-
D F
timed
Approximate-
C E
timed
Un-
A B
timed
Un- Approximate- Cycle-
Computation
timed timed timed
4-4
E: “C cle Accurate Computation Model”
Computation
PE4
Cycle-accurate S
S1
Communication S2
cycle-accurate
and pin-accurate
Approximate-timed S3
channels interfaces 1
cv2
2 4 S2
cv11
PE2 S3
Communication
M A r1 r2 r2 r1 4 1 Master interface S4
2 Slave interface
Cycle- D F
timed
3 Arbiter interface
cycle-accurate
4 rapper
Approximate-
timed C E and pin-accurate
Un-
timed A B
Un- Approximate- Cycle- Computation
timed timed timed
4-4
Example E: at is an ISS
An Instruction Set Simulator (ISS) is a coded in a
- hich mimics the behavior of a processor by
“reading” instructions and maintaining internal variables hich represent
processor s registers
Instruction-accurate
Cycle-accurate
4-4
Example E: Types of ISS
original C code original assembly code
… …
a = b+c; compilation add r1, r2, r3
… …
ISS code
Interpretive ISS Compiled ISS
int Reg[32];
… intermediary
while(1) { C code generation
and recompilation
Fetch();
Decode(); …
add(r1, r2, r3);
Execute(); …
InterruptHandler();
}
#define Add(r1, r2, r3)\
switch INSN { r3=r1+r2
case ADD: r3=r1+r2;
case SUB: ...
}
4 - 50
: e ister Transfer odel
Computation and Communication
E1 E2
cycle timed interr pt
r1, 1
interr pt
r1, r1, r1 r1, r2, r2, r1
modeled on the le el of Re . e
combinatorial (stateless)
functions, memory and C T
and digital signals T
E E3
S
S
S1
interrupt S1
S2
Comm nication
S2
S3
Cycle S3
timed
S
Approximate
C E
E1, E2: microprocessors
timed
E3, E : custom hardware
Un
timed
Un Approximate Cycle
Comp tation
timed timed timed
4-5
ifferent bstraction odels
odels Communication time Computation time Communication E Interface
Scheme
A. Un imed o o ariables no E
Functional odel
B. imed Functional o pproximate bstract c annel bstract
odel
C. ransaction pproximate pproximate bstract b s bstract
odel c annel
D. Cycle Accurate Cycle acc rate pproximate rotocol b s bstract
Communication c annel
odel
E. Cycle Accurate pproximate Cycle acc rate bstract b s in acc rate
Computation odel c annel
F. Register ransfer Cycle acc rate Cycle acc rate s ires in acc rate
model
4-5
Trace ased Sim lation
(Un timed Functional odel) and ( ransaction odel)
Higher simulation speed (for large hardware software systems,
multiprocessors)
Uses estimates of non functional beha ior
Comm nication
Cycle
timed
Approximate
timed
C E
Un
timed
Comp tation
Un Approximate Cycle
timed timed timed
4-5
Trace ased Sim lation: 2 ases
Input: application specification
utput: execution traces = se uence of
e ents ∈ { ; ; }
ethod: un timed functional simulation
race generation
-
Input:
execution traces
architecture specification
mapping specification
utput: performance estimation results, e.g.
execution time, processor load and bus load
ethod: map abstract read, write and
compute primiti es onto irtual machines that
reflect binding and resource sharing (mapping)
race based simulation
4 - 54
Cosim lation otivation ixed odels
and the simulation is ery much dependent on the
system description model
How to se eral abstraction le els or se eral models of computation
oti ating
1. Different abstraction le els
2. Different description languages
3. Different models of computation C C++
4 - 55
Cosim lation Example
En ironments for multiprocessor system cosimulation:
Se eral ISSs coupled with ISSs are replaced with higher
H R simulation: accurate, but le el simulation models: speed
slow (especially for multiple ISS up simulation time
running in parallel)
ISS ISS nati e execution (UNI )
H R H R
Simulator T Simulator
T T1 T2 T
(SystemC) (SystemC) T T1 T2
T T3
T T3
S S S S
model model
H I H I
interconnect interconnect
4-5
Cosim lation Sin le vs ltiple En ines
2
1 n 1 2 n
Simulator
Cosimulation Bus
4-5
ard are Soft are Codesi n
doc dr re or apa
Intellectual Intellectual
rop. Code rop. Bloc
5-
Ind strial eeds
- , often in safety critical
applications abound
Aeronautics, automoti e, train industries, manufacturing
control
Sideairbag in car,
Reaction in 1 mSec
5-4
ard eal Time Systems
Embedded controllers are expected to finish their tas s
reliably within time bounds.
Essential: of all
tas s statically nown.
Analogously, - (BCE )
5-5
Distribution f execution times eas rement Ind stry s best practice
nsafe:
Best Case Execution ime
Execution ime easurement
Upper bound
orst Case
Execution ime
Execution ime
or s if either therwise,
worst case input can be determined, or determine upper bound
exhausti e measurement is performed from execution times of
instructions
5-
ost of Ind stry s est ractice
Measurements: determine execution times directly by
observing the execution or a simulation on a set of inputs.
Does not guarantee an upper bound to all executions.
Exhaustive execution in general not possible!
Too large space of input domain x set of initial execution
states.
5-
Sequence of Statements
Constituents of A:
A { A1; A2;
A1 and A2
5-8
on t ona Statement
{ f Constituents of A:
t en ondition
e se state ents A1 and A2
es no
ub(A) =
ub( ) +
max(ub(A1), ub(A2))
5-
oo s
A { for i m 1 to 1 do
A1
im1
ub(A) =
ub(i m 1) +
no
i ≤ 100 1 u ( ub(i 1 ) +
es ub(A1) ) +
ub( i ≤ 100)
A1
5-
o to sta t
ssignment load a
xma+b ssu es onstant
e e load b ti es
ution
fo inst
addu tions
store x
cycles
ub(x m a + b) =
add
cycles( oa a) +
cycles( oa ) + load ot
m a 12 i a e
cycles(a ) + tostore
odem n 1 o esso s
cycles(sto e ) move 1
5-
o en a a e eatu es
odern processors increase per ormance by using:
Ca es i e ines an edi tion e u ation
5-
ccess mes
LOAD r2, _a
x = a + b; LOAD r1, _b
ADD r3,r2,r1
2
loc ycles
1
5-
mn cc ents an ena t es
iming ccident cause for an increase of the execution
time of an instruction
iming enalt the associated increase
pes of timing accidents
ache misses
Pipeline stalls
ranch mispredictions
us collisions
emory refresh of D A
T miss
5-
ea oac o ua at on
Micro-architecture nal sis:
Uses Abstract nterpretation
xcludes as many Timing Accidents as possible
Determines T for basic bloc s (in contexts)
5- 5
ontents
ntroduction
problem statement, tool architecture
rogram ath nal sis
alue Analysis
aches
must, may analysis
Pipelines
Abstract pipeline models
ntegrated analyses
5-
ont o o a
1
what_is_this {
1 read (a,b);
2
2 done = FALSE;
3 repeat {
4 if (a>b)
5 a = a-b; a>b a<=b
6 elseif (b>a)
7 b = b-a; a<b a=b
8 else done = TRUE;
9 } until done;
10 write (a);
}
!done done
5-
o am at na s s
rogram ath nal sis
hich se uence of instructions is executed in the orst case
(longest runtime)
problem: the number of possible program paths gro s
exponentially ith the program length
Model
fixed number of cycles for each basic bloc (from static
analysis)
loops must be bounded
Concept
Transform structure of into a set of (integer) linear
e uations.
Solution of the nteger inear Program ( P) yields bound on
the T.
5- 8
as c oc
e inition A basic bloc is a se uence of instructions
here the control flo enters at the beginning and exits at
the end, ithout stopping in bet een or branching (except at
the end).
t1 := c - d
t2 := e * t1
t3 := b * t1
t4 := t2 + t3
if t4 < 10 goto L
5-
as c oc s
etermine basic bloc s o a program
1. ete ine t e o e innin s:
the first instruction
targets of un conditional umps
instructions that follo un conditional umps
2. dete ine t e asi o s:
there is a basic bloc for each bloc beginning
the basic bloc consists of the bloc beginning and runs
until the next bloc beginning (exclusive) or until the
program ends
5-
ont o o a t as c oc s
egenerated control lo graph C
the nodes are the basic bloc s
i := 0
t2 := 0
L t2 := t2 + i
i := i + 1
if i < 10 goto L i < 10
x := t2 i >= 10
5-
am e
1 s = k;
/* k >= 0 */
s = k;
WHILE (k < 10) { 2 WHILE (k<10)
IF (ok)
j++; if (ok)
ELSE {
j = 0;
j = 0;
ok = true; j++;
ok = true;
}
k ++;
k++;
}
r = j;
r = j;
5-
a cu at on of t e
Definition: A program consists of N basic blocks, where
each basic block Bi has a worst-case execution time ci and
is executed for exactly xi times. Then, the WCET is given by
N
WCET ¦ c i xi
i 1
the ci values are determined using the static analysis.
how to determine xi ?
• structural constraints given by the program structure
• additional constraints provided by the programmer (bounds for
loop counters, etc.; based on knowledge of the program context)
5-
Structural Constraints
d1
B1 s = k;
Flow equations:
d2
d1 = d2 = x1
B2 WHILE (k<10) d2 + d8 = d3 + d9 = x2
d3 d3 = d4 + d5 = x3
B3 if (ok) d4 = d6 = x4
d5 d5 = d7 = x5
d4
d6 + d7 = d8 = x6
j = 0;
B4 j++; B5 d9 = d10 = x7
ok = true;
d9 d6 d7
B6 k++;
d8
B7 r = j;
d10 5 - 24
itional Constraints
d1
B1 s = k; loop is executed for at most 10
times
d2
B2 WHILE (k<10) x3 = 10 x1
d3
B3 if (ok)
d5 B5 is executed for at most one
d4
time
j = 0;
B4 j++; B5
ok = true; x5 = 1 x1
d6 d7
d9 d8
B6 k++;
B7 r = j;
d10 5 - 25
WCET - ILP
ILP with structural and additional constraints:
program is executed
once
N
max {¦ i i 1 1
i 1
¦ j ¦ k i, i 1...N
j in ( Bi ) k out ( Bi )
structural
additional constraints } constraints
5 - 26
Cont nts
ntroduction
pro lem statement tool arc itecture
rogram at nal sis
alu nal sis
ac es
must ma anal sis
ipelines
stract pipeline models
ntegrated anal ses
5-2
A stra t Int r r tation AI
antics- as d thod or static program anal sis
5-2
A stra t Int r r tation t In r di nts
a stract do ain related to concrete domain
a straction and concreti ation unctions
eg ĺ Intervals,
where Intervals = LB u UB, LB = UB = Int{-f, f}
instead of L ĺ Int
abstract transfer functions for each statement type –
abstract versions of their semantics
e.g. + : Intervals u Intervals ĺ Intervals where
[a,b] + [c,d] = [a+c, b+d] with + extended to -f, f
a join function combining abstract values from different
control-flow paths
e.g. t : Interval u Interval ĺ Interval where
[a,b] t [c,d] = [min(a,c),max(b,d)]
5-2
Value Analysis
Motivation:
Provide access information to data-cache/pipeline analysis
Detect infeasible paths
Derive loop bounds
5 - 30
Value Analysis
D :[- , ], :[ x , x ]
move #4,D0 Intervals are computed along
the edges
D :[ , ], D :[- , ], t oins, intervals are unioned
:[ x , x ]
add D1,D0
D : [- ,+ ] D : [- , ]
D :[ , ], D :[- , ],
:[ x , x ]
D : [- ,+ ]
move (A0,D0),D1
hich address is accessed here
access [ x , x ]
5-3
n en s
Introduction
problem statement, tool architecture
Program Path nalysis
alue nalysis
aches
must, may analysis
Pipelines
bstract pipeline models
Integrated analyses
5-3
a es as e y n i
aches are used, because
ast main memory is too expensive
he speed gap between PU and memory is too large and
increasing
5 - 33
a es
access Processor
takes
~ 1 cycle
fast, small,
Cache expensive
access
takes Bus
~ 100 cycles
(relatively)
Memory slow, large,
cheap
5-3
a es e
PU wants to read rite at e or address a,
sends a re uest for a to the bus.
ases:
Bloc m containing a in the cache (hit):
re uest for a is served in the next cycle.
Bloc m not in the cache (miss):
m is transferred from main memory to the cache,
m may replace some bloc in the cache,
re uest for a is served asap while transfer still continues.
everal re ace ent strate ies: L U, PL U, I ,...
determine which line to replace.
5 - 35
ay e Ass ia i e a e
5-3
ae y
ach cache set has its own re ace ent o ic = ache sets
are independent. verything explained in terms of one set
- e ace ent trate :
eplace the bloc that has been Least ecently Used
odeled by ges
a e: -way set associative cache
access age age age age
m m m m
m (miss) m m m m
m (hit) m m m m
m (miss) m m m m
5-3
a e Analysis
ow to statically precompute cache contents:
Must na sis:
or each program point (and calling context), find out which
bloc s are in the cache.
Determines safe information about cache hits. ach
predicted cache hit reduces .
Ma na sis:
or each program point (and calling context), find out which
bloc s may be in the cache. omplement says what is not in
the cache.
5-3
ne s
ache contents depends on the context, i.e. calls and
loops
5 - 50
n en s
Introduction
problem statement, tool architecture
Program Path nalysis
alue nalysis
aches
must, may analysis
i e ines
bstract pipeline models
Integrated analyses
5-5
a is n A i e u es
in y lenverarb.
single cycle L
L
ehr y lenverarb.
multiple cycle I B I
pipelining
Pipelineverarb. I B L
I B
5-5
d e Fe tu e e e
t t t t
Fetch Fetch
Decode Decode Fetch
Execute Execute Decode Fetch
WB WB Execute Decode Fetch
WB Execute Decode
WB Execute
WB
5 - 53
D t th o e e ch tectu e
5-5
d e Fe tu e e e
.
5 - 55
e e d
Pipeline azards:
: perands not yet available
ata ependences
: Conditional branch
5-5
o to d
5-5
D t d
5-5
t tc o h d
: prediction o cache hits on instruction or
operand etch or store
l z r4 2 r1 Hi
: analysis o data control hazards
add r4 r5 r6
l z r7 1 r1 pera d
add r8 r4 r4 read
5-5
o c ete t te ch e
Processor pipeline cache memory inputs vie ed as a
per orming transitions every clock cycle.
Starting in an initial state or an instruction transitions are
per ormed until a is reached:
: instruction has le t the pipeline
: e ecution time o instruction
u ct o e ec : c oc s : co c ete e e t te : t ce
interprets instruction stream o starting in state s producing trace
successor basic block is interpreted starting in initial state las
le h gives number o cycles
5-
t ct e e o B c B oc
u ct o e ec : c oc s: t ct e e t te
: t ce
interprets instruction stream o annotated ith cache
in ormation starting in state s producing trace
le h gives number o cycles
5-
Wh t d ee t
or successor basic block In particular i
there are several predecessor blocks
:
sets o states
combine by assuming that local orst case is sa e
s s
s
5- 2
u o te
5- 3
d e ot e ode
ut te t to
doc d e o
Intellectual Intellectual
Prop. Code Prop. Block
c to ch tectu e
E t to
-3
ectu e o
ptimization
esign
Implementation
-
E o ut o u t o ect e t to
o th
Wha are lu i ar l ri hms
H d he r
-5
he c o e
eight 75 g eight eight 3 g eight
pro it 5 15 g pro it 7 1 g
pro it 8 pro it 3
pr i
15
ei h
5 g 1 g 15 g 2 g 25 g 3 g 35 g
-
he de o F o t
e to n there is no single optimal solution but
o some solutions are better than others
pr i
selecting a
2 solution
15
inding the good
1 solutions
ei h
5 g 1 g 15 g 2 - g 25 g 3 g 35 g
Dec o e ect o ut o
o che pro it more important than cost ranking
eight must not e ceed 24 g constraint
pr i
15
too heavy
1
ei h
5 g 1 g 15 g 2 - g 25 g 3 g 35 g
Whe to e the Dec o
Be o e t to te t to
searches or a set o
ranks ob ectives green solutions
de ines constraints
pr i
searches or one selects one solution
green solution 2 considering constraints
15
too heavy
1
-
Ft e d ut e ect e
e to ed do ce ed
ei h ed sum
y2 y2
y1 y1
1 2 k
y 1y1 kyk
y1
- 3
ut eo e E o ut o o th
11 1 solution 111 itness 19
itness
evaluation mating
selection
11 1 11
environmental
selection
recombination
1 11
mutation
1 1
item 1 item 2 item 3 item 4
subset
- 5
e e c u t o ect e E
population archive
sample
update
vary
truncate
select
ne population - ne archive
E o ut o o th ct o
ma . y2
hypothetical trade o ront
-
min. y1
B c Box t to
o ect e u ct o
Stretch Module ecision Module andling Module
X lane \
1
lane a re uire
v a
D Br
v D r
v t point o
gear
ob ective
change D r
decision
t n gear
2 1
gear 4 3
lane R 5
vector
v a
lane v
a gear n
ptimization lgorithm:
only allo ed to evaluate
direct search
-
De ce Ex o t o
Speci
Speciication
ication ptimization
ptimization Evaluation
Evaluation Implementation
Implementation
po er
consumption latency
cost
-
c et oce et o
B e
Embedded
Embedded
Internet
Internet
evices
evices o
ethod
to
d
do oth
c co d e
Wearable
WearableComputing
Computing
e d o
ccess Core
Mobile
MobileInternet
Internet
-2
et o oce o
et o oce o high per ormance programmable device
designed to e iciently e ecute
communication
orkloads r le e al
incoming lo s routing or arding outgoing lo s
packet streams transcoding processed packets
encryption decryption
real time lo s
e.g. voice et o
oce o
e.g. s tp
non real time lo s
-2
t to ce o e e
e n speci ication o the task structure t ode
or each lo the corresponding tasks to be e ecuted
o di erent usage scenarios o ode
sets o lo s ith di erent characteristics
ou ht net ork processor implementation
architecture task mapping scheduling
ect e n ma imize per ormance
o minimize cost
e o ce ode
u ect to n memory constraint
o delay constraints
- 22
Ex o t o t te
or each usage
scenario separately
architecture task binding
template graph restrictions
e u to
co t uct e t te
allocation per ormance ch tectu e o e o ce
bindings cost vector
u t o ect e
o t to
- 23
ectu e o
Introduction
esign
Implementation
-2
Do ce eto o t
design point is d mi a ed by i i i is
better or e ual than in all criteria and
better in at least one criterion.
- 25
u t o ect e t to
-2
u t o ect e t to
Ma imize y1 y2 yk g 1 2 n
y2 y2
better
incomparable dominated
orse incomparable
y1 y1
Randomized
search algorithm
t g tt
randomly choose a randomly choose a
solution 1 to start ith solution t 1 using solutions
1 t
-2
e o do ed e ch o th
e o e ect o to
mating
selection
environmental
selection
E ≥1 both
evolutionary algorithm
1 no mating selection
tabu search
1 no mating selection
simulated annealing
-2
Limitations of Randomized Search Algorithms
Remarks:
Not all functions equally likely and realistic
We cannot expect to design the algorithm beating all others
Ongoing research: which algorithm suited for which class of
problem? 6 - 30
ourse Synopsis
ntroduction
Optimi ation
mplementation
6-3
esign hoices
representation fitness assignment mating selection
11 111
parameters
11 1
1 11 11
6 - 32
omparison of Three mplementations
-o ective knapsack pro lem
e te ded
6 - 33
esign hoices
fitness assignment mating selection
11 111
parameters
11 1
1 11 11
6-3
Representation
search space decoder solution space o ectives o ective space
1 1 1 1
1 1 1 1
g
1 1 1 1
nodes A
selected? 1 1 1 1
6 - 36
E ample: nteger ector Encoding
iven: graph k colors
oal: assign each node one of the k colors such that the
number of connected nodes with the same color is
minimi ed graph coloring problem
A
nodes A
colors 1 1 1
6-3
E ample: Real ector Encoding
parameters x1 x x x xn
values 6.- 3 . 1. . .
Tree E ample: arking a Truck
steering u
angle
cab
d
dock
trailer
t
position x y
oal:
find function c with
constant speed u cx y d t
6 - 39
Search Space for the Truck ro lem
perators:
Arguments: position x
position y
cab angle d
AN trailer angle t
Search space : set of symbolic expression using the above
operators and6 -arguments
0
E ample Solution: Tree Representation
AN
6- 2
esign hoices
representation mating selection
11 111
parameters
11 1
1 11 11
6- 3
Fitness Assignment
Fitness F scalar value representing quality of an individual
itness function:
execution time
F (1) 3
F ( 2) 1
F ( 3) 1
F ( 4) 2
F (5) 1
F (6) 0
cost
6-
onstraint andling
onstraint x1 x xn ≥ feasi le ≥
11 111
parameters
11 1
1 11 11
variation operators
6-
Selection
T o types of selection:
6-
Tournament Selection
n o
uniformly choose compare fitness
individuals at and copy best
random independently individual
of fitness in mating pool
6- 9
esign hoices
representation fitness assignment mating selection
11 111
parameters
11 1
1 11 11
environmental selection
6- 0
ector utation: E amples
1 1 1 1
it vectors:
each bit is flipped with probability 1
1 1 1
1 1
ermutations:
swap
1 1
rearrange
6-
utation perators on Trees: ro
gro
N N
N AN N
6- 2
utation perators on Trees: Shrink
shrink
N N
N AN AN AN
6- 3
utation perators on Trees: S itch
s itch
N AN
N AN N
6-
utation perators on Trees: Replace
replace
N N
N AN N AN
6-
ector Recom ination: E amples
it vectors:
1 1 1
1 1 1
1 1 1
ermutations:
1 child
parents 1
1
6- 6
Recom ination of Trees
N AN N
N AN
e change
AN
6-
A eneric ultio ective EA
population archive
sample
update
vary
truncate
select
non-dominated solutions:
dominated solutions
dominated solutions
of non areto solutions
∑ strengths of dominators
y1
6 - 62
mplementation: omponents
A frame ork that
rovides ready to use modules algorithms applications
s simple to use
s independent of programming language and O
omes with minimum overhead
Representation
Selection
Recombination Objective functions
Fitness assignment
Archiving Mutation
6-6
The oncept of SA
Algorithms Applications
A knapsack
N A
network
A processor
design
text based
latform and programming language independent nterface
for Search Algorithms [ le ler et al : ]
6-6
SA: mplementation
shared
shared
file
file
system
system
selector
selector text variator
variator
process
process files process
process
application independent: handshake protocol: application dependent:
mating environmental state action variation operators
selection individual s stores and manages
individuals are objective vectors individuals
described parameters
by s and objective
vectors
6 - 66
ard are Soft are odesign
ntellectual ntellectual
rop. ode rop. lock
: select components
-3
Application Specification
Depends on the underlying model of computation.
Examples (see also next slides):
Task graphs (data flow graph, control flow graph)
Process Networks (Kahn Process Network, Synchronous
Dataflow)
State Machine Representations (SpecCharts, StateCharts,
Polis) [not covered in this course].
7-4
ata lo ap
a b b a b
c
x = 3*a + b*b - c;
y = a + b*x; c
z = b - c*(a + b); b
b
y x
7-
ont ol lo ap
what_is_this {
1 read (a,b);
2 done = FALSE;
3 repeat {
4 if (a>b)
5 a = a-b; a>b a<=b
6 elseif (b>a)
7 b = b-a; a<b a=b
8 else done = TRUE;
9 } until done;
10 write (a);
}
!done done
7-
a n oce et o
ierarchical network for M P application:
7-7
A c itect e Specification
Depends on the underlying model of the platform.
sually a graph notation is used to the elements,
properties of the underlying platform are usually attached.
7-
ample A c itect e Specification
- <processor name="processor1" type="DSP">
<port name="processor_port" type="duplex" />
<configuration name="clock" value="100 MHz" />
</processor>
+ <processor name="processor2" type="RISC">
+ <memory name="sharedmemory" type="DXM">
- <hw_channel name="in_tile_link" type="bus">
<port name="port1" type="duplex" /> DSP
DSP RRSC
SC DD MM
<port name="port2" type="duplex" />
<port name="port3" type="duplex" />
<configuration name="buswidth" value="32bit" />
</hw_channel>
- <connection name="processor1link">
<origin name="processor1"> bus
bus
<port name="processor_port" />
</origin>
<target name="in_tile_link">
<port name="port1" />
</target>
</connection>
+ <connection name="processor2link">
+ <connection name="memorylink">
7-
apping Specification
Relates application and architecture specification:
maps processes to computing resources
maps communication between processes (in case of process
networks) to communication paths of the architecture
specifies resource sharing disciplines and scheduling
7-
ample
asic model with a data flow graph and static scheduling
Problem
Data graph GPP((VPP,,EPP):)
flow graph
1 2
Interpretation:
• VP consists of functional
5 6 nodes VPf (task, proce-
dure) and communication
3
nodes VPc .
7 • EP represent data depend-
encies
4
7-
Example (2)
Architecture graph GA(VA,EA):
RISC HWM1
RISC HWM1
7 - 12
Example ( )
1
Definition: A specifica-
tion graph is a graph 5 RISC
GS=(VS,ES) consisting
aa l
of a problem graph GP, 3 SB
ES=EP∪EA∪EM 6 HWM2
4
GP EM GA
7-1
Example ( )
Three main tasks of synthesis:
• Allocation α is a subset of VA.
• Binding β is a subset of EM, i.e., a mapping of functional
nodes of VP onto resource nodes of VA.
• Schedule τ is a function that assigns a number (start time) to
each functional node.
7-1
Example ( )
1
0 1
0
Definition: Given a 1 5 RISC
specification graph GS 8
an implementation is a 21 3 SB
triple (α,β,τ), where α 29 7
1
HWM1
is a feasible allocation, 20
α
β is a feasible binding, 1 2
and τ is a schedule. 1
21 6 β RISC HWM1
2 PTP bus
30 4 shared
bus
HWM2
τ
7-1
em e
pecification
ntellectual ntellectual
Prop ode Prop Bloc
Determine mapping
Determine important paramerters (end to end dela ,
throughput, uffer space output itter, )
Gi e feed ac to optimi ation
ppl a e e
app
E ma
7-1
e e a el
7-1
a ae ae e
em a
e apa
ntellectual ntellectual
Prop ode Prop Bloc
-
el
(see pre ious lecture )
model application
define architectural template
identif possi le indings
-
e a lem
he partitioning pro lem is to assign n
o ects O ={o1, ..., on} to m loc s (also called
partitions) P={p1, ..., pm}, such that
z p1 p2 ... pm = O
z pi pj = i,j: iz j and
z cost c(P) are minimi ed
n (simple model)
z o ects data flow graph nodes
z loc s architecture graph nodes
-
of a design point
ma include C s stem cost in
L latenc in sec
P power consumption in
re uires to find C, L, P
-
e e al a e
enumeration
nteger inear Programs ( P)
constructi e methods
random mapping
hierarchical clustering
iterati e methods
ernighan in lgorithm
imulated nnealing
E olutionar lgorithms (E ) see ne t lecture
-7
Integer Programming o el
Ingredients:
Cost function In ol in linear e pressions of
Constraints inte er ariables from a set X
Cost function C ¦a x
xi X
i i with a i R , x i т
N (1)
Constraints: j J : ¦b i, j x i t c j with bi , j , c j ъ
R (2)
xi X
minimi e C x x x
b e t to x1 x 2 x t 2
x 1, x 2 , x 0,1
ptimal
8-
emar on Integer Programming
aximi ing the cost function can be done b settin C C
8-
Integer inear Program or Partitioning (1)
inar ariables xi
xi 1: ob ect i in bloc
xi 0: ob ect i not in bloc
8-
on tr ti e et o
andom ma ing
each ob ect is assi ned to a bloc randoml
Hierarchica c ustering
stepwise roupin of ob ects
closeness function determines how desirable it is to roup
two ob ects
onstructi e methods
are often used to enerate a startin partition for iterati e
methods
show the difficult of findin proper closeness functions
8-
Hierarchical Clustering - Example (1)
v5 = v1v3
v1 v5
10 20 10
10 v2 7
v2 v3
4 4
v4
v4
8 - 14
Hierarchical Clustering - Example ( )
v = v2v5
v5
10 v
v2 7 5 5
4
v4 v4
8-1
Hierarchical Clustering - Example ( )
v
v7 = v v4
5 5 v7
v4
8-1
Hierarchical Clustering - Example ( )
ste :
v7 = v v4
ste :
v5 = v1v3
v1 v2 v3 v4
8-1
terative Methods - ernighan- in (1)
imple greed heuristic:
ntil there is no im ro ement in cost: re grou a air of
o ects which lea s to the largest gain in cost
v2
v1
v5
v4
v v3
v7
v v
8-1
terative Methods - imulated nnealing
rom h sics:
metal an gas ta e on a minimal energ state uring cooling own
un er certain constraints :
at each tem erature the s stem reaches a thermo namic e uili rium
the tem erature is ecrease sufficientl slowl
ro a ilit that a article um s to a higher energ state:
ei ei 1
k BT
P ei ei 1 T e
8- 1
Iterative Methods - Simulated Annealing
Cooling Down: DecreaseTemp(), Frozen()
• temp_start = 1.0
• temp = D • temp (typical: 0.8 d D d 0.99)
• terminate when temp < temp_min or there is no more improvement
Equilibrium: Equilibrium()
• after defined number of iterations or when there is no more
improvement
Complexity
from exponential to constant, depending on the implementation of
the functions Equilibrium(), DecreaseTemp(), and Frozen()
the longer the runtime, the better the quality of results
typical: construct functions to get polynomial runtimes
8 - 22
ard are Soft are odesign
Allo ation
do dr regor a a
ost function ¦
with , т (1)
onstraints: : ¦ , t with , , ъ ( )
1
1 t
,
1 , 0,1
ptimal
-
emar s on integer rogramming
-
Simulated Annealing
eneral method for solving combinatorial
optimization problems.
ased the model of slowly cooling crystal liquids.
ome configuration is sub ect to changes.
pecial property of imulated annealing: hanges
leading to a poorer configuration (with respect to
some cost function) are accepted with a certain
probability.
This probability is controlled by a temperature
parameter: the probability is smaller for smaller
temperatures.
-
lanation
nitially, some random initial configuration is created.
urrent temperature is set to a large value.
uter loop:
• Temperature is reduced for each iteration
• Terminated if (temperature d lower limit) or
(number of iterations t upper limit).
nner loop: For each iteration:
• ew configuration generated from current configuration
• Accepted if (new cost d cost of current configuration)
• Accepted with temperature-dependent probability if
(cost of new config. > cost of current configuration).
-
Multio e tive timi ation
Maximize (y1, y2, …, yk) = g(x1, x2, …, xn)
y2 y2
better
incomparable dominated
worse incomparable
y1 y1
9-8
m ro re ta ty or a es
oop caches
Mapping code to less used part(s) of the index space
Cache locking freezing
Changing the memory allocation for code or data
Mapping pieces of software to specific ways
Methods:
- enerating appropriate way in software
- llocation of certain parts of the address space to a specific way
- ncluding way-identifiers in virtual to real-address translation
) Caches behave almost like a scratch pad
9-9
Summary
9-
ar are So t are o es
o e o t m at o
o r re or Pa a
-
er o tas s
-
S tt o tas s
-
er a s tt o tas s
-
system
am e
-
ttr utes o a system t at ee s
re r t
Tasks blocking after they have already started running
-
or y orta e a et a
1. Transform each of the tasks into a Petri net,
2. enerate one global Petri net from the nets of the tasks,
. Partition global net into se uences of transition
. enerate one task from each such se uence
-8
esu t as u s e y orta e a
Reads only at the beginning
nitialization task
ever
true
lways true
-9
tm e ers o o
ever true
Tin ()
RE ( , sample, 1)
j==i-1 sum = sample i
j)i T = sample d =
T
: (i < ) retur
T = sum d= T
d = d c R TE( T,d,1)
sum = i =
retur
lways true -
as e e o urre y ma a eme t
-
am e o a m e
Task1
Task2
Task
eadline
eadline
t t
Static (compile-time) methods can …or they can define a probability for
ensure CET feasible schedules, but violating the deadline.
waste energy in the average case.
eadline
t
Mixed methods use compile-time Runtime scheduler selects the most
analysis to define a set of possible energy saving, deadline preserving
execution parameters for each task. combination.
e um tt me e
-
oat o t to e o t
o ers o
Pros:
ower cost
aster
ower power consumption
Sufficient S R, if properly scaled
Suitable for portable applications
Cons:
ecreased dynamic range
inite word-length effect, unless properly scaled
verflow and excessive uantization noise
Extra programming effort
exponent,
exponent,mantissa
mantissa
loating-Point
loating-Point S 1 . . . 1
automatic
automaticcomputation
computationand
and
update
updateofofeach
eachexponent
exponent
(a) nteger
at run-time
at run-time =
ixed-Point
ixed-Point
implicit
implicitexponent
exponent S 1 . . . 1
determined
determinedoff-line
off-line
hypothetical binary point
(b) ixed-Point
© Ki-Il Kum, et al
-
ss me t a
t o Su tra t o
ssume y = x, with et result = x y:
- x( =2) and e ualizing each
- y( = ):
s
s
s
s
y s
y s
resu t s
© Ki-Il Kum, et al
-
ut at o
ssume result = x y,
with
- x( =2) and
- y( = ) s
- - result ( =2 ) y
s
s s
resu t s
© Ki-Il Kum, et al
-
e e o me t Pro e ure
oat Po t
Pro ram
-
a e st mat o
- Pro ram
a ua
s e at o
e Po t
Pro ram
-
© Ki-Il Kum, et al
a e st mator
oat Po t a e st mat o Pro ram
Pro ram
float
float iir1(float
iir1(float x)
x)
re ro essor
static
staticfloat
float ss ==
ro t e
float
float yy
ass me t
x
iwl= .xxxxxxxxxxxx
overflow if z
result
- 9
oat Po t to e Po t Pro ram o erter
ixed-Point C Program
mulh
int iir1(int x) to access the upper
half of the multiplied
static int s = result
int y target dependent
y=sll(mulh(29491,s)+ (x>> 5),1); implementation
s=y
return y
sll
to remove 2nd sign bit
opt. overflow check
© Ki-Il Kum, et al
-
Per orma e om ar so
a e y es
ourt r er ter
Cycles
2
1
21
ixed-Point (1 b) loating-Point
© Ki-Il Kum, et al
-
Per orma e om ar so
a e y es
Cycles P
1 12 2
12
1
1 1
2 1
2
© Ki-Il Kum, et al
-
Per orma e om ar so
S
ixed-Point (1 b)
P ixed-Point ( 2b)
S R (d ) loating-Point
2
2
1
1
© Ki-Il Kum, et al
-
m a t o memory a o at o o e e y
rray
Column major
Row major order order ( RTR )
(C)
-
est er orma e ermost oo
orres o s to r tmost array e
o oo s assum ro ma or or er
or (k= k<=m k ) or (j= j<=n j )
or (j= j<=n j ) ) or (k= k<=m k )
p j k = ... p j k = ...
ot always the same impact .. Till uchwald, iploma thesis, niv. ortmund, nformatik 12, 12 2
-
oo us o mer oo
ss o
or(j= j<=n j ) or (j= j<=n j )
p j = ... p j = ...
or (j= j<=n j ) , p j = p j ...
p j = p j ...
#define
#define size
size 30
30 void ms1() {int i,j;
#define
#define iter
iter 40000
40000 for (i=0;i<
int size;i++){
int a[size][size];
a[size][size];
float for
float b[size][size];
b[size][size];
(j=0;j<size;j++){
void ss1() {int i,j; a[i][j]+=17; }
for for
(i=0;i<size;i++){ (j=0;j<size;j++){
for void mm1() {int i,j;
b[i][j]-=13; }}}
(j=0;j<size;j++){ for(i=0;i<size;i++){
a[i][j]+= 17;}}
for(j=0;j<size;j++){
for(i=0;i<size;i++){ a[i][j] += 17;
for b[i][j] -= 13;}}}
(j=0;j<size;j++){ - 9
b[i][j]-=13;}}}
esu ts s m e oo s
ss1 u t me (1 ԑ max)
))
ms1
12 Merge
Merge
1
mm1 dd
loops
loops
superi
superi
or
or
except
except
Sparc
Sparc
2
with
with
oo
gcc .2 - x gcc 2. -o Sparc gcc xo1 Sparc gcc x o
P att orm
-
oo u ro
or (j= j<=n j =2)
or (j= j<=n j )
p j = ... p j 1 = ...
p j = ...
factor = 2
etter locality for access to
p.
ess branches per
execution of the loop. More
opportunities for
optimizations.
Tradeoff between code size
and improvement.
- Extreme case: completely
unrolled loop (no branch)
am e matr mu t
#define s 30 extern void compute2()
#define iter 4000 {int i, j, k;
int for (i = 0; i < 30; i++) {
a[s][s],b[s][s],c[ for (j = 0; j < 30; j++) {
s][s]; for (k = 0; k <= 28; k += 2)
{{int *suif_tmp;
void compute(){int suif_tmp = &c[i][k];
i,j,k; *suif_tmp=
for(i=0;i<s;i++){ *suif_tmp+a[i][j]*b[j][k];}
{int *suif_tmp;
for(j=0;j<s;j++){ suif_tmp=&c[i][k+1];
*suif_tmp=*suif_tmp
for(k=0;k<s;k++){ +a[i][j]*b[j][k+1];
c[i][k]+= }}}}
return;}
a[i][j]*b[j][k];
}}}} -
esu ts
Pro essor Su SP te Pe t um
a tor
a tor
enefits uite small penalties may be Till uchwald, iploma thesis, niv.
ortmund, nformatik 12, 12 2
large -
esu ts e e ts or oo
e e e es
Pro essor
re u t o to
#define s 50
#define iter 150000
int a[s][s], b[s][s];
void compute() {
int i,k;
for (i = 0; i < s; i++) {
for (k = 1; k < s; k++) {
a[i][k] = b[i][k];
b[i][k] = a[i][k-1];
}}}
a tor
Small Till uchwald, iploma thesis, niv.
-
oo t oo o
r a ers o
or (i=1 i<= i )
or(k=1 k<= k )
r= i,k to be allocated to a register
or (j=1 j<= j )
i,j = r k,j
ever reusing information in the cache for and if
is large or cache is small (2 references for ).
-
oo t oo o
t e ers o
or (kk=1 kk<= kk = ) euse a tor o
or (jj=1 jj<= jj = ) or or
or (i=1 i<= i )
a esses to
or (k=kk k<= min(kk -1, ) k ) ma memory
r= i k to be allocated to a register Compiler
or (j=jj j<= min(jj Same
-1, elements
) j ) for should select
i j =r k j next iteration of i best option
ra t e
resu ts y
u a are
sa o t
eo t e e
ases ere a
m ro eme t
as a e e
Sour e s m ar
to matr mu t a tor
Pe t um Till uchwald, iploma thesis, niv.
ortmund, nformatik 12, 12 2
-
Summary
Task concurrency management
Re-partitioning of computations into tasks
ynamic exploitation of slack
loating-point to fixed point conversion
Range estimation
Conversion
nalysis of the results
igh-level loop transformations
usion
nrolling
Tiling
- 8
ra s ormat o oo est s tt
am e Se arat o o mar a
- 9
oop nest from MPE - full search
motion estimation
for (z= z 2 z )
if (x =1 y =1 )
for (x= x x ) x1= x
for ( y y )
for (y= y y ) y1= y
for (k= k k )
for (k= k k ) x2=x1 k-
for (l= l l )
for (l= l ) y2=y1 l-
for (i= i i )
for (i= i i ) x =x1 i x =x2 i
for (j= j j )
for (j= j j ) y =y1 j y =y2 j
then block 1 then block 2
if (x x y y )
else y1= y
then block 1 else else block 1
for (k= k k ) x2=x1 k-
if (x x y y )
for (l= l ) y2=y1 l-
then block 2 else else block 2
for (i= i i ) x =x1 i x =x2 i
for (j= j j ) y =y1 j y =y2 j
analysis of polyhedral domains, if ( x y )
selection with genetic algorithm then-block-1 else else-block-1
if (x x y y )
for (z= z 2 z ) then block 2 else else block 2
for (x= x x ) x1= x
for (y= y y ) . alk et al., nf 12, ni o, 2 2
-
esu ts or oo est s tt
e ut o t mes
Cavity Motion Estimation S PCM
PS
a
m
e
a
Su
ar
tu
a
e
er
er
t
Pe
r
Po
PS
m
a
m
e
a
Su
ar
tu
a
e
er
er
t
Pe
alk, 2 2
Po
-
rray o
nitial
arrays
-
rray o
nfolded
nfolded
arrays
arrays
-
tra array
o
ter array o
-
at o
rray folding is implemented in the TSE optimization
proposed by MEC. rray folding adds div and mod ops.
ptimizations re uired to remove these costly operations.
t MEC, PT address optimizations perform this task.
or example, modulo operations are replaced by pointers
(indexes) which are incremented and reset.
-
esu ts y es or a ty
e mar
ta
ta S
ta P
ta S
P
PT TSE
re uired to achieve
real benefit
[C.Ghez et al.: Systematic high-level Address
Code Transformations for Piece-wise Linear
Indexing: Illustration on a Medical Imaging
Algorithm, IEEE WS on Signal Processing
Pe t um PS r e a P S P S System: design & implementation, 2000, pp.
623-632]
o P
-
Prilagoditev kode
prenos zapisa iz ANSI-C v Handel-C
VHDL zahteva bistveno veþ sprememb
si 0x0 a0 si .62
var si [ : ] var 0x0
var2 si [ :0] var2 0xa0
10 - 4
60
Prilagoditev rogra ke kode
ukazi za vzporedno izvajanje delov kode
ukaz namesto
• kjer je mogoþe, glede na vsebino zanke
for i 0 i 3 i par i 0 i 3 i
se se
a[i] b[2 i]
a[i] a[i] c[i] par
b[2 i] a[i]
a[i] b[2 i]
a[i] a[i] c[i]
b[2 i] a[i]
10 - 0
Prilagoditev rogra ke kode
prilagoditev velikosti vseh spremenljivk
vse velikosti morajo biti vnaprej definirane
• za manjšo porabo virov naj bodo minimizirane
vnaprej je treba doloþiti predznaþene nepredznaþene
pri raþunanju s spremenljivkami razliþnih velikosti
• uporaba operatorja spajanja: manjši spremenljivki dodamo
manjkajoþa mesta
• uporaba spodnjih mest pri veþji spremenljivki
[signed unsigned] int n n-bit
o ilatio
do dr regor Pa a
11 -
ke ro le or t re e or
te
verage eed
erg Po er
Predi ta ilit
Energy
Access times
11 -
a ea
o ti i atio or ig er or a e
o
• High-performance if available memory bandwidth fully used
low-energy consumption if memories are at stand-by mode
• educed energy if more values are kept in registers
ADD r3,r0,r2
M V r0, 2
LD r3, [r2, 0] int
inta[
a[ 000]
000] M V r2,r 2
ADD r3,r0,r3 cc aa M V r 2,r
M V r0, 2 for
for i i ii 00
00 i i M V r ,rr 0
LD r0, [r2, r0] bb cc M V r0,r
ADD r0,r3,r0 bb cc M V r ,r
ADD r2,r2, cc M V r ,r
LD r , [r , r0]
ADD r ,r , ADD r0,r3,r
CMP r , 00 ADD r ,r ,
LT LL3 le le ADD r ,r ,
CMP r , 00
LT LL3
11 - 4
o iler o ti i atio
or i rovi g e erg e i ie
Energy-aware scheduling
Energy-aware instruction selection
perator strength reduction: e.g. replace by and
Minimize the bitwidth of loads and stores
Standard compiler optimizations with energy as a cost
function
2: a[0]
E.g.: egister pipelining: for i: to 0 do
begin
for i: 0 to 0 do : a[i]
C: 2 a[i] a[i- ] C: 2 2
2:
end
Exploitation
Exploitationof
ofthe
thememory
memoryhierarchy
hierarchy
11 -
i g rat ad e orie
P
Hierarchy
Hierarchy
Example
Example
main
SPM
Address
space A M TDMI
processor cores, well-known
0 for low power
consumption
no tag memory scratch pad memory
..
11 -
er li ited ort i
a ed tool lo
e rag a i o r e to allo ate to e i i e tio
or example:
#pragma arm section rwdata = "foo", rodata = "bar"
int x2 = 5; // in foo (data part of region)
int const z2[3] = {1,2,3}; // in bar
t atter loadi g ile to li ker or allo ati g e tio to
e i i addre ra ge
11 -
glo al o ti i atio odel
ort d
Example: or i .
Which memory object array,
for j .. loop, etc. to be stored in SPM
while ...
o overla i g tati
epeat allo atio
main
memory call ...
Gain gk and size sk for each
segment k. Maximise gain G = 6gk,
Array ... respecting size of SPM SSP t 6 sk.
Solution: knapsack algorithm.
Scratch pad Array
memory, verla i g d a i
capacity SSP allo atio
Processor Int ... Moving objects back and forth
11 - 8
P re re e tatio
igrati g tio a d varia le
ol
S vark size of variable k
nk number of accesses to variable k
e vark energy aved per variable access, if vark is migrated
E vark energy aved if variable vark is migrated e vark n vark
x vark decision variable, if variable k is migrated to SPM,
0 otherwise
K set of variables
11 -
ed tio i e erg a d average
r ti e
easible
standa with
rd
& postp compiler
a
optimiz ss
Cycles [x 00]
ation
Energy [ ]
Multi sort
mix of sort
algorithms
Measured processor external memory energy Numbers will change with technology,
CACTI values for SPM combined model algorithms remain unchanged.
11 - 10
llo atio o a i lo k
ine-grained
ine-grained
granularity
granularity
smoothens
smoothens
dependency
dependency on on the
the
size
sizeof
ofthe
thescratch
scratch Main
Statically 2
pad.
pad. memory ump jumps,
but only one
ee uires
uires additional
additional is taken
ump2
jump
jump instructions
instructions
or consecutive
to
toreturn
returnto
to main
main basic blocks
memory.
memory. ump3
2
ump
11 - 11
llo atio o a i lo k et o
ad a e t a i lo k a d t e ta k
e uires
genera
tio
additio n of
nal jum
specia ps
Cycles [x 00]
l comp
Energy [ ]
iler
11 - 1
avi g or e or te e erg
alo e
aiT:
aiT:
WCET
WCETanalysis
analysistool
tool
support
supportfor
forscratchpad
scratchpadmemories
memoriesby
byspecifying
specifyingdifferent
different
memory
memoryaccess
accesstimes
times
also
alsofeatures
featuresexperimental
experimentalcache
cacheanalysis
analysisfor
forAA MM
11 - 14
r ite t re o idered
A M TDMI with 3 different memory architectures:
ai e or
LD -cycles: CP ,I ,D 3,2,2
ST -cycles: 2,2,2
,2,0
ai e or i ied a e
LD -cycles: CP ,I ,D 3, 2,6
ST -cycles: 2, 2,3
, 2,0
ai e or rat ad
LD -cycles: CP ,I ,D 3,0,2
ST -cycles: 2,0,0
,0,0
11 - 1
e lt or
sing Scratchpad: sing nified Cache:
eferences:
• Wehmeyer, Marwedel: Influence of nchip Scratchpad Memories on
WCET: th Intl Workshop on worst-case execution time WCET
analysis, Catania, Sicily, Italy, une 2 , 200
• Second paper on SP Cache and WCET at DATE, March 200
11 - 1
lti le rat ad
11 - 1
ti i atio or lti le rat
ad
Minimize C ¦e ¦ x
j
j
i
j ,i ni
i : ¦ x j ,i
j
11 - 1
e lt or art o
oder de oder
Working set
11 - 0
a i re la e e t it i rat
ad
CP Effectively results in a
kind of o iler
SPM o trolled
eg e tatio agi g
for SPM
Memory Address assignment
Memory within SPM re uired
paging or
segmentation-like
PP
Solution:
Solution:
AAÎ
ÎSP SP&&T3
T3Î
Î
SP
SP
PP
11 -
ar
High-level transformations
Loop nest splitting
Array folding
Impact of memory architecture on execution times &
energy.
The SPM provides
untime efficiency
Energy efficiency
Timing predictability
Achieved savings are sometimes dramatic, for example:
savings of of the memory system energy
11 -
ard are o t are ode ig
Per or a e ti atio
do dr regor Pa a
Intellectual Intellectual
Prop. Code Prop. lock
1 -
tli e
vervie
Performance Metrics
Subsystems
Abstraction Levels
1 -4
Per or a e ti atio lo al Pi t re
x(y) = x0 * exp (-k0*y)
PERFORMANCE ESTIMATION METHOD x
x0 = 105
k0 = 1.2593 analytic
y
METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL
M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2
5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication 1 -
OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
Po itio i t e te e ig lo
-
ig level
tio al ti atio
e i i atio Advantages: short simulation
time, no details of
a i g
a d Partitio i g
implementation necessary
Drawbacks: limited accuracy,
Parallel od le od le ti atio
e.g. no information about timing
e i i atio
o i atio
-
ei e e t
1 -
e o t e ti atio
1 -
tli e
verview
Per or a e etri
Subsystems
Abstraction Levels
1 -8
Per or a e ti atio lo al Pi t re
x(y) = x0 * exp (-k0*y)
PERFORMANCE ESTIMATION METHOD x
x0 = 105
k0 = 1.2593 analytic
y
METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL
M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2
5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication 1 -
OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
Performance Metrics
Per ormance metric = function defined on relevant non-functional properties of a
system which indicates a quantitative performance of the system.
Time [second]
for example end-to-end delay, throughput, latency
Area [mm2]
for example area of an integrated circuit
Cost [$]
for example cost of parts, labor, development cost
Other metrics:
usually, performance metrics are
SNR (signal to noise ratio), quality of the video
conflicting
image/sound, size of the hardware platform
Eam les of Performance ra e ffs
Ma in omain
change the mapping of the application to the architecture
Æ see example 1
rc itecture omain
change the hardware platform
Æ see example 2
lication omain
change the application implementation (e.g. degree of
parallelization, partitioning into concurrent processes, use of
different algorithms with a similar functional behavior)
E ra e ffs in t e Ma in omain
ob
ob
worst bus load
E ra e ffs in t e ar are Platform
verview
erformance etrics
ubs stems
bstraction evels
METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL
M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2
5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
stem om osition
ommunication em lates om utation em lates
P
M E
m
interface
namic static
fi e riorit
E E
s Estimation ifficult
om utation an ommunication
(Non-deterministic) computations in processing nodes
(Non-deterministic) communication delays
Complex resource interaction via scheduling and arbitration policies
ncertain en ironment
ifferent load scenarios
nknown (worst case) inputs
llustration of E aluation ifficulties
ab acc b
n ut
tream
as ommunication
as c e ulin
om le n ut
imin itter bursts
ifferent E ent es
llustration of E aluation ifficulties
Processor
as
ab acc b
uffer
n ut
tream
verview
erformance etrics
Subsystems
bstraction e els
METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL
M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2
5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
Comm.Netw.
SW W
rief istor in bstraction SW
abstract
SW W
SWtasks
SW tasks
SW tasks
Register-transfer level model cluster SS
SW adaptation
data[ ] (critical path latency) SW asks
C core
abstract
Comm.
Comm. int.
int.
S/drivers W adaptation
abstract
R
ate level model C
/ / / ( ns) cluster on-chip
abstract
W communication
ransistor model cluster adaptation Network
(t=RC)
s
W adaptation
cluster
abstract
s s s 2 s 2
tec nolo si nal gate, transaction SW, to ens SW tasks,
transistors, layouts schematic, R W systems comm. backbones, s
simulator So W/SW
simulator SystemC
simulator S C simulator codes./cosim. tools
/ SS
formal methods
utline
verview
erformance etrics
Subsystems
bstraction evels
METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL
M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2
5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
System-Level Performance Estimation Methods
e.g. delay
Worst-Case
Best-Case
12 - 26
vervie
System
o to e aluate
12 - 2
Performance Estimation Methods
designers
designers component
component
experience
experience simulation
simulation
model
modeloo
application
application
input
input data
data
traces
traces ss eets
eets
model
modeloo system
system model
modeloo
en
en ironment
ironment model
model arc
arc itecture
itecture
spec.
spec.oo plat
platorm
orm
inputs
inputs benc
benc mar
mar ss
estimation
tool (met od)
estimation
estimation
results
results
12 - 2
nalytic Models
Static analytic sym olic models
escribe computing communication and memory resources by
algebraic e uations e.g.
ª # words º
delay « » comm_ time
« burst_ size»
12 - 2
ynamic nalytic Models
Combination bet een
Static models possibly extended by non-determinism in run-
time and e ent processing
ynamic models or describing e.g. resource s aring
mec anisms (sc eduling and arbitration).
Existing approac es
- t eory
(statistical bounds)
- orst case best case
be a ior)
12 -
E am le - e in Systems
clients re uest some ser ice rom a ser er o er a net or .
Per ormance o t e ser er
Per ormance o t e net or
12 - 1
Stochastic Models - Queuing Systems
queuing system is described by Performance measures
rrival rate average delay in queue
Service mechanism • Customer point of view
ueuing discipline time-average number of customers
in queue.
• System point of view
proportion of time server is busy
12 - 32
ondete ministic Models - Queuing Systems
queuing system is described by Performance measures
rrival function (bounds on worst case delay in queue
arrival times) worst-case number of customers in
Service functions (bounds on queue.
server behavior) worst-case and best-case end-to-
esource interaction end delay in the system
12 - 33
Simulation
Consider the underlying hardware platform and the mapping
of the application onto that architecture
Combine functional simulation and performance data
valuate average-case behavior for one simulation scenario
Model
Model
nput utput
trace application
application hardware
hardwareplatform
platform mapping
mapping trace
12 - 3
Example ace- ased Simulation
A stract simulation at system-le el it out timing
aster than simulation but still based on a single input trace
A straction
pplication - represented by abstract execution traces Æ graph of events:
read, write, and execute
rchitecture - represented by “virtual machines” and “virtual channels”
including non-functional properties (timing power energy)
teps
xecution trace determined by functional application simulation
xtension of the event graph by non-functional properties
Simulation of the extended model
application complete a st act
unctional model t ace e ent g aph