HW SW Codesign

Hardware/Software Codesign
0. Organization
doc. dr. Gregor Papa
Jožef Stefan International Postgraduate School

0-1
Overview
Administration
Course synopsis
Introduction and motivation
0-2
Organization (1)
Lecture: introductionary course + consultations
Exercises: delivered during consultations
Contact: Gregor Papa

gregor.papa@ijs.si
01-477-3514
Web page: http://csd.ijs.si/papa/courses.php
0-3
Organization (2)
Course materials:
slide copies, exercise sheets, papers
the slides contain material from Marco Platzner, Peter
Marwedel, Lothar Thiele, Frank Vahid, Reinhard Wilhelm
References:
P. Marwedel: Embedded System Design, Springer, 2006.
F. Vahid, T. Givargis: Embedded System Design: A Unified
Hardware/Software Introduction, John Wiley & Sons, 2002.
Exam: written seminar + oral, Slovenian or English
0-4
Textbook & slides
course based
on the book and the slides
“Embedded System Design” by
Peter Marwedel
on the slides “Hardware/Software

Codesign” by Lothar Thiele
0-5
Overview
Administration
Course synopsis
0-6
Course Synopsis
Different Levels of Model Representation
Specifications
Models
Abstraction Levels
Dealing with Contradictory Constraints
Exploration
Simulation
• Worst-Case Eexecution Time
Optimization
Hardware/Software Mapping
Partitioning
Scheduling
Allocation
Software Code Optimizations
Compilation
Estimation
0-7
Benefits ? Learn about …
… challenges and approaches in modern system design
… useful optimization methods
… performance estimation of embedded systems
… a current research area
0-8
Overview
Administration
Course synopsis
0-9
What is HW/SW Codesign?
... integrated design of systems that consist of hardware-
and software-components
Analysis of HW/SW boundaries and interfaces

Evaluation of design alternatives
0 - 10
Hardware/Software Boundaries
General purpose systems (PC, workstation)
processor design:
processor ļ compiler, operating system
Embedded systems (cell phone, automotive electronics)

design of specialized processors:
processor ļ compiler, operating system
system design:
processors ļ dedicated hardware devices
0 - 11
Target Architectures
0 - 12
Why Codesign? (1)
Modern embedded systems require “design” optimization
many functions, great variability, high flexibility
heterogeneous target systems
• processors, ASICs, FPGAs, systems-on-chip, …
many design goals
• performance, cost, power consumption, reliability, ...
Advances in formal / automated design methods

automation on the system level becomes possible
reduction of cost and time-to-market
0 - 13
Why Codesign? (2)
Optimization of the “design process”
classic design co-design
0 - 14
Codesign methodologies
Different Levels of Model Representation
Dealing with Contradictory Constraints
Hardware/Software Mapping
Software Code Optimizations
Estimation
0 - 15
System Design
0 - 16
System Design
0 - 17
Motivation (1)
According to forecasts, future of IT characterized
by terms such as
Disappearing computer,
Ubiquitous computing,
Pervasive computing,
Ambient intelligence,
Post-PC era,
Cyber-physical systems.
Basic technologies:
Embedded Systems
Communication technologies
0 - 18
Motivation (2)
“Information technology (IT) is on the verge of another revolution. …..

networked systems of embedded computers ... have the potential to change
radically the way people interact with their environment by linking together a
range of devices and sensors that will allow information to be collected,
shared, and processed in unprecedented ways. ...
The use … throughout society could well dwarf previous milestones in the
information revolution.”
Source. Edward A. Lee, UC Berkeley, ARTEMIS
Embedded Systems Conference, Graz, 5/2006
0 - 19
Embedded Systems & Cyber-Physical
Systems
“Dortmund“ Definition: [Peter Marwedel]
Information processing systems embedded into a larger
product
Berkeley: [Edward A. Lee]:

Embedded software is software integrated with physical*
processes. The technical problem is managing time and
concurrency in computational systems.
) Definition: Cyber-Physical (cy-phy) Systems (CPS) are

integrations of computation with physical processes [Edward Lee,
2006].
0 - 20
Embedded Systems and ubiquitous
computing
Ubiquitous computing: Information anytime, anywhere.
Embedded systems provide fundamental technology.
Communication Embedded
Technology Systems
Dependability
Robots
Optical networking
Quality of
Real-time
Control systems
service
Network management
Feature extraction
Distributed applications
and recognition
Service provision
Sensors/actors
UMTS, DECT, Hiperlan, ATM
A/D-converters
Pervasive/Ubiquitous computing
Distributed systems
Embedded web systems
0 - 21
Growing importance of embedded systems
Spending on GPS units exceeded $100 mln during Thanksgiving week, up 237%
from 2006 … More people bought GPS units than bought PCs, NPD found.
[www.itfacts.biz, Dec. 6th, 2007]
…, the market for remote home health monitoring is expected to generate $225
mln revenue in 2011, up from less than $70 mln in 2006, according to Parks
Associates. . [www.itfacts.biz, Sep. 4th, 2007]
According to IDC the identity and access management (IAM) market in Australia
and New Zealand (ANZ) … is expected to increase at a compound annual growth
rate (CAGR) of 13.1% to reach $189.3 mln by 2012 [www.itfacts.biz, July 26th, 2008].
Accessing the Internet via a mobile device up by 82% in the US, by 49% in
Europe, from May 2007 to May 2008 [www.itfacts.biz, July 29th, 2008]
0 - 22
Automotive electronics
Functions by embedded processing: Multiple networks
ABS: Anti-lock braking systems Body, engine, telematics, media,
safety
ESP: Electronic stability control
Airbags
Efficient automatic gearboxes Multiple processors
Theft prevention with smart keys Up to 100
Blind-angle alert systems • 8-bit – door locks, lights, etc.
... etc ... • 16-bit – most functions
• 32-bit – engine control, airbags
Processing where the action is
Sensors and actuators distributed
all over the vehicle
Networked together
0 - 23
Avionics
Flight control systems,

anti-collision systems,
pilot information systems,
power supply system,
flap control system,
entertainment system,
…
Dependability is of outmost
importance.
0 - 24
Railways
Safety features
contribute significantly
to the total value of
trains, and dependability
is extremely important
0 - 25
Telecommunication
Mobile phones have been one of the fastest growing
markets in the recent years,
• Multiprocessor
• 8-bit/32-bit for UI
• DSP for signals
• 32-bit in IR port
• 32-bit in Bluetooth
• 8-100 MB of memory
• All custom chips
• Power consumption & battery life depends on
software
base stations
• Massive signal processing
• Several processing tasks per connected
mobile phone
• Based on DSPs
• Standard or custom
• 100s of processors
Geo-positioning systems,
Fast Internet connections,
Closed systems for police, ambulances, rescue staff.
0 - 26
Medical systems
For example:
• Artificial eye: several approaches,
e.g.:
• Camera attached to glasses;
computer worn at belt; output
directly connected to the brain,
“pioneering work by William
Dobelle”. Previously at
[www.dobelle.com]
Translation into sound; claiming much

better resolution.
[http://www.seeingwithsound.com/etumble.htm]
0 - 27
Extremely Large
Functions requiring computers:
Radar
Weapons
Damage control
Navigation
basically everything
Computers:
Large servers
1000s of processors
0 - 28
Inside your PC
Custom processors
Graphics, sound
32-bit processors
IR, Bluetooth
Network, WLAN
Harddisk
RAID controllers
8-bit processors
USB
Keyboard, mouse
0 - 29
Authentication systems
Finger print sensors

Access control
Airport security systems
Smartpen®
Smart cards
….
0 - 30
Consumer electronics
Examples
0 - 31
Industrial automation
Examples
0 - 32
Forestry Machines
Networked computer system
Controlling arms & tools
Navigating the forest
Recording the trees harvested
Crucial to efficient work
Operator panel
Graphical display
Touch panel
Joystick
Buttons
Keyboard
“Tough enough to be out in the woods”
0 - 33
© Jakob Engblom
Smart buildings
Examples
Integrated cooling,
lightning, room
reservation, emergency
handling,
communication
Goal: “Zero-energy
building”
0 - 34
Robotics
“Pipe-climber”
Robot “Johnnie“
Lego mindstorms
Standard controller
• 8-bit processor
• 64 kB of memory
Electronics to interface
to motors and sensors
0 - 35
Estimation
Hardware, software and system as a whole suitability
0 - 36
a a ot a o si n
nt o tion
o o a a
Jo ef Stefan nternational Postgraduate School

-
ont nts
Le els of bstraction in Electronic System esign
ypical esign low of Hardware-Software Systems
-
Em st ms
Embedded systems ES in o mation o ssin
s st ms m into a a o t
E amples
ain reason for buying is not information processing
-3
Em st ms
t na o ss
man int a
s nso s a t ato s
m s st m
-
aa an ist i t a t at o ms
ABS
ASR
ACC ESP
engine
control powertrain
control
-
E am o sso
ell Processor B combines
general-purpose architecture core with
coprocessing elements which greatly accelerate multimedia
and ector processing applications, as well as many other
forms of dedicated computation
-6
omm ni atin Em st ms
sensor networks ci il engineering, buildings, en ironmental
monitoring, traffic, emergency situations
smart products, wearable ubi uitous computing
-
n s in n o mation an omm ni ation
entrali ed etworked Large-scale

Systems Systems istributed Systems
nternet
ew pplications and
System Paradigms
-
om a ison
Embedded Systems eneral Purpose omputing
ew applications that are Broad class of applications
known at design-time
ot programmable by end Programmable by end user
user
i ed run-time re uirements aster is better
additional computing power
not useful
riteria riteria
• cost • cost
• power consumption • a erage speed
• predictability
• meeting time bounds
•
-
si n a n s
increasing application complexity e en in standard and large

olume products
• large systems with legacy functions
• mi ture of e ent dri en and data flow tasks
• e amples multimedia, automoti e, mobile communication
increasing target system complexity
• mi ture of different technologies, processor types, and design styles
• large systems-on-a-chip combining components from different
sources, distributed system implementations
numerous constraints and design objectives
• e amples cost, power consumption, timing constraints, dependability
- 0
a n s o Em ot a
ynamic en ironments
apture the re uired beha iour
alidate specifications
Efficient translation of specifications
into implementations
How can we check that we meet real-
time constraints
How do we alidate embedded real-
time software large olumes of data,
testing may be safety-critical
-
m m ntation t nati s
n a os o sso s
i ation s i i inst tion s t o sso s

s
• i o ont o
o man • s i ita si na o sso s
o E i i n i i it
o amma a a
• i o amma at a a s
i ation s i i int at i its s
-
n a i it
ES ust be ,
probability of system working correctly
pro ided that is was working at t
probability of system working correctly d
time units after error occurred
probability of system working at time t
no harm to be caused
confidential and authentic communication
E en perfectly designed systems can fail if the assumptions
about the workload and possible errors turn out to be wrong
aking the system dependable must not be an after-thought,
it must be considered from the ery beginning
- 3
E i i n
ES must be efficient
ode-si e efficient
especially for systems on a chip
Run-time efficient
eight efficient
ost efficient
Energy efficient
-
a tim onst aints
any ES must meet -
real-time system must react to stimuli from the controlled
ob ect or the operator within the time inter al by the
en ironment
or real-time systems, right answers arri ing too late are wrong
-
opet ,
ll other time-constraints are called
guaranteed system response has to be e plained without
statistical arguments
-
a im st ms
Embedded and Real- ime

Synonymous
ost embedded systems
are real-time m
ost real-time systems
are embedded m
a tim
a tim
Jakob Engblom
- 6
a ti i s st ms
ypically, ES are

Beha ior depends on input

) automata model appropriate,
model of computable functions inappropriate
i s st ms
analog digital parts
-
i at s st ms
towards a certain
nowledge about beha ior at design time
can be used to minimi e resources and to
ma imi e robustness
no mouse, keyboard and screen
-
ont nts
hat is an Embedded System
ypical esign low of Hardware-Software Systems
-
st a tion o s an nt sis
ormal description of selected properties of a system or subsystem

model consists of data and associated methods
egree of abstraction, granularity

• system, architecture, logic, transistor,
• module, block, function,
iew
• beha ior, structural, physical
Linking ad acent le els of abstraction refinement

Stepwise adding of structural information
- 0
so st a tions
Beha ior
st m
Process odule rchitecture
unction R L
at mo s Structure
it mo s
t o i it mo s
i mo s
a o t mo s
-
ont nts
hat is an Embedded System
Le els of bstraction in Electronic System esign
-
si n a oa s
inition nt sis is the process of generating the
description of a system in terms of related lower-le el
components from some high-le el description of the e pected
beha ior
)“describe-and-synthesi e” paradigm by a ski, 4
n contrast to the traditional “specify-e plore-refine” approach,

also known as “design-and-simulate” approach
anual design steps are more error-prone than automatic

synthesis and, therefore, simulation is more important
- 3
st m si n
Specification
System Synthesis Estimation
S - ompilation nstruction Set H -Synthesis
ntellectual ntellectual
Prop ode Prop Block
achine ode et lists

-
i o sso it t
Specification
Prop ode Prop Block
achine ode et lists

-
i ation ii o
Specification
Prop ode Prop Block
achine ode et lists

- 6
i ation ii nst tion t o sso
Specification
Prop ode Prop Block
achine ode et lists

-
st m si n
- is a comple synthesis tasks
software synthesis and code generation
hardware synthesis
interface and communication synthesis
hardware software partitioning and component selection
hardware software scheduling
:
application specification
design space e ploration and system optimi ation
estimation
-
a in o m
-
a in an in
Partitioning of system function to programmable components

software , hard-wired or parameteri ed components hardware
or application specific instruction set processors
to scheduling and load distribution problem in real-
time operating systems
time constraints, conte t switch and conte t switch o erhead,
process synchroni ation and communication
to real-time operating systems
larger design space with ery different solutions
high optimi ation re uirements moti ation for hardware design
underlying hardware is not fi ed
- 30
a in an in
Similarity to allocation or load distribution problem in high-
le el synthesis or real-time operating systems
dedicated
P2 HWcomponents
P1 P4
P3
SW
(processors)
-3
Estimation
he principle of synthesis based on abstraction only makes
sense if there are a ailable
Estimate properties of the ne t layer s of abstraction
esign decisions are based on these estimated properties f
the estimation is not correct or not accurate enough , the
design will be sub-optimal or e en not working correctly
si n a i im in
E o ation a st a tion o
Estimation o
si n
o a
o ti s si n a
E o ation
si n a o
E o ation a st a tion
-3
a a ot a o si n
i i ation an o so om tation
doc. dr. Gregor Papa
Jo ef Stefan nternational Postgraduate School

-
System Design
Specification
SW-Compilation Instruction Set HW-Synthesis
Intellectual Intellectual
Prop. Code Prop. Block
Machine Code Net lists

2-2
onsider a simp e e amp e
he ser er pattern defines a one-to-many dependency

et een a su ect o ect and any num er of o ser er
o ects so that hen the su ect o ect chan es state all
its o ser er o ects are notified and updated
automatically.
Eric amman ichard Helm alph ohnson ohn lissides Design Patterns ddision-
Wesley
2-
amp e ser er pattern in a a
pu lic oid add istener listener
pu lic oid set alue newvalue

my alue ne alue
for int i i mylisteners.len th i
my isteners i . alueChan ed ne alue
Will this ork in a multithreaded conte t
2-
ser er pattern it m te es
pu lic sync roni ed oid add istener listener
pu lic sync roni ed oid set alue newvalue

my alue ne alue
a asoft recommends a ainst this.

What s ron ith it
2-
te es sing monitors are mine ie ds
pu lic sync roni ed oid set alue newvalue

calls add istener
my alue ne alue mute
alu ests
re
eC
u
ha
y
ld
he
ed
lock
alueChan ed may attempt to ac uire a lock on

some other o ect and stall. If the holder of that lock
calls add istener deadlock
2-
Simp e o ser er pattern gets comp icated
pu lic oid set alue newValue

sync roni ed this
my alue ne alue
hile holdin lock make a copy of
listeners my isteners.clone
listeners to a oid race conditions
for int i i listeners.len th i notify each listener outside of the

synchroni ed lock to a oid deadlock
listeners i . alueChan ed ne alue
his still isn t ri ht.

What s ron ith it
2-
Simp e o ser er pattern o to ma e it rig t
pu lic oid set alue newValue

Suppose t o threads call
sync roni ed this set alue . ne of them
my alue ne alue ill set the alue last
listeners my isteners.clone lea in that alue in the
o ect ut listeners may
e notified in the opposite
for int i i listeners.len th i order. he listeners may
e alerted to the alue-
listeners i . alueChan ed ne alue chan es in the ron
order
2-
Pro ems it t read ased conc rrency
Nontrivial software written with

threads, semaphores, and
mutexes is incomprehensible to
humans.
) Search for non-thread- ased models hich are the

re uirements for appropriate specification techni ues
2-
ontents
StateCharts
ata- lo Models
2-
e irements or Speci ication ec ni es
Humans not capa le to understand systems

containin more than a fe o ects.
Most actual systems re uire more o ects

) Hierarchy
proc
proc
E amples states processes procedures. proc

E amples processors racks
printed circuit oards
2-
e irements or Speci ication ec ni es
-
e uired for reacti e systems.
-
Components send streams of data
to each other.
o o stac es or
2- 2
ode s o omp tation De inition
at does it mean to comp te

ode s o comp tation de ine
Components and an e ecution model for C-
computations for each component
Communication model for e chan e of
information et een components. C-
Shared memory
Messa e passin
2-
S ared memory
Potential race conditions )inconsistent results possi le
) Critical sections sections at hich e clusi e access to
resource r e. . shared memory must e uaranteed.
process a process ace-free access to

.. ..
P S o tain lock P S o tain lock shared memory
.. critical section .. critical section protected y S
S release lock S release lock possi le
his model may e supported y

mutual e clusion for critical sections
cache coherency protocols
2-
on oc ing async rono s message passing
Sender does not ha e to ait until messa e has arri ed

potential pro lem uffer o erflo
send recei e
2-
oc ing sync rono s message passing
Sender ill ait until recei er has recei ed messa e
send recei e
2-
Sync rono s message passing SP
SP communicatin se uential processes

Hoare
rendez-vous- ased communication
E ample
process
process process
processBB
.... ....
arar aa...
... arar ...
...
aa ...
...
ccaa ----output
output cc ----input
input
end
end end
end
2-
omponents
on Neumann model
Se uential e ecution pro ram memory etc.
iscrete e ent model

ueue
a
time
c a c a a action
2-
omponents
inite state machines
ifferential e uations
w2x
b
wt 2
2-
amp e Discrete ent D
D hard are description lan ua e is commonly used

as a desi n-entry lan ua e for di ital circuits.
2-2
Sensiti ity ists in D
Sensi ity lists are a shorthand for a sin le ait on-statement
at the end of the process ody
process y
egin
prod and y
end process
is e ui alent to
process
egin
ait on y
prod and y
end process
2-2
No lan ua e that meets all lan ua e re uirements
) usin compromises
2 - 22
ontents
Models of Computation
ata- lo Models
2-2
assica tomata
Classical automata
input X Internal state Z output Y

clock
Moore- Mealy
Ne t state Z computed y function G automata finite state
utput computed y function O machines SMs
e
Moore-automata
Y O Z Z G X, Z e e
Mealy-automata
Y O X Z Z G X, Z
e
2-2
State arts
Classical automata not useful for comple systems comple

raphs cannot e understood y humans .
) ) StateCharts Harel
2-2
ntrod cing ierarc y
SM ill e in e actly one

of the su states of S if S is
acti e
either in or in B or ..
2-2
De initions
Current states of SMs are also called states.
States hich are not composed of other states are called
.
States containin other states are called - .
or each asic state s the super-states containin s are called
.
Super-states S are called - - if e actly one of the
su -states of S is acti e hene er S is acti e.
superstate
ancestor state of E
su states
2-2
De a t State ec anism
ry to hide internal
structure from outside
orld
) efault state
illed circle
indicates su -state
entered hene er
super-state is entered.
Not a state y itself
2-2
istory ec anism
eha ior different

from last slide
m
k
or input m S enters the state it as in efore S as left can e

B C or E . If S is entered for the ery first time the default
mechanism applies.
History and default mechanisms can e used hierarchically.
2-2
om ining istory and De a t State
same meanin
2-
onc rrency
Con enient ays of descri in concurrency are re uired.
- - : FSM is in all (immediate) sub-states of a
super-state.
2-
Entering and Leaving AND-Super-States
incl.
Line-monitoring and key-monitoring are entered and left, when

service switch is operated.
2 - 32
ree representati n state sets
basic -super-state -super-state
state
F
F M
L
M
L
2 - 33
putati n state sets
omputation of state sets by from
leaves to root:
basic states: state set state
-super-states: state set union of children
-super-states: state set artesian product of children
F M
2-3
pes States
n State harts, states are either

r
- - r
- -
2-3
i ers
Since time needs to be modeled in embedded systems,
timers need to be modeled.
n State harts, special edges can be used for timeouts.
f event a does not happen while the system is in the left

state for ms, a timeout will take place.
2-3
sing i ers in Ans ering a ine
2-3
epresentati n putati ns
esides states, arbitrary many other variables can be
defined. his way, not all states of the system are
modeled e plicitly.
hese variables can be changed as a result of a state
transition ( ). State transitions can be dependent
on these variables ( ).
action unstructured
state space
variables
condition
2-3
eneral r Edge La els
event condition action
ist only for the ne t evaluation of the model

an be either internally or e ternally generated
efer to values of variables that keep their value until t e are

reassigned
an either be assignments for variables or creation of events
service-off not in Lproc service:
2-3
Events and a ti ns
can be composed of several events:
and 2 : event that corresponds to the simultaneous
occurrence of e and e .
r 2 : event that corresponds to the occurrence of either
e or e or both.
n t : event that corresponds to the absence of event e.
can also be composed:

2 : actions a und a are e ecuted in parallel.
ll events, states and actions are globally visible.
2-
E a ple
e a1 c a2
x y z
e:
a1:
a2:
true
c: false
e:
a1:
a2:
true
c: false
2-
e State arts Si ulati n ases
ow are edge labels evaluated
:
. ffect of e ternal changes on events and conditions is
evaluated,
. he set of transitions to be made in the current step and right
hand sides of assignments are computed,
. ransitions become effective, variables obtain new values.
2- 2
E a ple
n phase , variables a and b are assigned to temporary

variables. n phase , these are assigned to a and b. s a result,
variables a and b are swapped.
n a single phase environment, e ecuting the left state first would
assign the old value of b ( ) to a and b. ecuting the right state
first would assign the old value of a ( ) to a and b. he
e ecution would be non-deterministic.
2- 3
Steps
ecution of a State hart model consists of a se uence of
(status, step) pairs
Status values of all variables set of events current time

Step e ecution of the three phases
e
phas
phase
Status
phase
2-
e le ts del l ed ard are
n an actual clocked (synchronous) hardware system, both

registers would be swapped as well.
Same
Sameseparation
separationinto
intophases
phasesfound
foundin
inother
otherlanguages
languagesas
as
well,
well, especially
especially those
those that
that are
are intended
intended to
to model
model hardware.
hardware.
2-
re n se anti s State arts
nfortunately, there are several time-semantics of
State harts in use. his is another possibility:
step is e ecuted in arbitrarily small time.
nternal (generated) events e ist only within the ne t step.
ternal events can only be detected after a stable state
has been reached.
e ternal events
stable stable
state state state
transitions t
transport of internal events step
2-
E a ples
state diagram:
stable state
2-
E a ple
on-determinism
a a
A C E G
a a
B D F H
state diagram:
a E,H
a
A,B C,D
a F,G
2-
E a ple
state diagram (only
stable states are
represented, only a
a c and b are e ternal):
a
a a a
a
ac
a a
2-
Evaluati n State arts
allows arbitrary nesting of - and -super

states.
in a follow-up paper to original paper.
Large number of commercial simulation
(StateMate, StateFlow Matlab, etterState, ML, ...)
vailable back-ends translate State harts into
, thus enabling software or hardware
implementations.
2-
Evaluati n State arts
enerated ,
ot useful for applications,
o description of - ,
o - ,
o description of .
2-
SDL
(S L) is a
specification language targeted at the unambiguous
specification and description of the behaviour of reactive
and distributed systems.
sed here as a (prominent) e ample of a model of

computation based on as n r n us essage passing.
appropriate also for distributed systems
2- 2
uni ati n a ng SDL- S s
ommunication between FSMs (or processes ) is based on
essage-passing, assuming a p tentiall inde initel large
- ueue.
ach process fetches
ne t entry from F F ,
checks if input enables
transition,
if yes: transition takes
place,
if no: input is discarded
(e ception: S -
mechanism).
2- 3
Deter inisti
Let tokens be arriving at F F at the same time:
rder in which they are stored, is unknown
ll orders are legal: simulators can show different behaviors for

the same input, all of which are correct.
2-
ntents
Models of omputation
State harts
2-
Data l Language del
communicating through
FF uffer
rocess rocess
FF uffer
FF uffer
rocess
2-
il s p Data l Languages
:
mperative language style: program counter is king
ataflow language: movement of data is the priority
Scheduling responsibility of the system, not the programmer
:
ll processes run simultaneously
rocesses can be described with imperative code
rocesses can y communicate through buffers
Se uence of read tokens is identical to the se uence of written
tokens
2-
Data l Languages
ppropriate for applications that deal with :
Fundamentally concurrent: maps easily to parallel hardware
erfect fit for block-diagram specifications (control systems, signal
processing)
Matches well current and future trend towards multimedia
applications
:
ost Language (process description), e.g. , , ava, .... .
oordination Language (network description), usually home made ,
e.g. ML.
2-
E a ple E - vide de der
2-
a n r ess Net r s
roposed by ahn in as a general-purpose scheme

for parallel programming:
: destructive and blocking (reading an empty channel
blocks until data is available)
: non-blocking
: infinite si e
ni ue attribute:
2-
A a n r ess
From ahn s original paper
process f(in int u, in int v, out int w)

u
int i bool b true
for ( ) w
f
i b wait(u) : wait(v)
printf( i n , i) v
send(i, w)
hat does this do
b b
rocess alternately reads from u
and v, prints the data value, and
writes it to w
2-
A a n r ess
From ahn s original paper:
process g(in int u, out int v, out int w)
int i bool b true

for( ) v
i wait(u) u g
if (b) send(i, v) else send(i, w) w
b b hat does this do
rocess reads from u and
alternately copies it to v and w
2- 2
A a n r ess
From ahn s original paper:
process h(in int u, out int v, int init)
int i init
send(i, v) u h v
for( )
i wait(u)
hat does this do
send(i, v)
rocess sends initial value, then
passes through values.
2- 3
A a n r ess Net r
hat does this do
rints an alternating se uence of s and s.
mits a once and then copies input to output

h
init
g f
h
init
mits a once and then copies input to output
2-
Deter ina
:
system is random if the information about the
system and its inputs is not sufficient to determine its outputs.
:
efine the y of a channel to be the se uence of tokens
that have been both written and read. process network is
said to be e e a e if the histories of all channels depend
only on the histories of the input channels.
:
Functional behavior is independent of timing (scheduling,
communication time, e ecution time of processes).
Separation of functional properties and timing.
2-
Determinacy
[x1,x2,x3,…] [y1,y2,y3,…]
F
monotonic mapping
x x
y
,
2 - 66
Determinacy
[x1,x2,x3,…] [y1,y2,y3,…]
F
orma de inition
[x1, x2, x3, ]
x [x1] [x1, x2] [x1, x2, x3, ]
, 1, ,

F o
F F
2-6
r Determini m
determinate

y
Rea oning
, y y
y
y y, ,
y
, y
y
2-6
in n eterminacy
y

amp e
2-6
in n eterminacy
1 [ ]
F [ , ]
F 1, 2
2 [ ]
1 [ , ]
F [ , , ]
F 1 , 2
2 [ ]
[ ], [ ] [ , ], [ ]
F F [ , ][ , , ]
2-
c e in a n et r
2-
Deman ri en c e in
y y
2- 2
m ar rit m
o nded memor
tart ith o nded er si es
any s hed in te hni e x
itho t dead o
y ontin e
y dead o , in rease si e
2-
Fr m n inite t Finite er i e
y
n,
n
y
y
2-
Dea c am e
x y
2
, ,
1,
1,
1,
2- 5
am e Finite i e er in
2
1
1, 1, 1,
1,
1, 1,
1, 1,
1,
2- 6
ar rit m in cti n
1
, , ,
1 1 1
1 2 1 1 1
3 1 1
2 y
2-
ar rit m in cti n
y y
1 1 1 1
2 1 1 1 1 1
1
3 1 1
2 y
2-
a ati n a n r ce et r
ro
y

x

on

y y
x y
2-
ync r n Data DF
, y, 1
estri tion
i ed n m er o token
amp e 1
1 1 2 3 2 1
2-
DF c e in
c ed e y at compi e time
y
sta ish re ati e e e tion rates y y
etermine eriodi s hed e y y
Re t x y
2-
a ancin ati n
3a 2
3d
1
3
2 2 a
3 3
d 2a
2 1
3
1 2
2- 2
in t e a ancin ati n
ain D c ed ing t eorem
n
y x n1
n1 x
y y
y
amp e
2-
Determine eri ic c e e
1
o i e c ed e

3 2 3

…
2 1
3
1 2
y y
e i i it
, y

2-
ar are t are e i n
De i n ace rati n
c r re r a a
-
y tem De i n
y y
-2
De i n ace rati n
icati n rc itect re
a in
timati n
-
Detai e ie De i n ace rati n
e a ati n
c n tr ct ma e timate
arc itect re a icati n er rmance
m ti ecti e
timi ati n
-
am e im e e
, 2,
1
1 3
, y,
-5
Example 1: Evolutionary Algorithms for DSE
individual
allocation decode allocation
n selection binding
o recombination decode binding
p mutation
scheduling
design point
“chromosome” = encoded
allocation + binding (implementation)
fitness evaluation
fitness
3-6 user constraints

Example 1: asi Model
1
Definition: A specifica-
tion graph is a graph 5 RISC
GS=(VS,ES) consisting
data flow graph GP,
of a problem 3 SB
an architecture graph 7 HWM1

GA, and edges EM. In
particular, VS=VP∪VA, 2 PTP
ES=EP∪EA∪EM 6 HWM2
4
GP EM GA
3-
Example 1: Mapping
1
0 1
0
1 5 RISC
8
21 3 SB
1
29 7 HWM1
20
α
1 2
1
21 6 β RISC HWM1
2 PTP bus
30 4 shared
bus
HWM2
τ
3-
Example 1: hallenges
ncoding of (allocation+binding)
simple encoding
e g one bit per resource one variable per binding
eas to implement
man infeasible partitioning solutions
encoding + repair
e g simple encoding and modif such that for each vp VP there
e ists at least one va VA ith a E(vp) = va
reduces number of infeasible partitioning solutions
eneration of the initial population mutation
ecombination
3-
Example 1: ase Study
3-
Example 1: ase Study
3-
am le ase tud
rame memor dual orted rame memor bloc matc module
ut module
out ut module
subtract/add module
/ module u ma e coder
3 - 12
am le olut o
INM
INM OUTM
OUTM FM
FM RISC2
RISC2
SBS
3 - 13
am le olut o
INM
INM OUTM
OUTM DPFM
DPFM HC
HC DCTM
DCTM BMM
BMM SAM
SAM
SBF
3-1
am le o t are t es s
2 2
A B C D F
CD DAT
ec s o s
nS oC
I
ABABABCCABABA CODE(A) CALL(A) FOR 1 TO 2
CODE(B) CALL(B) CODE(A)
CODE(A) CALL(A) CALL(B)
CODE(B) CALL(B) CODE(C)
CODE(C) CALL(C) CODE(A)
3-1
am le t m at o r ter a
P
PROCEDURE A
FOR 1 TO 3
CALL(A)
CODE(B)
CODE(B)
2
A B
3-1
am le rade o s
3-1
am le rade o ur aces
3-1
am le lorat o trate
3-1
am le a rocess et or
3-2
am le ard are rc tecture
3 - 21
am le esult o u ct o al mulat o
n(p)
b(s)
3 - 22
am le esult o lat orm e c mar s
P
( )
p
(p )
3 - 23
am le ac o t e e elo e al s s
3-2
am le am le lorat o esult
2 2
3-2
ard are/ o t are odes
stem mulat o
doc dr re or a a
S I P S
-1
stem es
S
S S
S C I S H S
I I
P C P B
M C N
-2
utl e
D S
S C
S H A
-3
stem a d odel
A
I S D
T
A
-
tate
T
t
t
T
-
tate
I
A
p s

-
me
I
p s

-
e ts a d screte e t stems
A
T
I
I
-
screte e t stems
AD S
A D S
T D S
I D S
-
me dr e s e t dr e
()
-1
me dr e s e t dr e
- -
T
T
A
- 11
me dr e s e t dr e
-
S
A
- 12
utl e
S C
S C
S H A
- 13
screte e t odel a d mulat o

T Æ -
-1
om o e ts o a screte e t mulat o

I

T

C
C
A
P
-1
screte e t mulat o e
t rout e
I
le
set to
D
rocess b
call subs stem
module s remo e
e e t rom
u date stat st cal

U ormat o
e erate s mulat o
re ort
-1
screte e t mulat o
A
P
s n Æ may “produce” new events.
Problem: Within the same simulation cycle, “cause” and “effect” events share
the same time of occurrence
Solution: The simulator uses a zero duration virtual time interval, called delta-
cycle (į)
The role of a delta-cycle is to order “simultaneous” events within a simulation
cycle, i.e. identifying which event caused another; “causes” and “effects” are
separated by delta-cycles.
Simulation cycles may be composed of several delta-cycles (į)
A C D BC E
į 2į į
-1
Outline
System Classification
Discrete Event Simulation
Example SystemC
Simulation at High Abstraction Levels
4 - 18
S te O e ie
4-1
le
4-
le O
4- 1
o ule
processes
4-
o e e
4-
o ule
4- 4
o e o uni tion
rocesses can directly communicate through s als.
odule
nput
ports
port
rocess
utput
ports
sensitivity
rocess
nternal
signal
4-
n e o uni tion
SystemC . introduces general purpose primitives
C a el
A container for communication and synchronization, e.g. can
have state and private data, transport data, transport events.
They implement one or more te aces
ter ace
Specify a set of access methods to the channel
But it does not implement those methods
E e t
le ible, low level synchronization primitive, sed to construct
other forms of synchronization
Have no type and no value
ther comm. sync. models can be built based on the
above primitives
4-
nnel n ot
4-
Wait and Notify
Wait: halt rocess e ecution until event is raised
wait() with arguments => dynamic sensitivity
• wait(sc_event)
• wait(time)
• wait(time_out, sc_event)
Notify: raise an event

notify() with arguments => delayed notification
• my_event.notify(); // notify immediately
• my_event.notify(SC_ZERO_TIME); // notify
next delta cycle
• my_event.notify(time); // notify after time
4 - 28
i lation le ents ain Pro ra
nitiali ation Phase
e te all the ro esses

ntil a lo in oint
date si nals
lo y le Co te the set of delta y le

ready ro esses
N er of
ready
ro esses
e te all the ro esses
d an e si lation ti e
ntil a lo in oint
date si nals
4-2
a le i le Channel
4-
a le i le Channel nterfa e
4-
a le i le Channel
4- 2
a le i le Prod er Cons er
4-
a le ahn Pro ess Net or
4- 4
4-
the will deadloc unless

an initial to en is ut into the loo :
output1.write(0.0);
4-
4- 7
4- 8
yste C and odels of Co tation
4-
tline
ystem lassification
iscrete vent imulation
am le ystem
i atio at i t a tio
4-4
lti le e els of stra tion
unctional nti ed n tional e el
se: model un timed functionality
ommunication: shared varia les messages
y ical languages: atla
ransaction evel ransa tion e el

se: o architecture analysis early
develo ment timing estimation
ommunication: method calls to channels
y ical languages: ystem
egister ransfer evel

e ister ransfer e el Pin e el
se: design and verification
ommunication: wires and registers
y ical languages: erilog
4-4
A straction Models
Time ranularity for communication computation objects can be classified
into 3 basic cate ories -Timed, Approximate-Timed, Cycle-Timed
Models B, C, D and E could be classified as Transaction Level Models
(TLM)
Communication A. "Un-timed functional model"
Cycle-
timed
D F B. "Timed functional model"
C. "Transaction model"
Approximate-
timed
C E
D. "Cycle-accurate communication model"
Un-
timed
A B E. "Cycle-accurate computation model”
Computation
Un- Approximate- Cycle-
timed timed timed
F. "Register transfer model"
System Modeling Graph
(2003 Dan Gajski and Lukai Cai)
4 - 42
A "Un-Timed Functional Model"
Computation se uential B
execution v aa
Un-timed be avior B , parallel
B2 B3, execution
Communication B B2 B3
Un-timed transfer B B
B B
ariables v2 v bb v3 v - b b
Communication
Cycle- D F B
timed
v v2 v3
c se u(v )
Approximate-
timed
C E
Un-
timed
A B
Un- Approximate- Cycle- Computation
timed timed timed
4-4
B Timed Functional Model”
Computation (on processin elements - Es) Messa e-
Time annotation (estimate) E
passin
B c
v aa
Communication
Messa e-passin no protocol
implementation E
Un-timed transfer B
c
v3 v - b b
Mappin
Es (arc itecture) allocation and process- E
to- E mappin B
B c v v2 v3
v2 v bb c se u(v )
Communication
B
A v aa
Cycle- D F
timed
v
code - time
B B
B B estimates e
Approximate- v2 v bb v3 v - b b
timed
C E
DELA () or
v2 v3
ait()
Un- Bv
timed
A B c
v2 v3
se u(v )

timed timed timed
4 - 44
E ample B Soft are Code Annotation
pecification v__st_tmp = v__st;
startup(proc);
A C nput
if(events[proc][0] & 1)
execute(proc); UT A C source code
Analy e UT UT functionally
ld
basic blocks
compute delays ld e uivalent C code au mented
op by execution times
ld
li
op
delay c aracteri ation ts
--
br
v__st_tmp = v__st;
Annotate C code
__DELAY(LI+LI+LI+LI+LI+LI+OPc);
startup(proc);
if(events[proc][0] & 1) {
Model __DELAY(OPi+LD+LI+OPc+LD+OPi+OPi+IF);
C code execute(proc);
execution delay }
Compile erformance
enerated C and Estimation
run natively
4-4
C: “Transaction Model”
Computation PE4
Approximate-timed (Arbiter)
(estimate)
PE1
Communication PE3
Approximate-timed B1 3
B3
(estimate) using simplified v1 = a*a;
v3= v1- b*b;
cv12
(abstract) bus protocols
cv2
Mapping 1 2
v3
cv11
Mapping of computation PE2
and communication B4
B2 1 Master interface v4 = v2 + v3;
v2 = v1 + b*b; c = sequ(v4);
2 Slave interface
3 Arbiter interface
Communication
Cycle- D F
timed
Approximate-
timed
C E
Un-
timed
A B
timed timed timed
4 - 46
D: “C cle Accurate Communication Model”
Computation PE4
Approximate-timed (Arbiter)
(estimate)
PE1
Communication PE3
3
Protocol bus channels B1
B3
v1 = a*a;
(time cycle-accurate v3= v1- b*b;
and pin-accurate) address

address 1 1
datadata
: :
31: 31:
1 2
Mapping
e
ready
ack ready
ack v3
PE2
Mapping of computation
B4
and communication B2 1 Master interface v4 = v2 + v3;
v2 = v1 + b*b; c = sequ(v4);
Communication 2 Slave interface
3 Arbiter interface
Cycle-
D F
timed
Approximate-
C E
timed
Un-
A B
timed
Un- Approximate- Cycle-
Computation
timed timed timed
4-4
E: “C cle Accurate Computation Model”
Computation
PE4
Cycle-accurate S
S1
Communication S2
cycle-accurate
and pin-accurate
Approximate-timed S3
(estimate) using simplified

(abstract) bus protocols cycle-accurate and
pin-accurate 4
Wrappers PE1 PE3

Simulation interfaces MO r1 1 3
bet een cycle-accurate MU r1 r1 r1 4

S
PEs and abstract bus cv12

S1
channels interfaces 1
cv2
2 4 S2
cv11
PE2 S3
Communication
M A r1 r2 r2 r1 4 1 Master interface S4
2 Slave interface
Cycle- D F
timed
3 Arbiter interface
cycle-accurate
4 rapper
Approximate-
timed C E and pin-accurate
Un-
timed A B
timed timed timed
4-4
Example E: at is an ISS
An Instruction Set Simulator (ISS) is a coded in a
- hich mimics the behavior of a processor by
“reading” instructions and maintaining internal variables hich represent
processor s registers
Instruction-accurate
Cycle-accurate
Simulate (execute and monitor) machine code instructions, compiled for a

target processor
4-4
Example E: Types of ISS
original C code original assembly code
… …
a = b+c; compilation add r1, r2, r3
… …
ISS code
Interpretive ISS Compiled ISS
int Reg[32];
… intermediary
while(1) { C code generation
and recompilation
Fetch();
Decode(); …
add(r1, r2, r3);
Execute(); …
InterruptHandler();
}
#define Add(r1, r2, r3)\
switch INSN { r3=r1+r2
case ADD: r3=r1+r2;
case SUB: ...
}
4 - 50
: e ister Transfer odel
Computation and Communication
E1 E2
cycle timed interr pt
r1, 1
interr pt
r1, r1, r1 r1, r2, r2, r1
modeled on the le el of Re . e
combinatorial (stateless)
functions, memory and C T
and digital signals T
E E3
S
S
S1
interrupt S1
S2
Comm nication
S2
S3
Cycle S3
timed
S
Approximate
C E
E1, E2: microprocessors
timed
E3, E : custom hardware
Un
timed
Un Approximate Cycle
Comp tation
timed timed timed
4-5
ifferent bstraction odels
odels Communication time Computation time Communication E Interface
Scheme
A. Un imed o o ariables no E
Functional odel
B. imed Functional o pproximate bstract c annel bstract
odel
C. ransaction pproximate pproximate bstract b s bstract
odel c annel
D. Cycle Accurate Cycle acc rate pproximate rotocol b s bstract
Communication c annel
odel
E. Cycle Accurate pproximate Cycle acc rate bstract b s in acc rate
Computation odel c annel
F. Register ransfer Cycle acc rate Cycle acc rate s ires in acc rate
model
4-5
Trace ased Sim lation
(Un timed Functional odel) and ( ransaction odel)
Higher simulation speed (for large hardware software systems,
multiprocessors)
Uses estimates of non functional beha ior
Comm nication
Cycle
timed
Approximate
timed
C E
Un
timed
Comp tation
Un Approximate Cycle
timed timed timed
4-5
Trace ased Sim lation: 2 ases
Input: application specification
utput: execution traces = se uence of
e ents ∈ { ; ; }
ethod: un timed functional simulation
race generation
-
Input:
execution traces
architecture specification
mapping specification
utput: performance estimation results, e.g.
execution time, processor load and bus load
ethod: map abstract read, write and
compute primiti es onto irtual machines that
reflect binding and resource sharing (mapping)
race based simulation
4 - 54
Cosim lation otivation ixed odels
and the simulation is ery much dependent on the
system description model
How to se eral abstraction le els or se eral models of computation
oti ating
1. Different abstraction le els
2. Different description languages
3. Different models of computation C C++
more abstract less abstract

address
data
pac et cmd
cnfg
status
4 - 55
Cosim lation Example
En ironments for multiprocessor system cosimulation:
Se eral ISSs coupled with ISSs are replaced with higher
H R simulation: accurate, but le el simulation models: speed
slow (especially for multiple ISS up simulation time
running in parallel)
ISS ISS nati e execution (UNI )
H R H R
Simulator T Simulator
T T1 T2 T
(SystemC) (SystemC) T T1 T2
T T3
T T3
S S S S
model model
H I H I
cosim. interface cosim. interface cosim. interface cosim. interface
interconnect interconnect
4-5
Cosim lation Sin le vs ltiple En ines
2
1 n 1 2 n
Unified odel Simulator Simulator Simulator

#1 #2 #n
Simulator
Cosimulation Bus
Sin le sim lation en ine ltiple sim lation en ines
4-5
ard are Soft are Codesi n
orst Case Exec tion Time nalysis
doc dr re or apa
o ef Stefan International ostgraduate School

5-
System esi n
Specification
S Compilation Instruction Set H Synthesis
rop. Code rop. Bloc
achine Code Net lists

5-
Contents
problem statement, tool architecture

rogram ath Analysis
alue Analysis
Caches
must, may analysis
ipelines
Abstract pipeline models
Integrated analyses
5-
Ind strial eeds
- , often in safety critical
applications abound
Aeronautics, automoti e, train industries, manufacturing
control
Sideairbag in car,
Reaction in 1 mSec
ing ibration of airplane,

sensing e ery mSec
5-4
ard eal Time Systems
Embedded controllers are expected to finish their tas s
reliably within time bounds.
as scheduling must be performed.
Essential: of all
tas s statically nown.
Commonly called the -

( CE )
Analogously, - (BCE )
5-5
Distribution f execution times eas rement Ind stry s best practice
nsafe:
Best Case Execution ime
Execution ime easurement
Upper bound
orst Case
Execution ime
Execution ime
or s if either therwise,
worst case input can be determined, or determine upper bound
exhausti e measurement is performed from execution times of
instructions
5-
ost of Ind stry s est ractice
Measurements: determine execution times directly by
observing the execution or a simulation on a set of inputs.
Does not guarantee an upper bound to all executions.
Exhaustive execution in general not possible!
Too large space of input domain x set of initial execution
states.
Compute upper bounds along the structure of the

program:
Programs are hierarchically structured.
Statements are nested inside statements.
So, compute the upper bound for a statement from the upper
bounds of its constituents
5-
Sequence of Statements
Constituents of A:
A { A1; A2;
A1 and A2
Upper bound for A

is the sum of the upper
bounds for A1 and A2
ub(A) = ub(A1) + ub(A2)
5-8
on t ona Statement
{ f Constituents of A:
t en ondition
e se state ents A1 and A2
es no
ub(A) =
ub( ) +
max(ub(A1), ub(A2))
5-
oo s
A { for i m 1 to 1 do
A1
im1
ub(A) =
ub(i m 1) +
no
i ≤ 100 1 u ( ub(i 1 ) +
es ub(A1) ) +
ub( i ≤ 100)
A1
5-
o to sta t
ssignment load a
xma+b ssu es onstant
e e load b ti es
ution
fo inst
addu tions
store x
cycles
ub(x m a + b) =
add
cycles( oa a) +
cycles( oa ) + load ot
m a 12 i a e
cycles(a ) + tostore
odem n 1 o esso s
cycles(sto e ) move 1
5-
o en a a e eatu es
odern processors increase per ormance by using:
Ca es i e ines an edi tion e u ation
These features ma e CE computation di icult:

xecution times of instructions vary idely.
est case everything goes smoothely: no cache miss,
operands ready, needed resources free, branch correctly
predicted.
orst case everything goes rong: all loads miss the
cache, resources needed are occupied, operands are not
ready.
an a e se e a und ed es
5-
ccess mes
LOAD r2, _a
x = a + b; LOAD r1, _b
ADD r3,r2,r1
xecution Time ( loc ycles)
2
loc ycles
1
est ase orst ase
5-
mn cc ents an ena t es
iming ccident cause for an increase of the execution
time of an instruction
iming enalt the associated increase
pes of timing accidents
ache misses
Pipeline stalls
ranch mispredictions
us collisions
emory refresh of D A
T miss
5-
ea oac o ua at on
Micro-architecture nal sis:
Uses Abstract nterpretation
xcludes as many Timing Accidents as possible
Determines T for basic bloc s (in contexts)
orst-case ath etermination

aps control flo graph to an integer linear program
Determines upper bound and associated path
5- 5
ontents
ntroduction
rogram ath nal sis
alue Analysis
aches
must, may analysis
Pipelines
Abstract pipeline models
ntegrated analyses
5-
ont o o a
1
what_is_this {
1 read (a,b);
2
2 done = FALSE;
3 repeat {
4 if (a>b)
5 a = a-b; a>b a<=b
6 elseif (b>a)
7 b = b-a; a<b a=b
8 else done = TRUE;
9 } until done;
10 write (a);
}
!done done
5-
o am at na s s
rogram ath nal sis
hich se uence of instructions is executed in the orst case
(longest runtime)
problem: the number of possible program paths gro s
exponentially ith the program length
Model
fixed number of cycles for each basic bloc (from static
analysis)
loops must be bounded
Concept
Transform structure of into a set of (integer) linear
e uations.
Solution of the nteger inear Program ( P) yields bound on
the T.
5- 8
as c oc
e inition A basic bloc is a se uence of instructions
here the control flo enters at the beginning and exits at
the end, ithout stopping in bet een or branching (except at
the end).
t1 := c - d
t2 := e * t1
t3 := b * t1
t4 := t2 + t3
if t4 < 10 goto L
5-
as c oc s
etermine basic bloc s o a program
1. ete ine t e o e innin s:
the first instruction
targets of un conditional umps
instructions that follo un conditional umps
2. dete ine t e asi o s:
there is a basic bloc for each bloc beginning
the basic bloc consists of the bloc beginning and runs
until the next bloc beginning (exclusive) or until the
program ends
5-
ont o o a t as c oc s
egenerated control lo graph C
the nodes are the basic bloc s
i := 0
t2 := 0
L t2 := t2 + i
i := i + 1
if i < 10 goto L i < 10
x := t2 i >= 10
5-
am e
1 s = k;
/* k >= 0 */
s = k;
WHILE (k < 10) { 2 WHILE (k<10)
IF (ok)
j++; if (ok)
ELSE {
j = 0;
j = 0;
ok = true; j++;
ok = true;
}
k ++;
k++;
}
r = j;
r = j;
5-
a cu at on of t e
Definition: A program consists of N basic blocks, where
each basic block Bi has a worst-case execution time ci and
is executed for exactly xi times. Then, the WCET is given by
N
WCET ¦ c i xi
i 1
the ci values are determined using the static analysis.
how to determine xi ?
• structural constraints given by the program structure
• additional constraints provided by the programmer (bounds for
loop counters, etc.; based on knowledge of the program context)
5-
Structural Constraints
d1
B1 s = k;
Flow equations:
d2
d1 = d2 = x1
B2 WHILE (k<10) d2 + d8 = d3 + d9 = x2
d3 d3 = d4 + d5 = x3
B3 if (ok) d4 = d6 = x4
d5 d5 = d7 = x5
d4
d6 + d7 = d8 = x6
j = 0;
B4 j++; B5 d9 = d10 = x7
ok = true;
d9 d6 d7
B6 k++;
d8
B7 r = j;
d10 5 - 24
itional Constraints
d1
B1 s = k; loop is executed for at most 10
times
d2
B2 WHILE (k<10) x3 = 10 x1
d3
B3 if (ok)
d5 B5 is executed for at most one
d4
time
j = 0;
B4 j++; B5
ok = true; x5 = 1 x1
d6 d7
d9 d8
B6 k++;
B7 r = j;
d10 5 - 25
WCET - ILP
ILP with structural and additional constraints:
program is executed
once
N
max {¦ i i 1 1
i 1
¦ j ¦ k i, i 1...N
j in ( Bi ) k out ( Bi )
structural
additional constraints } constraints
5 - 26
Cont nts
ntroduction
pro lem statement tool arc itecture
rogram at nal sis
alu nal sis
ac es
must ma anal sis
ipelines
stract pipeline models
ntegrated anal ses
5-2
A stra t Int r r tation AI
antics- as d thod or static program anal sis
asic id a o I er orm t e program s computations

using alue descriptions or abstract values in place o t e
concrete alues start it a description o all possi le
inputs
supports corr ctn ss proo s
5-2
A stra t Int r r tation t In r di nts
a stract do ain related to concrete domain
a straction and concreti ation unctions
eg ĺ Intervals,
where Intervals = LB u UB, LB = UB = Int{-f, f}
instead of L ĺ Int
abstract transfer functions for each statement type –
abstract versions of their semantics
e.g. + : Intervals u Intervals ĺ Intervals where
[a,b] + [c,d] = [a+c, b+d] with + extended to -f, f
a join function combining abstract values from different
control-flow paths
e.g. t : Interval u Interval ĺ Interval where
[a,b] t [c,d] = [min(a,c),max(b,d)]
5-2
Value Analysis
Motivation:
Provide access information to data-cache/pipeline analysis
Detect infeasible paths
Derive loop bounds
Method: calculate intervals at all program points, i.e. lower

and upper bounds for the set of possible values occurring
in the machine program (addresses, register contents,
local and global variables).
5 - 30
Value Analysis
D :[- , ], :[ x , x ]
move #4,D0 Intervals are computed along
the edges
D :[ , ], D :[- , ], t oins, intervals are unioned
:[ x , x ]
add D1,D0
D : [- ,+ ] D : [- , ]
D :[ , ], D :[- , ],
:[ x , x ]
D : [- ,+ ]
move (A0,D0),D1
hich address is accessed here
access [ x , x ]
5-3
n en s
Introduction
Program Path nalysis
alue nalysis
aches
must, may analysis
Pipelines
bstract pipeline models
Integrated analyses
5-3
a es as e y n i
aches are used, because
ast main memory is too expensive
he speed gap between PU and memory is too large and
increasing
aches wor well in the avera e case:

Programs access data locally (many hits)
Programs reuse items (instructions, data)
ccess patterns are distributed evenly across the cache
5 - 33
a es
access Processor
takes
~ 1 cycle
fast, small,
Cache expensive
access
takes Bus
~ 100 cycles
(relatively)
Memory slow, large,
cheap
5-3
a es e
PU wants to read rite at e or address a,
sends a re uest for a to the bus.
ases:
Bloc m containing a in the cache (hit):
re uest for a is served in the next cycle.
Bloc m not in the cache (miss):
m is transferred from main memory to the cache,
m may replace some bloc in the cache,
re uest for a is served asap while transfer still continues.
everal re ace ent strate ies: L U, PL U, I ,...
determine which line to replace.
5 - 35
ay e Ass ia i e a e
5-3
ae y
ach cache set has its own re ace ent o ic = ache sets
are independent. verything explained in terms of one set
- e ace ent trate :
eplace the bloc that has been Least ecently Used
odeled by ges
a e: -way set associative cache
access age age age age
m m m m
m (miss) m m m m
m (hit) m m m m
m (miss) m m m m
5-3
a e Analysis
ow to statically precompute cache contents:
Must na sis:
or each program point (and calling context), find out which
bloc s are in the cache.
Determines safe information about cache hits. ach
predicted cache hit reduces .
Ma na sis:
or each program point (and calling context), find out which
bloc s may be in the cache. omplement says what is not in
the cache.
Determines safe information about cache misses. ach

predicted cache miss increases B .
5-3
ne s
ache contents depends on the context, i.e. calls and
loops
irst Iteration loads the cache:

Intersection looses most of the
information. ile cond
oin (must)
Distinguish as many contexts as useful:
unrolling for caches
unrolling for branch prediction (pipeline)
5 - 50
n en s
Introduction
Program Path nalysis
alue nalysis
aches
must, may analysis
i e ines
bstract pipeline models
Integrated analyses
5-5
a is n A i e u es
in y lenverarb.
single cycle L
L
ehr y lenverarb.
multiple cycle I B I
pipelining
Pipelineverarb. I B L
I B
5-5
d e Fe tu e e e
t t t t
Fetch Fetch
Decode Decode Fetch
Execute Execute Decode Fetch
WB WB Execute Decode Fetch
WB Execute Decode
WB Execute
WB
Ideal Case: 1 Instruction per Cycle
5 - 53
D t th o e e ch tectu e
5-5
d e Fe tu e e e
.
Several instructions can be e ecuted in parallel.
Some pipelines can begin more than one instruction per

cycle: VLIW, Superscalar.
Some CP s can e ecute instructions out o order.
: Hazards and cache misses.
5 - 55
e e d
Pipeline azards:
: perands not yet available
ata ependences
: Consecutive instructions use same

resource
: Conditional branch
- : Instruction etch causes cache

miss
5-5
o to d
5-5
D t d
5-5
t tc o h d
: prediction o cache hits on instruction or
operand etch or store
l z r4 2 r1 Hi
: analysis o data control hazards
add r4 r5 r6
l z r7 1 r1 pera d
add r8 r4 r4 read
: analysis o resource hazards

F
E
5-5
o c ete t te ch e
Processor pipeline cache memory inputs vie ed as a
per orming transitions every clock cycle.
Starting in an initial state or an instruction transitions are
per ormed until a is reached:
: instruction has le t the pipeline
: e ecution time o instruction
u ct o e ec : c oc s : co c ete e e t te : t ce
interprets instruction stream o starting in state s producing trace
successor basic block is interpreted starting in initial state las
le h gives number o cycles
5-
t ct e e o B c B oc
u ct o e ec : c oc s: t ct e e t te
: t ce
interprets instruction stream o annotated ith cache
in ormation starting in state s producing trace
le h gives number o cycles
bstract states may lack in ormation e.g. about cache

contents.
ssume local orst cases is sa e
in the case o no timing anomalies
Traces may be longer but never shorter .
5-
Wh t d ee t
or successor basic block In particular i
there are several predecessor blocks
:
sets o states
combine by assuming that local orst case is sa e
s s
s
5- 2
u o te
using statically computed e ective

addresses and loop bounds
assume cache hits here predicted

assume cache misses here predicted or not e cluded.
nly the orst result states o an instruction need to be
considered as input states or successor instructions
5- 3
d e ot e ode
ut te t to
doc d e o
o e Ste an International Postgraduate School

-
te De
Speci ication
SW Compilation Instruction Set W Synthesis
Prop. Code Prop. Block
Machine Code et lists

-2
De ce Ex o t o
c to ch tectu e
E t to
multi ob ective optimization
-3
ectu e o
ptimization
esign
Implementation
-
E o ut o u t o ect e t to
o th
Wha are lu i ar l ri hms
randomized o e de e de t search heuristics

→ applicable to black bo optimization problems
H d he r
by iteratively improving a o u t o o solutions by

variation and selection
→ can ind many di erent optimal solution in a single run
-5
he c o e
eight 75 g eight eight 3 g eight
pro it 5 15 g pro it 7 1 g
pro it 8 pro it 3
o choose subset that

ma imizes overall pro it
minimizes total eight
-
he o ut o ce
pr i
15
ei h
5 g 1 g 15 g 2 g 25 g 3 g 35 g
-
he de o F o t
e to n there is no single optimal solution but
o some solutions are better than others
pr i
selecting a
2 solution
15
inding the good
1 solutions
ei h
5 g 1 g 15 g 2 - g 25 g 3 g 35 g
Dec o e ect o ut o
o che pro it more important than cost ranking
eight must not e ceed 24 g constraint
pr i
15
too heavy
1
ei h
5 g 1 g 15 g 2 - g 25 g 3 g 35 g
Whe to e the Dec o
Be o e t to te t to
searches or a set o
ranks ob ectives green solutions
de ines constraints
pr i
searches or one selects one solution
green solution 2 considering constraints
15
too heavy
1
5 decision making o ten easier

ei h
5 g 1
evolut. algorithms
g 15 g 2 g 25 g 3
ell suited
g 35 g
-
t to te t e
se o classical si le ec i e p imiza i methods
simulated annealing tabu search
integer linear program
other constructive or iterative heuristic methods
ecisi ma i eighting the di erent ob ectives is
done e re he p imiza i .
pula i ased p imiza i me h ds

evolutionary algorithms
genetic algorithms
ecisi ma i is done a er he p imiza i .
-
Ft e d ut e ect e
e to ed do ce ed
ei h ed sum
y2 y2
y1 y1
parameter oriented set oriented

scaling dependent - 2
scaling independent
We hted o t Fu ct o
parameters
multiple single
ob ectives ob ective
y1 y2 yk trans ormation y
e ample: eighting approach y2 ma imization problem
1 2 k
y 1y1 kyk
y1
- 3
ut eo e E o ut o o th
11 1 solution 111 itness 19
itness
evaluation mating
selection
11 1 11
environmental
selection
recombination
1 11
mutation
recombination mutation variation

-
t t o t E cod o o ut o
1 1
item 1 item 2 item 3 item 4
subset
- 5
e e c u t o ect e E
population archive
sample
update
vary
truncate
select
ne population - ne archive
E o ut o o th ct o
ma . y2
hypothetical trade o ront
-
min. y1
B c Box t to
o ect e u ct o
Stretch Module ecision Module andling Module
X lane \
1
lane a re uire
v a
D Br
v D r
v t point o
gear
ob ective
change D r
decision
t n gear
2 1
gear 4 3
lane R 5
vector
v a
vector ehicle Module gear

s clutch
lane v
a gear n
e.g. simulation model
ptimization lgorithm:
only allo ed to evaluate
direct search
-
De ce Ex o t o
Speci
Speciication
ication ptimization
ptimization Evaluation
Evaluation Implementation
Implementation
po er
consumption latency
cost
-
c et oce et o
B e
Embedded
Embedded
Internet
Internet
evices
evices o
ethod
to
d
do oth
c co d e
Wearable
WearableComputing
Computing
e d o
ccess Core
Mobile
MobileInternet
Internet
-2
et o oce o
et o oce o high per ormance programmable device
designed to e iciently e ecute
communication
orkloads r le e al
incoming lo s routing or arding outgoing lo s
packet streams transcoding processed packets
encryption decryption
real time lo s
e.g. voice et o
oce o
e.g. s tp
non real time lo s
-2
t to ce o e e
e n speci ication o the task structure t ode
or each lo the corresponding tasks to be e ecuted
o di erent usage scenarios o ode
sets o lo s ith di erent characteristics
ou ht net ork processor implementation
architecture task mapping scheduling
ect e n ma imize per ormance
o minimize cost
e o ce ode
u ect to n memory constraint
o delay constraints
- 22
Ex o t o t te
or each usage
scenario separately
architecture task binding
template graph restrictions
e u to
co t uct e t te
allocation per ormance ch tectu e o e o ce
bindings cost vector
u t o ect e
o t to
architecture per ormance
- 23
ectu e o
Introduction
esign
Implementation
-2
Do ce eto o t
design point is d mi a ed by i i i is
better or e ual than in all criteria and
better in at least one criterion.
point is Pareto optimal or a are p i i it is not

dominated.
The domination relation imposes a partial order on all

design points
We are aced ith a set o optimal solutions.
ivergence o solutions vs. convergence.
- 25
u t o ect e t to
-2
u t o ect e t to
Ma imize y1 y2 yk g 1 2 n
y2 y2
Pareto optimal not dominated
better
incomparable dominated
orse incomparable
y1 y1
eto et set o all Pareto optimal solutions

-2
do ed B c Box e ch o th
de ind good solutions ithout investigating all solutions
u to better solutions can be ound in the neighborhood
o good solutions
in ormation available only by unction evaluations
Randomized
search algorithm
t g tt
randomly choose a randomly choose a
solution 1 to start ith solution t 1 using solutions
1 t
-2
e o do ed e ch o th
e o e ect o to
mating
selection
environmental
selection
E ≥1 both
evolutionary algorithm
1 no mating selection
tabu search
1 no mating selection
simulated annealing
-2
Limitations of Randomized Search Algorithms
The No-Free-Lunch Theorem
All search algorithms provide in average the same

performance on a all possible functions
with finite search and objective spaces.
[Wolpert, McReady: 1997]
Remarks:
Not all functions equally likely and realistic
We cannot expect to design the algorithm beating all others
Ongoing research: which algorithm suited for which class of
problem? 6 - 30
ourse Synopsis
ntroduction
Optimi ation
mplementation
6-3
esign hoices
representation fitness assignment mating selection
11 111
parameters
11 1
1 11 11
environmental selection variation operators
6 - 32
omparison of Three mplementations
-o ective knapsack pro lem
e te ded
rade off between

distance and diversity?
6 - 33
esign hoices
fitness assignment mating selection
11 111
parameters
11 1
1 11 11
6-3
Representation
search space decoder solution space o ectives o ective space
1 1 1 1
1 1 1 1
g
1 1 1 1
solutions encoded by vectors matrices trees lists ...

fixed length variable length
ssues:
completeness each solution has an encoding
uniformity all solutions are represented equally often
redundancy cardinality of search space vs. solution space
feasibility each encoding maps
6-3
to a feasible solution
E ample: inary ector Encoding
iven: graph
oal: find minimum subset of nodes such that each edge
is connected to at least one node of this subset
minimum vertex cover
A
nodes A
selected? 1 1 1 1
6 - 36
E ample: nteger ector Encoding
iven: graph k colors
oal: assign each node one of the k colors such that the
number of connected nodes with the same color is
minimi ed graph coloring problem
A
nodes A
colors 1 1 1
6-3
E ample: Real ector Encoding
parameters x1 x x x xn
values 6.- 3 . 1. . .
Tree E ample: arking a Truck
steering u
angle
cab
d
dock
trailer
t
position x y
oal:
find function c with
constant speed u cx y d t
6 - 39
Search Space for the Truck ro lem
perators:
Arguments: position x
position y
cab angle d
AN trailer angle t
Search space : set of symbolic expression using the above
operators and6 -arguments
0
E ample Solution: Tree Representation
AN
encodes the function symbolic expression : u x d y t

6-
A Solution Found y an EA
truck simulation encoded tree
6- 2
esign hoices
representation mating selection
11 111
parameters
11 1
1 11 11
6- 3
Fitness Assignment
Fitness F scalar value representing quality of an individual
The simple case:

single objective optimi ation:
solution in objective space
solution in search space

solution in solution space
ore difficult cases:
fitness not only takes into account the different objectives
compliance to areto optimality but also properties of the whole
population
multiple optima need to be approximated diversity
constraints are involved which have to be met
6-
Simple e ample: areto Ranking
itness function:
execution time
F (1) 3
F ( 2) 1
F ( 3) 1
F ( 4) 2
F (5) 1
F (6) 0
cost
6-
onstraint andling
onstraint x1 x xn ≥ feasi le ≥
solution in solution space infeasi le

<
Approaches:
construct initiali ation and variation such that infeasible
solutions are not generated resp. not inserted
representation is such that decoding always yields a feasible
solution
calculate constraint violation x1 x xn and incorporate it into
fitness e.g. penalty x1 x xn fitness to be
maximi ed use of a penalty function penalty y if y
include the constraints as new objectives
6- 6
esign hoices
representation fitness assignment
11 111
parameters
11 1
1 11 11
variation operators
6-
Selection
T o types of selection:
mating selection select for variation
environmental selection select for survival
6-
Tournament Selection
population mating pool
n o
uniformly choose compare fitness
individuals at and copy best
random independently individual
of fitness in mating pool
tournament si e binary tournament selection means
6- 9
esign hoices
representation fitness assignment mating selection
11 111
parameters
11 1
1 11 11
environmental selection
6- 0
ector utation: E amples
1 1 1 1
it vectors:
each bit is flipped with probability 1
1 1 1
1 1
ermutations:
swap
1 1
rearrange
6-
utation perators on Trees: ro
gro
N N
N AN N
6- 2
utation perators on Trees: Shrink
shrink
N N
N AN AN AN
6- 3
utation perators on Trees: S itch
s itch
N AN
N AN N
6-
utation perators on Trees: Replace
replace
N N
N AN N AN
6-
ector Recom ination: E amples
it vectors:
1 1 1
1 1 1
1 1 1
ermutations:
1 child
parents 1
1
6- 6
Recom ination of Trees
N AN N
N AN
e change
AN
6-
A eneric ultio ective EA
population archive
sample
update
vary
truncate
select
new population 6- new archive

S EA Algorithm
tep 1: enerate initial population and empty archive
external set A . et t .
tep : alculate fitness values of individuals in t and At.
tep : At 1 non dominated individuals in t At.
f si e of At 1 N then reduce At 1 else if
si e of At 1 N then fill At 1 with dominated
individuals in t and At.
tep : ft then output the non dominated set of At 1.
top.
tep : ill mating pool by binary tournament selection.
tep : Apply recombination and mutation operators to
the mating pool and set t 1 to the resulting
population. et t t 1 and go to tep .
6- 9
S EA Fitness Assignment
dea Step : calculate dominance rank weighted by
dominance count
y
non-dominated solutions:
dominated solutions
dominated solutions
of non areto solutions
∑ strengths of dominators
y1
Note: higher objective function better

smaller fitness better 6 - 60
ourse Synopsis
ntroduction
Optimi ation
esign
6 - 62
mplementation: omponents
A frame ork that
rovides ready to use modules algorithms applications
s simple to use
s independent of programming language and O
omes with minimum overhead
dea: separate problem dependent from problem

independent part cut
Representation
Selection
Recombination Objective functions
Fitness assignment
Archiving Mutation
6-6
The oncept of SA
Algorithms Applications
A knapsack
N A
network
A processor
design
text based
latform and programming language independent nterface
for Search Algorithms [ le ler et al : ]
6-6
SA: mplementation
shared
shared
file
file
system
system
selector
selector text variator
variator
process
process files process
process
application independent: handshake protocol: application dependent:
mating environmental state action variation operators
selection individual s stores and manages
individuals are objective vectors individuals
described parameters
by s and objective
vectors
6 - 66
ard are Soft are odesign
apping Applications To Architectures
doc dr regor apa
o ef tefan nternational ostgraduate chool

-
System esign
pecification
ystem ynthesis stimation
W ompilation nstruction et W ynthesis
rop. ode rop. lock
achine ode Net lists

-2
Synthesis
ynthesis transforms behavior into structure.
: select components
: assign functions to components

mapping
scheduling: determine execution order
(allocation and) binding sometimes

called partitioning
-3
Application Specification
Depends on the underlying model of computation.
Examples (see also next slides):
Task graphs (data flow graph, control flow graph)
Process Networks (Kahn Process Network, Synchronous
Dataflow)
State Machine Representations (SpecCharts, StateCharts,
Polis) [not covered in this course].
For the mapping, very often only the network structure

and abstract properties of the processes are relevant
(abstraction from detailed process function).
7-4
ata lo ap
a b b a b

c

x = 3*a + b*b - c;
y = a + b*x; c
z = b - c*(a + b); b

b

y x
7-
ont ol lo ap
what_is_this {
1 read (a,b);
2 done = FALSE;
3 repeat {
4 if (a>b)
5 a = a-b; a>b a<=b
6 elseif (b>a)
7 b = b-a; a<b a=b
8 else done = TRUE;
9 } until done;
10 write (a);
}
!done done
7-
a n oce et o
ierarchical network for M P application:
7-7
A c itect e Specification
Depends on the underlying model of the platform.
sually a graph notation is used to the elements,
properties of the underlying platform are usually attached.
7-
ample A c itect e Specification
- <processor name="processor1" type="DSP">
<port name="processor_port" type="duplex" />
<configuration name="clock" value="100 MHz" />
</processor>
+ <processor name="processor2" type="RISC">
+ <memory name="sharedmemory" type="DXM">
- <hw_channel name="in_tile_link" type="bus">
<port name="port1" type="duplex" /> DSP
DSP RRSC
SC DD MM
<port name="port2" type="duplex" />
<port name="port3" type="duplex" />
<configuration name="buswidth" value="32bit" />
</hw_channel>
- <connection name="processor1link">
<origin name="processor1"> bus
bus
<port name="processor_port" />
</origin>
<target name="in_tile_link">
<port name="port1" />
</target>
</connection>
+ <connection name="processor2link">
+ <connection name="memorylink">
7-
apping Specification
Relates application and architecture specification:
maps processes to computing resources
maps communication between processes (in case of process
networks) to communication paths of the architecture
specifies resource sharing disciplines and scheduling
7-
ample
asic model with a data flow graph and static scheduling
Problem
Data graph GPP((VPP,,EPP):)
flow graph
1 2
Interpretation:
• VP consists of functional
5 6 nodes VPf (task, proce-
dure) and communication
3
nodes VPc .
7 • EP represent data depend-
encies
4
7-
Example (2)
Architecture graph GA(VA,EA):
RISC HWM1
RISC HWM1
PTP bus shared bus

shared PTP bus
bus
HWM2 HWM2
Architecture Architecture graph
• VA consists of functional resources VAf (RISC, ASIC) and
bus resources VAc. These components are potentially allo-
catable.
• EA model directed communication.
7 - 12
Example ( )
1
Definition: A specifica-
tion graph is a graph 5 RISC
GS=(VS,ES) consisting
aa l
of a problem graph GP, 3 SB
an architecture graph 7 HWM1

GA, and edges EM. In
particular, VS=VP∪VA, 2 PTP
ES=EP∪EA∪EM 6 HWM2
4
GP EM GA
7-1
Example ( )
Three main tasks of synthesis:
• Allocation α is a subset of VA.
• Binding β is a subset of EM, i.e., a mapping of functional
nodes of VP onto resource nodes of VA.
• Schedule τ is a function that assigns a number (start time) to
each functional node.
7-1
Example ( )
1
0 1
0
Definition: Given a 1 5 RISC
specification graph GS 8
an implementation is a 21 3 SB
triple (α,β,τ), where α 29 7
1
HWM1
is a feasible allocation, 20
α
β is a feasible binding, 1 2
and τ is a schedule. 1
21 6 β RISC HWM1
2 PTP bus
30 4 shared
bus
HWM2
τ
7-1
em e
pecification
stem nthesis Estimation
ompilation nstruction et nthesis
Prop ode Prop Bloc
achine ode et lists

7-1
e pa e Expl a
Determine mapping
Determine important paramerters (end to end dela ,
throughput, uffer space output itter, )
Gi e feed ac to optimi ation
ppl a e e
app
E ma
7-1
e e a el
7-1
a ae ae e
em a
e apa
o ef tefan nternational Postgraduate chool

-1
em e
pecification
stem nthesis Estimation
ompilation nstruction et nthesis
Prop ode Prop Bloc
achine ode et lists

-2
a
low level: at the register transfer ( ) le el, at the netlist

le el
split a digitial circuit and map it to se eral de ices ( PG s,
s)
s stem parameters are relati el well nown (area, dela )
high level: at the s stem le el

comparison of design alternati es mandator (design space
e ploration)
s stem parameters are un nown
importance of estimation (anal sis, simulation, rapid
protot ping)
-
el
(see pre ious lecture )
model application
define architectural template
identif possi le indings
Ver often, parameters are attached to the a o e models that

simpl allow to of the partitioning (allocation
and inding)
ometimes, (simulation, anal sis)
are applied to gi e more accurate predictions

allocation gi es cost as the sum of the allocated component costs
scheduling gi es latenc
constraints feasi le schedule ma feasi le allocation ma
-
e a lem
he partitioning pro lem is to assign n
o ects O ={o1, ..., on} to m loc s (also called
partitions) P={p1, ..., pm}, such that
z p1 p2 ... pm = O
z pi pj = i,j: iz j and
z cost c(P) are minimi ed
n (simple model)
z o ects data flow graph nodes
z loc s architecture graph nodes
-
of a design point
ma include C s stem cost in
L latenc in sec
P power consumption in
re uires to find C, L, P
linear cost function with penalt
f(C, L, P) = k1·hC(C,Cmax) + k2·hL(L,Lmax) + k3·hP(P,Pmax)
hC , hL , hP denote how strong C, L, P iolate the design

constraints Cmax, Lmax, Pmax
k1 , k2 , k3 weighting and normali ation
-
e e al a e
enumeration
nteger inear Programs ( P)
constructi e methods
random mapping
hierarchical clustering
iterati e methods
ernighan in lgorithm
imulated nnealing
E olutionar lgorithms (E ) see ne t lecture
-7
Integer Programming o el
Ingredients:
Cost function In ol in linear e pressions of
Constraints inte er ariables from a set X
Cost function C ¦a x
xi X
i i with a i R , x i т
N (1)
Constraints: j J : ¦b i, j x i t c j with bi , j , c j ъ
R (2)
xi X
Def.: he problem of minimi in (1) sub ect to the constraints (2) is

called an integer programming (IP) problem.
If all xi are constrained to be either 0 or 1, the IP problem said to be
a 0/1 integer programming problem.
8-8
ample
minimi e C x x x
b e t to x1 x 2 x t 2
x 1, x 2 , x 0,1
ptimal
8-
emar on Integer Programming
aximi ing the cost function can be done b settin C C
Inte er pro rammin is P complete.
In practice, running times can increase ex onentia l

with the si e of the problem, but problems of some
thousands of ariables can still be sol ed with commercial
sol ers, dependin on the si e and structure of the
problem.
IP models can be a good starting oint for modelin ,

e en if in the end heuristics ha e to be used to sol e them.
8-
Integer inear Program or Partitioning (1)
inar ariables xi
xi 1: ob ect i in bloc
xi 0: ob ect i not in bloc
Cost ci , if ob ect i is in bloc

Inte er linear pro ram:
xi k ^ ` didn dkdm
m
¦ xi k didn
k
m n
e ¦ ¦ xi k ci k dkdm didn
k i
8-
Integer inear Program or Partitioning ( )
dditiona constraints
e ample: ma imum number of h ob ects in bloc
n
¦ xi k d hk dkdm
i
he idea of mappin the s nthesis problem to an I P is er
o u ar:
chedulin can be inte rated.
arious additional constraints can be added.
If not sol in to optimalit , run times are acceptable and a solution
with a uaranteed ualit can be determined.
indin the ri ht e uations to model the constraints is an art .
8-
on tr ti e et o
andom ma ing
each ob ect is assi ned to a bloc randoml
Hierarchica c ustering
stepwise roupin of ob ects
closeness function determines how desirable it is to roup
two ob ects
onstructi e methods
are often used to enerate a startin partition for iterati e
methods
show the difficult of findin proper closeness functions
8-
Hierarchical Clustering - Example (1)
v5 = v1v3
v1 v5
10 20 10
10 v2 7
v2 v3
4 4
v4
v4
closeness function: arithmetic mean of weights
8 - 14
Hierarchical Clustering - Example ( )
v = v2v5
v5
10 v
v2 7 5 5
4
v4 v4
8-1
v
v7 = v v4
5 5 v7
v4
8-1
ste :
v7 = v v4
ste : cut lines

v = v2v5 artitions
ste :
v5 = v1v3
v1 v2 v3 v4
8-1
terative Methods - ernighan- in (1)
imple greed heuristic:
ntil there is no im ro ement in cost: re grou a air of
o ects which lea s to the largest gain in cost
v2
v1
v5
v4
v v3
v7
v v
e am le: cost num er of e ges crossing the artitions

efore re grou : after re grou : gain
8 - 18
terative Methods - ernighan- in ( )
ro lem
im le gree heuristic can get stuc in a local minimum
mproved algorithm ernighan in :

as long as a etter artition is foun :
from all ossi le airs of o ects irtuall re grou the est
lowest cost of the resulting artition then from the remaining
not et touche o ects irtuall re grou the est air etc
until all o ects ha e een re grou e
from these n/2 artitions ta e the one with smallest cost an
actuall erform the corres on ing re grou o erations
8-1
terative Methods - imulated nnealing
rom h sics:
metal an gas ta e on a minimal energ state uring cooling own
un er certain constraints :
at each tem erature the s stem reaches a thermo namic e uili rium
the tem erature is ecrease sufficientl slowl
ro a ilit that a article um s to a higher energ state:
ei ei 1
k BT
P ei ei 1 T e
pplication to om inatorial timi ation:

energ cost of a solution artition
cost ecreases with tem erature sometimes with a certain
ro a ilit increases in cost are acce te
8-
terative Methods - imulated nnealing
tem tem start
cost c
hile ro en
hile uili rium {
P’ = RandomMove(P);
cost’ = c(P’);
deltacost = cost’ - cost;
if (Accept(deltacost, temp) > random[0,1)) {
P = P’;
cost = cost’;
deltacost
}
k temp
} Accept(deltacost , temp) e
temp = DecreaseTemp (temp);
}
8- 1
Iterative Methods - Simulated Annealing
Cooling Down: DecreaseTemp(), Frozen()
• temp_start = 1.0
• temp = D • temp (typical: 0.8 d D d 0.99)
• terminate when temp < temp_min or there is no more improvement
Equilibrium: Equilibrium()
• after defined number of iterations or when there is no more
improvement
Complexity
from exponential to constant, depending on the implementation of
the functions Equilibrium(), DecreaseTemp(), and Frozen()
the longer the runtime, the better the quality of results
typical: construct functions to get polynomial runtimes
8 - 22
ard are Soft are odesign
Allo ation
do dr regor a a
o ef tefan nternational Postgraduate

chool -
Integer rogramming models
ngredients:
ost function nvolving linear expressions of
onstraints integer variables from a set
ost function ¦

with , т (1)
onstraints: : ¦ , t with , , ъ ( )

ef.: The problem of minimizing (1) sub ect to the constraints

( ) is called an integer linear rogramming I ro lem.
f all are constrained to be either 0 or 1, the P problem said
to be a integer linear rogramming ro lem.
-2
am le
1
1 t
,
1 , 0,1
ptimal
-
emar s on integer rogramming
Maximizing the cost function: ust set =

nteger programming is P-complete.
Running times depend exponentially on problem size,
but problems of >1000 vars solvable with good solver
(depending on the size and structure of the problem)
The case of ъ is called ( P).
P has polynomial complexity, but most algorithms are
exponential, still in practice faster than for P problems.
The case of some ъ and some т is called
P P models can be a good starting point for modeling, even

if in the end heuristics have to be used to solve them.
-
Simulated Annealing
eneral method for solving combinatorial
optimization problems.
ased the model of slowly cooling crystal liquids.
ome configuration is sub ect to changes.
pecial property of imulated annealing: hanges
leading to a poorer configuration (with respect to
some cost function) are accepted with a certain
probability.
This probability is controlled by a temperature
parameter: the probability is smaller for smaller
temperatures.
-
lanation
nitially, some random initial configuration is created.
urrent temperature is set to a large value.
uter loop:
• Temperature is reduced for each iteration
• Terminated if (temperature d lower limit) or
(number of iterations t upper limit).
nner loop: For each iteration:
• ew configuration generated from current configuration
• Accepted if (new cost d cost of current configuration)
• Accepted with temperature-dependent probability if
(cost of new config. > cost of current configuration).
-
Multio e tive timi ation
Maximize (y1, y2, …, yk) = g(x1, x2, …, xn)
y2 y2
Pareto optimal = not dominated
better
incomparable dominated
worse incomparable
y1 y1
Pareto set = set of all Pareto-optimal solutions

-
Summary
Single objective optimization methods
decision is performed during optimization
Examples: integer programming, simulated annealing
Multiple objective optimization methods
decision is done after optimization
Example: Evolutionary algorithms
Refer to publications of Thiele or Schwefel et al. for more
information
Concept of Pareto points
eliminates large set of non-relevant design points
allows separating optimization and decision
9-8
m ro re ta ty or a es
oop caches
Mapping code to less used part(s) of the index space
Cache locking freezing
Changing the memory allocation for code or data
Mapping pieces of software to specific ways
Methods:
- enerating appropriate way in software
- llocation of certain parts of the address space to a specific way
- ncluding way-identifiers in virtual to real-address translation
) Caches behave almost like a scratch pad
9-9
Summary
llocation strategies for SPM

ynamic sets of processes
Multiprocessors
MM s
Sharing between SPMs in a multi-processor
ptimizations for Caches
Code ayout transformations
ay prediction
9-
ar are So t are o es
o e o t m at o
o r re or Pa a
o ef Stefan nternational Postgraduate

School -
as e e o urre y ma a eme t
Granularity: size of tasks (e.g. in instructions)

Readable specifications and efficient implementations can
possibly re uire different task structures.
) ranularity changes
-
er o tas s
Reduced overhead of context switches,

More global optimization of machine code,
Reduced overhead for inter-process task communication.
-
S tt o tas s
o blocking of resources while waiting for input,

more flexibility for scheduling, possibly improved result.
-
er a s tt o tas s
The most appropriate task graph granularity depends

upon the context ) merging and splitting may be
re uired.
Merging and splitting of tasks should be done
automatically, depending upon the context.
-
system
am e
-
ttr utes o a system t at ee s
re r t
Tasks blocking after they have already started running
-
or y orta e a et a
1. Transform each of the tasks into a Petri net,
2. enerate one global Petri net from the nets of the tasks,
. Partition global net into se uences of transition
. enerate one task from each such se uence
Mature, commercial approach not yet available
-8
esu t as u s e y orta e a
Reads only at the beginning
nitialization task
ever
true
lways true
-9
tm e ers o o
ever true
Tin ()
RE ( , sample, 1)
j==i-1 sum = sample i
j)i T = sample d =
T
: (i < ) retur
T = sum d= T
d = d c R TE( T,d,1)
sum = i =
retur
lways true -
as e e o urre y ma a eme t
The dynamic behavior of applications getting more attention.

Energy consumption reduction is the main target.
Some classes of applications (i.e. video processing) have a
considerable variation in processing power re uirements
depending on input data.
Static design-time methods becoming insufficient.
Runtime-only methods not feasible for embedded systems.
Æ ow about mixed approaches
-
am e o a m e
Task1
Task2
Task
eadline
eadline
t t
Static (compile-time) methods can …or they can define a probability for
ensure CET feasible schedules, but violating the deadline.
waste energy in the average case.
eadline
t
Mixed methods use compile-time Runtime scheduler selects the most
analysis to define a set of possible energy saving, deadline preserving
execution parameters for each task. combination.
e um tt me e
-
oat o t to e o t
o ers o
Pros:
ower cost
aster
ower power consumption
Sufficient S R, if properly scaled
Suitable for portable applications
Cons:
ecreased dynamic range
inite word-length effect, unless properly scaled
verflow and excessive uantization noise
Extra programming effort
© Ki-Il Kum, et al. (Seoul ational niversity): loating-point To ixed-point C Converter

or ixed-point igital Signal Processors, 2nd S orkshop, 1
-
e Po t ata ormat
loating-Point
loating-Pointvs.
vs. ixed-Point
ixed-Point nteger
ntegervs.
vs. ixed-Point
ixed-Point
exponent,
exponent,mantissa
mantissa
loating-Point
loating-Point S 1 . . . 1
automatic
automaticcomputation
computationand
and
update
updateofofeach
eachexponent
exponent
(a) nteger
at run-time
at run-time =
ixed-Point
ixed-Point
implicit
implicitexponent
exponent S 1 . . . 1
determined
determinedoff-line
off-line
hypothetical binary point
(b) ixed-Point
© Ki-Il Kum, et al
-
ss me t a
t o Su tra t o
ssume y = x, with et result = x y:
- x( =2) and e ualizing each
- y( = ):
s
s
s
s
y s
y s
resu t s
© Ki-Il Kum, et al
-
ut at o
ssume result = x y,
with
- x( =2) and
- y( = ) s
- - result ( =2 ) y
s
s s
resu t s
© Ki-Il Kum, et al
-
e e o me t Pro e ure
oat Po t
Pro ram
-
a e st mat o
- Pro ram
a ua
s e at o
e Po t
Pro ram
-
© Ki-Il Kum, et al
a e st mator
oat Po t a e st mat o Pro ram
Pro ram
float
float iir1(float
iir1(float x)
x)
re ro essor
static
staticfloat
float ss ==
ro t e
float
float yy
ass me t
Su rout e a sert o yy== .. ss xx

S to o erter range(y,
range(y,0);
0);
ss==yy
a e st mat o range(s,
range(s,1);
1);
Pro ram
return
return yy
ormat o
- 8
© Ki-Il Kum, et al
erat o s e o t ro ram
s
. x 21 iwl= .xxxxxxxxxxxx
x
iwl= .xxxxxxxxxxxx
overflow if z
result
- 9
oat Po t to e Po t Pro ram o erter
ixed-Point C Program
mulh
int iir1(int x) to access the upper
half of the multiplied
static int s = result
int y target dependent
y=sll(mulh(29491,s)+ (x>> 5),1); implementation
s=y
return y
sll
to remove 2nd sign bit
opt. overflow check
© Ki-Il Kum, et al
-
Per orma e om ar so
a e y es
ourt r er ter
Cycles
2
1
21
ixed-Point (1 b) loating-Point
© Ki-Il Kum, et al
-
Per orma e om ar so
a e y es
Cycles P
1 12 2
12
1
1 1
2 1
2
ixed-Point ixed-Point loating-Point

(1 b) ( 2b)
© Ki-Il Kum, et al
-
Per orma e om ar so
S
ixed-Point (1 b)
P ixed-Point ( 2b)
S R (d ) loating-Point
2
2
1
1
© Ki-Il Kum, et al
-
m a t o memory a o at o o e e y
rray
Column major
Row major order order ( RTR )
(C)
-
est er orma e ermost oo
orres o s to r tmost array e
o oo s assum ro ma or or er
or (k= k<=m k ) or (j= j<=n j )
or (j= j<=n j ) ) or (k= k<=m k )
p j k = ... p j k = ...
Same behavior for homogenous memory access, but:
or row major order
n Poor cache behavior ood cache behavior n
) memory architecture dependent optimization

-
) Pro ram tra s ormat o oo
ter a e
Example:
…#define iter 400000 ) mproved locality
int a[20][20][20];
void computeijk() {int i,j,k;
for (i = 0; i < 20; i++) {
for (j = 0; j < 20; j++) {
for (k = 0; k < 20; k++) {
a[i][j][k] += a[i][j][k];}}}}
void computeikj() {int i,j,k;
for (i = 0; i < 20; i++) {
for (j = 0; j < 20; j++) {
for (k = 0; k < 20; k++) {
a[i][k][j] += a[i][k][j] ;}}}}…
start=time(&start);for(z=0;z<iter;z++)computeijk();
end=time(&end);
printf("ijk=%16.9f\n",1.0*difftime(end,start));
(S interchanges array indexes instead of loops)
-
stro ue e o t e memory
ar te ture
oop structure: i j ramatic impact of locality
k
Pro essor Su SP te Pe t um
re u t o to
me s
ot always the same impact .. Till uchwald, iploma thesis, niv. ortmund, nformatik 12, 12 2
-
oo us o mer oo
ss o
or(j= j<=n j ) or (j= j<=n j )
p j = ... p j = ...
or (j= j<=n j ) , p j = p j ...
p j = p j ...
oops small enough to etter locality for

allow zero overhead access to p.
oops etter chances for
parallel execution.
hich of the two versions is best

rchitecture-aware compiler should select best version.
- 8
am e s m e oo s
#define
#define size
size 30
30 void ms1() {int i,j;
#define
#define iter
iter 40000
40000 for (i=0;i<
int size;i++){
int a[size][size];
a[size][size];
float for
float b[size][size];
b[size][size];
(j=0;j<size;j++){
void ss1() {int i,j; a[i][j]+=17; }
for for
(i=0;i<size;i++){ (j=0;j<size;j++){
for void mm1() {int i,j;
b[i][j]-=13; }}}
(j=0;j<size;j++){ for(i=0;i<size;i++){
a[i][j]+= 17;}}
for(j=0;j<size;j++){
for(i=0;i<size;i++){ a[i][j] += 17;
for b[i][j] -= 13;}}}
(j=0;j<size;j++){ - 9
b[i][j]-=13;}}}
esu ts s m e oo s
ss1 u t me (1 ԑ max)
))
ms1
12 Merge
Merge
1
mm1 dd
loops
loops
superi
superi
or
or
except
except
Sparc
Sparc
2
with
with
oo
gcc .2 - x gcc 2. -o Sparc gcc xo1 Sparc gcc x o
P att orm
-
oo u ro
or (j= j<=n j =2)
or (j= j<=n j )
p j = ... p j 1 = ...
p j = ...
factor = 2
etter locality for access to
p.
ess branches per
execution of the loop. More
opportunities for
optimizations.
Tradeoff between code size
and improvement.
- Extreme case: completely
unrolled loop (no branch)
am e matr mu t
#define s 30 extern void compute2()
#define iter 4000 {int i, j, k;
int for (i = 0; i < 30; i++) {
a[s][s],b[s][s],c[ for (j = 0; j < 30; j++) {
s][s]; for (k = 0; k <= 28; k += 2)
{{int *suif_tmp;
void compute(){int suif_tmp = &c[i][k];
i,j,k; *suif_tmp=
for(i=0;i<s;i++){ *suif_tmp+a[i][j]*b[j][k];}
{int *suif_tmp;
for(j=0;j<s;j++){ suif_tmp=&c[i][k+1];
*suif_tmp=*suif_tmp
for(k=0;k<s;k++){ +a[i][j]*b[j][k+1];
c[i][k]+= }}}}
return;}
a[i][j]*b[j][k];
}}}} -
esu ts
Pro essor Su SP te Pe t um
a tor
a tor
enefits uite small penalties may be Till uchwald, iploma thesis, niv.
ortmund, nformatik 12, 12 2
large -
esu ts e e ts or oo
e e e es
Pro essor
re u t o to
#define s 50
#define iter 150000
int a[s][s], b[s][s];
void compute() {
int i,k;
for (i = 0; i < s; i++) {
for (k = 1; k < s; k++) {
a[i][k] = b[i][k];
b[i][k] = a[i][k-1];
}}}
a tor
Small Till uchwald, iploma thesis, niv.
benefits ortmund, nformatik 12, 12 2
-
oo t oo o
r a ers o
or (i=1 i<= i )
or(k=1 k<= k )
r= i,k to be allocated to a register
or (j=1 j<= j )
i,j = r k,j
ever reusing information in the cache for and if
is large or cache is small (2 references for ).
-
oo t oo o
t e ers o
or (kk=1 kk<= kk = ) euse a tor o
or (jj=1 jj<= jj = ) or or
or (i=1 i<= i )
a esses to
or (k=kk k<= min(kk -1, ) k ) ma memory
r= i k to be allocated to a register Compiler
or (j=jj j<= min(jj Same
-1, elements
) j ) for should select
i j =r k j next iteration of i best option
- Monica am: The Cache Performance and ptimization

of locked lgorithms, SP S, 1 1
am e
SP
ra t e
resu ts y
u a are
sa o t
eo t e e
ases ere a
m ro eme t
as a e e
Sour e s m ar
to matr mu t a tor
Pe t um Till uchwald, iploma thesis, niv.
ortmund, nformatik 12, 12 2
-
Summary
Task concurrency management
Re-partitioning of computations into tasks
ynamic exploitation of slack
loating-point to fixed point conversion
Range estimation
Conversion
nalysis of the results
igh-level loop transformations
usion
nrolling
Tiling
- 8
ra s ormat o oo est s tt
am e Se arat o o mar a
many if- only few

statements for no checking, margin
margin-checking efficient elements to
be processed
- 9
oop nest from MPE - full search
motion estimation
for (z= z 2 z )
if (x =1 y =1 )
for (x= x x ) x1= x
for ( y y )
for (y= y y ) y1= y
for (k= k k )
for (k= k k ) x2=x1 k-
for (l= l l )
for (l= l ) y2=y1 l-
for (i= i i )
for (i= i i ) x =x1 i x =x2 i
for (j= j j )
for (j= j j ) y =y1 j y =y2 j
then block 1 then block 2
if (x x y y )
else y1= y
then block 1 else else block 1
for (k= k k ) x2=x1 k-
if (x x y y )
for (l= l ) y2=y1 l-
then block 2 else else block 2
for (i= i i ) x =x1 i x =x2 i
for (j= j j ) y =y1 j y =y2 j
analysis of polyhedral domains, if ( x y )
selection with genetic algorithm then-block-1 else else-block-1
if (x x y y )
for (z= z 2 z ) then block 2 else else block 2
for (x= x x ) x1= x
for (y= y y ) . alk et al., nf 12, ni o, 2 2
-
esu ts or oo est s tt
e ut o t mes
Cavity Motion Estimation S PCM
PS
a
m
e
a
Su
ar
tu
a
e
er
er
t
Pe
r
Po
. alk et al., nf 12, ni o, 2 2

-
esu ts or oo est s tt
o e s es
Cavity Motion Estimation S PCM
PS
m
a
m
e
a
Su
ar
tu
a
e
er
er
t
Pe
alk, 2 2
Po
-
rray o
nitial
arrays
-
rray o
nfolded
nfolded
arrays
arrays
-
tra array
o
ter array o
-
at o
rray folding is implemented in the TSE optimization
proposed by MEC. rray folding adds div and mod ops.
ptimizations re uired to remove these costly operations.
t MEC, PT address optimizations perform this task.
or example, modulo operations are replaced by pointers
(indexes) which are incremented and reset.
-
esu ts y es or a ty
e mar
ta
ta S
ta P
ta S
P
PT TSE
re uired to achieve
real benefit
[C.Ghez et al.: Systematic high-level Address
Code Transformations for Piece-wise Linear
Indexing: Illustration on a Medical Imaging
Algorithm, IEEE WS on Signal Processing
Pe t um PS r e a P S P S System: design & implementation, 2000, pp.
623-632]
o P
-
Prilagoditev kode
prenos zapisa iz ANSI-C v Handel-C
VHDL zahteva bistveno veþ sprememb
opis algoritma v C kodi je treba pred strojno izvedbo

ustrezno prilagoditi
SystemC oz. Handel-C vsebujeta samo podmnožico
ukazov obiþajnega C
drugaþe je treba realizirati aritmetiko plavajoþe vejice, ki
je strojne izvedbe naþeloma ne podpirajo
• zavzame preveþ razpoložljivih virov
• zmanjšuje frekvenco delovanja
vnos ukazov za vzporedno izvajanje delov kode
prilagoditev velikosti vseh spremenljivk
10 - 48
Prilagoditev rogra ke kode
nadomestek aritmetike plavajoþe vejice
uporaba fiksne vejice
uporaba celoštevilþnih vrednosti manjša enota
mere
vrednosti s fiksno vejico so pomnožene in predstavljene kot
celoštevilske vrednosti
si 62 si .62
celoštevilski in decimalni del sta predstavljena kot zgornji in spodnji

del celoštevilske spremenljivke
signed int var , var2
signed int 6 si
si 0x0 a0 si .62
var si [ : ] var 0x0
var2 si [ :0] var2 0xa0
10 - 4
60
ukazi za vzporedno izvajanje delov kode
ukaz namesto
• kjer je mogoþe, glede na vsebino zanke
for i 0 i 3 i par i 0 i 3 i
a[i] b[2 i] a[i] b[2 i]
se se
a[i] b[2 i]
a[i] a[i] c[i] par
b[2 i] a[i]
a[i] b[2 i]
a[i] a[i] c[i]
b[2 i] a[i]
10 - 0
prilagoditev velikosti vseh spremenljivk
vse velikosti morajo biti vnaprej definirane
• za manjšo porabo virov naj bodo minimizirane
vnaprej je treba doloþiti predznaþene nepredznaþene
pri raþunanju s spremenljivkami razliþnih velikosti
• uporaba operatorja spajanja: manjši spremenljivki dodamo
manjkajoþa mesta
• uporaba spodnjih mest pri veþji spremenljivki
[signed unsigned] int n n-bit
unsigned int 6 var , var3

unsigned int var2, var
var3 var var2

var var var2
10 - 1
ard are o t are ode ig
o ilatio
do dr regor Pa a
ožef Stefan International Postgraduate

School 11 - 1
o iler or e edded te
are o iler a i e
Many reports about low efficiency of standard compilers
- Special features of embedded processors have to be exploited.
- High levels of optimization more important than compilation
speed.
- Compilers can help to reduce the energy consumption.
- Compilers could help to meet real-time constraints.
Less legacy problems than for PCs.
- There is a large variety of instruction sets.
- Design space exploration for optimized processors makes
sense
11 -
ke ro le or t re e or
te
verage eed
erg Po er
Predi ta ilit
Energy
Access times
11 -
a ea
o ti i atio or ig er or a e
o
• High-performance if available memory bandwidth fully used
low-energy consumption if memories are at stand-by mode
• educed energy if more values are kept in registers
ADD r3,r0,r2
M V r0, 2
LD r3, [r2, 0] int
inta[
a[ 000]
000] M V r2,r 2
ADD r3,r0,r3 cc aa M V r 2,r
M V r0, 2 for
for i i ii 00
00 i i M V r ,rr 0
LD r0, [r2, r0] bb cc M V r0,r
ADD r0,r3,r0 bb cc M V r ,r
ADD r2,r2, cc M V r ,r
LD r , [r , r0]
ADD r ,r , ADD r0,r3,r
CMP r , 00 ADD r ,r ,
LT LL3 le le ADD r ,r ,
CMP r , 00
LT LL3
11 - 4
o iler o ti i atio
or i rovi g e erg e i ie
Energy-aware scheduling
Energy-aware instruction selection
perator strength reduction: e.g. replace by and
Minimize the bitwidth of loads and stores
Standard compiler optimizations with energy as a cost
function
2: a[0]
E.g.: egister pipelining: for i: to 0 do
begin
for i: 0 to 0 do : a[i]
C: 2 a[i] a[i- ] C: 2 2
2:
end
Exploitation
Exploitationof
ofthe
thememory
memoryhierarchy
hierarchy
11 -
i g rat ad e orie
P
Hierarchy
Hierarchy
Example
Example
main
SPM
Address
space A M TDMI
processor cores, well-known
0 for low power
consumption
no tag memory scratch pad memory
..
11 -
er li ited ort i
a ed tool lo
e rag a i o r e to allo ate to e i i e tio
or example:
#pragma arm section rwdata = "foo", rodata = "bar"
int x2 = 5; // in foo (data part of region)
int const z2[3] = {1,2,3}; // in bar
t atter loadi g ile to li ker or allo ati g e tio to
e i i addre ra ge
http: www.arm.com documentation

Software Development Tools index.html
11 -
glo al o ti i atio odel
ort d
Example: or i .
Which memory object array,
for j .. loop, etc. to be stored in SPM
while ...
o overla i g tati
epeat allo atio
main
memory call ...
Gain gk and size sk for each
segment k. Maximise gain G = 6gk,
Array ... respecting size of SPM SSP t 6 sk.
Solution: knapsack algorithm.
Scratch pad Array
memory, verla i g d a i
capacity SSP allo atio
Processor Int ... Moving objects back and forth
11 - 8
P re re e tatio
igrati g tio a d varia le
ol
S vark size of variable k
nk number of accesses to variable k
e vark energy aved per variable access, if vark is migrated
E vark energy aved if variable vark is migrated e vark n vark
x vark decision variable, if variable k is migrated to SPM,
0 otherwise
K set of variables
Similar for functions I
teger rogra i g or latio

Maximize ¦kK x vark E vark ¦iI x Fi E Fi
Subject to the constraint
¦k K S vark x vark ¦i I S Fi x Fi d SSP
11 -
ed tio i e erg a d average
r ti e
easible
standa with
rd
& postp compiler
a
optimiz ss
Cycles [x 00]
ation
Energy [ ]
Multi sort
mix of sort
algorithms
Measured processor external memory energy Numbers will change with technology,
CACTI values for SPM combined model algorithms remain unchanged.
11 - 10
llo atio o a i lo k
ine-grained
ine-grained
granularity
granularity
smoothens
smoothens
dependency
dependency on on the
the
size
sizeof
ofthe
thescratch
scratch Main
Statically 2
pad.
pad. memory ump jumps,
but only one
ee uires
uires additional
additional is taken
ump2
jump
jump instructions
instructions
or consecutive
to
toreturn
returnto
to main
main basic blocks
memory.
memory. ump3
2
ump
11 - 11
llo atio o a i lo k et o
ad a e t a i lo k a d t e ta k
e uires
genera
tio
additio n of
nal jum
specia ps
Cycles [x 00]
l comp
Energy [ ]
iler
11 - 1
avi g or e or te e erg
alo e
Combined model for memories

11 - 1
i i g redi ta ilit
aiT:
aiT:
WCET
WCETanalysis
analysistool
tool
support
supportfor
forscratchpad
scratchpadmemories
memoriesby
byspecifying
specifyingdifferent
different
memory
memoryaccess
accesstimes
times
also
alsofeatures
featuresexperimental
experimentalcache
cacheanalysis
analysisfor
forAA MM
11 - 14
r ite t re o idered
A M TDMI with 3 different memory architectures:
ai e or
LD -cycles: CP ,I ,D 3,2,2
ST -cycles: 2,2,2
,2,0
ai e or i ied a e
LD -cycles: CP ,I ,D 3, 2,6
ST -cycles: 2, 2,3
, 2,0
ai e or rat ad
LD -cycles: CP ,I ,D 3,0,2
ST -cycles: 2,0,0
,0,0
11 - 1
e lt or
sing Scratchpad: sing nified Cache:
eferences:
• Wehmeyer, Marwedel: Influence of nchip Scratchpad Memories on
WCET: th Intl Workshop on worst-case execution time WCET
analysis, Catania, Sicily, Italy, une 2 , 200
• Second paper on SP Cache and WCET at DATE, March 200
11 - 1
lti le rat ad
11 - 1
ti i atio or lti le rat
ad
Minimize C ¦e ¦ x
j
j
i
j ,i ni
With ej: energy per access to memory j,

and xj,i if object i is mapped to memory j, 0 otherwise,
and ni: number of accesses to memory object i,
subject to the constraints:
j : ¦ x j ,i Si d SSPj
i
i : ¦ x j ,i
j
With Si: size of memory object i,

SSPj: size of memory j.
11 - 18
o idered artitio
11 - 1
e lt or art o
oder de oder
Working set
A key advantage of partitioned

scratchpads for multiple applications is
their ability to adapt to the size of the
current working set.
11 - 0
a i re la e e t it i rat
ad
CP Effectively results in a
kind of o iler
SPM o trolled
eg e tatio agi g
for SPM
Memory Address assignment
Memory within SPM re uired
paging or
segmentation-like
eference: Verma, Marwedel: Dynamic verlay

of Scratchpad Memory for Energy Minimization,
ISSS 200
11 - 1
rat ad a ed o live e
a al i
M A, T , T2, T3, T
SP Size A T
T
PP
PP
Solution:
Solution:
AAÎ
ÎSP SP&&T3
T3Î
Î
SP
SP
PP
11 -
ar
High-level transformations
Loop nest splitting
Array folding
Impact of memory architecture on execution times &
energy.
The SPM provides
untime efficiency
Energy efficiency
Timing predictability
Achieved savings are sometimes dramatic, for example:
savings of of the memory system energy
11 -
ard are o t are ode ig
Per or a e ti atio
do dr regor Pa a
ožef Stefan International Postgraduate School

1 -1
te e ig
Specification
SW-Compilation Instruction Set HW-Synthesis
Prop. Code Prop. lock
Machine Code Net lists

1 -
otivatio
The values of the objective
functions that should guide the
design space exploration are
Application Architecture
obtained through
Mapping Design space exploration intends

to change
mapping binding and resource
sharing
Estimation
architecture hardware platform
application choice between different
algorithms and or partitioning into
concurrent components
1 -
tli e
vervie
Performance Metrics
Subsystems
Abstraction Levels
Performance Estimation Methods
1 -4
Per or a e ti atio lo al Pi t re
x(y) = x0 * exp (-k0*y)
PERFORMANCE ESTIMATION METHOD x
x0 = 105
k0 = 1.2593 analytic
y
METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL
M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2
5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication 1 -
OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
Po itio i t e te e ig lo
-
ig level
tio al ti atio
e i i atio Advantages: short simulation
time, no details of
a i g
a d Partitio i g
implementation necessary
Drawbacks: limited accuracy,
Parallel od le od le ti atio
e.g. no information about timing
e i i atio
o i atio
-
ei e e t
a li Advantages: higher accuracy

P
o level e i
lo er to t e
P
i
ti atio Drawbacks: long simulation
i
i le e tatio time, many implementation
details need to be known
le e tatio
1 -
e o t e ti atio
Prere uisite for
part of the feedback cycle see global flow

functional and non-functional validation e.g. power, energy,
timing, memory consumption
show e uivalence of specification and implementation

functional and non-functional aspects
1 -
tli e
verview
Per or a e etri
Subsystems
Abstraction Levels
1 -8
Per or a e ti atio lo al Pi t re
x(y) = x0 * exp (-k0*y)
x0 = 105
y
METRIC simulation
Cost
Area statistic
Power
M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
Mem CPU subsyste
HW itf. HW itf.
CPU communication
Task1 Task2
7 0 1
7 1 subsystem
6 6 2 2
5
5
4
3
API API API
communication 1 -
OS – Operating System
Performance Metrics
Per ormance metric = function defined on relevant non-functional properties of a
system which indicates a quantitative performance of the system.
Time [second]
for example end-to-end delay, throughput, latency
Power, Energy, Temperature [mW, mJ, °C]

for example power consumed by the network, energy
execute a task, maximal temperature
Area [mm2]
for example area of an integrated circuit
Cost [$]
for example cost of parts, labor, development cost
Other metrics:
usually, performance metrics are
SNR (signal to noise ratio), quality of the video
conflicting
image/sound, size of the hardware platform
Eam les of Performance ra e ffs
Ma in omain
change the mapping of the application to the architecture
Æ see example 1
rc itecture omain
change the hardware platform
Æ see example 2
lication omain
change the application implementation (e.g. degree of
parallelization, partitioning into concurrent processes, use of
different algorithms with a similar functional behavior)
E ra e ffs in t e Ma in omain
PE apping Optimi ation

2 mapping optimization space
ob Worst load of computation node
ob 2 Worst load of communication node
ob
ob
worst bus load
E ra e ffs in t e ar are Platform
General ur ose rocessors
lication s ecific instruction set rocessors

Ps
Microcontroller
imin erformance Ps i ital si nal rocessors
Ener Efficienc le ibilit
Pro rammable ar are

PG fiel ro rammable ate arra s
lication s ecific inte rate circuits s

utline
verview
erformance etrics
ubs stems
bstraction evels
erformance stimation ethods

Performance Estimation – Global Picture
x(y) = x0 * exp (-k0*y)
x0 = 105
y
METRIC simulation
Cost
Area statistic
Power
M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
Mem CPU subsyste
HW itf. HW itf.
CPU communication
Task1 Task2
7 0 1
7 1 subsystem
6 6 2 2
5
5
4
3
API API API
communication OS – Operating System
stem om osition
ommunication em lates om utation em lates
P
M E
m
interface
rc itecture c e ulin an rbitration

em lates
M M
E E
ro ortional
riorit s are
namic static
fi e riorit
E E
s Estimation ifficult
om utation an ommunication
(Non-deterministic) computations in processing nodes
(Non-deterministic) communication delays
Complex resource interaction via scheduling and arbitration policies
clic timin e en encies

nternal data streams interact on computing and communication
resources
nteraction determines stream characteristics
ncertain en ironment
ifferent load scenarios
nknown (worst case) inputs
llustration of E aluation ifficulties
ab acc b
n ut
tream
as ommunication
as c e ulin
om le n ut
imin itter bursts
ifferent E ent es
llustration of E aluation ifficulties
Processor
as
ab acc b
uffer
n ut
tream
as ommunication ariable esource ailabilit

as c e ulin ariable E ecution eman
om le n ut n ut ifferent e ent t es
imin itter bursts nternal tate Pro ram ac e
ifferent E ent es
e uirements for Performance Estimation
stimation should be com osable in terms of

su systems and their interactions, i.e. W, SW, interconnect
computation, communication, and sche u ing ar itration
stimation should cover different metrics, for example power,

energy, delay, memory, throughput
stimation method should represent a reasonable tra e off

between (a) estimation effort in terms of
computation/simulation time and set-up time and (b) accuracy
utline
verview
erformance etrics
Subsystems
bstraction e els
erformance stimation ethods

PERFORMANCE ESTIMATION METHOD x x(y) = x0 * exp (-k0*y)

x0 = 105
y
METRIC simulation
Cost
Area statistic
Power
M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
Mem CPU subsyste
HW itf. HW itf.
CPU communication
Task1 Task2
7 0 1
7 1 subsystem
6 6 2 2
5
5
4
3
API API API
Comm.Netw.
SW W
rief istor in bstraction SW
abstract
SW W
SWtasks
SW tasks
SW tasks
Register-transfer level model cluster SS
SW adaptation
data[ ] (critical path latency) SW asks
C core
abstract
Comm.
Comm. int.
int.
S/drivers W adaptation
abstract
R
ate level model C
/ / / ( ns) cluster on-chip
abstract
W communication
ransistor model cluster adaptation Network
(t=RC)
s
W adaptation
cluster
abstract
s s s 2 s 2
tec nolo si nal gate, transaction SW, to ens SW tasks,
transistors, layouts schematic, R W systems comm. backbones, s
simulator So W/SW
simulator SystemC
simulator S C simulator codes./cosim. tools
/ SS
formal methods
utline
verview
erformance etrics
Subsystems
bstraction evels
Performance Estimation Met o s

x(y) = x0 * exp (-k0*y)
x0 = 105
y
METRIC simulation
Cost
Area statistic
Power
M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
Mem CPU subsyste
HW itf. HW itf.
CPU communication
Task1 Task2
7 0 1
7 1 subsystem
6 6 2 2
5
5
4
3
API API API
System-Level Performance Estimation Methods
e.g. delay
Worst-Case
Best-Case
Real System Measurement Simulation Worst Case

Probabilistic (Formal) Analysis
Estimation
Æ presented in Æ presented later
Lecture 6
(next lecture)
12 - 26
vervie
System
o to e aluate
Measurements Formal Analysis Simulation Statistics
e elop a e elop a e elop a

se existing mat ematical program ic statistical
instance o t e abstraction o t e implements a abstraction o t e
system to system and model o t e system and
per orm deri e ormulas system. Per orm deri e statistic
per ormance ic describe experiments by per ormance ia
measurements. t e system running t e analysis or
per ormance. program. simulation.
12 - 2
designers
designers component
component
experience
experience simulation
simulation
model
modeloo
application
application
input
input data
data
traces
traces ss eets
eets
model
modeloo system
system model
modeloo
en
en ironment
ironment model
model arc
arc itecture
itecture
spec.
spec.oo plat
platorm
orm
inputs
inputs benc
benc mar
mar ss
estimation
tool (met od)
estimation
estimation
results
results
12 - 2
nalytic Models
Static analytic sym olic models
escribe computing communication and memory resources by
algebraic e uations e.g.
ª # words º
delay « » comm_ time
« burst_ size»
escribe system properties by parameters e.g. data rate

Combine relations
Fast and simple estimation

enerally inaccurate modeling e.g. resource s aring not modeled
12 - 2
ynamic nalytic Models
Combination bet een
Static models possibly extended by non-determinism in run-
time and e ent processing
ynamic models or describing e.g. resource s aring
mec anisms (sc eduling and arbitration).
Existing approac es
- t eory
(statistical bounds)
- orst case best case
be a ior)
12 -
E am le - e in Systems
clients re uest some ser ice rom a ser er o er a net or .

Per ormance o t e ser er
Per ormance o t e net or
12 - 1
Stochastic Models - Queuing Systems
queuing system is described by Performance measures
rrival rate average delay in queue
Service mechanism • Customer point of view
ueuing discipline time-average number of customers
in queue.
• System point of view
proportion of time server is busy
The classical M/M/1 queuing system:

(M = Markovian (exp.) distribution )
12 - 32
ondete ministic Models - Queuing Systems
queuing system is described by Performance measures
rrival function (bounds on worst case delay in queue
arrival times) worst-case number of customers in
Service functions (bounds on queue.
server behavior) worst-case and best-case end-to-
esource interaction end delay in the system
12 - 33
Simulation
Consider the underlying hardware platform and the mapping
of the application onto that architecture
Combine functional simulation and performance data
valuate average-case behavior for one simulation scenario
Complex set-up and extensive runtimes

... ut accurate results and good debugging possibilities
Model
Model
nput utput
trace application
application hardware
hardwareplatform
platform mapping
mapping trace
12 - 3
Example ace- ased Simulation
A stract simulation at system-le el it out timing
aster than simulation but still based on a single input trace
A straction
pplication - represented by abstract execution traces Æ graph of events:
read, write, and execute
rchitecture - represented by “virtual machines” and “virtual channels”
including non-functional properties (timing power energy)
teps
xecution trace determined by functional application simulation
xtension of the event graph by non-functional properties
Simulation of the extended model
application complete a st act
unctional model t ace e ent g aph
a chitectu e t ace estimation

desc iption simulation esults
eg ahi i et
12 - al
3 imentel et al

HW SW Codesign

Uploaded by

Copyright:

Available Formats

HW SW Codesign

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HW SW Codesign

Uploaded by

Copyright:

Available Formats

Hardware/Software Codesign

doc. dr. Gregor Papa

Jožef Stefan International Postgraduate School

Contact: Gregor Papa

Web page: http://csd.ijs.si/papa/courses.php

Exam: written seminar + oral, Slovenian or English

 on the slides “Hardware/Software

 Analysis of HW/SW boundaries and interfaces

Embedded systems (cell phone, automotive electronics)

Advances in formal / automated design methods

classic design co-design

“Information technology (IT) is on the verge of another revolution. …..

Berkeley: [Edward A. Lee]:

) Definition: Cyber-Physical (cy-phy) Systems (CPS) are

 Flight control systems,

 Translation into sound; claiming much

 Finger print sensors

Jo ef Stefan nternational Postgraduate School

Le els of bstraction in Electronic System esign

ypical esign low of Hardware-Software Systems

ain reason for buying is not information processing

entrali ed etworked Large-scale

 increasing application complexity e en in standard and large

i ation s i i inst tion s t o sso s

i ation s i i int at i its s

Embedded and Real- ime

 Beha ior depends on input

 no mouse, keyboard and screen

ypical esign low of Hardware-Software Systems

 ormal description of selected properties of a system or subsystem

 egree of abstraction, granularity

 Linking ad acent le els of abstraction refinement

Le els of bstraction in Electronic System esign

)“describe-and-synthesi e” paradigm by a ski, 4

n contrast to the traditional “specify-e plore-refine” approach,

anual design steps are more error-prone than automatic

System Synthesis Estimation

S - ompilation nstruction Set H -Synthesis

achine ode et lists

System Synthesis Estimation

S - ompilation nstruction Set H -Synthesis

achine ode et lists

System Synthesis Estimation

S - ompilation nstruction Set H -Synthesis

achine ode et lists

System Synthesis Estimation

S - ompilation nstruction Set H -Synthesis

achine ode et lists

 Partitioning of system function to programmable components

doc. dr. Gregor Papa

Jo ef Stefan nternational Postgraduate School

System Synthesis Estimation

SW-Compilation Instruction Set HW-Synthesis

Machine Code Net lists

he ser er pattern defines a one-to-many dependency

pu lic oid set alue newvalue

Will this ork in a multithreaded conte t

pu lic sync roni ed oid add istener listener

pu lic sync roni ed oid set alue newvalue

a asoft recommends a ainst this.

on the slides “Hardware/Software

Analysis of HW/SW boundaries and interfaces

Flight control systems,

Translation into sound; claiming much

Finger print sensors

increasing application complexity e en in standard and large

Beha ior depends on input

no mouse, keyboard and screen

ormal description of selected properties of a system or subsystem

egree of abstraction, granularity

Linking ad acent le els of abstraction refinement

Partitioning of system function to programmable components

SP communicatin se uential processes

iscrete e ent model

allows arbitrary nesting of - and -super

tart ith o nded er si es

any s hed in te hni e x