Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HW SW Codesign

Download as pdf or txt
Download as pdf or txt
You are on page 1of 514

Hardware/Software Codesign

0. Organization

doc. dr. Gregor Papa

Jožef Stefan International Postgraduate School


0-1
Overview
Administration
Course synopsis
Introduction and motivation

0-2
Organization (1)
Lecture: introductionary course + consultations
Exercises: delivered during consultations

Contact: Gregor Papa


gregor.papa@ijs.si
01-477-3514

Web page: http://csd.ijs.si/papa/courses.php

0-3
Organization (2)
Course materials:
ƒ slide copies, exercise sheets, papers
ƒ the slides contain material from Marco Platzner, Peter
Marwedel, Lothar Thiele, Frank Vahid, Reinhard Wilhelm

References:
ƒ P. Marwedel: Embedded System Design, Springer, 2006.
ƒ F. Vahid, T. Givargis: Embedded System Design: A Unified
Hardware/Software Introduction, John Wiley & Sons, 2002.

Exam: written seminar + oral, Slovenian or English

0-4
Textbook & slides
course based
ƒ on the book and the slides
“Embedded System Design” by
Peter Marwedel

ƒ on the slides “Hardware/Software


Codesign” by Lothar Thiele

0-5
Overview
Administration
Course synopsis
Introduction and motivation

0-6
Course Synopsis
Different Levels of Model Representation
ƒ Specifications
ƒ Models
ƒ Abstraction Levels
Dealing with Contradictory Constraints
ƒ Exploration
ƒ Simulation
• Worst-Case Eexecution Time
ƒ Optimization
Hardware/Software Mapping
ƒ Partitioning
ƒ Scheduling
ƒ Allocation
Software Code Optimizations
ƒ Compilation
Estimation

0-7
Benefits ? Learn about …
… challenges and approaches in modern system design
… useful optimization methods
… performance estimation of embedded systems
… a current research area

0-8
Overview
Administration
Course synopsis
Introduction and motivation

0-9
What is HW/SW Codesign?
... integrated design of systems that consist of hardware-
and software-components

ƒ Analysis of HW/SW boundaries and interfaces


ƒ Evaluation of design alternatives

0 - 10
Hardware/Software Boundaries
General purpose systems (PC, workstation)
ƒ processor design:
processor ļ compiler, operating system

Embedded systems (cell phone, automotive electronics)


ƒ design of specialized processors:
processor ļ compiler, operating system
ƒ system design:
processors ļ dedicated hardware devices

0 - 11
Target Architectures

0 - 12
Why Codesign? (1)
Modern embedded systems require “design” optimization
ƒ many functions, great variability, high flexibility
ƒ heterogeneous target systems
• processors, ASICs, FPGAs, systems-on-chip, …
ƒ many design goals
• performance, cost, power consumption, reliability, ...

Advances in formal / automated design methods


ƒ automation on the system level becomes possible
ƒ reduction of cost and time-to-market

0 - 13
Why Codesign? (2)
Optimization of the “design process”

classic design co-design

0 - 14
Codesign methodologies
Different Levels of Model Representation
Dealing with Contradictory Constraints
Hardware/Software Mapping
Software Code Optimizations
Estimation

0 - 15
System Design

0 - 16
System Design

0 - 17
Motivation (1)
According to forecasts, future of IT characterized
by terms such as
ƒ Disappearing computer,
ƒ Ubiquitous computing,
ƒ Pervasive computing,
ƒ Ambient intelligence,
ƒ Post-PC era,
ƒ Cyber-physical systems.
Basic technologies:
ƒ Embedded Systems
ƒ Communication technologies

0 - 18
Motivation (2)

“Information technology (IT) is on the verge of another revolution. …..


networked systems of embedded computers ... have the potential to change
radically the way people interact with their environment by linking together a
range of devices and sensors that will allow information to be collected,
shared, and processed in unprecedented ways. ...
The use … throughout society could well dwarf previous milestones in the
information revolution.”
Source. Edward A. Lee, UC Berkeley, ARTEMIS
Embedded Systems Conference, Graz, 5/2006

0 - 19
Embedded Systems & Cyber-Physical
Systems
“Dortmund“ Definition: [Peter Marwedel]
Information processing systems embedded into a larger
product

Berkeley: [Edward A. Lee]:


Embedded software is software integrated with physical*
processes. The technical problem is managing time and
concurrency in computational systems.

) Definition: Cyber-Physical (cy-phy) Systems (CPS) are


integrations of computation with physical processes [Edward Lee,
2006].
0 - 20
Embedded Systems and ubiquitous
computing
Ubiquitous computing: Information anytime, anywhere.
Embedded systems provide fundamental technology.

Communication Embedded
Technology Systems

Dependability
Robots
Optical networking

Quality of

Real-time
Control systems

service
Network management
Feature extraction
Distributed applications
and recognition
Service provision
Sensors/actors
UMTS, DECT, Hiperlan, ATM
A/D-converters

Pervasive/Ubiquitous computing
Distributed systems
Embedded web systems

0 - 21
Growing importance of embedded systems

ƒSpending on GPS units exceeded $100 mln during Thanksgiving week, up 237%
from 2006 … More people bought GPS units than bought PCs, NPD found.
[www.itfacts.biz, Dec. 6th, 2007]

ƒ…, the market for remote home health monitoring is expected to generate $225
mln revenue in 2011, up from less than $70 mln in 2006, according to Parks
Associates. . [www.itfacts.biz, Sep. 4th, 2007]
ƒAccording to IDC the identity and access management (IAM) market in Australia
and New Zealand (ANZ) … is expected to increase at a compound annual growth
rate (CAGR) of 13.1% to reach $189.3 mln by 2012 [www.itfacts.biz, July 26th, 2008].
ƒAccessing the Internet via a mobile device up by 82% in the US, by 49% in
Europe, from May 2007 to May 2008 [www.itfacts.biz, July 29th, 2008]

0 - 22
Automotive electronics
Functions by embedded processing: Multiple networks
ƒ ABS: Anti-lock braking systems ƒ Body, engine, telematics, media,
safety
ƒ ESP: Electronic stability control
ƒ Airbags
ƒ Efficient automatic gearboxes Multiple processors
ƒ Theft prevention with smart keys ƒ Up to 100
ƒ Blind-angle alert systems • 8-bit – door locks, lights, etc.
ƒ ... etc ... • 16-bit – most functions
• 32-bit – engine control, airbags
ƒ Processing where the action is
ƒ Sensors and actuators distributed
all over the vehicle
ƒ Networked together

0 - 23
Avionics

ƒ Flight control systems,


ƒ anti-collision systems,
ƒ pilot information systems,
ƒ power supply system,
ƒ flap control system,
ƒ entertainment system,
ƒ …
Dependability is of outmost
importance.

0 - 24
Railways

ƒ Safety features
contribute significantly
to the total value of
trains, and dependability
is extremely important

0 - 25
Telecommunication
ƒ Mobile phones have been one of the fastest growing
markets in the recent years,
• Multiprocessor
• 8-bit/32-bit for UI
• DSP for signals
• 32-bit in IR port
• 32-bit in Bluetooth
• 8-100 MB of memory
• All custom chips
• Power consumption & battery life depends on
software
ƒ base stations
• Massive signal processing
• Several processing tasks per connected
mobile phone
• Based on DSPs
• Standard or custom
• 100s of processors
ƒ Geo-positioning systems,
ƒ Fast Internet connections,
ƒ Closed systems for police, ambulances, rescue staff.

0 - 26
Medical systems
ƒFor example:
• Artificial eye: several approaches,
e.g.:
• Camera attached to glasses;
computer worn at belt; output
directly connected to the brain,
“pioneering work by William
Dobelle”. Previously at
[www.dobelle.com]

ƒ Translation into sound; claiming much


better resolution.
[http://www.seeingwithsound.com/etumble.htm]

0 - 27
Extremely Large
Functions requiring computers:
ƒ Radar
ƒ Weapons
ƒ Damage control
ƒ Navigation
ƒ basically everything
Computers:
ƒ Large servers
ƒ 1000s of processors

0 - 28
Inside your PC
Custom processors
ƒ Graphics, sound
32-bit processors
ƒ IR, Bluetooth
ƒ Network, WLAN
ƒ Harddisk
ƒ RAID controllers
8-bit processors
ƒ USB
ƒ Keyboard, mouse

0 - 29
Authentication systems

ƒ Finger print sensors


ƒ Access control
ƒ Airport security systems
ƒ Smartpen®
ƒ Smart cards
ƒ ….

0 - 30
Consumer electronics
Examples

0 - 31
Industrial automation

Examples

0 - 32
Forestry Machines
Networked computer system
ƒ Controlling arms & tools
ƒ Navigating the forest
ƒ Recording the trees harvested
ƒ Crucial to efficient work
Operator panel
ƒ Graphical display
ƒ Touch panel
ƒ Joystick
ƒ Buttons
ƒ Keyboard
“Tough enough to be out in the woods”

0 - 33
© Jakob Engblom
Smart buildings

Examples
ƒ Integrated cooling,
lightning, room
reservation, emergency
handling,
communication
ƒ Goal: “Zero-energy
building”

0 - 34
Robotics
“Pipe-climber”

Robot “Johnnie“

Lego mindstorms
ƒ Standard controller
• 8-bit processor
• 64 kB of memory
ƒ Electronics to interface
to motors and sensors

0 - 35
Estimation
Hardware, software and system as a whole suitability

0 - 36
a a ot a o si n

nt o tion

o o a a

Jo ef Stefan nternational Postgraduate School


-
ont nts

Le els of bstraction in Electronic System esign

ypical esign low of Hardware-Software Systems

-
Em st ms
Embedded systems ES in o mation o ssin
s st ms m into a a o t

E amples

ain reason for buying is not information processing

-3
Em st ms

t na o ss
man int a

s nso s a t ato s
m s st m

-
aa an ist i t a t at o ms

ABS
ASR
ACC ESP

engine
control powertrain
control

-
E am o sso
ell Processor B combines
ƒ general-purpose architecture core with
ƒ coprocessing elements which greatly accelerate multimedia
and ector processing applications, as well as many other
forms of dedicated computation

-6
omm ni atin Em st ms
ƒ sensor networks ci il engineering, buildings, en ironmental
monitoring, traffic, emergency situations
ƒ smart products, wearable ubi uitous computing

-
n s in n o mation an omm ni ation

entrali ed etworked Large-scale


Systems Systems istributed Systems

nternet
ew pplications and
System Paradigms
-
om a ison
Embedded Systems eneral Purpose omputing
ƒ ew applications that are ƒ Broad class of applications
known at design-time
ƒ ot programmable by end ƒ Programmable by end user
user
ƒ i ed run-time re uirements ƒ aster is better
additional computing power
not useful
ƒ riteria ƒ riteria
• cost • cost
• power consumption • a erage speed
• predictability
• meeting time bounds

-
si n a n s

ƒ increasing application complexity e en in standard and large


olume products
• large systems with legacy functions
• mi ture of e ent dri en and data flow tasks
• e amples multimedia, automoti e, mobile communication
ƒ increasing target system complexity
• mi ture of different technologies, processor types, and design styles
• large systems-on-a-chip combining components from different
sources, distributed system implementations
ƒ numerous constraints and design objectives
• e amples cost, power consumption, timing constraints, dependability

- 0
a n s o Em ot a
ynamic en ironments
apture the re uired beha iour
alidate specifications
Efficient translation of specifications
into implementations
How can we check that we meet real-
time constraints
How do we alidate embedded real-
time software large olumes of data,
testing may be safety-critical

-
m m ntation t nati s

n a os o sso s

i ation s i i inst tion s t o sso s


s
• i o ont o
o man • s i ita si na o sso s
o E i i n i i it

o amma a a
• i o amma at a a s

i ation s i i int at i its s

-
n a i it
ES ust be ,
ƒ probability of system working correctly
pro ided that is was working at t
ƒ probability of system working correctly d
time units after error occurred
ƒ probability of system working at time t
ƒ no harm to be caused
ƒ confidential and authentic communication
E en perfectly designed systems can fail if the assumptions
about the workload and possible errors turn out to be wrong
aking the system dependable must not be an after-thought,
it must be considered from the ery beginning

- 3
E i i n

ES must be efficient
ƒ ode-si e efficient
especially for systems on a chip
ƒ Run-time efficient
ƒ eight efficient
ƒ ost efficient
ƒ Energy efficient

-
a tim onst aints
any ES must meet -
ƒ real-time system must react to stimuli from the controlled
ob ect or the operator within the time inter al by the
en ironment
ƒ or real-time systems, right answers arri ing too late are wrong
ƒ -
opet ,
ƒ ll other time-constraints are called
ƒ guaranteed system response has to be e plained without
statistical arguments

-
a im st ms

Embedded and Real- ime


Synonymous
ƒ ost embedded systems
are real-time m
ƒ ost real-time systems
are embedded m
a tim

a tim
Jakob Engblom

- 6
a ti i s st ms

ypically, ES are
ƒ

ƒ Beha ior depends on input


) automata model appropriate,
model of computable functions inappropriate
i s st ms
analog digital parts

-
i at s st ms
towards a certain
ƒ nowledge about beha ior at design time
can be used to minimi e resources and to
ma imi e robustness

ƒ no mouse, keyboard and screen

-
ont nts
hat is an Embedded System

ypical esign low of Hardware-Software Systems

-
st a tion o s an nt sis

ƒ ormal description of selected properties of a system or subsystem


ƒ model consists of data and associated methods

ƒ egree of abstraction, granularity


• system, architecture, logic, transistor,
• module, block, function,
ƒ iew
• beha ior, structural, physical

ƒ Linking ad acent le els of abstraction refinement


ƒ Stepwise adding of structural information

- 0
so st a tions

Beha ior
st m
Process odule rchitecture

unction R L

at mo s Structure
it mo s
t o i it mo s
i mo s
a o t mo s
-
ont nts
hat is an Embedded System

Le els of bstraction in Electronic System esign

-
si n a oa s
inition nt sis is the process of generating the
description of a system in terms of related lower-le el
components from some high-le el description of the e pected
beha ior

)“describe-and-synthesi e” paradigm by a ski, 4

n contrast to the traditional “specify-e plore-refine” approach,


also known as “design-and-simulate” approach

anual design steps are more error-prone than automatic


synthesis and, therefore, simulation is more important

- 3
st m si n
Specification

System Synthesis Estimation

S - ompilation nstruction Set H -Synthesis

ntellectual ntellectual
Prop ode Prop Block

achine ode et lists


-
i o sso it t
Specification

System Synthesis Estimation

S - ompilation nstruction Set H -Synthesis

ntellectual ntellectual
Prop ode Prop Block

achine ode et lists


-
i ation ii o
Specification

System Synthesis Estimation

S - ompilation nstruction Set H -Synthesis

ntellectual ntellectual
Prop ode Prop Block

achine ode et lists


- 6
i ation ii nst tion t o sso
Specification

System Synthesis Estimation

S - ompilation nstruction Set H -Synthesis

ntellectual ntellectual
Prop ode Prop Block

achine ode et lists


-
st m si n
- is a comple synthesis tasks
ƒ software synthesis and code generation
ƒ hardware synthesis
ƒ interface and communication synthesis
ƒ hardware software partitioning and component selection
ƒ hardware software scheduling

:
ƒ application specification
ƒ design space e ploration and system optimi ation
ƒ estimation

-
a in o m

-
a in an in

ƒ Partitioning of system function to programmable components


software , hard-wired or parameteri ed components hardware
or application specific instruction set processors
to scheduling and load distribution problem in real-
time operating systems
ƒ time constraints, conte t switch and conte t switch o erhead,
process synchroni ation and communication
to real-time operating systems
ƒ larger design space with ery different solutions
ƒ high optimi ation re uirements moti ation for hardware design
ƒ underlying hardware is not fi ed

- 30
a in an in
Similarity to allocation or load distribution problem in high-
le el synthesis or real-time operating systems

dedicated
P2 HWcomponents
P1 P4

P3
SW
(processors)

-3
Estimation
he principle of synthesis based on abstraction only makes
sense if there are a ailable
ƒ Estimate properties of the ne t layer s of abstraction
ƒ esign decisions are based on these estimated properties f
the estimation is not correct or not accurate enough , the
design will be sub-optimal or e en not working correctly

si n a i im in
E o ation a st a tion o
Estimation o
si n
o a
o ti s si n a
E o ation

si n a o
E o ation a st a tion
-3
a a ot a o si n

i i ation an o so om tation

doc. dr. Gregor Papa

Jo ef Stefan nternational Postgraduate School


-
System Design
Specification

System Synthesis Estimation

SW-Compilation Instruction Set HW-Synthesis

Intellectual Intellectual
Prop. Code Prop. Block

Machine Code Net lists


2-2
onsider a simp e e amp e

he ser er pattern defines a one-to-many dependency


et een a su ect o ect and any num er of o ser er
o ects so that hen the su ect o ect chan es state all
its o ser er o ects are notified and updated
automatically.

Eric amman ichard Helm alph ohnson ohn lissides Design Patterns ddision-
Wesley

2-
amp e ser er pattern in a a
pu lic oid add istener listener

pu lic oid set alue newvalue


my alue ne alue
for int i i mylisteners.len th i
my isteners i . alueChan ed ne alue

Will this ork in a multithreaded conte t

2-
ser er pattern it m te es

pu lic sync roni ed oid add istener listener

pu lic sync roni ed oid set alue newvalue


my alue ne alue
for int i i mylisteners.len th i
my isteners i . alueChan ed ne alue

a asoft recommends a ainst this.


What s ron ith it

2-
te es sing monitors are mine ie ds
pu lic sync roni ed oid add istener listener

pu lic sync roni ed oid set alue newvalue


calls add istener
my alue ne alue mute

alu ests
for int i i mylisteners.len th i

re
eC
u
ha

y
my isteners i . alueChan ed ne alue

ld
he
ed
lock

alueChan ed may attempt to ac uire a lock on


some other o ect and stall. If the holder of that lock
calls add istener deadlock

2-
Simp e o ser er pattern gets comp icated
pu lic sync roni ed oid add istener listener

pu lic oid set alue newValue


sync roni ed this
my alue ne alue
hile holdin lock make a copy of
listeners my isteners.clone
listeners to a oid race conditions

for int i i listeners.len th i notify each listener outside of the


synchroni ed lock to a oid deadlock
listeners i . alueChan ed ne alue

his still isn t ri ht.


What s ron ith it
2-
Simp e o ser er pattern o to ma e it rig t

pu lic sync roni ed oid add istener listener

pu lic oid set alue newValue


Suppose t o threads call
sync roni ed this set alue . ne of them
my alue ne alue ill set the alue last
listeners my isteners.clone lea in that alue in the
o ect ut listeners may
e notified in the opposite
for int i i listeners.len th i order. he listeners may
e alerted to the alue-
listeners i . alueChan ed ne alue chan es in the ron
order

2-
Pro ems it t read ased conc rrency

Nontrivial software written with


threads, semaphores, and
mutexes is incomprehensible to
humans.

) Search for non-thread- ased models hich are the


re uirements for appropriate specification techni ues

2-
ontents

StateCharts

ata- lo Models

2-
e irements or Speci ication ec ni es

Humans not capa le to understand systems


containin more than a fe o ects.

Most actual systems re uire more o ects


) Hierarchy

ƒ proc
proc
E amples states processes procedures. proc

ƒ
E amples processors racks
printed circuit oards

2-
e irements or Speci ication ec ni es

-
e uired for reacti e systems.

-
Components send streams of data
to each other.

o o stac es or

2- 2
ode s o omp tation De inition

at does it mean to comp te


ode s o comp tation de ine
ƒ Components and an e ecution model for C-
computations for each component
ƒ Communication model for e chan e of
information et een components. C-
Shared memory
Messa e passin

2-
S ared memory
Potential race conditions )inconsistent results possi le
) Critical sections sections at hich e clusi e access to
resource r e. . shared memory must e uaranteed.

process a process ace-free access to


.. ..
P S o tain lock P S o tain lock shared memory
.. critical section .. critical section protected y S
S release lock S release lock possi le

his model may e supported y


ƒ mutual e clusion for critical sections
ƒ cache coherency protocols
2-
on oc ing async rono s message passing

Sender does not ha e to ait until messa e has arri ed


potential pro lem uffer o erflo

send recei e

2-
oc ing sync rono s message passing

Sender ill ait until recei er has recei ed messa e

send recei e

2-
Sync rono s message passing SP

ƒ SP communicatin se uential processes


Hoare
rendez-vous- ased communication
E ample

process
process process
processBB
.... ....
arar aa...
... arar ...
...
aa ...
...
ccaa ----output
output cc ----input
input
end
end end
end

2-
omponents

ƒ on Neumann model

Se uential e ecution pro ram memory etc.

ƒ iscrete e ent model


ueue
a
time
c a c a a action

2-
omponents
ƒ inite state machines

ƒ ifferential e uations

w2x
b
wt 2

2-
amp e Discrete ent D

D hard are description lan ua e is commonly used


as a desi n-entry lan ua e for di ital circuits.

2-2
Sensiti ity ists in D
Sensi ity lists are a shorthand for a sin le ait on-statement
at the end of the process ody
process y
egin
prod and y
end process
is e ui alent to
process
egin
ait on y
prod and y
end process
2-2
No lan ua e that meets all lan ua e re uirements
) usin compromises

2 - 22
ontents
Models of Computation

ata- lo Models

2-2
assica tomata
Classical automata

input X Internal state Z output Y


clock
Moore- Mealy
Ne t state Z computed y function G automata finite state
utput computed y function O machines SMs

e
Moore-automata
Y O Z Z G X, Z e e
Mealy-automata
Y O X Z Z G X, Z
e

2-2
State arts

Classical automata not useful for comple systems comple


raphs cannot e understood y humans .

) ) StateCharts Harel

2-2
ntrod cing ierarc y

SM ill e in e actly one


of the su states of S if S is
acti e
either in or in B or ..

2-2
De initions
Current states of SMs are also called states.
States hich are not composed of other states are called
.
States containin other states are called - .
or each asic state s the super-states containin s are called
.
Super-states S are called - - if e actly one of the
su -states of S is acti e hene er S is acti e.
superstate
ancestor state of E

su states
2-2
De a t State ec anism

ry to hide internal
structure from outside
orld
) efault state
illed circle
indicates su -state
entered hene er
super-state is entered.
Not a state y itself

2-2
istory ec anism

eha ior different


from last slide

m
k

or input m S enters the state it as in efore S as left can e


B C or E . If S is entered for the ery first time the default
mechanism applies.
History and default mechanisms can e used hierarchically.

2-2
om ining istory and De a t State

same meanin

2-
onc rrency
Con enient ays of descri in concurrency are re uired.
- - : FSM is in all (immediate) sub-states of a
super-state.

2-
Entering and Leaving AND-Super-States

incl.

Line-monitoring and key-monitoring are entered and left, when


service switch is operated.
2 - 32
ree representati n state sets
basic -super-state -super-state
state

F
F M
L

M
L
2 - 33
putati n state sets
omputation of state sets by from
leaves to root:
ƒ basic states: state set state
ƒ -super-states: state set union of children
ƒ -super-states: state set artesian product of children

F M

2-3
pes States

n State harts, states are either


r
- - r
- -

2-3
i ers
Since time needs to be modeled in embedded systems,
timers need to be modeled.
n State harts, special edges can be used for timeouts.

f event a does not happen while the system is in the left


state for ms, a timeout will take place.

2-3
sing i ers in Ans ering a ine

2-3
epresentati n putati ns
esides states, arbitrary many other variables can be
defined. his way, not all states of the system are
modeled e plicitly.
hese variables can be changed as a result of a state
transition ( ). State transitions can be dependent
on these variables ( ).

action unstructured
state space
variables

condition

2-3
eneral r Edge La els
event condition action

ist only for the ne t evaluation of the model


an be either internally or e ternally generated

efer to values of variables that keep their value until t e are


reassigned

an either be assignments for variables or creation of events

service-off not in Lproc service:

2-3
Events and a ti ns
can be composed of several events:
ƒ and 2 : event that corresponds to the simultaneous
occurrence of e and e .
ƒ r 2 : event that corresponds to the occurrence of either
e or e or both.
ƒ n t : event that corresponds to the absence of event e.

can also be composed:


ƒ 2 : actions a und a are e ecuted in parallel.

ll events, states and actions are globally visible.

2-
E a ple
e a1 c a2
x y z

e:
a1:
a2:
true
c: false

e:
a1:
a2:
true
c: false

2-
e State arts Si ulati n ases

ow are edge labels evaluated

:
. ffect of e ternal changes on events and conditions is
evaluated,
. he set of transitions to be made in the current step and right
hand sides of assignments are computed,
. ransitions become effective, variables obtain new values.

2- 2
E a ple

n phase , variables a and b are assigned to temporary


variables. n phase , these are assigned to a and b. s a result,
variables a and b are swapped.
n a single phase environment, e ecuting the left state first would
assign the old value of b ( ) to a and b. ecuting the right state
first would assign the old value of a ( ) to a and b. he
e ecution would be non-deterministic.
2- 3
Steps
ecution of a State hart model consists of a se uence of
(status, step) pairs

Status values of all variables set of events current time


Step e ecution of the three phases

e
phas
phase
Status
phase
2-
e le ts del l ed ard are

n an actual clocked (synchronous) hardware system, both


registers would be swapped as well.

Same
Sameseparation
separationinto
intophases
phasesfound
foundin
inother
otherlanguages
languagesas
as
well,
well, especially
especially those
those that
that are
are intended
intended to
to model
model hardware.
hardware.

2-
re n se anti s State arts
nfortunately, there are several time-semantics of
State harts in use. his is another possibility:
ƒ step is e ecuted in arbitrarily small time.
ƒ nternal (generated) events e ist only within the ne t step.
ƒ ternal events can only be detected after a stable state
has been reached.

e ternal events
stable stable
state state state
transitions t
transport of internal events step

2-
E a ples

state diagram:
stable state

2-
E a ple
on-determinism
a a
A C E G

a a
B D F H

state diagram:
a E,H
a
A,B C,D
a F,G

2-
E a ple
state diagram (only
stable states are
represented, only a
a c and b are e ternal):

a
a ›a a

a
ac
a ›a

2-
Evaluati n State arts

ƒ allows arbitrary nesting of - and -super


states.
ƒ in a follow-up paper to original paper.
ƒ Large number of commercial simulation
(StateMate, StateFlow Matlab, etterState, ML, ...)
ƒ vailable back-ends translate State harts into
, thus enabling software or hardware
implementations.

2-
Evaluati n State arts

ƒ enerated ,
ƒ ot useful for applications,
ƒ o description of - ,
ƒ o - ,
ƒ o description of .

2-
SDL

(S L) is a
specification language targeted at the unambiguous
specification and description of the behaviour of reactive
and distributed systems.

sed here as a (prominent) e ample of a model of


computation based on as n r n us essage passing.

appropriate also for distributed systems

2- 2
uni ati n a ng SDL- S s
ommunication between FSMs (or processes ) is based on
essage-passing, assuming a p tentiall inde initel large
- ueue.
ƒ ach process fetches
ne t entry from F F ,
ƒ checks if input enables
transition,
ƒ if yes: transition takes
place,
ƒ if no: input is discarded
(e ception: S -
mechanism).

2- 3
Deter inisti
Let tokens be arriving at F F at the same time:
ƒ rder in which they are stored, is unknown

ll orders are legal: simulators can show different behaviors for


the same input, all of which are correct.
2-
ntents
Models of omputation

State harts

2-
Data l Language del
communicating through

FF uffer
rocess rocess

FF uffer
FF uffer

rocess

2-
il s p Data l Languages
:
ƒ mperative language style: program counter is king
ƒ ataflow language: movement of data is the priority
ƒ Scheduling responsibility of the system, not the programmer

:
ƒ ll processes run simultaneously
ƒ rocesses can be described with imperative code
ƒ rocesses can y communicate through buffers
ƒ Se uence of read tokens is identical to the se uence of written
tokens

2-
Data l Languages
ppropriate for applications that deal with :
ƒ Fundamentally concurrent: maps easily to parallel hardware
ƒ erfect fit for block-diagram specifications (control systems, signal
processing)
ƒ Matches well current and future trend towards multimedia
applications

:
ƒ ost Language (process description), e.g. , , ava, .... .
ƒ oordination Language (network description), usually home made ,
e.g. ML.

2-
E a ple E - vide de der

2-
a n r ess Net r s

roposed by ahn in as a general-purpose scheme


for parallel programming:
ƒ : destructive and blocking (reading an empty channel
blocks until data is available)
ƒ : non-blocking
ƒ : infinite si e
ni ue attribute:

2-
A a n r ess
From ahn s original paper

process f(in int u, in int v, out int w)


u
int i bool b true
for ( ) w
f
i b wait(u) : wait(v)
printf( i n , i) v
send(i, w)
hat does this do
b b
rocess alternately reads from u
and v, prints the data value, and
writes it to w
2-
A a n r ess
From ahn s original paper:

process g(in int u, out int v, out int w)

int i bool b true


for( ) v
i wait(u) u g
if (b) send(i, v) else send(i, w) w
b b hat does this do
rocess reads from u and
alternately copies it to v and w

2- 2
A a n r ess
From ahn s original paper:

process h(in int u, out int v, int init)

int i init
send(i, v) u h v
for( )
i wait(u)
hat does this do
send(i, v)
rocess sends initial value, then
passes through values.

2- 3
A a n r ess Net r
hat does this do
rints an alternating se uence of s and s.

mits a once and then copies input to output


h
init

g f

h
init
mits a once and then copies input to output
2-
Deter ina
:
ƒ system is random if the information about the
system and its inputs is not sufficient to determine its outputs.
:
ƒ efine the y of a channel to be the se uence of tokens
that have been both written and read. process network is
said to be e e a e if the histories of all channels depend
only on the histories of the input channels.
:
ƒ Functional behavior is independent of timing (scheduling,
communication time, e ecution time of processes).
ƒ Separation of functional properties and timing.

2-
Determinacy
[x1,x2,x3,…] [y1,y2,y3,…]
F

monotonic mapping

ƒ x x

y
ƒ ,

2 - 66
Determinacy
[x1,x2,x3,…] [y1,y2,y3,…]
F

orma de inition
ƒ [x1, x2, x3, ]
ƒ x [x1] Ž [x1, x2] Ž [x1, x2, x3, ]
ƒ , 1, , 
ƒ Ž  Ž
ƒ F o
ƒ Ž ŸF ŽF

2-6
r Determini m

determinate
ƒ

ƒ y
Rea oning
ƒ , y y
y
ƒ y y, ,
ƒ y
ƒ , y
y

2-6
in n eterminacy
y
ƒ
ƒ
ƒ

amp e

2-6
in n eterminacy
1 [ ]
F [ , ]
F 1, 2

2 [ ]

1 [ , ]
F [ , , ]
F 1 , 2

2 [ ]

Ž [ ], [ ] Ž [ , ], [ ]
F ŽF [ , ]Ž[ , , ]
2-
c e in a n et r

2-
Deman ri en c e in
y y

2- 2
m ar rit m
o nded memor

ƒ tart ith o nded er si es

ƒ any s hed in te hni e x

ƒ itho t dead o
y ontin e
ƒ y dead o , in rease si e

2-
Fr m n inite t Finite er i e

ƒ y

ƒ n,
n
ƒ y

y
2-
Dea c am e
x y
2

, ,

1,
1,
1,

2- 5
am e Finite i e er in

2
1

1, 1, 1,
1,
1, 1,
1, 1,
1,

2- 6
ar rit m in cti n
1
, , ,

1 1 1
1 2 1 1 1
3 1 1

2 y

2-
ar rit m in cti n

y y

1 1 1 1
2 1 1 1 1 1
1
3 1 1

2 y

2-
a ati n a n r ce et r
ro
ƒ y

ƒ
ƒ x
ƒ
on
ƒ

ƒ y y

ƒ x y

2-
ync r n Data DF
, y, 1
ƒ estri tion
ƒ i ed n m er o token

amp e 1

1 1 2 3 2 1

2-
DF c e in
c ed e y at compi e time
y

sta ish re ati e e e tion rates y y

etermine eriodi s hed e y y

Re t x y

2-
a ancin ati n
3a 2
3d
1
3
2 2 a
3 3
d 2a

2 1
3
1 2

2- 2
in t e a ancin ati n
ain D c ed ing t eorem
ƒ n
y x n1
ƒ n1 x

ƒ y y
ƒ y

amp e

2-
Determine eri ic c e e
1
o i e c ed e
ƒ
3 2 3
ƒ
ƒ
ƒ …
2 1
3
1 2

y y
e i i it
ƒ , y
ƒ

2-
ar are t are e i n

De i n ace rati n

c r re r a a

-
y tem De i n

y y

-2
De i n ace rati n

icati n rc itect re

a in

timati n

-
Detai e ie De i n ace rati n

e a ati n
c n tr ct ma e timate
arc itect re a icati n er rmance

m ti ecti e
timi ati n

-
am e im e e

, 2, 
1

1 3

, y,
-5
Example 1: Evolutionary Algorithms for DSE
individual

allocation decode allocation

n selection binding
o recombination decode binding
p mutation
scheduling

design point
“chromosome” = encoded
allocation + binding (implementation)

fitness evaluation
fitness

3-6 user constraints


Example 1: asi Model

1
Definition: A specifica-
tion graph is a graph 5 RISC
GS=(VS,ES) consisting
data flow graph GP,
of a problem 3 SB

an architecture graph 7 HWM1


GA, and edges EM. In
particular, VS=VP∪VA, 2 PTP

ES=EP∪EA∪EM 6 HWM2

4
GP EM GA

3-
Example 1: Mapping

1
0 1
0
1 5 RISC
8
21 3 SB
1
29 7 HWM1
20
α
1 2
1
21 6 β RISC HWM1
2 PTP bus
30 4 shared
bus
HWM2
τ

3-
Example 1: hallenges
ncoding of (allocation+binding)
ƒ simple encoding
e g one bit per resource one variable per binding
eas to implement
man infeasible partitioning solutions
ƒ encoding + repair
e g simple encoding and modif such that for each vp  VP there
e ists at least one va  VA ith a E(vp) = va
reduces number of infeasible partitioning solutions
eneration of the initial population mutation
ecombination

3-
Example 1: ase Study

3-
Example 1: ase Study

3-
am le ase tud
rame memor dual orted rame memor bloc matc module
ut module

out ut module
subtract/add module
/ module u ma e coder
3 - 12
am le olut o

INM
INM OUTM
OUTM FM
FM RISC2
RISC2

SBS

3 - 13
am le olut o

INM
INM OUTM
OUTM DPFM
DPFM HC
HC DCTM
DCTM BMM
BMM SAM
SAM

SBF

3-1
am le o t are t es s

2 2
A B C D F
CD DAT

ec s o s

nS oC
I
ABABABCCABABA CODE(A) CALL(A) FOR 1 TO 2
CODE(B) CALL(B) CODE(A)
CODE(A) CALL(A) CALL(B)
CODE(B) CALL(B) CODE(C)
CODE(C) CALL(C) CODE(A)
3-1
am le t m at o r ter a

P
PROCEDURE A
FOR 1 TO 3
CALL(A)
CODE(B)
CODE(B)

2
A B

3-1
am le rade o s

3-1
am le rade o ur aces

3-1
am le lorat o trate

3-1
am le a rocess et or

3-2
am le ard are rc tecture

3 - 21
am le esult o u ct o al mulat o
n(p)

b(s)
3 - 22
am le esult o lat orm e c mar s
P
ƒ ( )

p
(p )

3 - 23
am le ac o t e e elo e al s s

3-2
am le am le lorat o esult

2 2

3-2
ard are/ o t are odes

stem mulat o

doc dr re or a a

S I P S
-1
stem es
S

S S

S C I S H S

I I
P C P B

M C N
-2
utl e

D S

S C

S H A

-3
stem a d odel
A

I S D
T
A

-
tate
T
t
t
T

-
tate
I
A

p s
ƒ

-
me
I

p s
ƒ

-
e ts a d screte e t stems
A
T
I
I

-
screte e t stems
AD S

A D S

T D S
I D S

-
me dr e s e t dr e

()

-1
me dr e s e t dr e
- -
ƒ T
T

ƒ A

- 11
me dr e s e t dr e
-
ƒ S
ƒ A

- 12
utl e

S C

S C

S H A

- 13
screte e t odel a d mulat o

ƒ
ƒ
ƒ
ƒ

T Æ -

-1
om o e ts o a screte e t mulat o
ƒ
ƒ I

ƒ
ƒ T
ƒ

ƒ C
ƒ C
ƒ A

ƒ P

-1
screte e t mulat o e
t rout e
ƒ I
le

set to
ƒ D
rocess b
call subs stem
module s remo e
e e t rom

u date stat st cal


ƒ U ormat o
e erate s mulat o
re ort

-1
screte e t mulat o
A
ƒ P
ƒ s n Æ may “produce” new events.

Problem: Within the same simulation cycle, “cause” and “effect” events share
the same time of occurrence
Solution: The simulator uses a zero duration virtual time interval, called delta-
cycle (į)
ƒ The role of a delta-cycle is to order “simultaneous” events within a simulation
cycle, i.e. identifying which event caused another; “causes” and “effects” are
separated by delta-cycles.
Simulation cycles may be composed of several delta-cycles (į)
A C D BC E

į 2į į
-1
Outline

System Classification

Discrete Event Simulation

Example SystemC

Simulation at High Abstraction Levels

4 - 18
S te O e ie

4-1
le

4-
le O

4- 1
o ule

processes

4-
o e e

4-
o ule

4- 4
o e o uni tion
rocesses can directly communicate through s als.

odule
nput
ports

port
rocess
utput
ports
sensitivity

rocess

nternal
signal

4-
n e o uni tion
SystemC . introduces general purpose primitives
ƒ C a el
A container for communication and synchronization, e.g. can
have state and private data, transport data, transport events.
They implement one or more te aces
ƒ ter ace
Specify a set of access methods to the channel
But it does not implement those methods
ƒ E e t
le ible, low level synchronization primitive, sed to construct
other forms of synchronization
Have no type and no value
ther comm. sync. models can be built based on the
above primitives
4-
nnel n ot

4-
Wait and Notify
Wait: halt rocess e ecution until event is raised
ƒ wait() with arguments => dynamic sensitivity
• wait(sc_event)
• wait(time)
• wait(time_out, sc_event)

Notify: raise an event


ƒ notify() with arguments => delayed notification
• my_event.notify(); // notify immediately
• my_event.notify(SC_ZERO_TIME); // notify
next delta cycle
• my_event.notify(time); // notify after time

4 - 28
i lation le ents ain Pro ra
nitiali ation Phase

e te all the ro esses


ntil a lo in oint

date si nals

lo y le Co te the set of delta y le


ready ro esses

N er of
ready
ro esses
e te all the ro esses
d an e si lation ti e
ntil a lo in oint

date si nals

4-2
a le i le Channel

4-
a le i le Channel nterfa e

4-
a le i le Channel

4- 2
a le i le Prod er Cons er

4-
a le ahn Pro ess Net or

4- 4
a le ahn Pro ess Net or

4-
a le ahn Pro ess Net or

the will deadloc unless


an initial to en is ut into the loo :

output1.write(0.0);

4-
a le ahn Pro ess Net or

4- 7
a le ahn Pro ess Net or

4- 8
yste C and odels of Co tation

4-
tline

ystem lassification

iscrete vent imulation

am le ystem

i atio at i t a tio

4-4
lti le e els of stra tion
unctional nti ed n tional e el
ƒ se: model un timed functionality
ƒ ommunication: shared varia les messages
ƒ y ical languages: atla

ransaction evel ransa tion e el


ƒ se: o architecture analysis early
develo ment timing estimation
ƒ ommunication: method calls to channels
ƒ y ical languages: ystem

egister ransfer evel


e ister ransfer e el Pin e el
ƒ se: design and verification
ƒ ommunication: wires and registers
ƒ y ical languages: erilog
4-4
A straction Models
Time ranularity for communication computation objects can be classified
into 3 basic cate ories -Timed, Approximate-Timed, Cycle-Timed
Models B, C, D and E could be classified as Transaction Level Models
(TLM)
Communication A. "Un-timed functional model"

Cycle-
timed
D F B. "Timed functional model"

C. "Transaction model"
Approximate-
timed
C E

D. "Cycle-accurate communication model"

Un-
timed
A B E. "Cycle-accurate computation model”
Computation
Un- Approximate- Cycle-
timed timed timed
F. "Register transfer model"
System Modeling Graph
(2003 Dan Gajski and Lukai Cai)
4 - 42
A "Un-Timed Functional Model"
Computation se uential B
execution v aa
ƒ Un-timed be avior B , parallel
B2 B3, execution
Communication B B2 B3

ƒ Un-timed transfer B B
B B
ƒ ariables v2 v bb v3 v - b b

Communication

Cycle- D F B
timed
v v2 v3
c se u(v )
Approximate-
timed
C E

Un-
timed
A B
Un- Approximate- Cycle- Computation
timed timed timed

4-4
B Timed Functional Model”
Computation (on processin elements - Es) Messa e-
ƒ Time annotation (estimate) E
passin
B c
v aa
Communication
ƒ Messa e-passin no protocol
implementation E
ƒ Un-timed transfer B

c
v3 v - b b

Mappin
ƒ Es (arc itecture) allocation and process- E
to- E mappin B
B c v v2 v3
v2 v bb c se u(v )
Communication
B
A v aa

Cycle- D F
timed
v
code - time
B B
B B estimates e
Approximate- v2 v bb v3 v - b b

timed
C E
DELA () or
v2 v3
ait()
Un- Bv
timed
A B c
v2 v3
se u(v )

Un- Approximate- Cycle- Computation


timed timed timed
4 - 44
E ample B Soft are Code Annotation
pecification v__st_tmp = v__st;
startup(proc);
A C nput
if(events[proc][0] & 1)
execute(proc); UT A C source code
Analy e UT UT functionally
ld
basic blocks
compute delays ld e uivalent C code au mented
op by execution times
ld
li
op
delay c aracteri ation ts
--
br

v__st_tmp = v__st;
Annotate C code
__DELAY(LI+LI+LI+LI+LI+LI+OPc);
startup(proc);
if(events[proc][0] & 1) {
Model __DELAY(OPi+LD+LI+OPc+LD+OPi+OPi+IF);
C code execute(proc);
execution delay }

Compile erformance
enerated C and Estimation
run natively
4-4
C: “Transaction Model”
Computation PE4
ƒ Approximate-timed (Arbiter)
(estimate)
PE1
Communication PE3
ƒ Approximate-timed B1 3
B3
(estimate) using simplified v1 = a*a;
v3= v1- b*b;
cv12
(abstract) bus protocols
cv2
Mapping 1 2
v3
cv11
ƒ Mapping of computation PE2
and communication B4
B2 1 Master interface v4 = v2 + v3;
v2 = v1 + b*b; c = sequ(v4);
2 Slave interface
3 Arbiter interface
Communication

Cycle- D F
timed

Approximate-
timed
C E

Un-
timed
A B
Un- Approximate- Cycle- Computation
timed timed timed
4 - 46
D: “C cle Accurate Communication Model”
Computation PE4
ƒ Approximate-timed (Arbiter)
(estimate)
PE1
Communication PE3
3
ƒ Protocol bus channels B1
B3
v1 = a*a;
(time cycle-accurate v3= v1- b*b;

and pin-accurate) address


address 1 1
datadata
: :
31: 31:
1 2
Mapping

e
ready
ack ready
ack v3
PE2
ƒ Mapping of computation
B4
and communication B2 1 Master interface v4 = v2 + v3;
v2 = v1 + b*b; c = sequ(v4);
Communication 2 Slave interface
3 Arbiter interface
Cycle-
D F
timed

Approximate-
C E
timed

Un-
A B
timed
Un- Approximate- Cycle-
Computation
timed timed timed

4-4
E: “C cle Accurate Computation Model”
Computation
PE4
ƒ Cycle-accurate S

S1

Communication S2
cycle-accurate
and pin-accurate
ƒ Approximate-timed S3

(estimate) using simplified


(abstract) bus protocols cycle-accurate and
pin-accurate 4

Wrappers PE1 PE3


ƒ Simulation interfaces MO r1 1 3

bet een cycle-accurate MU r1 r1 r1 4


S

PEs and abstract bus cv12


S1

channels interfaces 1
cv2
2 4 S2

cv11
PE2 S3

Communication
M A r1 r2 r2 r1 4 1 Master interface S4

2 Slave interface
Cycle- D F
timed
3 Arbiter interface
cycle-accurate
4 rapper
Approximate-
timed C E and pin-accurate

Un-
timed A B
Un- Approximate- Cycle- Computation
timed timed timed

4-4
Example E: at is an ISS
ƒ An Instruction Set Simulator (ISS) is a coded in a
- hich mimics the behavior of a processor by
“reading” instructions and maintaining internal variables hich represent
processor s registers

ƒ Instruction-accurate
ƒ Cycle-accurate

ƒ Simulate (execute and monitor) machine code instructions, compiled for a


target processor

4-4
Example E: Types of ISS
original C code original assembly code
… …
a = b+c; compilation add r1, r2, r3
… …

ISS code
Interpretive ISS Compiled ISS
int Reg[32];
… intermediary
while(1) { C code generation
and recompilation
Fetch();
Decode(); …
add(r1, r2, r3);
Execute(); …
InterruptHandler();
}
#define Add(r1, r2, r3)\
switch INSN { r3=r1+r2
case ADD: r3=r1+r2;
case SUB: ...
}
4 - 50
: e ister Transfer odel
Computation and Communication
E1 E2
ƒ cycle timed interr pt
r1, 1
interr pt
r1, r1, r1 r1, r2, r2, r1
ƒ modeled on the le el of Re . e

combinatorial (stateless)
functions, memory and C T
and digital signals T

E E3
S
S
S1
interrupt S1
S2
Comm nication
S2
S3
Cycle S3
timed
S

Approximate
C E
E1, E2: microprocessors
timed
E3, E : custom hardware

Un
timed
Un Approximate Cycle
Comp tation
timed timed timed

4-5
ifferent bstraction odels
odels Communication time Computation time Communication E Interface
Scheme
A. Un imed o o ariables no E
Functional odel
B. imed Functional o pproximate bstract c annel bstract
odel
C. ransaction pproximate pproximate bstract b s bstract
odel c annel
D. Cycle Accurate Cycle acc rate pproximate rotocol b s bstract
Communication c annel
odel
E. Cycle Accurate pproximate Cycle acc rate bstract b s in acc rate
Computation odel c annel
F. Register ransfer Cycle acc rate Cycle acc rate s ires in acc rate
model

4-5
Trace ased Sim lation
(Un timed Functional odel) and ( ransaction odel)
ƒ Higher simulation speed (for large hardware software systems,
multiprocessors)
ƒ Uses estimates of non functional beha ior
Comm nication

Cycle
timed

Approximate
timed
C E

Un
timed
Comp tation
Un Approximate Cycle
timed timed timed

4-5
Trace ased Sim lation: 2 ases
Input: application specification
utput: execution traces = se uence of
e ents ∈ { ; ; }
ethod: un timed functional simulation
race generation
-
Input:
ƒ execution traces
ƒ architecture specification
ƒ mapping specification
utput: performance estimation results, e.g.
execution time, processor load and bus load
ethod: map abstract read, write and
compute primiti es onto irtual machines that
reflect binding and resource sharing (mapping)
race based simulation

4 - 54
Cosim lation otivation ixed odels
and the simulation is ery much dependent on the
system description model
How to se eral abstraction le els or se eral models of computation
oti ating
1. Different abstraction le els
2. Different description languages
3. Different models of computation C C++

more abstract less abstract


address
data
pac et cmd
cnfg
status

4 - 55
Cosim lation Example
En ironments for multiprocessor system cosimulation:
Se eral ISSs coupled with ISSs are replaced with higher
H R simulation: accurate, but le el simulation models: speed
slow (especially for multiple ISS up simulation time
running in parallel)
ISS ISS nati e execution (UNI )
H R H R
Simulator T Simulator
T T1 T2 T
(SystemC) (SystemC) T T1 T2
T T3
T T3
S S S S
model model
H I H I

cosim. interface cosim. interface cosim. interface cosim. interface

interconnect interconnect

4-5
Cosim lation Sin le vs ltiple En ines

2
1 n 1 2 n

Unified odel Simulator Simulator Simulator


#1 #2 #n

Simulator
Cosimulation Bus

Sin le sim lation en ine ltiple sim lation en ines

4-5
ard are Soft are Codesi n

orst Case Exec tion Time nalysis

doc dr re or apa

o ef Stefan International ostgraduate School


5-
System esi n
Specification

System Synthesis Estimation

S Compilation Instruction Set H Synthesis

Intellectual Intellectual
rop. Code rop. Bloc

achine Code Net lists


5-
Contents

ƒ problem statement, tool architecture


rogram ath Analysis
alue Analysis
Caches
ƒ must, may analysis
ipelines
ƒ Abstract pipeline models
ƒ Integrated analyses

5-
Ind strial eeds
- , often in safety critical
applications abound
ƒ Aeronautics, automoti e, train industries, manufacturing
control

Sideairbag in car,
Reaction in 1 mSec

ing ibration of airplane,


sensing e ery mSec

5-4
ard eal Time Systems
Embedded controllers are expected to finish their tas s
reliably within time bounds.

as scheduling must be performed.

Essential: of all
tas s statically nown.

Commonly called the -


( CE )

Analogously, - (BCE )

5-5
Distribution f execution times eas rement Ind stry s best practice

nsafe:
Best Case Execution ime
Execution ime easurement
Upper bound
orst Case
Execution ime

Execution ime

or s if either therwise,
worst case input can be determined, or determine upper bound
exhausti e measurement is performed from execution times of
instructions
5-
ost of Ind stry s est ractice
Measurements: determine execution times directly by
observing the execution or a simulation on a set of inputs.
ƒ Does not guarantee an upper bound to all executions.
Exhaustive execution in general not possible!
ƒ Too large space of input domain x set of initial execution
states.

Compute upper bounds along the structure of the


program:
ƒ Programs are hierarchically structured.
ƒ Statements are nested inside statements.
ƒ So, compute the upper bound for a statement from the upper
bounds of its constituents

5-
Sequence of Statements

Constituents of A:
A { A1; A2;
A1 and A2

Upper bound for A


is the sum of the upper
bounds for A1 and A2

ub(A) = ub(A1) + ub(A2)

5-8
on t ona Statement
{ f Constituents of A:
t en ondition
e se state ents A1 and A2

es no

ub(A) =
ub( ) +
max(ub(A1), ub(A2))

5-
oo s
A { for i m 1 to 1 do
A1

im1
ub(A) =
ub(i m 1) +
no
i ≤ 100 1 u ( ub(i ” 1 ) +
es ub(A1) ) +
ub( i ≤ 100)

A1

5-
o to sta t
ssignment load a
xma+b ssu es onstant
e e load b ti es
ution
fo inst
addu tions
store x

cycles
ub(x m a + b) =
add
cycles( oa a) +
cycles( oa ) + load ot
m a 12 i a e
cycles(a ) + tostore
odem n 1 o esso s
cycles(sto e ) move 1
5-
o en a a e eatu es
odern processors increase per ormance by using:
Ca es i e ines an edi tion e u ation

These features ma e CE computation di icult:


xecution times of instructions vary idely.
ƒ est case everything goes smoothely: no cache miss,
operands ready, needed resources free, branch correctly
predicted.
ƒ orst case everything goes rong: all loads miss the
cache, resources needed are occupied, operands are not
ready.
ƒ an a e se e a und ed es

5-
ccess mes
LOAD r2, _a
x = a + b; LOAD r1, _b
ADD r3,r2,r1

xecution Time ( loc ycles)

2
loc ycles
1

est ase orst ase

5-
mn cc ents an ena t es
iming ccident cause for an increase of the execution
time of an instruction
iming enalt the associated increase
pes of timing accidents
ƒ ache misses
ƒ Pipeline stalls
ƒ ranch mispredictions
ƒ us collisions
ƒ emory refresh of D A
ƒ T miss

5-
ea oac o ua at on
Micro-architecture nal sis:
ƒ Uses Abstract nterpretation
ƒ xcludes as many Timing Accidents as possible
ƒ Determines T for basic bloc s (in contexts)

orst-case ath etermination


ƒ aps control flo graph to an integer linear program
ƒ Determines upper bound and associated path

5- 5
ontents
ntroduction
ƒ problem statement, tool architecture
rogram ath nal sis
alue Analysis
aches
ƒ must, may analysis
Pipelines
ƒ Abstract pipeline models
ƒ ntegrated analyses

5-
ont o o a
1
what_is_this {
1 read (a,b);
2
2 done = FALSE;
3 repeat {
4 if (a>b)
5 a = a-b; a>b a<=b
6 elseif (b>a)
7 b = b-a; a<b a=b
8 else done = TRUE;
9 } until done;
10 write (a);
}

!done done

5-
o am at na s s
rogram ath nal sis
ƒ hich se uence of instructions is executed in the orst case
(longest runtime)
ƒ problem: the number of possible program paths gro s
exponentially ith the program length
Model
ƒ fixed number of cycles for each basic bloc (from static
analysis)
ƒ loops must be bounded
Concept
ƒ Transform structure of into a set of (integer) linear
e uations.
ƒ Solution of the nteger inear Program ( P) yields bound on
the T.

5- 8
as c oc
e inition A basic bloc is a se uence of instructions
here the control flo enters at the beginning and exits at
the end, ithout stopping in bet een or branching (except at
the end).

t1 := c - d
t2 := e * t1
t3 := b * t1
t4 := t2 + t3
if t4 < 10 goto L

5-
as c oc s
etermine basic bloc s o a program
1. ete ine t e o e innin s:
the first instruction
targets of un conditional umps
instructions that follo un conditional umps
2. dete ine t e asi o s:
there is a basic bloc for each bloc beginning
the basic bloc consists of the bloc beginning and runs
until the next bloc beginning (exclusive) or until the
program ends

5-
ont o o a t as c oc s
egenerated control lo graph C
ƒ the nodes are the basic bloc s

i := 0
t2 := 0
L t2 := t2 + i
i := i + 1
if i < 10 goto L i < 10
x := t2 i >= 10

5-
am e
1 s = k;
/* k >= 0 */
s = k;
WHILE (k < 10) { 2 WHILE (k<10)
IF (ok)
j++; if (ok)
ELSE {
j = 0;
j = 0;
ok = true; j++;
ok = true;
}
k ++;
k++;
}
r = j;
r = j;
5-
a cu at on of t e
Definition: A program consists of N basic blocks, where
each basic block Bi has a worst-case execution time ci and
is executed for exactly xi times. Then, the WCET is given by
N
WCET ¦ c i ˜ xi
i 1
ƒ the ci values are determined using the static analysis.
ƒ how to determine xi ?
• structural constraints given by the program structure
• additional constraints provided by the programmer (bounds for
loop counters, etc.; based on knowledge of the program context)

5-
Structural Constraints
d1
B1 s = k;
Flow equations:
d2
d1 = d2 = x1
B2 WHILE (k<10) d2 + d8 = d3 + d9 = x2
d3 d3 = d4 + d5 = x3
B3 if (ok) d4 = d6 = x4
d5 d5 = d7 = x5
d4
d6 + d7 = d8 = x6
j = 0;
B4 j++; B5 d9 = d10 = x7
ok = true;
d9 d6 d7
B6 k++;
d8

B7 r = j;
d10 5 - 24
itional Constraints
d1
B1 s = k; loop is executed for at most 10
times
d2
B2 WHILE (k<10) x3 = 10 x1
d3
B3 if (ok)
d5 B5 is executed for at most one
d4
time
j = 0;
B4 j++; B5
ok = true; x5 = 1 x1
d6 d7
d9 d8
B6 k++;

B7 r = j;
d10 5 - 25
WCET - ILP
ILP with structural and additional constraints:
program is executed
once
N
max {¦ i˜ i 1 1š
i 1
¦ j ¦ k i, i 1...N š
j in ( Bi ) k  out ( Bi )
structural
additional constraints } constraints

5 - 26
Cont nts
ntroduction
ƒ pro lem statement tool arc itecture
rogram at nal sis
alu nal sis
ac es
ƒ must ma anal sis
ipelines
ƒ stract pipeline models
ƒ ntegrated anal ses

5-2
A stra t Int r r tation AI
antics- as d thod or static program anal sis

asic id a o I er orm t e program s computations


using alue descriptions or abstract values in place o t e
concrete alues start it a description o all possi le
inputs

supports corr ctn ss proo s

5-2
A stra t Int r r tation t In r di nts
a stract do ain related to concrete domain
a straction and concreti ation unctions
eg ĺ Intervals,
where Intervals = LB u UB, LB = UB = Int‰{-f, f}
instead of L ĺ Int
abstract transfer functions for each statement type –
abstract versions of their semantics
e.g. + : Intervals u Intervals ĺ Intervals where
[a,b] + [c,d] = [a+c, b+d] with + extended to -f, f
a join function combining abstract values from different
control-flow paths
e.g. t : Interval u Interval ĺ Interval where
[a,b] t [c,d] = [min(a,c),max(b,d)]
5-2
Value Analysis
Motivation:
ƒ Provide access information to data-cache/pipeline analysis
ƒ Detect infeasible paths
ƒ Derive loop bounds

Method: calculate intervals at all program points, i.e. lower


and upper bounds for the set of possible values occurring
in the machine program (addresses, register contents,
local and global variables).

5 - 30
Value Analysis
D :[- , ], :[ x , x ]
move #4,D0 Intervals are computed along
the edges
D :[ , ], D :[- , ], t oins, intervals are unioned
:[ x , x ]

add D1,D0

D : [- ,+ ] D : [- , ]
D :[ , ], D :[- , ],
:[ x , x ]
D : [- ,+ ]
move (A0,D0),D1
hich address is accessed here
access [ x , x ]
5-3
n en s
Introduction
ƒ problem statement, tool architecture
Program Path nalysis
alue nalysis
aches
ƒ must, may analysis
Pipelines
ƒ bstract pipeline models
ƒ Integrated analyses

5-3
a es as e y n i
aches are used, because
ƒ ast main memory is too expensive
ƒ he speed gap between PU and memory is too large and
increasing

aches wor well in the avera e case:


ƒ Programs access data locally (many hits)
ƒ Programs reuse items (instructions, data)
ƒ ccess patterns are distributed evenly across the cache

5 - 33
a es

access Processor
takes
~ 1 cycle

fast, small,
Cache expensive
access
takes Bus
~ 100 cycles
(relatively)
Memory slow, large,
cheap

5-3
a es e
PU wants to read rite at e or address a,
sends a re uest for a to the bus.
ases:
ƒ Bloc m containing a in the cache (hit):
re uest for a is served in the next cycle.
ƒ Bloc m not in the cache (miss):
m is transferred from main memory to the cache,
m may replace some bloc in the cache,
re uest for a is served asap while transfer still continues.
everal re ace ent strate ies: L U, PL U, I ,...
determine which line to replace.

5 - 35
ay e Ass ia i e a e

5-3
ae y
ach cache set has its own re ace ent o ic = ache sets
are independent. verything explained in terms of one set
- e ace ent trate :
ƒ eplace the bloc that has been Least ecently Used
ƒ odeled by ges
a e: -way set associative cache
access age age age age
m m m m
m (miss) m m m m
m (hit) m m m m
m (miss) m m m m

5-3
a e Analysis
ow to statically precompute cache contents:

ƒ Must na sis:
or each program point (and calling context), find out which
bloc s are in the cache.
Determines safe information about cache hits. ach
predicted cache hit reduces .

ƒ Ma na sis:
or each program point (and calling context), find out which
bloc s may be in the cache. omplement says what is not in
the cache.

Determines safe information about cache misses. ach


predicted cache miss increases B .

5-3
ne s
ache contents depends on the context, i.e. calls and
loops

irst Iteration loads the cache:


ƒ Intersection looses most of the
information. ile cond
oin (must)
Distinguish as many contexts as useful:
ƒ unrolling for caches
ƒ unrolling for branch prediction (pipeline)

5 - 50
n en s
Introduction
ƒ problem statement, tool architecture
Program Path nalysis
alue nalysis
aches
ƒ must, may analysis
i e ines
ƒ bstract pipeline models
ƒ Integrated analyses

5-5
a is n A i e u es

in y lenverarb.
single cycle L

L
ehr y lenverarb.
multiple cycle I B I

pipelining
Pipelineverarb. I B L
I B

5-5
d e Fe tu e e e
t t t t
Fetch Fetch
Decode Decode Fetch
Execute Execute Decode Fetch
WB WB Execute Decode Fetch
WB Execute Decode
WB Execute
WB

Ideal Case: 1 Instruction per Cycle

5 - 53
D t th o e e ch tectu e

5-5
d e Fe tu e e e
.

Several instructions can be e ecuted in parallel.

Some pipelines can begin more than one instruction per


cycle: VLIW, Superscalar.

Some CP s can e ecute instructions out o order.

: Hazards and cache misses.

5 - 55
e e d
Pipeline azards:
ƒ : perands not yet available
ata ependences

ƒ : Consecutive instructions use same


resource

ƒ : Conditional branch

ƒ - : Instruction etch causes cache


miss

5-5
o to d

5-5
D t d

5-5
t tc o h d
: prediction o cache hits on instruction or
operand etch or store

l z r4 2 r1 Hi
: analysis o data control hazards
add r4 r5 r6
l z r7 1 r1 pera d
add r8 r4 r4 read

: analysis o resource hazards


F
E

5-5
o c ete t te ch e
Processor pipeline cache memory inputs vie ed as a
per orming transitions every clock cycle.
Starting in an initial state or an instruction transitions are
per ormed until a is reached:
ƒ : instruction has le t the pipeline
ƒ : e ecution time o instruction

u ct o e ec : c oc s : co c ete e e t te : t ce
ƒ interprets instruction stream o starting in state s producing trace
ƒ successor basic block is interpreted starting in initial state las
ƒ le h gives number o cycles

5-
t ct e e o B c B oc
u ct o e ec : c oc s: t ct e e t te
: t ce
ƒ interprets instruction stream o annotated ith cache
in ormation starting in state s producing trace
ƒ le h gives number o cycles

ƒ bstract states may lack in ormation e.g. about cache


contents.
ƒ ssume local orst cases is sa e
in the case o no timing anomalies
ƒ Traces may be longer but never shorter .

5-
Wh t d ee t
or successor basic block In particular i
there are several predecessor blocks
:
ƒ sets o states
ƒ combine by assuming that local orst case is sa e

s s
s

5- 2
u o te

using statically computed e ective


addresses and loop bounds

ƒ assume cache hits here predicted


ƒ assume cache misses here predicted or not e cluded.
ƒ nly the orst result states o an instruction need to be
considered as input states or successor instructions

5- 3
d e ot e ode

ut te t to

doc d e o

o e Ste an International Postgraduate School


-
te De
Speci ication

System Synthesis Estimation

SW Compilation Instruction Set W Synthesis

Intellectual Intellectual
Prop. Code Prop. Block

Machine Code et lists


-2
De ce Ex o t o

c to ch tectu e

E t to

multi ob ective optimization

-3
ectu e o

ptimization
esign
Implementation

-
E o ut o u t o ect e t to
o th
Wha are lu i ar l ri hms

randomized o e de e de t search heuristics


→ applicable to black bo optimization problems

H d he r

by iteratively improving a o u t o o solutions by


variation and selection
→ can ind many di erent optimal solution in a single run

-5
he c o e
eight 75 g eight eight 3 g eight
pro it 5 15 g pro it 7 1 g
pro it 8 pro it 3

o choose subset that


ma imizes overall pro it
minimizes total eight
-
he o ut o ce

pr i

15

ei h
5 g 1 g 15 g 2 g 25 g 3 g 35 g

-
he de o F o t
e to n there is no single optimal solution but
o some solutions are better than others
pr i

selecting a
2 solution

15
inding the good
1 solutions

ei h
5 g 1 g 15 g 2 - g 25 g 3 g 35 g
Dec o e ect o ut o
o che pro it more important than cost ranking
eight must not e ceed 24 g constraint
pr i

15
too heavy
1

ei h
5 g 1 g 15 g 2 - g 25 g 3 g 35 g
Whe to e the Dec o
Be o e t to te t to

searches or a set o
ranks ob ectives green solutions
de ines constraints

pr i
searches or one selects one solution
green solution 2 considering constraints
15
too heavy
1

5 decision making o ten easier


ei h
5 g 1
evolut. algorithms
g 15 g 2 g 25 g 3
ell suited
g 35 g
-
t to te t e
se o classical si le ec i e p imiza i methods
ƒ simulated annealing tabu search
ƒ integer linear program
ƒ other constructive or iterative heuristic methods
ecisi ma i eighting the di erent ob ectives is
done e re he p imiza i .

pula i ased p imiza i me h ds


ƒ evolutionary algorithms
ƒ genetic algorithms
ecisi ma i is done a er he p imiza i .

-
Ft e d ut e ect e
e to ed do ce ed
ei h ed sum

y2 y2

y1 y1

parameter oriented set oriented


scaling dependent - 2
scaling independent
We hted o t Fu ct o
parameters
multiple single
ob ectives ob ective
y1 y2 yk trans ormation y

e ample: eighting approach y2 ma imization problem

1 2 k

y 1y1 kyk

y1
- 3
ut eo e E o ut o o th
11 1 solution 111 itness 19

itness
evaluation mating
selection

11 1 11
environmental
selection
recombination
1 11

mutation

recombination mutation variation


-
t t o t E cod o o ut o

1 1
item 1 item 2 item 3 item 4

subset
- 5
e e c u t o ect e E
population archive

sample
update
vary
truncate
select

ne population - ne archive
E o ut o o th ct o
ma . y2
hypothetical trade o ront

-
min. y1
B c Box t to
o ect e u ct o
Stretch Module ecision Module andling Module
X lane \
1

lane a re uire
v a
D Br
v D r
v t point o
gear

ob ective
change D r

decision
t n gear
2 1
gear 4 3
lane R 5

vector
v a

vector ehicle Module gear


s clutch

lane v
a gear n

e.g. simulation model

ptimization lgorithm:
only allo ed to evaluate
direct search
-
De ce Ex o t o

Speci
Speciication
ication ptimization
ptimization Evaluation
Evaluation Implementation
Implementation

po er
consumption latency

cost

-
c et oce et o
B e
Embedded
Embedded
Internet
Internet
evices
evices o
ethod
to
d

do oth
c co d e
Wearable
WearableComputing
Computing
e d o

ccess Core

Mobile
MobileInternet
Internet

-2
et o oce o
et o oce o high per ormance programmable device
designed to e iciently e ecute
communication
orkloads r le e al
incoming lo s routing or arding outgoing lo s
packet streams transcoding processed packets
encryption decryption
real time lo s
e.g. voice et o
oce o

e.g. s tp
non real time lo s

-2
t to ce o e e
e n speci ication o the task structure t ode
or each lo the corresponding tasks to be e ecuted
o di erent usage scenarios o ode
sets o lo s ith di erent characteristics
ou ht net ork processor implementation
architecture task mapping scheduling
ect e n ma imize per ormance
o minimize cost
e o ce ode
u ect to n memory constraint
o delay constraints

- 22
Ex o t o t te
or each usage
scenario separately
architecture task binding
template graph restrictions
e u to
co t uct e t te
allocation per ormance ch tectu e o e o ce
bindings cost vector

u t o ect e
o t to

architecture per ormance

- 23
ectu e o
Introduction

esign
Implementation

-2
Do ce eto o t
design point is d mi a ed by i i i is
ƒ better or e ual than in all criteria and
ƒ better in at least one criterion.

point is Pareto optimal or a are p i i it is not


dominated.

The domination relation imposes a partial order on all


design points
ƒ We are aced ith a set o optimal solutions.
ƒ ivergence o solutions vs. convergence.

- 25
u t o ect e t to

-2
u t o ect e t to
Ma imize y1 y2 yk g 1 2 n

y2 y2

Pareto optimal not dominated

better

incomparable dominated

orse incomparable

y1 y1

eto et set o all Pareto optimal solutions


-2
do ed B c Box e ch o th
de ind good solutions ithout investigating all solutions
u to better solutions can be ound in the neighborhood
o good solutions
in ormation available only by unction evaluations

Randomized
search algorithm

t g t•t
randomly choose a randomly choose a
solution 1 to start ith solution t 1 using solutions
1 t

-2
e o do ed e ch o th
e o e ect o to

mating
selection

environmental
selection

E ≥1 both
evolutionary algorithm

1 no mating selection
tabu search

1 no mating selection
simulated annealing
-2
Limitations of Randomized Search Algorithms

The No-Free-Lunch Theorem

All search algorithms provide in average the same


performance on a all possible functions
with finite search and objective spaces.

[Wolpert, McReady: 1997]

Remarks:
Not all functions equally likely and realistic
We cannot expect to design the algorithm beating all others
Ongoing research: which algorithm suited for which class of
problem? 6 - 30
ourse Synopsis
ntroduction
Optimi ation

mplementation

6-3
esign hoices
representation fitness assignment mating selection

11 111

parameters
11 1

1 11 11

environmental selection variation operators

6 - 32
omparison of Three mplementations
-o ective knapsack pro lem

e te ded

rade off between


distance and diversity?

6 - 33
esign hoices
fitness assignment mating selection

11 111

parameters
11 1

1 11 11

environmental selection variation operators

6-3
Representation
search space decoder solution space o ectives o ective space

1 1 1 1

1 1 1 1
g
1 1 1 1

solutions encoded by vectors matrices trees lists ...


fixed length variable length
ssues:
completeness each solution has an encoding
uniformity all solutions are represented equally often
redundancy cardinality of search space vs. solution space
feasibility each encoding maps
6-3
to a feasible solution
E ample: inary ector Encoding
iven: graph
oal: find minimum subset of nodes such that each edge
is connected to at least one node of this subset
minimum vertex cover
A

nodes A
selected? 1 1 1 1

6 - 36
E ample: nteger ector Encoding
iven: graph k colors
oal: assign each node one of the k colors such that the
number of connected nodes with the same color is
minimi ed graph coloring problem
A

nodes A
colors 1 1 1

6-3
E ample: Real ector Encoding

parameters x1 x x x xn
values 6.- 3 . 1. . .
Tree E ample: arking a Truck
steering u
angle

cab
d
dock
trailer

t
position x y
oal:
find function c with
constant speed u cx y d t
6 - 39
Search Space for the Truck ro lem
perators:

Arguments: position x
position y
cab angle d
AN trailer angle t
Search space : set of symbolic expression using the above
operators and6 -arguments
0
E ample Solution: Tree Representation

AN

encodes the function symbolic expression : u x d y t


6-
A Solution Found y an EA
truck simulation encoded tree

6- 2
esign hoices
representation mating selection

11 111

parameters
11 1

1 11 11

environmental selection variation operators

6- 3
Fitness Assignment
Fitness F scalar value representing quality of an individual

The simple case:


single objective optimi ation:
solution in objective space

solution in search space


solution in solution space
ore difficult cases:
fitness not only takes into account the different objectives
compliance to areto optimality but also properties of the whole
population
multiple optima need to be approximated diversity
constraints are involved which have to be met
6-
Simple e ample: areto Ranking

itness function:

execution time

F (1) 3
F ( 2) 1
F ( 3) 1
F ( 4) 2
F (5) 1
F (6) 0
cost
6-
onstraint andling
onstraint x1 x xn ≥ feasi le ≥

solution in solution space infeasi le


<
Approaches:
construct initiali ation and variation such that infeasible
solutions are not generated resp. not inserted
representation is such that decoding always yields a feasible
solution
calculate constraint violation x1 x xn and incorporate it into
fitness e.g. penalty x1 x xn fitness to be
maximi ed use of a penalty function penalty y if y
include the constraints as new objectives
6- 6
esign hoices
representation fitness assignment

11 111

parameters
11 1

1 11 11

variation operators

6-
Selection

T o types of selection:

mating selection select for variation

environmental selection select for survival

6-
Tournament Selection

population mating pool

n o
uniformly choose compare fitness
individuals at and copy best
random independently individual
of fitness in mating pool

tournament si e binary tournament selection means

6- 9
esign hoices
representation fitness assignment mating selection

11 111

parameters
11 1

1 11 11

environmental selection

6- 0
ector utation: E amples

1 1 1 1
it vectors:
each bit is flipped with probability 1

1 1 1

1 1
ermutations:
swap
1 1

rearrange
6-
utation perators on Trees: ro

gro
N N

N AN N

6- 2
utation perators on Trees: Shrink

shrink
N N

N AN AN AN

6- 3
utation perators on Trees: S itch

s itch
N AN

N AN N

6-
utation perators on Trees: Replace

replace
N N

N AN N AN

6-
ector Recom ination: E amples
it vectors:
1 1 1
1 1 1
1 1 1

ermutations:

1 child

parents 1
1

6- 6
Recom ination of Trees

N AN N

N AN

e change
AN

6-
A eneric ultio ective EA
population archive

sample
update
vary
truncate
select

new population 6- new archive


S EA Algorithm
tep 1: enerate initial population and empty archive
external set A . et t .
tep : alculate fitness values of individuals in t and At.
tep : At 1 non dominated individuals in t ‰ At.
f si e of At 1 N then reduce At 1 else if
si e of At 1 N then fill At 1 with dominated
individuals in t and At.
tep : ft then output the non dominated set of At 1.
top.
tep : ill mating pool by binary tournament selection.
tep : Apply recombination and mutation operators to
the mating pool and set t 1 to the resulting
population. et t t 1 and go to tep .
6- 9
S EA Fitness Assignment
dea Step : calculate dominance rank weighted by
dominance count
y

non-dominated solutions:
dominated solutions
dominated solutions
of non areto solutions
∑ strengths of dominators

y1

Note: higher objective function better


smaller fitness better 6 - 60
ourse Synopsis
ntroduction
Optimi ation
esign

6 - 62
mplementation: omponents
A frame ork that
rovides ready to use modules algorithms applications
s simple to use
s independent of programming language and O
omes with minimum overhead

dea: separate problem dependent from problem


independent part cut

Representation
Selection
Recombination Objective functions
Fitness assignment
Archiving Mutation
6-6
The oncept of SA
Algorithms Applications

A knapsack

N A

network
A processor
design
text based
latform and programming language independent nterface
for Search Algorithms [ le ler et al : ]
6-6
SA: mplementation
shared
shared
file
file
system
system

selector
selector text variator
variator
process
process files process
process
application independent: handshake protocol: application dependent:
mating environmental state action variation operators
selection individual s stores and manages
individuals are objective vectors individuals
described parameters
by s and objective
vectors
6 - 66
ard are Soft are odesign

apping Applications To Architectures

doc dr regor apa

o ef tefan nternational ostgraduate chool


-
System esign
pecification

ystem ynthesis stimation

W ompilation nstruction et W ynthesis

ntellectual ntellectual
rop. ode rop. lock

achine ode Net lists


-2
Synthesis
ynthesis transforms behavior into structure.

ƒ : select components

ƒ : assign functions to components


mapping
ƒ scheduling: determine execution order

(allocation and) binding sometimes


called partitioning

-3
Application Specification
Depends on the underlying model of computation.
Examples (see also next slides):
ƒ Task graphs (data flow graph, control flow graph)
ƒ Process Networks (Kahn Process Network, Synchronous
Dataflow)
ƒ State Machine Representations (SpecCharts, StateCharts,
Polis) [not covered in this course].

For the mapping, very often only the network structure


and abstract properties of the processes are relevant
(abstraction from detailed process function).

7-4
ata lo ap
a b b a b


c


x = 3*a + b*b - c;
y = a + b*x; c
z = b - c*(a + b); b

 
b


y x
7-
ont ol lo ap

what_is_this {
1 read (a,b);
2 done = FALSE;
3 repeat {
4 if (a>b)
5 a = a-b; a>b a<=b
6 elseif (b>a)
7 b = b-a; a<b a=b
8 else done = TRUE;
9 } until done;
10 write (a);
}

!done done

7-
a n oce et o
ierarchical network for M P application:

7-7
A c itect e Specification
Depends on the underlying model of the platform.
sually a graph notation is used to the elements,
properties of the underlying platform are usually attached.

7-
ample A c itect e Specification
- <processor name="processor1" type="DSP">
<port name="processor_port" type="duplex" />
<configuration name="clock" value="100 MHz" />
</processor>
+ <processor name="processor2" type="RISC">
+ <memory name="sharedmemory" type="DXM">
- <hw_channel name="in_tile_link" type="bus">
<port name="port1" type="duplex" /> DSP
DSP RRSC
SC DD MM
<port name="port2" type="duplex" />
<port name="port3" type="duplex" />
<configuration name="buswidth" value="32bit" />
</hw_channel>
- <connection name="processor1link">
<origin name="processor1"> bus
bus
<port name="processor_port" />
</origin>
<target name="in_tile_link">
<port name="port1" />
</target>
</connection>
+ <connection name="processor2link">
+ <connection name="memorylink">

7-
apping Specification
Relates application and architecture specification:
ƒ maps processes to computing resources
ƒ maps communication between processes (in case of process
networks) to communication paths of the architecture
ƒ specifies resource sharing disciplines and scheduling

7-
ample
asic model with a data flow graph and static scheduling
Problem
Data graph GPP((VPP,,EPP):)
flow graph

1 2
Interpretation:
• VP consists of functional
5 6 nodes VPf (task, proce-
dure) and communication
3
nodes VPc .
7 • EP represent data depend-
encies
4

7-
Example (2)
Architecture graph GA(VA,EA):
RISC HWM1
RISC HWM1

PTP bus shared bus


shared PTP bus
bus
HWM2 HWM2
Architecture Architecture graph
• VA consists of functional resources VAf (RISC, ASIC) and
bus resources VAc. These components are potentially allo-
catable.
• EA model directed communication.

7 - 12
Example ( )
1
Definition: A specifica-
tion graph is a graph 5 RISC
GS=(VS,ES) consisting
aa l
of a problem graph GP, 3 SB

an architecture graph 7 HWM1


GA, and edges EM. In
particular, VS=VP∪VA, 2 PTP

ES=EP∪EA∪EM 6 HWM2

4
GP EM GA

7-1
Example ( )
Three main tasks of synthesis:
• Allocation α is a subset of VA.
• Binding β is a subset of EM, i.e., a mapping of functional
nodes of VP onto resource nodes of VA.
• Schedule τ is a function that assigns a number (start time) to
each functional node.

7-1
Example ( )
1
0 1
0
Definition: Given a 1 5 RISC
specification graph GS 8
an implementation is a 21 3 SB
triple (α,β,τ), where α 29 7
1
HWM1
is a feasible allocation, 20
α
β is a feasible binding, 1 2
and τ is a schedule. 1
21 6 β RISC HWM1
2 PTP bus
30 4 shared
bus
HWM2
τ

7-1
em e
pecification

stem nthesis Estimation

ompilation nstruction et nthesis

ntellectual ntellectual
Prop ode Prop Bloc

achine ode et lists


7-1
e pa e Expl a

Determine mapping
Determine important paramerters (end to end dela ,
throughput, uffer space output itter, )
Gi e feed ac to optimi ation

ppl a e e

app

E ma
7-1
e e a el

7-1
a ae ae e

em a

e apa

o ef tefan nternational Postgraduate chool


-1
em e
pecification

stem nthesis Estimation

ompilation nstruction et nthesis

ntellectual ntellectual
Prop ode Prop Bloc

achine ode et lists


-2
a

ƒ low level: at the register transfer ( ) le el, at the netlist


le el
split a digitial circuit and map it to se eral de ices ( PG s,
s)
s stem parameters are relati el well nown (area, dela )

ƒ high level: at the s stem le el


comparison of design alternati es mandator (design space
e ploration)
s stem parameters are un nown
importance of estimation (anal sis, simulation, rapid
protot ping)

-
el
(see pre ious lecture )
model application
define architectural template
identif possi le indings

ƒ Ver often, parameters are attached to the a o e models that


simpl allow to of the partitioning (allocation
and inding)
ƒ ometimes, (simulation, anal sis)
are applied to gi e more accurate predictions
ƒ
allocation gi es cost as the sum of the allocated component costs
scheduling gi es latenc
constraints feasi le schedule ” ma feasi le allocation ” ma

-
e a lem
he partitioning pro lem is to assign n
o ects O ={o1, ..., on} to m loc s (also called
partitions) P={p1, ..., pm}, such that
z p1‰ p2 ‰ ...‰ pm = O
z pi ˆ pj =  i,j: iz j and
z cost c(P) are minimi ed

n (simple model)
z o ects data flow graph nodes
z loc s architecture graph nodes

-
of a design point
ƒ ma include C s stem cost in
L latenc in sec
P power consumption in
ƒ re uires to find C, L, P

linear cost function with penalt

f(C, L, P) = k1·hC(C,Cmax) + k2·hL(L,Lmax) + k3·hP(P,Pmax)

ƒ hC , hL , hP denote how strong C, L, P iolate the design


constraints Cmax, Lmax, Pmax
ƒ k1 , k2 , k3 weighting and normali ation

-
e e al a e

ƒ enumeration
ƒ nteger inear Programs ( P)

ƒ constructi e methods
random mapping
hierarchical clustering
ƒ iterati e methods
ernighan in lgorithm
imulated nnealing
E olutionar lgorithms (E ) see ne t lecture

-7
Integer Programming o el
Ingredients:
ƒ Cost function In ol in linear e pressions of
ƒ Constraints inte er ariables from a set X

Cost function C ¦a x
xi X
i i with a i  R , x i  т
N (1)

Constraints:  j  J : ¦b i, j x i t c j with bi , j , c j  ъ
R (2)
xi X

Def.: he problem of minimi in (1) sub ect to the constraints (2) is


called an integer programming (IP) problem.
If all xi are constrained to be either 0 or 1, the IP problem said to be
a 0/1 integer programming problem.
8-8
ample

minimi e C x  x  x
b e t to x1  x 2  x t 2
x 1, x 2 , x  0,1

ptimal

8-
emar on Integer Programming
aximi ing the cost function can be done b settin C C

Inte er pro rammin is P complete.

In practice, running times can increase ex onentia l


with the si e of the problem, but problems of some
thousands of ariables can still be sol ed with commercial
sol ers, dependin on the si e and structure of the
problem.

IP models can be a good starting oint for modelin ,


e en if in the end heuristics ha e to be used to sol e them.

8-
Integer inear Program or Partitioning (1)
inar ariables xi
ƒ xi 1: ob ect i in bloc
ƒ xi 0: ob ect i not in bloc

Cost ci , if ob ect i is in bloc


Inte er linear pro ram:
xi k  ^ ` didn dkdm
m
¦ xi k didn
k
m n
e ¦ ¦ xi k ˜ ci k dkdm didn
k i
8-
Integer inear Program or Partitioning ( )
dditiona constraints
ƒ e ample: ma imum number of h ob ects in bloc
n
¦ xi k d hk dkdm
i
he idea of mappin the s nthesis problem to an I P is er
o u ar:
ƒ chedulin can be inte rated.
ƒ arious additional constraints can be added.
ƒ If not sol in to optimalit , run times are acceptable and a solution
with a uaranteed ualit can be determined.
ƒ indin the ri ht e uations to model the constraints is an art .

8-
on tr ti e et o
andom ma ing
ƒ each ob ect is assi ned to a bloc randoml

Hierarchica c ustering
ƒ stepwise roupin of ob ects
ƒ closeness function determines how desirable it is to roup
two ob ects

onstructi e methods
ƒ are often used to enerate a startin partition for iterati e
methods
ƒ show the difficult of findin proper closeness functions

8-
Hierarchical Clustering - Example (1)

v5 = v1‰v3
v1 v5
10 20 10

10 v2 7
v2 v3

4 4
v4
v4

closeness function: arithmetic mean of weights

8 - 14
Hierarchical Clustering - Example ( )

v = v2‰v5
v5
10 v

v2 7 5 5

4
v4 v4

8-1
Hierarchical Clustering - Example ( )

v
v7 = v ‰v4
5 5 v7

v4

8-1
Hierarchical Clustering - Example ( )

ste :
v7 = v ‰v4

ste : cut lines


v = v2‰v5 artitions

ste :
v5 = v1‰v3

v1 v2 v3 v4
8-1
terative Methods - ernighan- in (1)
imple greed heuristic:
ƒ ntil there is no im ro ement in cost: re grou a air of
o ects which lea s to the largest gain in cost

v2
v1
v5
v4
v v3
v7
v v

e am le: cost num er of e ges crossing the artitions


efore re grou : after re grou : gain
8 - 18
terative Methods - ernighan- in ( )
ro lem
ƒ im le gree heuristic can get stuc in a local minimum

mproved algorithm ernighan in :


ƒ as long as a etter artition is foun :
from all ossi le airs of o ects irtuall re grou the est
lowest cost of the resulting artition then from the remaining
not et touche o ects irtuall re grou the est air etc
until all o ects ha e een re grou e
from these n/2 artitions ta e the one with smallest cost an
actuall erform the corres on ing re grou o erations

8-1
terative Methods - imulated nnealing
rom h sics:
ƒ metal an gas ta e on a minimal energ state uring cooling own
un er certain constraints :
at each tem erature the s stem reaches a thermo namic e uili rium
the tem erature is ecrease sufficientl slowl
ƒ ro a ilit that a article um s to a higher energ state:
ei  ei 1
k BT
P ei ei 1 T e

pplication to om inatorial timi ation:


ƒ energ cost of a solution artition
ƒ cost ecreases with tem erature sometimes with a certain
ro a ilit increases in cost are acce te
8-
terative Methods - imulated nnealing
tem tem start
cost c
hile ro en
hile uili rium {
P’ = RandomMove(P);
cost’ = c(P’);
deltacost = cost’ - cost;
if (Accept(deltacost, temp) > random[0,1)) {
P = P’;
cost = cost’;
deltacost
} 
k ˜temp
} Accept(deltacost , temp) e
temp = DecreaseTemp (temp);
}

8- 1
Iterative Methods - Simulated Annealing
Cooling Down: DecreaseTemp(), Frozen()
• temp_start = 1.0
• temp = D • temp (typical: 0.8 d D d 0.99)
• terminate when temp < temp_min or there is no more improvement

Equilibrium: Equilibrium()
• after defined number of iterations or when there is no more
improvement

Complexity
ƒ from exponential to constant, depending on the implementation of
the functions Equilibrium(), DecreaseTemp(), and Frozen()
ƒ the longer the runtime, the better the quality of results
ƒ typical: construct functions to get polynomial runtimes
8 - 22
ard are Soft are odesign

Allo ation

do dr regor a a

o ef tefan nternational Postgraduate


chool -
Integer rogramming models
ngredients:
ƒ ost function nvolving linear expressions of
ƒ onstraints integer variables from a set

ost function ¦

with  ,  т (1)

onstraints:  : ¦ , t with , , ъ ( )


ef.: The problem of minimizing (1) sub ect to the constraints


( ) is called an integer linear rogramming I ro lem.
f all are constrained to be either 0 or 1, the P problem said
to be a integer linear rogramming ro lem.
-2
am le

1  

1   t
,
1 ,  0,1

ptimal

-
emar s on integer rogramming

ƒ Maximizing the cost function: ust set =


ƒ nteger programming is P-complete.
ƒ Running times depend exponentially on problem size,
but problems of >1000 vars solvable with good solver
(depending on the size and structure of the problem)
ƒ The case of  ъ is called ( P).
P has polynomial complexity, but most algorithms are
exponential, still in practice faster than for P problems.
ƒ The case of some  ъ and some  т is called

ƒ P P models can be a good starting point for modeling, even


if in the end heuristics have to be used to solve them.

-
Simulated Annealing
ƒ eneral method for solving combinatorial
optimization problems.
ƒ ased the model of slowly cooling crystal liquids.
ƒ ome configuration is sub ect to changes.
ƒ pecial property of imulated annealing: hanges
leading to a poorer configuration (with respect to
some cost function) are accepted with a certain
probability.
ƒ This probability is controlled by a temperature
parameter: the probability is smaller for smaller
temperatures.

-
lanation
ƒ nitially, some random initial configuration is created.
ƒ urrent temperature is set to a large value.
ƒ uter loop:
• Temperature is reduced for each iteration
• Terminated if (temperature d lower limit) or
(number of iterations t upper limit).
ƒ nner loop: For each iteration:
• ew configuration generated from current configuration
• Accepted if (new cost d cost of current configuration)
• Accepted with temperature-dependent probability if
(cost of new config. > cost of current configuration).

-
Multio e tive timi ation
Maximize (y1, y2, …, yk) = g(x1, x2, …, xn)
y2 y2

Pareto optimal = not dominated

better

incomparable dominated

worse incomparable

y1 y1

Pareto set = set of all Pareto-optimal solutions


-
Summary
Single objective optimization methods
ƒ decision is performed during optimization
ƒ Examples: integer programming, simulated annealing
Multiple objective optimization methods
ƒ decision is done after optimization
ƒ Example: Evolutionary algorithms
ƒ Refer to publications of Thiele or Schwefel et al. for more
information
Concept of Pareto points
ƒ eliminates large set of non-relevant design points
ƒ allows separating optimization and decision

9-8
m ro re ta ty or a es
ƒ oop caches
ƒMapping code to less used part(s) of the index space
ƒCache locking freezing
ƒChanging the memory allocation for code or data
ƒMapping pieces of software to specific ways
Methods:
- enerating appropriate way in software
- llocation of certain parts of the address space to a specific way
- ncluding way-identifiers in virtual to real-address translation
) Caches behave almost like a scratch pad

9-9
Summary

ƒ llocation strategies for SPM


ynamic sets of processes
Multiprocessors
MM s
Sharing between SPMs in a multi-processor
ƒ ptimizations for Caches
Code ayout transformations
ay prediction

9-
ar are So t are o es

o e o t m at o

o r re or Pa a

o ef Stefan nternational Postgraduate


School -
as e e o urre y ma a eme t

Granularity: size of tasks (e.g. in instructions)


Readable specifications and efficient implementations can
possibly re uire different task structures.
) ranularity changes

-
er o tas s

Reduced overhead of context switches,


More global optimization of machine code,
Reduced overhead for inter-process task communication.

-
S tt o tas s

o blocking of resources while waiting for input,


more flexibility for scheduling, possibly improved result.

-
er a s tt o tas s

The most appropriate task graph granularity depends


upon the context ) merging and splitting may be
re uired.
Merging and splitting of tasks should be done
automatically, depending upon the context.

-
system
am e

-
ttr utes o a system t at ee s
re r t
Tasks blocking after they have already started running

-
or y orta e a et a
1. Transform each of the tasks into a Petri net,
2. enerate one global Petri net from the nets of the tasks,
. Partition global net into se uences of transition
. enerate one task from each such se uence

Mature, commercial approach not yet available

-8
esu t as u s e y orta e a
Reads only at the beginning
nitialization task

ever
true

lways true
-9
tm e ers o o
ever true

Tin ()
RE ( , sample, 1)
j==i-1 sum = sample i
j)i T = sample d =
T
: (i < ) retur
T = sum d= T
d = d c R TE( T,d,1)
sum = i =
retur
lways true -
as e e o urre y ma a eme t

ƒ The dynamic behavior of applications getting more attention.


ƒ Energy consumption reduction is the main target.
ƒ Some classes of applications (i.e. video processing) have a
considerable variation in processing power re uirements
depending on input data.
ƒ Static design-time methods becoming insufficient.
ƒ Runtime-only methods not feasible for embedded systems.

Æ ow about mixed approaches

-
am e o a m e
Task1

Task2

Task

eadline

eadline
t t
Static (compile-time) methods can …or they can define a probability for
ensure CET feasible schedules, but violating the deadline.
waste energy in the average case.

eadline
t
Mixed methods use compile-time Runtime scheduler selects the most
analysis to define a set of possible energy saving, deadline preserving
execution parameters for each task. combination.
e um tt me e

-
oat o t to e o t
o ers o
Pros:
ƒ ower cost
ƒ aster
ƒ ower power consumption
ƒ Sufficient S R, if properly scaled
ƒ Suitable for portable applications
Cons:
ƒ ecreased dynamic range
ƒ inite word-length effect, unless properly scaled
verflow and excessive uantization noise
ƒ Extra programming effort

© Ki-Il Kum, et al. (Seoul ational niversity): loating-point To ixed-point C Converter


or ixed-point igital Signal Processors, 2nd S orkshop, 1
-
e Po t ata ormat
loating-Point
loating-Pointvs.
vs. ixed-Point
ixed-Point nteger
ntegervs.
vs. ixed-Point
ixed-Point

ƒƒ exponent,
exponent,mantissa
mantissa
ƒƒ loating-Point
loating-Point S 1 . . . 1
automatic
automaticcomputation
computationand
and
update
updateofofeach
eachexponent
exponent
(a) nteger
at run-time
at run-time =
ƒƒ ixed-Point
ixed-Point
implicit
implicitexponent
exponent S 1 . . . 1
determined
determinedoff-line
off-line
hypothetical binary point

(b) ixed-Point

© Ki-Il Kum, et al
-
ss me t a
t o Su tra t o
ssume y = x, with et result = x y:
- x( =2) and e ualizing each
- y( = ):
s
s

s
s

y s
y s

resu t s

© Ki-Il Kum, et al
-
ut at o

ssume result = x y,
with
- x( =2) and
- y( = ) s
- - result ( =2 ) y
s

s s

resu t s

© Ki-Il Kum, et al
-
e e o me t Pro e ure

oat Po t
Pro ram

-
a e st mat o
- Pro ram

a ua
s e at o
e Po t
Pro ram

-
© Ki-Il Kum, et al
a e st mator
oat Po t a e st mat o Pro ram
Pro ram
float
float iir1(float
iir1(float x)
x)
re ro essor
static
staticfloat
float ss ==
ro t e
float
float yy
ass me t

Su rout e a sert o yy== .. ss xx


S to o erter range(y,
range(y,0);
0);
ss==yy
a e st mat o range(s,
range(s,1);
1);
Pro ram
return
return yy
ormat o
- 8
© Ki-Il Kum, et al
erat o s e o t ro ram
s
. x 21 iwl= .xxxxxxxxxxxx

x
iwl= .xxxxxxxxxxxx

overflow if z

result
- 9
oat Po t to e Po t Pro ram o erter

ixed-Point C Program

mulh
int iir1(int x) ƒ to access the upper
half of the multiplied
static int s = result
int y ƒ target dependent
y=sll(mulh(29491,s)+ (x>> 5),1); implementation
s=y
return y
sll
ƒ to remove 2nd sign bit
ƒ opt. overflow check

© Ki-Il Kum, et al
-
Per orma e om ar so
a e y es

ourt r er ter
Cycles

2
1
21

ixed-Point (1 b) loating-Point
© Ki-Il Kum, et al
-
Per orma e om ar so
a e y es

Cycles P

1 12 2
12
1
1 1

2 1
2

ixed-Point ixed-Point loating-Point


(1 b) ( 2b)

© Ki-Il Kum, et al
-
Per orma e om ar so
S

ixed-Point (1 b)
P ixed-Point ( 2b)
S R (d ) loating-Point
2
2
1
1

© Ki-Il Kum, et al
-
m a t o memory a o at o o e e y
rray
Column major
Row major order order ( RTR )
(C)

-
est er orma e ermost oo
orres o s to r tmost array e

o oo s assum ro ma or or er
or (k= k<=m k ) or (j= j<=n j )
or (j= j<=n j ) ) or (k= k<=m k )
p j k = ... p j k = ...

Same behavior for homogenous memory access, but:

or row major order

n Poor cache behavior ood cache behavior n

) memory architecture dependent optimization


-
) Pro ram tra s ormat o oo
ter a e
Example:
…#define iter 400000 ) mproved locality
int a[20][20][20];
void computeijk() {int i,j,k;
for (i = 0; i < 20; i++) {
for (j = 0; j < 20; j++) {
for (k = 0; k < 20; k++) {
a[i][j][k] += a[i][j][k];}}}}
void computeikj() {int i,j,k;
for (i = 0; i < 20; i++) {
for (j = 0; j < 20; j++) {
for (k = 0; k < 20; k++) {
a[i][k][j] += a[i][k][j] ;}}}}…
start=time(&start);for(z=0;z<iter;z++)computeijk();
end=time(&end);
printf("ijk=%16.9f\n",1.0*difftime(end,start));
(S interchanges array indexes instead of loops)
-
stro ue e o t e memory
ar te ture
oop structure: i j ramatic impact of locality
k
Pro essor Su SP te Pe t um
re u t o to
me s

ot always the same impact .. Till uchwald, iploma thesis, niv. ortmund, nformatik 12, 12 2

-
oo us o mer oo
ss o
or(j= j<=n j ) or (j= j<=n j )
p j = ... p j = ...
or (j= j<=n j ) , p j = p j ...
p j = p j ...

oops small enough to etter locality for


allow zero overhead access to p.
oops etter chances for
parallel execution.

hich of the two versions is best


rchitecture-aware compiler should select best version.
- 8
am e s m e oo s

#define
#define size
size 30
30 void ms1() {int i,j;
#define
#define iter
iter 40000
40000 for (i=0;i<
int size;i++){
int a[size][size];
a[size][size];
float for
float b[size][size];
b[size][size];
(j=0;j<size;j++){
void ss1() {int i,j; a[i][j]+=17; }
for for
(i=0;i<size;i++){ (j=0;j<size;j++){
for void mm1() {int i,j;
b[i][j]-=13; }}}
(j=0;j<size;j++){ for(i=0;i<size;i++){
a[i][j]+= 17;}}
for(j=0;j<size;j++){
for(i=0;i<size;i++){ a[i][j] += 17;
for b[i][j] -= 13;}}}
(j=0;j<size;j++){ - 9
b[i][j]-=13;}}}
esu ts s m e oo s
ss1 u t me (1 ԑ max)
))
ms1
12 Merge
Merge
1
mm1 dd
loops
loops
superi
superi
or
or
except
except
Sparc
Sparc
2
with
with
oo
gcc .2 - x gcc 2. -o Sparc gcc xo1 Sparc gcc x o
P att orm

-
oo u ro
or (j= j<=n j =2)
or (j= j<=n j )
p j = ... p j 1 = ...
p j = ...
factor = 2
etter locality for access to
p.
ess branches per
execution of the loop. More
opportunities for
optimizations.
Tradeoff between code size
and improvement.
- Extreme case: completely
unrolled loop (no branch)
am e matr mu t
#define s 30 extern void compute2()
#define iter 4000 {int i, j, k;
int for (i = 0; i < 30; i++) {
a[s][s],b[s][s],c[ for (j = 0; j < 30; j++) {
s][s]; for (k = 0; k <= 28; k += 2)
{{int *suif_tmp;
void compute(){int suif_tmp = &c[i][k];
i,j,k; *suif_tmp=
for(i=0;i<s;i++){ *suif_tmp+a[i][j]*b[j][k];}
{int *suif_tmp;
for(j=0;j<s;j++){ suif_tmp=&c[i][k+1];
*suif_tmp=*suif_tmp
for(k=0;k<s;k++){ +a[i][j]*b[j][k+1];
c[i][k]+= }}}}
return;}
a[i][j]*b[j][k];
}}}} -
esu ts
Pro essor Su SP te Pe t um

a tor
a tor
enefits uite small penalties may be Till uchwald, iploma thesis, niv.
ortmund, nformatik 12, 12 2
large -
esu ts e e ts or oo
e e e es
Pro essor
re u t o to

#define s 50
#define iter 150000
int a[s][s], b[s][s];
void compute() {
int i,k;
for (i = 0; i < s; i++) {
for (k = 1; k < s; k++) {
a[i][k] = b[i][k];
b[i][k] = a[i][k-1];
}}}

a tor
Small Till uchwald, iploma thesis, niv.

benefits ortmund, nformatik 12, 12 2

-
oo t oo o
r a ers o
or (i=1 i<= i )
or(k=1 k<= k )
r= i,k to be allocated to a register
or (j=1 j<= j )
i,j = r k,j
ever reusing information in the cache for and if
is large or cache is small (2 references for ).

-
oo t oo o
t e ers o
or (kk=1 kk<= kk = ) euse a tor o
or (jj=1 jj<= jj = ) or or

or (i=1 i<= i )
a esses to
or (k=kk k<= min(kk -1, ) k ) ma memory
r= i k to be allocated to a register Compiler
or (j=jj j<= min(jj Same
-1, elements
) j ) for should select
i j =r k j next iteration of i best option

- Monica am: The Cache Performance and ptimization


of locked lgorithms, SP S, 1 1
am e
SP

ra t e
resu ts y
u a are
sa o t
eo t e e
ases ere a
m ro eme t
as a e e
Sour e s m ar
to matr mu t a tor
Pe t um Till uchwald, iploma thesis, niv.
ortmund, nformatik 12, 12 2

-
Summary
ƒ Task concurrency management
Re-partitioning of computations into tasks
ynamic exploitation of slack
ƒ loating-point to fixed point conversion
Range estimation
Conversion
nalysis of the results
ƒ igh-level loop transformations
usion
nrolling
Tiling

- 8
ra s ormat o oo est s tt

am e Se arat o o mar a

many if- only few


statements for no checking, margin
margin-checking efficient elements to
be processed

- 9
oop nest from MPE - full search
motion estimation
for (z= z 2 z )
if (x =1 y =1 )
for (x= x x ) x1= x
for ( y y )
for (y= y y ) y1= y
for (k= k k )
for (k= k k ) x2=x1 k-
for (l= l l )
for (l= l ) y2=y1 l-
for (i= i i )
for (i= i i ) x =x1 i x =x2 i
for (j= j j )
for (j= j j ) y =y1 j y =y2 j
then block 1 then block 2
if (x x y y )
else y1= y
then block 1 else else block 1
for (k= k k ) x2=x1 k-
if (x x y y )
for (l= l ) y2=y1 l-
then block 2 else else block 2
for (i= i i ) x =x1 i x =x2 i
for (j= j j ) y =y1 j y =y2 j
analysis of polyhedral domains, if ( x y )
selection with genetic algorithm then-block-1 else else-block-1
if (x x y y )
for (z= z 2 z ) then block 2 else else block 2
for (x= x x ) x1= x
for (y= y y ) . alk et al., nf 12, ni o, 2 2
-
esu ts or oo est s tt
e ut o t mes
Cavity Motion Estimation S PCM

PS

a
m

e
a
Su

ar
tu

a
e
er

er
t
Pe

r
Po

. alk et al., nf 12, ni o, 2 2


-
esu ts or oo est s tt
o e s es
Cavity Motion Estimation S PCM

PS

m
a
m

e
a
Su

ar
tu

a
e
er

er
t
Pe

alk, 2 2
Po

-
rray o
nitial
arrays

-
rray o
nfolded
nfolded
arrays
arrays

-
tra array
o

ter array o

-
at o
ƒ rray folding is implemented in the TSE optimization
proposed by MEC. rray folding adds div and mod ops.
ptimizations re uired to remove these costly operations.
ƒ t MEC, PT address optimizations perform this task.
or example, modulo operations are replaced by pointers
(indexes) which are incremented and reset.

-
esu ts y es or a ty
e mar
ta

ta S

ta P

ta S
P
PT TSE
re uired to achieve
real benefit
[C.Ghez et al.: Systematic high-level Address
Code Transformations for Piece-wise Linear
Indexing: Illustration on a Medical Imaging
Algorithm, IEEE WS on Signal Processing
Pe t um PS r e a P S P S System: design & implementation, 2000, pp.
623-632]
o P
-
Prilagoditev kode
prenos zapisa iz ANSI-C v Handel-C
ƒ VHDL zahteva bistveno veþ sprememb

opis algoritma v C kodi je treba pred strojno izvedbo


ustrezno prilagoditi
ƒ SystemC oz. Handel-C vsebujeta samo podmnožico
ukazov obiþajnega C
ƒ drugaþe je treba realizirati aritmetiko plavajoþe vejice, ki
je strojne izvedbe naþeloma ne podpirajo
• zavzame preveþ razpoložljivih virov
• zmanjšuje frekvenco delovanja
ƒ vnos ukazov za vzporedno izvajanje delov kode
ƒ prilagoditev velikosti vseh spremenljivk
10 - 48
Prilagoditev rogra ke kode
nadomestek aritmetike plavajoþe vejice
ƒ uporaba fiksne vejice
ƒ uporaba celoštevilþnih vrednosti manjša enota
mere
vrednosti s fiksno vejico so pomnožene in predstavljene kot
celoštevilske vrednosti
si 62 si .62

celoštevilski in decimalni del sta predstavljena kot zgornji in spodnji


del celoštevilske spremenljivke
signed int var , var2
signed int 6 si

si 0x0 a0 si .62
var si [ : ] var 0x0
var2 si [ :0] var2 0xa0
10 - 4
60
Prilagoditev rogra ke kode
ukazi za vzporedno izvajanje delov kode
ƒ ukaz namesto
• kjer je mogoþe, glede na vsebino zanke

for i 0 i 3 i par i 0 i 3 i

a[i] b[2 i] a[i] b[2 i]

se se
a[i] b[2 i]
a[i] a[i] c[i] par
b[2 i] a[i]
a[i] b[2 i]
a[i] a[i] c[i]

b[2 i] a[i]
10 - 0
Prilagoditev rogra ke kode
prilagoditev velikosti vseh spremenljivk
ƒ vse velikosti morajo biti vnaprej definirane
• za manjšo porabo virov naj bodo minimizirane
ƒ vnaprej je treba doloþiti predznaþene nepredznaþene
ƒ pri raþunanju s spremenljivkami razliþnih velikosti
• uporaba operatorja spajanja: manjši spremenljivki dodamo
manjkajoþa mesta
• uporaba spodnjih mest pri veþji spremenljivki
[signed unsigned] int n n-bit

unsigned int 6 var , var3


unsigned int var2, var

var3 var var2


var var var2
10 - 1
ard are o t are ode ig

o ilatio

do dr regor Pa a

ožef Stefan International Postgraduate


School 11 - 1
o iler or e edded te
are o iler a i e
ƒ Many reports about low efficiency of standard compilers
- Special features of embedded processors have to be exploited.
- High levels of optimization more important than compilation
speed.
- Compilers can help to reduce the energy consumption.
- Compilers could help to meet real-time constraints.
ƒ Less legacy problems than for PCs.
- There is a large variety of instruction sets.
- Design space exploration for optimized processors makes
sense

11 -
ke ro le or t re e or
te
verage eed
erg Po er
Predi ta ilit

Energy

Access times

11 -
a ea
o ti i atio or ig er or a e
o
• High-performance if available memory bandwidth fully used
low-energy consumption if memories are at stand-by mode
• educed energy if more values are kept in registers
ADD r3,r0,r2
M V r0, 2
LD r3, [r2, 0] int
inta[
a[ 000]
000] M V r2,r 2
ADD r3,r0,r3 cc aa M V r 2,r
M V r0, 2 for
for i i ii 00
00 i i M V r ,rr 0
LD r0, [r2, r0] bb cc M V r0,r
ADD r0,r3,r0 bb cc M V r ,r
ADD r2,r2, cc M V r ,r
LD r , [r , r0]
ADD r ,r , ADD r0,r3,r
CMP r , 00 ADD r ,r ,
LT LL3 le le ADD r ,r ,
CMP r , 00
LT LL3
11 - 4
o iler o ti i atio
or i rovi g e erg e i ie
ƒ Energy-aware scheduling
ƒ Energy-aware instruction selection
ƒ perator strength reduction: e.g. replace by and
ƒ Minimize the bitwidth of loads and stores
ƒ Standard compiler optimizations with energy as a cost
function
2: a[0]
E.g.: egister pipelining: for i: to 0 do
begin
for i: 0 to 0 do : a[i]
C: 2 a[i] a[i- ] C: 2 2
2:
end

ƒƒ Exploitation
Exploitationof
ofthe
thememory
memoryhierarchy
hierarchy
11 -
i g rat ad e orie
P

Hierarchy
Hierarchy
Example
Example
main

SPM
Address
space A M TDMI
processor cores, well-known
0 for low power
consumption
no tag memory scratch pad memory

..
11 -
er li ited ort i
a ed tool lo
e rag a i o r e to allo ate to e i i e tio
or example:
#pragma arm section rwdata = "foo", rodata = "bar"
int x2 = 5; // in foo (data part of region)
int const z2[3] = {1,2,3}; // in bar
t atter loadi g ile to li ker or allo ati g e tio to
e i i addre ra ge

http: www.arm.com documentation


Software Development Tools index.html

11 -
glo al o ti i atio odel
ort d
Example: or i .
Which memory object array,
for j .. loop, etc. to be stored in SPM
while ...
o overla i g tati
epeat allo atio
main
memory call ...
Gain gk and size sk for each
segment k. Maximise gain G = 6gk,
Array ... respecting size of SPM SSP t 6 sk.
Solution: knapsack algorithm.
Scratch pad Array
memory, verla i g d a i
capacity SSP allo atio
Processor Int ... Moving objects back and forth
11 - 8
P re re e tatio
igrati g tio a d varia le
ol
S vark size of variable k
nk number of accesses to variable k
e vark energy aved per variable access, if vark is migrated
E vark energy aved if variable vark is migrated e vark n vark
x vark decision variable, if variable k is migrated to SPM,
0 otherwise
K set of variables

Similar for functions I

teger rogra i g or latio


Maximize ¦kK x vark E vark ¦iI x Fi E Fi
Subject to the constraint
¦k K S vark x vark ¦i I S Fi x Fi d SSP

11 -
ed tio i e erg a d average
r ti e
easible
standa with
rd
& postp compiler
a
optimiz ss
Cycles [x 00]

ation
Energy [ ]

Multi sort
mix of sort
algorithms

Measured processor external memory energy Numbers will change with technology,
CACTI values for SPM combined model algorithms remain unchanged.
11 - 10
llo atio o a i lo k
ine-grained
ine-grained
granularity
granularity
smoothens
smoothens
dependency
dependency on on the
the
size
sizeof
ofthe
thescratch
scratch Main
Statically 2
pad.
pad. memory ump jumps,
but only one
ee uires
uires additional
additional is taken
ump2
jump
jump instructions
instructions
or consecutive
to
toreturn
returnto
to main
main basic blocks
memory.
memory. ump3
2

ump

11 - 11
llo atio o a i lo k et o
ad a e t a i lo k a d t e ta k
e uires
genera
tio
additio n of
nal jum
specia ps
Cycles [x 00]

l comp
Energy [ ]

iler

11 - 1
avi g or e or te e erg
alo e

Combined model for memories


11 - 1
i i g redi ta ilit

aiT:
aiT:
ƒƒ WCET
WCETanalysis
analysistool
tool
ƒƒ support
supportfor
forscratchpad
scratchpadmemories
memoriesby
byspecifying
specifyingdifferent
different
memory
memoryaccess
accesstimes
times
ƒƒ also
alsofeatures
featuresexperimental
experimentalcache
cacheanalysis
analysisfor
forAA MM
11 - 14
r ite t re o idered
A M TDMI with 3 different memory architectures:
ai e or
LD -cycles: CP ,I ,D 3,2,2
ST -cycles: 2,2,2
,2,0
ai e or i ied a e
LD -cycles: CP ,I ,D 3, 2,6
ST -cycles: 2, 2,3
, 2,0
ai e or rat ad
LD -cycles: CP ,I ,D 3,0,2
ST -cycles: 2,0,0
,0,0
11 - 1
e lt or
sing Scratchpad: sing nified Cache:

eferences:
• Wehmeyer, Marwedel: Influence of nchip Scratchpad Memories on
WCET: th Intl Workshop on worst-case execution time WCET
analysis, Catania, Sicily, Italy, une 2 , 200
• Second paper on SP Cache and WCET at DATE, March 200
11 - 1
lti le rat ad

11 - 1
ti i atio or lti le rat
ad
Minimize C ¦e ˜ ¦ x
j
j
i
j ,i ˜ ni

With ej: energy per access to memory j,


and xj,i if object i is mapped to memory j, 0 otherwise,
and ni: number of accesses to memory object i,
subject to the constraints:
j : ¦ x j ,i ˜ Si d SSPj
i

i : ¦ x j ,i
j

With Si: size of memory object i,


SSPj: size of memory j.
11 - 18
o idered artitio

11 - 1
e lt or art o
oder de oder

Working set

A key advantage of partitioned


scratchpads for multiple applications is
their ability to adapt to the size of the
current working set.

11 - 0
a i re la e e t it i rat
ad
CP ƒEffectively results in a
kind of o iler
SPM o trolled
eg e tatio agi g
for SPM
Memory ƒAddress assignment
Memory within SPM re uired
paging or
segmentation-like

eference: Verma, Marwedel: Dynamic verlay


of Scratchpad Memory for Energy Minimization,
ISSS 200
11 - 1
rat ad a ed o live e
a al i
M A, T , T2, T3, T
SP Size A T
T
PP

PP
Solution:
Solution:
AAÎ
ÎSP SP&&T3
T3Î
Î
SP
SP

PP

11 -
ar
High-level transformations
ƒ Loop nest splitting
ƒ Array folding
Impact of memory architecture on execution times &
energy.
The SPM provides
ƒ untime efficiency
ƒ Energy efficiency
ƒ Timing predictability
Achieved savings are sometimes dramatic, for example:
ƒ savings of of the memory system energy

11 -
ard are o t are ode ig

Per or a e ti atio

do dr regor Pa a

ožef Stefan International Postgraduate School


1 -1
te e ig
Specification

System Synthesis Estimation

SW-Compilation Instruction Set HW-Synthesis

Intellectual Intellectual
Prop. Code Prop. lock

Machine Code Net lists


1 -
otivatio
The values of the objective
functions that should guide the
design space exploration are
Application Architecture
obtained through

Mapping Design space exploration intends


to change
ƒ mapping binding and resource
sharing
Estimation
ƒ architecture hardware platform
ƒ application choice between different
algorithms and or partitioning into
concurrent components

1 -
tli e

vervie

Performance Metrics

Subsystems

Abstraction Levels

Performance Estimation Methods

1 -4
Per or a e ti atio lo al Pi t re
x(y) = x0 * exp (-k0*y)
PERFORMANCE ESTIMATION METHOD x
x0 = 105
k0 = 1.2593 analytic
y

METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL

M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2

5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication 1 -
OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
Po itio i t e te e ig lo
-
ig level
tio al ti atio
e i i atio ƒ Advantages: short simulation
time, no details of
a i g
a d Partitio i g
implementation necessary
ƒ Drawbacks: limited accuracy,
Parallel od le od le ti atio
e.g. no information about timing
e i i atio
o i atio
-
ei e e t

a li ƒ Advantages: higher accuracy


P
o level e i
lo er to t e
P
i
ti atio ƒ Drawbacks: long simulation
i
i le e tatio time, many implementation
details need to be known
le e tatio

1 -
e o t e ti atio

Prere uisite for

ƒ part of the feedback cycle see global flow


ƒ functional and non-functional validation e.g. power, energy,
timing, memory consumption

ƒ show e uivalence of specification and implementation


ƒ functional and non-functional aspects

1 -
tli e

verview

Per or a e etri

Subsystems

Abstraction Levels

Performance Estimation Methods

1 -8
Per or a e ti atio lo al Pi t re
x(y) = x0 * exp (-k0*y)
PERFORMANCE ESTIMATION METHOD x
x0 = 105
k0 = 1.2593 analytic
y

METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL

M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2

5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication 1 -
OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
Performance Metrics
Per ormance metric = function defined on relevant non-functional properties of a
system which indicates a quantitative performance of the system.
Time [second]
ƒ for example end-to-end delay, throughput, latency

Power, Energy, Temperature [mW, mJ, °C]


ƒ for example power consumed by the network, energy
execute a task, maximal temperature

Area [mm2]
ƒ for example area of an integrated circuit

Cost [$]
ƒ for example cost of parts, labor, development cost

Other metrics:
usually, performance metrics are
ƒ SNR (signal to noise ratio), quality of the video
conflicting
image/sound, size of the hardware platform
Eam les of Performance ra e ffs
Ma in omain
ƒ change the mapping of the application to the architecture
Æ see example 1

rc itecture omain
ƒ change the hardware platform
Æ see example 2

lication omain
ƒ change the application implementation (e.g. degree of
parallelization, partitioning into concurrent processes, use of
different algorithms with a similar functional behavior)
E ra e ffs in t e Ma in omain

PE apping Optimi ation


2 mapping optimization space
ƒ ob Worst load of computation node
ƒ ob 2 Worst load of communication node

ob

ob
worst bus load
E ra e ffs in t e ar are Platform

General ur ose rocessors

lication s ecific instruction set rocessors


Ps
Microcontroller
imin erformance Ps i ital si nal rocessors
Ener Efficienc le ibilit

Pro rammable ar are


PG fiel ro rammable ate arra s

lication s ecific inte rate circuits s


utline

verview

erformance etrics

ubs stems

bstraction evels

erformance stimation ethods


Performance Estimation – Global Picture
x(y) = x0 * exp (-k0*y)
PERFORMANCE ESTIMATION METHOD x
x0 = 105
k0 = 1.2593 analytic
y

METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL

M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2

5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
stem om osition
ommunication em lates om utation em lates

P
M E

m
interface

rc itecture c e ulin an rbitration


em lates
M M
E E
ro ortional
riorit s are

namic static
fi e riorit
E E
s Estimation ifficult
om utation an ommunication
ƒ (Non-deterministic) computations in processing nodes
ƒ (Non-deterministic) communication delays
ƒ Complex resource interaction via scheduling and arbitration policies

clic timin e en encies


ƒ nternal data streams interact on computing and communication
resources
ƒ nteraction determines stream characteristics

ncertain en ironment
ƒ ifferent load scenarios
ƒ nknown (worst case) inputs
llustration of E aluation ifficulties

ab acc b
n ut
tream

as ommunication
as c e ulin
om le n ut
imin itter bursts
ifferent E ent es
llustration of E aluation ifficulties

Processor
as
ab acc b
uffer
n ut
tream

as ommunication ariable esource ailabilit


as c e ulin ariable E ecution eman
om le n ut n ut ifferent e ent t es
imin itter bursts nternal tate Pro ram ac e
ifferent E ent es
e uirements for Performance Estimation

stimation should be com osable in terms of


ƒ su systems and their interactions, i.e. W, SW, interconnect
ƒ computation, communication, and sche u ing ar itration

stimation should cover different metrics, for example power,


energy, delay, memory, throughput

stimation method should represent a reasonable tra e off


between (a) estimation effort in terms of
computation/simulation time and set-up time and (b) accuracy
utline

verview

erformance etrics

Subsystems

bstraction e els

erformance stimation ethods


Performance Estimation – Global Picture

PERFORMANCE ESTIMATION METHOD x x(y) = x0 * exp (-k0*y)


x0 = 105
k0 = 1.2593 analytic
y

METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL

M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2

5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
Comm.Netw.
SW W
rief istor in bstraction SW

abstract
SW W
SWtasks
SW tasks
SW tasks
Register-transfer level model cluster SS
SW adaptation
data[ ] (critical path latency) SW asks
C core

abstract
Comm.
Comm. int.
int.
S/drivers W adaptation

abstract
R
ate level model C
/ / / ( ns) cluster on-chip
abstract

W communication
ransistor model cluster adaptation Network
(t=RC)
s
W adaptation
cluster
abstract

s s s 2 s 2
tec nolo si nal gate, transaction SW, to ens SW tasks,
transistors, layouts schematic, R W systems comm. backbones, s
simulator So W/SW
simulator SystemC
simulator S C simulator codes./cosim. tools
/ SS
formal methods
utline

verview

erformance etrics

Subsystems

bstraction evels

Performance Estimation Met o s


Performance Estimation – Global Picture
x(y) = x0 * exp (-k0*y)
PERFORMANCE ESTIMATION METHOD x
x0 = 105
k0 = 1.2593 analytic
y

METRIC simulation
Other: Quality, SNR, …
Cost
Area statistic
Power
Time ABSTRACTION LEVEL

M1 M2 … M1 M2
Task1 Task2
SE blackbox
interface
HW HW IP (CPU) Task3
Mem CPU subsyste
HW itf. HW itf.
AD I/O m communication
CPU communication
Task1 Task2
subsystem Low-level Intermediary level High-level
Task3 e.g. RTL, ISA e.g. TLM, OS
0 SW e.g. functional, HLL
7 0 1
7 1 subsystem
6 6 2 2

5
5
4
3
3 interconnect Note:
HW IP SW ss. SW ss. 4 subsystem RTL – Register Transfer Level
ISA – Instruction Set Architecture
API API API
MPSoC TLM – Transaction-Level Model
communication OS – Operating System
SUBSYSTEM TO ANALYZE HLL – High-Level Language
System-Level Performance Estimation Methods
e.g. delay

Worst-Case

Best-Case

Real System Measurement Simulation Worst Case


Probabilistic (Formal) Analysis
Estimation
Æ presented in Æ presented later
Lecture 6
(next lecture)

12 - 26
vervie
System
o to e aluate

Measurements Formal Analysis Simulation Statistics

e elop a e elop a e elop a


se existing mat ematical program ic statistical
instance o t e abstraction o t e implements a abstraction o t e
system to system and model o t e system and
per orm deri e ormulas system. Per orm deri e statistic
per ormance ic describe experiments by per ormance ia
measurements. t e system running t e analysis or
per ormance. program. simulation.

12 - 2
Performance Estimation Methods
designers
designers component
component
experience
experience simulation
simulation

model
modeloo
application
application
input
input data
data
traces
traces ss eets
eets
model
modeloo system
system model
modeloo
en
en ironment
ironment model
model arc
arc itecture
itecture
spec.
spec.oo plat
platorm
orm
inputs
inputs benc
benc mar
mar ss
estimation
tool (met od)

estimation
estimation
results
results
12 - 2
nalytic Models
Static analytic sym olic models
ƒ escribe computing communication and memory resources by
algebraic e uations e.g.
ª # words º
delay « » comm_ time
« burst_ size»

ƒ escribe system properties by parameters e.g. data rate


ƒ Combine relations

Fast and simple estimation


enerally inaccurate modeling e.g. resource s aring not modeled

12 - 2
ynamic nalytic Models
Combination bet een
ƒ Static models possibly extended by non-determinism in run-
time and e ent processing
ƒ ynamic models or describing e.g. resource s aring
mec anisms (sc eduling and arbitration).

Existing approac es
ƒ - t eory
ƒ (statistical bounds)
ƒ - orst case best case
be a ior)

12 -
E am le - e in Systems
ƒ clients re uest some ser ice rom a ser er o er a net or .
ƒ
Per ormance o t e ser er
Per ormance o t e net or

12 - 1
Stochastic Models - Queuing Systems
queuing system is described by Performance measures
ƒ rrival rate ƒ average delay in queue
ƒ Service mechanism • Customer point of view
ƒ ueuing discipline ƒ time-average number of customers
in queue.
• System point of view
ƒ proportion of time server is busy

The classical M/M/1 queuing system:


(M = Markovian (exp.) distribution )

12 - 32
ondete ministic Models - Queuing Systems
queuing system is described by Performance measures
ƒ rrival function (bounds on ƒ worst case delay in queue
arrival times) ƒ worst-case number of customers in
ƒ Service functions (bounds on queue.
server behavior) ƒ worst-case and best-case end-to-
ƒ esource interaction end delay in the system

12 - 33
Simulation
Consider the underlying hardware platform and the mapping
of the application onto that architecture
Combine functional simulation and performance data
valuate average-case behavior for one simulation scenario

Complex set-up and extensive runtimes


... ut accurate results and good debugging possibilities

Model
Model
nput utput
trace application
application hardware
hardwareplatform
platform mapping
mapping trace

12 - 3
Example ace- ased Simulation
A stract simulation at system-le el it out timing
ƒ aster than simulation but still based on a single input trace
A straction
ƒ pplication - represented by abstract execution traces Æ graph of events:
read, write, and execute
ƒ rchitecture - represented by “virtual machines” and “virtual channels”
including non-functional properties (timing power energy)
teps
ƒ xecution trace determined by functional application simulation
ƒ xtension of the event graph by non-functional properties
ƒ Simulation of the extended model
application complete a st act
unctional model t ace e ent g aph

a chitectu e t ace estimation


desc iption simulation esults
eg ahi i et
12 - al
3 imentel et al

You might also like