Model-Driven Design of Embedded Multimedia Applications on SoCs
Adolf Abdallah, Abdoulaye Gamatié and Jean-Luc Dekeyser
LIFL-USTL/CNRS, INRIA Lille Nord Europe,
{Adolf.Abdallah, Abdoulaye.Gamatie, Jean-Luc.Dekeyser}@lifl.fr
Abstract
2
This paper addresses the design issue of System-onChip by elevating the design abstraction levels, through
a model-driven approach. It considers the standard
Marte profile, which is dedicated to the Modeling and
Analysis of Real-Time Embedded systems. From userdefined models, information are extracted, which serve
for the analysis of the models. The adopted analysis
technique relies on the synchronous reactive approach,
which strongly favors formal validation.
1
System modeling using Marte
We consider a simple multimedia processing module that consists of reducing a flow of images of size
1920×1080 to images of size of 640×540. Such an image size reduction is known as image downscaling in
the multimedia processing domain.
Functional part. The downscaling algorithm is described as a compound component in which the
HorizontalFilter and VerticalFilter are distinguished, as well as the pixel Reordering component.
This is illustrated by Figure 1. In addition, two other
components are considered: ColorInformation extracts the color space information of the input images whereas the PixelConstructor component reassembles images from their color space information.
All these components are connected according to their
functional interaction. In the bottom of Figure 1, the
FrameGenerator component creates high definition images that are treated by DownscalingProcess. The
resulting images are low definition images read by the
FrameConsumer component. All the processed data are
allocated to the memory component SRAM. In order to
express data parallelism, we use the repetitive structure modeling (RSM) package of Marte. The concepts of this package are based on the repetitive model
of computation [1]. In the downscaling example, the
VerticalFilter is repeated 16 times.
Introduction
The degree of complexity of Systems-on-Chip
(SoCs) is increasing at a tremendous pace. One of the
main reasons is the current technological advances that
are growing exponentially due to the evolution of targeted applications and the corresponding architectures.
As a result, the design of such systems is becoming
more challenging than ever.
To deal with complex systems, Model-Driven Engineering (MDE) [4] is an adequate solution. It enables
one to abstract the complexity of systems via high-level
specifications, and facilitates the design automation via
systematic model transformations up to code generation. The models manipulated at different abstraction
levels must conform to certain grammars, referred to
as metamodels. From one abstraction level to another,
the models are often refined by enriching them with
more precise information that allow the production of
executable models.
Another important aspect is the execution performance of the resulting systems. So, beyond the correctness of the systems functional properties, non functional aspects, e.g execution time must also be addressed.
In this paper, we mainly concentrate on the design
of SoCs by taking into account the above aspects.
Hardware architecture part. The architecture
model is shared memory-based, where four processors
are defined. We also define SRAM and ROM memories to write, store and retrieve data and program instructions. A sensor generating data and an actuator
consuming the processed data are also defined. All
communications between these physical resources pass
by a Bus component through Master and Slave ports.
Association and platform deployment. The association model defines the mapping between the func1
Figure 1. Specification of the system’s functionality.
tionality of the system and the physical resources. The
HorizontalFilter component from the functional description model is associated with a processor Proc1
via the allocate connector of Marte. The ports
FrameIn and FrameOut of the DownscalingProcess
component are also associated with the DataMemory.
In order to refine the models towards a specific technology, we have to instantiate the abstract descriptions
with concrete information. In [3], this phase is referred
to as platform “deployment”. During this phase, each
resource is assigned to a particular hardware implementation. Typically, the type and characteristics of
processors are defined. Figure 2 shows in our case
how the hardware processor Proc1 is deployed on a
virtualIP, called VProc1, which contains two hardware implementations at the SystemC transaction level
modeling (TLM), with different frequency values. The
component VProc1-TLM1 (resp. VProc1-TML2), stereotyped hardwareIP and hwProcessor is assigned a frequency of 15 Hz (resp. 30 Hz).
components and the analysis of clock synchronizability
enables to verify properties regarding, e.g QoS.
3.1
The specification of the functional behavior corresponds to a graph C representing a set of components
describing the system functional behavior. Clk is the
set of logical clocks clki expressing the activations of
components. At each instant of a given clock clk1 , the
associated component C1 is activated and the encapsulated component is executed by the corresponding
processor. Given a component C specified in the functional description: clk is the associated binary clock
of C and is deduced from the information given in the
model defined in Section 2. In order to synthesize clk,
the model should contain the number of instruction
cycles that the processor should execute and the frequency of the processor.
Logical binary clocks consist of a sequence of binary
values, associated with components describing the system functionality, where 0 and 1 respectively denote
the idle and active states. Regarding processors, the
value 1 in a clock clk means that a processor is active;
the value 0 means that a processor is either active and
executing 0-cycle or idle.
From now on, we assume the existence of an ideal
clock, which is fast enough to capture any binary clock
in our system. Given a binary clock clk, the function
length(clk) returns the number k ∈ N∗ of binary values
in clk.
Definition 1 (Activation rank) Let C ∈ C be a
component such that clk ∈ Clk(C) contains at least
one activation. Then, ∀n ∈ N∗ , the nth activation of
C w.r.t. clk is given by pos(n, clk) = q where q ∈ N∗
is at most equal to length(clk).
Figure 2. Hardware deployment of Proc1.
3
Binary clocks
Analysis of temporal properties
3.2
We analyze the temporal behavior of the previous
modeled system in order to extract information for design assessment. Our analysis relies on [2] which defines
the N-synchronous Kahn networks. In this approach,
periodic clocks are defined and associated with system
Description of the temporal behavior
In the downscaler, we analyze the system based on
four processors, where each one executes instruction
cycles based on the functional description. As we have
seen in Figure 2, Proc1 is deployed to the TLM level
2
• Quant(0.clk, 1) = 0
with two implementation choices. These implementations are specified as hardwareIP in Marte and are
encapsulated in a virtualIP. The first hardwareIP
has a frequency of 15 Hz whereas the second one has
a frequency of 30 Hz. We have chosen 15 Hz (expressed in Figure 2 by the implements arrow going
from VProc1-TLM1 towards Proc1). Proc1, which is
allocated to HorizontalFilter in the functional description, has 16 instruction cycles to execute since the
component HorizontalFilter is repeated 4×4 times.
If a given processor has a frequency f and is executing InstCycle instruction cycles, then the duration
d, expressed in seconds, that would take the processor to accomplish the computation is given by: d =
InstCycle/f . Below are the values extracted from the
specification of the downscaler:
Proc1:
Proc2:
Proc3:
Proc4:
F1
F2
F3
F4
Given the number of occurrences of 1 at each position of the defined clock, we are able to deduce the
delay between two clocks at each position.
Definition 3 (Delay) Let clk1 , clk2 be two binary
clocks such that length(clk1 ) = length(clk2 ). Then,
∀ i ∈ [1,length(clk1 )]:
Diff(clk1 , clk2 , i) = |Quant(clk1 , i) − Quant(clk2 , i)|.
We can determine the minimal buffer size that enables such communication.
Definition 4 (Minimal buffer length) Given two
binary clocks clk1 and clk2 such that length(clk1 ) =
length(clk2 ). Then, ∀i ∈ [1, length(clk1 )]:
= 15 Hz, InstCycle1 = 16 ⇒ d1 ≃ 1.06secs
= 45 Hz, InstCycle2 = 30 ⇒ d2 ≃ 0.66secs
= 45 Hz, InstCycle3 = 30 ⇒ d3 ≃ 0.66secs
= 40 Hz, InstCycle4 = 60 ⇒ d4 = 1.5secs
MinSize(clk1 , clk2 ) = Max(Diff(clk1 , clk2 , i)).
3.3
Distribution function. The scheduling of the repeated components allocated to processors is highly
related to the type of the instruction that is being processed. Therefore, we define affine and linear functions
to periodicaly distribute the activation instants of components.
Figure 3 shows the distribution of the activation instants of the binary clocks corresponding to the four
processors, associated with an ideal clock.
IdealClk
clk1 :
clk2 :
clk3 :
clk4 :
1
0
0
0
1
1
0
1
0
1
1
0
0
0
1
1
1
1
0
1
1
0
0
1
1
1
0
1
0
1
1
0
0
1
1
1
1
1
0
1
1
0
0
1
1
1
0
1
0
1
1
0
0
1
1
1
1
1
0
1
1
0
0
1
1
1
0
1
0
1
1
0
0
1
1
1
1
1
0
1
1
0
0
1
1
1
0
1
0
1
1
0
0
1
1
1
1
1
0
1
1
0
0
1
1
1
0
1
0
1
1
0
0
1
1
1
1
1
0
1
1
0
0
1
1
Analysis of the temporal behavior
We reason on the temporal behavior of the components in order to explore the architecture specified
earlier. This allows us to redesign the system for better
performance.
Let clk1 (resp. clk2 ) be the clock associated with
component C1 (resp. C2 ) and let nb1 (resp. nb2 ) be the
total number of component activations in clk1 (resp.
clk2 ). Assuming that nb1 = nb2, this implies that
C1 and C2 are activated the same number of times.
The algorithm shown in Figure 4 details the semantics
of the synchronizability analysis on clk1 and clk2 and
their meaning in a low description level.
..
..
..
..
..
∀n ∈ N
if pos(n,clk1 ) = pos(n,clk2 ) then
the channel connecting C1 to C2 does not need any
buffering since the communication between these two
ports is synchronous.
else if pos(n,clk1 ) < pos(n,clk2 ) then
the activation number of C1 is the same as C2 but with
a certain delay d. This means that C1 is writing data
and C2 is consuming the same amount of data after
dth logical instants.
else if pos(n,clk1 ) > pos(n,clk2 ) then
C2 is consuming patterns before it has been produced
by C1 which makes the system incoherent since data is
not present when C2 is activated.
end if
Figure 3. Trace of clocks clk1 , clk2 , clk3 , clk4 .
Now that we have synthesized logical binary clocks
associated with components describing functional behavior, we analyze these clocks and deduce information
concerning certain aspects of its architecture.
Below, we define four functions that help us to reason on the synchronizability of the clocks.
Definition 2 (Quantity) Let clk be a binary clock
and ∀i ∈ [1, length(clk)] such that (1.clk, i) expresses
that the position i of clk is 1 and (0.clk, i) expresses
that the position i of clk is 0. We define the recursive
function Quant as follows:
• Quant(1.clk, i) = 1 + Quant(clk, i − 1)
• Quant(0.clk, i) = 0 + Quant(clk, i − 1)
Figure 4. Clock synchronizability analysis.
• Quant(1.clk, 1) = 1
3
When nb1 6= nb1, clk1 and clk2 could not by synchronized since they do not have the same activation
number, i.e they cannot be both active during all logical instants.
3.4
another. This delaying mechanism allows to store the
processed data that are not yet cached by a reading
component. Our approach proposes to estimate earlier
a minimal buffer size based on [2], which would achieve
such communication. However, if the estimated buffer
length cannot be supported by the current specification, the designer must consider allocating additional
memory to the system which leads to a higher cost and
used space.
Changing the processors frequency is another solution for the synchronization problem. Frequency scaling might help reducing the buffer size needed to synchronize the communication. This solution is also constrained with QoS issues and is not taken for granted.
For instance, decreasing the processor’s frequency can
affect the quality of the produced images. Increasing
the frequency can be accompanied by an increase in
temperature.
Synchronizability analysis
Figure 5 shows a synchronization issue between clk2
and clk3 . clk2 is characterized by an affine function
f2 : x → 2x w.r.t. the ideal clock, while clk3 is characterized by: f3 : x → 2x + 3. Therefore, they are not
synchronous. To store the data produced by Proc2
and not yet consumed by Proc3 we have to determine
the minimal buffer size that enables data storage without any loss of information. By applying the Quant
function on clk2 and clk3 , we calculate the number of
executed instruction cycles until a given instant. Then
by applying the Diff function, we can determine the
gap between clk2 and clk3 at each logical instant:
5
Quant(clk2 ,2)=0, Quant(clk3 ,2)=1 ⇒ Diff (clk2 , clk3 , 2)=1
Quant(clk2 ,4)=0, Quant(clk3 ,4)=2 ⇒ Diff (clk2 , clk3 , 4)=2
Quant(clk2 ,6)=1, Quant(clk3 ,6)=3 ⇒ Diff (clk2 , clk3 , 6)=2
...
This paper addresses the design issue of System-onChip by elevating the design abstraction levels, through
a model-driven approach. It adopts an analysis technique that relies on the synchronous reactive approach,
which strongly favors formal validation. After analyzing the temporal behavior of the system, we were
able to extract information regarding the systems architecture that may serve the designer to better redesign the system. Among the perspectives is the efective implementation of the approach in the Gaspard
framework [3], which is dedicated to the design of highperformance embedded systems.
Figure 5. Image downscaling.
From the results obtained above, and by applying
the MinSize function, we can conclude that the communication between clk2 (associated to Proc2) and clk3
(associated to Proc3) needs a delay that can be represented as a buffer of size 2.
4
Conclusion
References
[1] P. Boulet. Array-OL revisited, multidimensional intensive signal processing specification. Research Report
RR-6113, INRIA, February 2007. http://hal.inria.
fr/inria-00128840/en.
Towards design space exploration
From the previous analysis, useful information are
extracted that can be used to redesign the system with
more accurate values of physical resources. However,
it is the designer’s task to choose the most suitable
system refactoring based on the physical resources and
QoS requirements. If a component C1 is writing data
and another component C2 is consuming the same data
but after several logical instants, then the designer
could proceed in two ways to deal with the synchronization issue.
First, the designer could consider an inter-processor
communication mechanism that allows a buffer sharing, which is used to pass data from one processor to
[2] A. Cohen, M. Duranton, C. Eisenbeis, C. Pagetti,
F. Plateau, and M. Pouzet. N-sychronous Kahn networks. In ACM Symp. on Principles of Programming Languages (PoPL’06), Charleston, South Carolina, USA, January 2006.
[3] A. Gamatié, S. Le Beux, É. Piel, A. Etien, and Rabie Ben-Atitallah. A model driven design framework
for high performance embedded systems. Research
Report 6614, INRIA, 2008. http://hal.inria.fr/inria00311115/en.
[4] D.C. Schmidt. Guest editor’s intro.: Model-driven engineering. Computer, 39(2):25–31, 2006.
4