Model-Driven Design of Embedded Multimedia Applications on SoCs

jean-luc Dekeyser

Model-Driven Design of Embedded Multimedia Applications on SoCs

2009 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, 2009

Figure 1. Speciﬁcation of the system’s functionality. tionality of the system and the physical resources. The HorizontalFilter component from the functional de- scription model is associated with a processor Proc1 via the allocate connector of Marte. The ports FrameIn and FrameOut of the DownscalingProcess component are also associated with the DataMemory. In order to reﬁne the models towards a speciﬁc tech- nology, we have to instantiate the abstract descriptions with concrete information. In [3], this phase is referred to as platform “deployment”. During this phase, each resource is assigned to a particular hardware imple- mentation. Typically, the type and characteristics of processors are deﬁned. Figure 2 shows in our case how the hardware processor Proc1 is deployed on a virtualIP, called VProc1, which contains two hard- ware implementations at the SystemC transaction level modeling (TLM), with diﬀerent frequency values. The component VProc1-TLM1 (resp. VProc1-TML2), stereo- typed hardwareIP and hwProcessor is assigned a fre- quency of 15 Hz (resp. 30 Hz). Figure 2. Hardware deployment of Proc1. 3 Analysis of temporal properties We analyze the temporal behavior of the previous modeled system in order to extract information for de- sign assessment. Our analysis relies on [2] which deﬁnes the N-synchronous Kahn networks. In this approach, periodic clocks are deﬁned and associated with system components and the analysis of clock synchronizability enables to verify properties regarding, e.g QoS. 3.1 Binary clocks The speciﬁcation of the functional behavior corre- sponds to a graph C representing a set of components describing the system functional behavior. Clk is the set of logical clocks clk i expressing the activations of components. At each instant of a given clock clk 1 , the associated component C 1 is activated and the encap- sulated component is executed by the corresponding processor. Given a component C speciﬁed in the func- tional description: clk is the associated binary clock of C and is deduced from the information given in the model deﬁned in Section 2. In order to synthesize clk, the model should contain the number of instruction cycles that the processor should execute and the fre- quency of the processor. Logical binary clocks consist of a sequence of binary values, associated with components describing the sys- tem functionality, where 0 and 1 respectively denote the idle and active states. Regarding processors, the value 1 in a clock clk means that a processor is active; the value 0 means that a processor is either active and executing 0-cycle or idle. From now on, we assume the existence of an ideal clock, which is fast enough to capture any binary clock in our system. Given a binary clock clk, the function length(clk) returns the number k ∈ N * of binary values in clk. Deﬁnition 1 (Activation rank) Let C ∈ C be a component such that clk ∈ Clk(C) contains at least one activation. Then, ∀n ∈ N * , the n th activation of C w.r.t. clk is given by pos(n, clk)= q where q ∈ N * is at most equal to length(clk). 3.2 Description of the temporal behavior In the downscaler, we analyze the system based on four processors, where each one executes instruction cycles based on the functional description. As we have seen in Figure 2, Proc1 is deployed to the TLM level 2

Model-Driven Design of Embedded Multimedia Applications on SoCs Adolf Abdallah, Abdoulaye Gamatié and Jean-Luc Dekeyser LIFL-USTL/CNRS, INRIA Lille Nord Europe, {Adolf.Abdallah, Abdoulaye.Gamatie, Jean-Luc.Dekeyser}@lifl.fr Abstract 2 This paper addresses the design issue of System-onChip by elevating the design abstraction levels, through a model-driven approach. It considers the standard Marte profile, which is dedicated to the Modeling and Analysis of Real-Time Embedded systems. From userdefined models, information are extracted, which serve for the analysis of the models. The adopted analysis technique relies on the synchronous reactive approach, which strongly favors formal validation. 1 System modeling using Marte We consider a simple multimedia processing module that consists of reducing a flow of images of size 1920×1080 to images of size of 640×540. Such an image size reduction is known as image downscaling in the multimedia processing domain. Functional part. The downscaling algorithm is described as a compound component in which the HorizontalFilter and VerticalFilter are distinguished, as well as the pixel Reordering component. This is illustrated by Figure 1. In addition, two other components are considered: ColorInformation extracts the color space information of the input images whereas the PixelConstructor component reassembles images from their color space information. All these components are connected according to their functional interaction. In the bottom of Figure 1, the FrameGenerator component creates high definition images that are treated by DownscalingProcess. The resulting images are low definition images read by the FrameConsumer component. All the processed data are allocated to the memory component SRAM. In order to express data parallelism, we use the repetitive structure modeling (RSM) package of Marte. The concepts of this package are based on the repetitive model of computation [1]. In the downscaling example, the VerticalFilter is repeated 16 times. Introduction The degree of complexity of Systems-on-Chip (SoCs) is increasing at a tremendous pace. One of the main reasons is the current technological advances that are growing exponentially due to the evolution of targeted applications and the corresponding architectures. As a result, the design of such systems is becoming more challenging than ever. To deal with complex systems, Model-Driven Engineering (MDE) [4] is an adequate solution. It enables one to abstract the complexity of systems via high-level specifications, and facilitates the design automation via systematic model transformations up to code generation. The models manipulated at different abstraction levels must conform to certain grammars, referred to as metamodels. From one abstraction level to another, the models are often refined by enriching them with more precise information that allow the production of executable models. Another important aspect is the execution performance of the resulting systems. So, beyond the correctness of the systems functional properties, non functional aspects, e.g execution time must also be addressed. In this paper, we mainly concentrate on the design of SoCs by taking into account the above aspects. Hardware architecture part. The architecture model is shared memory-based, where four processors are defined. We also define SRAM and ROM memories to write, store and retrieve data and program instructions. A sensor generating data and an actuator consuming the processed data are also defined. All communications between these physical resources pass by a Bus component through Master and Slave ports. Association and platform deployment. The association model defines the mapping between the func1 Figure 1. Specification of the system’s functionality. tionality of the system and the physical resources. The HorizontalFilter component from the functional description model is associated with a processor Proc1 via the allocate connector of Marte. The ports FrameIn and FrameOut of the DownscalingProcess component are also associated with the DataMemory. In order to refine the models towards a specific technology, we have to instantiate the abstract descriptions with concrete information. In [3], this phase is referred to as platform “deployment”. During this phase, each resource is assigned to a particular hardware implementation. Typically, the type and characteristics of processors are defined. Figure 2 shows in our case how the hardware processor Proc1 is deployed on a virtualIP, called VProc1, which contains two hardware implementations at the SystemC transaction level modeling (TLM), with different frequency values. The component VProc1-TLM1 (resp. VProc1-TML2), stereotyped hardwareIP and hwProcessor is assigned a frequency of 15 Hz (resp. 30 Hz). components and the analysis of clock synchronizability enables to verify properties regarding, e.g QoS. 3.1 The specification of the functional behavior corresponds to a graph C representing a set of components describing the system functional behavior. Clk is the set of logical clocks clki expressing the activations of components. At each instant of a given clock clk1 , the associated component C1 is activated and the encapsulated component is executed by the corresponding processor. Given a component C specified in the functional description: clk is the associated binary clock of C and is deduced from the information given in the model defined in Section 2. In order to synthesize clk, the model should contain the number of instruction cycles that the processor should execute and the frequency of the processor. Logical binary clocks consist of a sequence of binary values, associated with components describing the system functionality, where 0 and 1 respectively denote the idle and active states. Regarding processors, the value 1 in a clock clk means that a processor is active; the value 0 means that a processor is either active and executing 0-cycle or idle. From now on, we assume the existence of an ideal clock, which is fast enough to capture any binary clock in our system. Given a binary clock clk, the function length(clk) returns the number k ∈ N∗ of binary values in clk. Definition 1 (Activation rank) Let C ∈ C be a component such that clk ∈ Clk(C) contains at least one activation. Then, ∀n ∈ N∗ , the nth activation of C w.r.t. clk is given by pos(n, clk) = q where q ∈ N∗ is at most equal to length(clk). Figure 2. Hardware deployment of Proc1. 3 Binary clocks Analysis of temporal properties 3.2 We analyze the temporal behavior of the previous modeled system in order to extract information for design assessment. Our analysis relies on [2] which defines the N-synchronous Kahn networks. In this approach, periodic clocks are defined and associated with system Description of the temporal behavior In the downscaler, we analyze the system based on four processors, where each one executes instruction cycles based on the functional description. As we have seen in Figure 2, Proc1 is deployed to the TLM level 2 • Quant(0.clk, 1) = 0 with two implementation choices. These implementations are specified as hardwareIP in Marte and are encapsulated in a virtualIP. The first hardwareIP has a frequency of 15 Hz whereas the second one has a frequency of 30 Hz. We have chosen 15 Hz (expressed in Figure 2 by the implements arrow going from VProc1-TLM1 towards Proc1). Proc1, which is allocated to HorizontalFilter in the functional description, has 16 instruction cycles to execute since the component HorizontalFilter is repeated 4×4 times. If a given processor has a frequency f and is executing InstCycle instruction cycles, then the duration d, expressed in seconds, that would take the processor to accomplish the computation is given by: d = InstCycle/f . Below are the values extracted from the specification of the downscaler: Proc1: Proc2: Proc3: Proc4: F1 F2 F3 F4 Given the number of occurrences of 1 at each position of the defined clock, we are able to deduce the delay between two clocks at each position. Definition 3 (Delay) Let clk1 , clk2 be two binary clocks such that length(clk1 ) = length(clk2 ). Then, ∀ i ∈ [1,length(clk1 )]: Diff(clk1 , clk2 , i) = |Quant(clk1 , i) − Quant(clk2 , i)|. We can determine the minimal buffer size that enables such communication. Definition 4 (Minimal buffer length) Given two binary clocks clk1 and clk2 such that length(clk1 ) = length(clk2 ). Then, ∀i ∈ [1, length(clk1 )]: = 15 Hz, InstCycle1 = 16 ⇒ d1 ≃ 1.06secs = 45 Hz, InstCycle2 = 30 ⇒ d2 ≃ 0.66secs = 45 Hz, InstCycle3 = 30 ⇒ d3 ≃ 0.66secs = 40 Hz, InstCycle4 = 60 ⇒ d4 = 1.5secs MinSize(clk1 , clk2 ) = Max(Diff(clk1 , clk2 , i)). 3.3 Distribution function. The scheduling of the repeated components allocated to processors is highly related to the type of the instruction that is being processed. Therefore, we define affine and linear functions to periodicaly distribute the activation instants of components. Figure 3 shows the distribution of the activation instants of the binary clocks corresponding to the four processors, associated with an ideal clock. IdealClk clk1 : clk2 : clk3 : clk4 : 1 0 0 0 1 1 0 1 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1 0 0 1 1 Analysis of the temporal behavior We reason on the temporal behavior of the components in order to explore the architecture specified earlier. This allows us to redesign the system for better performance. Let clk1 (resp. clk2 ) be the clock associated with component C1 (resp. C2 ) and let nb1 (resp. nb2 ) be the total number of component activations in clk1 (resp. clk2 ). Assuming that nb1 = nb2, this implies that C1 and C2 are activated the same number of times. The algorithm shown in Figure 4 details the semantics of the synchronizability analysis on clk1 and clk2 and their meaning in a low description level. .. .. .. .. .. ∀n ∈ N if pos(n,clk1 ) = pos(n,clk2 ) then the channel connecting C1 to C2 does not need any buffering since the communication between these two ports is synchronous. else if pos(n,clk1 ) < pos(n,clk2 ) then the activation number of C1 is the same as C2 but with a certain delay d. This means that C1 is writing data and C2 is consuming the same amount of data after dth logical instants. else if pos(n,clk1 ) > pos(n,clk2 ) then C2 is consuming patterns before it has been produced by C1 which makes the system incoherent since data is not present when C2 is activated. end if Figure 3. Trace of clocks clk1 , clk2 , clk3 , clk4 . Now that we have synthesized logical binary clocks associated with components describing functional behavior, we analyze these clocks and deduce information concerning certain aspects of its architecture. Below, we define four functions that help us to reason on the synchronizability of the clocks. Definition 2 (Quantity) Let clk be a binary clock and ∀i ∈ [1, length(clk)] such that (1.clk, i) expresses that the position i of clk is 1 and (0.clk, i) expresses that the position i of clk is 0. We define the recursive function Quant as follows: • Quant(1.clk, i) = 1 + Quant(clk, i − 1) • Quant(0.clk, i) = 0 + Quant(clk, i − 1) Figure 4. Clock synchronizability analysis. • Quant(1.clk, 1) = 1 3 When nb1 6= nb1, clk1 and clk2 could not by synchronized since they do not have the same activation number, i.e they cannot be both active during all logical instants. 3.4 another. This delaying mechanism allows to store the processed data that are not yet cached by a reading component. Our approach proposes to estimate earlier a minimal buffer size based on [2], which would achieve such communication. However, if the estimated buffer length cannot be supported by the current specification, the designer must consider allocating additional memory to the system which leads to a higher cost and used space. Changing the processors frequency is another solution for the synchronization problem. Frequency scaling might help reducing the buffer size needed to synchronize the communication. This solution is also constrained with QoS issues and is not taken for granted. For instance, decreasing the processor’s frequency can affect the quality of the produced images. Increasing the frequency can be accompanied by an increase in temperature. Synchronizability analysis Figure 5 shows a synchronization issue between clk2 and clk3 . clk2 is characterized by an affine function f2 : x → 2x w.r.t. the ideal clock, while clk3 is characterized by: f3 : x → 2x + 3. Therefore, they are not synchronous. To store the data produced by Proc2 and not yet consumed by Proc3 we have to determine the minimal buffer size that enables data storage without any loss of information. By applying the Quant function on clk2 and clk3 , we calculate the number of executed instruction cycles until a given instant. Then by applying the Diff function, we can determine the gap between clk2 and clk3 at each logical instant: 5 Quant(clk2 ,2)=0, Quant(clk3 ,2)=1 ⇒ Diff (clk2 , clk3 , 2)=1 Quant(clk2 ,4)=0, Quant(clk3 ,4)=2 ⇒ Diff (clk2 , clk3 , 4)=2 Quant(clk2 ,6)=1, Quant(clk3 ,6)=3 ⇒ Diff (clk2 , clk3 , 6)=2 ... This paper addresses the design issue of System-onChip by elevating the design abstraction levels, through a model-driven approach. It adopts an analysis technique that relies on the synchronous reactive approach, which strongly favors formal validation. After analyzing the temporal behavior of the system, we were able to extract information regarding the systems architecture that may serve the designer to better redesign the system. Among the perspectives is the efective implementation of the approach in the Gaspard framework [3], which is dedicated to the design of highperformance embedded systems. Figure 5. Image downscaling. From the results obtained above, and by applying the MinSize function, we can conclude that the communication between clk2 (associated to Proc2) and clk3 (associated to Proc3) needs a delay that can be represented as a buffer of size 2. 4 Conclusion References [1] P. Boulet. Array-OL revisited, multidimensional intensive signal processing specification. Research Report RR-6113, INRIA, February 2007. http://hal.inria. fr/inria-00128840/en. Towards design space exploration From the previous analysis, useful information are extracted that can be used to redesign the system with more accurate values of physical resources. However, it is the designer’s task to choose the most suitable system refactoring based on the physical resources and QoS requirements. If a component C1 is writing data and another component C2 is consuming the same data but after several logical instants, then the designer could proceed in two ways to deal with the synchronization issue. First, the designer could consider an inter-processor communication mechanism that allows a buffer sharing, which is used to pass data from one processor to [2] A. Cohen, M. Duranton, C. Eisenbeis, C. Pagetti, F. Plateau, and M. Pouzet. N-sychronous Kahn networks. In ACM Symp. on Principles of Programming Languages (PoPL’06), Charleston, South Carolina, USA, January 2006. [3] A. Gamatié, S. Le Beux, É. Piel, A. Etien, and Rabie Ben-Atitallah. A model driven design framework for high performance embedded systems. Research Report 6614, INRIA, 2008. http://hal.inria.fr/inria00311115/en. [4] D.C. Schmidt. Guest editor’s intro.: Model-driven engineering. Computer, 39(2):25–31, 2006. 4

Log In

Model-Driven Design of Embedded Multimedia Applications on SoCs