This document summarizes a proposed method for synthesizing the data path and control path of a CPU. The method uses a graphical representation called a Register Transfer Graph (RTG) to represent the data transfer operations and processing operations between components. It proposes using synthesis parameters like resource sharing, multiport memory, multicycled operations, and pipelined operations to transform the architecture represented by the RTG. The RTG is scheduled at the micro-operation level to optimize performance under the selected parameters. Data transfer paths are also reduced by replacing paths with bypass routes to reduce the connection cost.
This document summarizes a proposed method for synthesizing the data path and control path of a CPU. The method uses a graphical representation called a Register Transfer Graph (RTG) to represent the data transfer operations and processing operations between components. It proposes using synthesis parameters like resource sharing, multiport memory, multicycled operations, and pipelined operations to transform the architecture represented by the RTG. The RTG is scheduled at the micro-operation level to optimize performance under the selected parameters. Data transfer paths are also reduced by replacing paths with bypass routes to reduce the connection cost.
This document summarizes a proposed method for synthesizing the data path and control path of a CPU. The method uses a graphical representation called a Register Transfer Graph (RTG) to represent the data transfer operations and processing operations between components. It proposes using synthesis parameters like resource sharing, multiport memory, multicycled operations, and pipelined operations to transform the architecture represented by the RTG. The RTG is scheduled at the micro-operation level to optimize performance under the selected parameters. Data transfer paths are also reduced by replacing paths with bypass routes to reduce the connection cost.
This document summarizes a proposed method for synthesizing the data path and control path of a CPU. The method uses a graphical representation called a Register Transfer Graph (RTG) to represent the data transfer operations and processing operations between components. It proposes using synthesis parameters like resource sharing, multiport memory, multicycled operations, and pipelined operations to transform the architecture represented by the RTG. The RTG is scheduled at the micro-operation level to optimize performance under the selected parameters. Data transfer paths are also reduced by replacing paths with bypass routes to reduce the connection cost.
E Asia Pacific Cmference 011 Civnrits and Systems '96
member 28- 21, 1996,
are-Software C Dept. of Electrical and Electronic Engineering Abstract ropose a systematic method which syn- thesizes the data path and control path of CPU We use a graphical representation re design space more broadly, change the architecture of data path. The num- ber of data transfer paths is reduced by replacing NTRODUCTI ON One of the fast time-to-market design solutions for he embedded system. While the software part is used for providing the behav ibility of system, the hardware par Application Specific as been studied pop- lnstruction Processor I P is a processor CPU core of ASIP. other is to synthesize the micro-architecture opti- mized the given application which is described with instructions. They have used the initial architectures which have data path with an almost fixed connec- tion topology In those cases, since architectural flex- ibility is limited by the initial topology, it is not easy to explore design space widely. Hiroaki Kunieda Dept. of Electrical and Elec tronic Engin Tokyo Institute of Techn 2-12-1, Ookayama, Meguro-ku, el: i-81-3-5734-257 : +81-3-5734-2842 unieda@ss.titech.ac.jp Compared with the previous works, our approach is more aggressive to achieve the high performance of ASIP. Instruction sequence is decomposed into micro-operations(M0P's). They are scheduled in MOP level in order to achieve higher performance with optimized micro-architecture. To explo sign space broadly, we try to transform the tecture by the selection of synthesis parameters. We assumed a virtual machine as the initial architectural template, in which there is no limit in th of control path on the se1 time. ed microprocessor. The part and hardware part is done quence. In the hardware synthesis part, th bly codes are translated into a graphical form, callcd ical representation of dat ing between RTL compon and the topology synthesis process begin, the combi- nations of synthesis parameters is applied selectively to enhance the performance or to reduce the area. This results in the transformation of data path topol- ogy. The scheduling is performed in MOP-level un- ansfer Graph(RTG) wh 306 T4-OB4. 0-7803-3702-6/96/$5.00@1996 IEEE eI .TL n.ni.r Fig. 1. : Synthesis Flow der the selected synthesis parameters. Additionally, to reduce the connection area cost, the data transfer paths are reduced by replacing a path with its bypass route. 3. REGISTER TRANSFER GRAPH Since instruction is composed of a series of MOPs which are data transfer operation or data processing operations between RTL components, it is possible to consider an instruction as the ordered execution of such RTL operations. In order to represent oper- ations in RTL, weintroduce a new RTL-level graph RTG(V,A), in which V is the set of RTL compo- nents and A is the set of MOPs. Compared with CDFG, RTG is more useful to predict the usages of RTL components, and the connection topology be- tween them. The easiness of prediction is important to evaluate the usages of RTL components before the resource allocation is performed. The multiple exe- cutions of operation are denoted by execution order set Oz3, simply called as order set. The elements of the order set are control step numbers at which the operation have to be executed. Initially, the propagation delays of all operations are assumed to have unit time delay. The numbering of registers are subject to its assignment table. The functional operation is represented by two incoming arcs with the same execution order to a vertex. All functional operation is numbered by a unique num- ber. The RTG for LOAD instruction and its MOP definition are shown in Fig.2. The order sets are as fOllOWS: 01, 3 ={I}, 01,1 ={2}, 0 2 , s ={4}, 0 3 , 4 = MAR <- PC MBRo <- mem[MARl. PC <- PC + 1 IR <- MBR MAR <- 1R.addr MBRo <- mem[UARI Rd <- MBR Fig. 2. : RTG of LOAD Instruction Fig. 3. : RTG for Sample Instruction Sequence {2,5), 0 4 , 2 =(3)~ 04, Bn =( 6) . According to the execution sequence of the given instruction sequence, each RTG for instruction is in- tegrated into a representative RTG. When the dif- ferent kind of instruction is integrated, a new vertex or a new arc may be added as well as the change of order sets. The order of arcs is updated sequentially according to its execution sequence. Fig.3 shows the representative RTG of the example instruction se- quence for x =(w +x) - y. 4. SELECTION OF SYNTHESIS PARAMETER Depending on the selected synthesis parameters, RTG is modified to accommodate the parameters, and in turn, the result architecture will be changed. Each parameter or the combination of them is ap- plied to the initial RTG in which all MOPs are assumed to be executed sequentially without any execution overlap or component sharing. We are using four synthesis parameters: Resource Sharing (Cl ), Multiport Memory (C2), Multicycled Opera- tion (C3), Pipelined Operation (C4). In this paper, these synthesis parameters are ap- plied additively shown in Table.1. Multicycled func- tional operation and functional pipelining are se- lectable alternatively according to the application or the objective function. Fig.4 shows the modified RTG by each case. The elements of order sets and the connection topology are changed. In (b), there are two pair of MBR and MAR. One is for instruction fetch (3,4), the other is for data fetch (3*,4*). In (c), two operand registers (pl,p2) are included for mul- ticycled functional operation (Case-111) or pipelined function operation (Case-IV). 5. LIST SCHEDULI NG WI TH I NSTRUCTI ON ORDER In order to schedule MOPs to guarantee correct execution of instructions, the dependencies between T4-OB4.2 307 ( C ) Fig 4. Modified RTG under (a) Case-I (b) Case-I1 (c) Case-II1,IV TABLE I COMBINATIONS OF SYNTHESIS PARAMETERS Fi IT1 c1+ c2 +c3 I V c1+ c 2 +c4 instructions or MOPS must be kept. There are two kinds of dependencies in instruction sequence : inter- instruction and intra-instruction. ion dependency is the dependency be- ons. Since the concurrent execution of multiple instructions such as super-scalar is not allowed at the current syste tion can be ex- ecuted only after all instruct re the instruc- tion are executed. Hence, struction implies the depe ral dependency between them, called as inter-MOP dency. The operation code field of an instruc- has the information w h kind of operation to be executed Therefore, only after the in- tion is decoded properly, the type of execution wn. The MOPs of exec uled before the MOPs o cle We call this constraint as cycle boundary. We use list scheduling to sch tion into a control step order is used as the priority function which resolves resource con- tention. Fig. 5. . Refined RTGs by Transfer Path Reduction 6. TRANSFER PATH REDUCTION For the scheduled RTG, we apply a heuristic tech- nique to reduce the number of data transfer paths without increasing the number of control steps. Data transfer operations means register-to-register opera- tion which directly transfers data without modifica- tion. If we can find out an alternative path for a data transfer path in RTG, the data t can be results in the reduction of connection cost. The rect connections between registers and the functional units with bypass operation are used as the bypass resources. placement is performed in three st replaced with alternative path (b , wh nement and selection. s to find out the candi s to refine the cur- path replacement. The selection step is to select only candidates with total number of control steps after RTGs for each case are refined like sho In (a), a4,60, a4, 61 are removed and a4*,60, a4*, 61 are removed. 7. DATA/CONTROL PATH GENERATION After the scheduling and the transfer path reduc- tion are completed, the vertices of RTG are mapped into RTL components and the arc mapped into connection resources can be implemented in bus-oriented ed. Since the connection geomet itly in connectivity graph, multiplexer-type When buses are used as the connection resources, the occupancies of connection resources have to be carefully investigated. When more than one MOP ath is derived straight-forw 308 T4-OB4.3 are executed simultaneously, all resources required to execute those concurrent MOPs should be re- served so as to avoid resource conflict and data colli- sion. Also, the operand paths and the result path of functional unit should be reserved during operation time. In case of memory access, both data path and address path must be reserved together in order to ensure the correct memory access. Control path consists of condition register, state register, decoder and micro-instructions stored in PLA, ROM or wired logic. Micro-instructions are generated from the scheduled time table. The dif- ferent combination of MOPs is executed at every control steps. We define the combination of MOPs as M-set. Among M-sets, there are common M-sets which are executed more than one times. Common M-set is unique for all control steps and has unique micro-instruction. Decoder associates the current control step with the micro-instruction which has to be executed. In order to reduce the hardware, new instruction set tuned to the derived topology must be generated. Currently, our system does not include such a procedure. 8. EXPERIMENTAL RESULT To verify the feasibility of the proposed method, the basic block of dzfleq, the differential equation benchmark are chosen as the example. Table.11 shows the result component utilization of five cases. wegenerate data path and control path under 4 com- binations of synthesis parameters. We assume the delay of multiplier is three times of that of addition. Mark t means the multicycled multiplier with the propagation delay of 3 control cycles and means the 4-stage pipelined multiplier, respectively. Cur- rently, the number of pipeline stages is fixed to 4 and the propagation delay of multicycled operator is 3 control steps. As the advanced work, the optimal number of pipeline stages and the propagation delay of multicycled operator will be determined so that the given design goal can be satisfied. re, which is the product of the number of control steps and the maximum register-to-register delay, is calculated un- der the assumption that the control cycle time of the initial RTG is the nominal cycle time s, n, , w, are the number of control steps, the number of M-set and the width of micro-instruction. Note that even if the number of functional com- ponents and the number of storage components are same, the usages of connection resources are differ- ent according to its connection topologies and im- plementation methods. This result indicates that the connection geometry as well as the utilization of storage unit and functional unit has to be consid- ered at the design evaluation step. Also, the width of micro-instruction is vaned with the different im- plementation methods even in the same case. So, in order to select more practical solution, the effect of control path has to be considered together with the data path. 9. CONCLUSI ON We proposed a systematic method which synthe- sizes the data path and control path of CPU Core for hardware-software codesign. We firstly proposed a graphical representation method to describe instruc- tions in register transfer level. By using RTG, we can derive the topology of data path directly. In order to transform the architecture of data path, we ap- plied synthesis parameters selectively. As the result, we can explore design space more efficiently. The optimization of data path topology as well as the maximization of resource utilization is considered si- multaneously. By reducing the number of data trans- fel paths by replacing the rarely used path with its bypass route, the connection cost is minimized. To select the best among the candidate CPU core, the data path cost and control path cost are considered together. 10. ACKNOWLEDGEMENT This work has been engaged as a project in CAD21 Research Body of Tokyo I nstitute of Technology. We wish to thank all the members of CAD21 for their suggestions and cooperations. T4-OB4.4 309