RTL-to-Gates Synthesis Using Synopsys Design Compiler: 6.375 Tutorial 4 February 12, 2007
RTL-to-Gates Synthesis Using Synopsys Design Compiler: 6.375 Tutorial 4 February 12, 2007
RTL-to-Gates Synthesis Using Synopsys Design Compiler: 6.375 Tutorial 4 February 12, 2007
Getting started
Before using the 6.375 toolow you must add the course locker and run the course setup script with the following two commands. % add 6.375 % source /mit/6.375/setup.csh For this tutorial we will be using an unpipelined SMIPSv1 processor as our example RTL design. You should create a working directory and checkout the SMIPSv1 example project from the course CVS repository using the following commands. % % % % mkdir tut4 cd tut4 cvs checkout examples/smipsv1-1stage-v cd examples/smipsv1-1stage-v
Before starting, take a look at the subdirectories in the smips1-1stage-v project directory. Figure 1 shows the system diagram which is implemented by the example code. When pushing designs through the physical toolow we will often refer to the core. The core module contains everything which will be on-chip, while blocks outside the core are assume to be o-chip. For this tutorial we are assuming that the processor and a combinational memory are located within the core. A combinational memory means that the read address is specied at the beginning of the cycle, and the read data returns during the same cycle. Building large combinational memories is relatively inecient. It is much more common to use synchronous memories. A synchronous memory means that the read address is specied at the end of a cycle, and the read data returns during the next cycle. From Figure 1 it should be clear that the unpipelined SMIPSv1 processor requires combinational memories (or else it would turn into a four stage pipeline). For this tutorial we will not be using a real combinational memory, but instead we will use a dummy memory to emulate
pc+4
+4
branch pc_sel
eq?
PC
Cmp
ir[25:21] ir[20:16] val rd0 wb_sel
ir[20:16]
Instruction Mem
rf_wen
rd1
Add
Reg File
ir[15:0]
>> 2
Data Mem
tohost_en
Decoder
rw val
Control Signals
tohost
testrig_tohost
the combinational delay through the memory. Examine the source code in src and compare smipsCore rtl with smipsCore synth. The smipsCore rtl module is used for simulating the RTL of the SMIPSv1 processor and it includes a functional model for a large on-chip combinational memory. The smipsCore synth module is used for synthesizing the SMIPSv1 processor and it uses a dummy memory. The dummy memory combinationally connects the memory request bus to the memory response bus with a series of standard-cell buers. Obviously, this is not functionally correct, but it will help us illustrate more reasonable critical paths in the design. In later tutorials, we will start using memory generators which will create synchronous on-chip SRAMs. Now examine the build directory. This directory will contain all generated content including simulators, synthesized gate-level Verilog, and nal layout. In this course we will always try to keep generated content separate from our source RTL. This keeps our project directories well organized, and helps prevent us from unintentionally modifying our source RTL. There are subdirectories in the build directory for each major step in the 6.375 toolow. These subdirectories contain scripts and conguration les for running the tools required for that step in the toolow. For this tutorial we will work exclusively in the dc-synth directory.
Figure 2: Output from the Design Compiler elaborate command timing/area information for each standard cell. DC will use this information to try and optimize the synthesis process. You can now load your Verilog design into Design Compiler with the analyze and elaborate commands. Executing these commands will result in a great deal of log output as the tool elaborates some Verilog constructs and starts to infer some high-level components. Try executing the commands as follows. dc_shell-xg-t> analyze -library WORK -format verilog \ { vcMuxes.v vcStateElements.v vcRAMs.v vcArith.v smipsInst.v \ smipsProcCtrl.v smipsProcDpathRegfile.v smipsProcDpath_pstr.v \ smipsProc.v smipsCore_synth.v } dc_shell-xg-t> elaborate smipsCore_synth -architecture verilog -library WORK Take a closer look at the output during elaboration. DC will report all state inferences. This is a good way to verify that latches and ip-ops are not being accidentally inferred. You should be able to check that the only inferred state elements are the PC, the tohost register, a one-bit reset register, and the register le. DC will also note information about inferred muxes. Figure 2 shows a fragment from the elaboration output text. From this output you can see that DC is inferring 32-bit ip-ops for the register le and two 32 input 32-bit muxes for the register le read ports. See the Presto HDL Compiler Reference Manual (presto-HDL-compiler.pdf) for more information on the output from the elaborate command and more generally how DC infers combinational and sequential hardware elements. After reading our design into DC we can use the check design command to check that the design is consistent. A consistent design is one which does not contain any errors such as unconnected ports, constant-valued ports, cells with no input or output pins, mismatches between a cell and its reference, multiple driver nets, connection class violations, or recursive hierarchy denitions. You will not be able to synthesize your design until you eliminate any errors. Run the check design command as follows and take a look at the warnings. Many of these warning are obviously not an issue, but it is still useful to skim through this output.
Before you can synthesize your design, you must specify some constraints; most importantly you must tell the tool your target clock period. The following commands tell the tool that the pin named clk is the clock and that your desired clock period is 5 nanoseconds. We need to set the clock period constraint carefully. If the period is unrealistically small, then the tools will spend forever trying to meet timing and ultimately fail. If the period is too large, then the tools will have no trouble but you will get a very conservative implementation. For more information about constraints consult the Design Compiler Constraints and Timing Reference Manual (dc-constraints.pdf). dc_shell-xg-t> create_clock clk -name ideal_clock1 -period 5 Now we are ready to use the compile command to actually synthesize our design into a gate-level netlist. Two of the most important options for the compile command are the map effort and the area effort. Both of these can be set to one of none, low, medium, or high. They specify how much time to spend on technology mapping and area reduction. DC will attempt to synthesize your design while still meeting the constraints. DC considers two types of constraints: user specied constraints and design rule constraints. User specied constraints can be used to constrain the clock period (as we saw with the create clock command) but they can also be used to constrain the arrival of certain input signals, the drive strength of the input signals, and the capacitive load on the output signals. Design rule constraints are xed constraints which are specied by the standard cell library. For example, there are restrictions on the loads specic gates can drive and on the transition times of certain pins. Note that the compile command does not optimize across module boundaries. You can use the set flatten command to enable inter-module optimization. For more information on the compile command consult the Design Compiler User Guide (dc-user-guide.pdf) or use man compile at the DC shell prompt. Run the following command and take a look at the output. dc_shell-xg-t> compile -map_effort medium -area_effort medium The compile command will report how the design is being optimized. You should see DC performing technology mapping, delay optimization, and area reduction. Figure 3 shows a fragment from the compile output. Each line is an optimization pass. The area column is in units specic to the standard cell library, but for now you should just use the area numbers as a relative metric. The worst negative slack column shows how much room there is between the critical path in your design and the clock constraint. Larger negative slack values are worse since this means that your design is missing the desired clock frequency by a greater amount. Total negative slack is the sum of all negative slack across all endpoints in the design - if this is a large negative number it indicates that not only is the design not making timing, but it is possible that many paths are too slow. If the total negative slack is a small negative number, then this indicates that only a few paths are too slow. The design rule cost is a indication of how many cells violate one of the standard cell library design rules constraints. Figure 3 shows that on the rst iteration, the tool makes timing but at a high area cost, so on the second iteration it optimizes area but this causes the design to no longer meet timing. On the third through fth iterations the tool is trying to increase performance and decrease the design rule cost while maintaining the same area. Once the synthesis is complete, you will get a the following warning about a very high-fanout net. Warning: Design smipsCore_synth contains 1 high-fanout nets.
Figure 3: Output from the Design Compiler compile command A fanout number of 1000 will be used for delay calculations involving these nets. (TIM-134) Net clk: 1034 load(s), 1 driver(s) The synthesis tool is noting that the clock is driving 1034 gates. Normally this would be a serious problem, and it would require special steps to be taken during place+route so that the net is properly driven. However, this is the clock node and we already handle the clock specially by building a clock tree during place+route so there is no need for concern. We can now use various commands to examine timing paths, display reports, and further optimize our design. Entering in these commands by hand can be tedious and error prone, plus doing so makes it dicult to reproduce a result. Thus we will mostly use TCL scripts to control the tool. Even so, using the shell directly is useful for nding out more information about a specic command or playing with various options. Before continuing, exit the DC shell and delete your build directory with the following commands. dc_shell-xg-t> exit % pwd tut4/examples/smipsv1-1stage-v/build/dc-synth/build % cd .. % rm -rf build
First take a look at the libs.tcl script. You will see that it sets up several library variables, creates the search path, and instructs DC to use a work directory. The rst line of the libs.tcl script loads the make generated vars.tcl script. This script is generated by the makele and it contains variables which are dened by the makele and used by the TCL scripts. We will
take a closer look at it in a moment. Now examine the synth.tcl script. You will see many familiar commands which we executed by hand in the previous section. You will also see some new commands. Take a closer look at the bottom of this TCL script where we write out several text reports. Remember that you can get more information on any command by using man <command> at the DC shell prompt. The synth.sdc le contains various user specied constraints. This is where you constrain the clock period. We also specify that DC should assume that minimum sized inverters are driving the inputs to the design and that the outputs must drive 4 fF of capacitance. Now that we are more familiar with the various TCL scripts, we will see how to use the makele to drive synthesis. Look inside the makele and identify where the Verilog sources are dened. Notice that we are using smipsCore synth.v instead of smipsCore rtl.v and that the test harness is not included. You should only list those Verilog les which are part of the core; do not included non-synthesizable test harnesses modules. Also notice that we must identify the toplevel Verilog module in the design. We also specify several modules in the dont touch make variable. Any modules which we list here will be marked with the DC set dont touch command. DC will not optimize any modules which are marked dont touch. In this tutorial we are marking the dummy memories dont touch so that DC does not completely optimize away the buer chain we are using to model the combinational delay through the memory. The build rules in the makele will create new build directories, copy the TCL scripts into these build directories, and then run DC. Use the following make target to create a new build directory. % pwd tut4/examples/smipsv1-1stage-v/build/dc-synth % make new-build-dir You should now see a new build directory named build-<date> where <date> represents the time and date. The current symlink always points to the most recent build directory. If you look inside the build directory, you will see the libs.tcl, synth.tcl, and synth.sdc scripts but you will also see an additional make generated vars.tcl script. Various variables inside make generated vars.tcl are used to specify the search path, which Verilog les to read in, which modules should be marked dont touch, the toplevel Verilog name, etc. After using make new-build-dir you can cd into the current directory, start the DC shell, and run DC commands by hand. For example, the following sequence will perform the same steps as in the previous section. % pwd tut4/examples/smipsv1-1stage-v/build/dc-synth % cd current % dc_shell-xg-t dc_shell-xg-t> source libs.tcl dc_shell-xg-t> analyze -library WORK -format verilog ${VERILOG_SRCS} dc_shell-xg-t> elaborate ${VERILOG_TOPLEVEL} -architecture verilog -library WORK dc_shell-xg-t> check_design dc_shell-xg-t> source synth.sdc dc_shell-xg-t> compile -map_effort medium -area_effort medium dc_shell-xg-t> exit % cd ..
The new-build-dir make target is useful when you want to conveniently run through some DC commands by hand to try them out. To completely automate our synthesis we can use the synth make target (which is also the default make target). For example, the following commands will automatically synthesize the design and save several text reports to the build directory. % pwd tut4/examples/smipsv1-1stage-v/build/dc-synth % make synth You should see DC compiler start and then execute the commands located in the synth.tcl script. Once synthesis is nished try running make synth again. The makele will detect that nothing has changed (i.e. the Verilog source les and DC scripts are the same) and so it does nothing. Lets make a change to one of the TCL scripts. Edit synth.sdc and change the clock period constraint to 15 ns. Now use make synth to resynthesize the design. Since a TCL script has changed, make will correctly run DC again. Take a look at the current contents of dc-synth. % pwd tut4/examples/smipsv1-1stage-v/build/dc-synth % ls -l build-2007-02-26_16-15 build-2007-02-26_16-25 build-2007-02-26_16-31 current -> build-2007-02-26_16-31 CVS libs.tcl Makefile synth.sdc synth.tcl Notice that the makele does not overwrite build directories. It always creates new build directories. This makes it easy to change your synthesis scripts or source Verilog, resynthesize your design, and compare your results to previous designs. We can use symlinks to keep track of what various build directories correspond to. For example, the following commands label the build directory which corresponds to a 5 ns clock period constraint and the build directory which corresponds to a 15 ns clock period constraint. % pwd tut4/examples/smipsv1-1stage-v/build/dc-synth % ln -s build-2007-02-26_16-25 build-5ns % ln -s build-2007-02-26_16-31 build-15ns Every so often you should delete old build directories to save space. The make clean command will delete all build directories so use it carefully. Sometimes you want to really force the makele to resynthesize the design but for some reason it may not work properly. To force a resynthesis without doing a make clean simply remove the current symlink. For example, the following commands will force a resynthesis without actually changing any of the source TCL scripts or Verilog. % pwd tut4/examples/smipsv1-1stage-v/build/dc-synth % rm -rf current % make synth
In this section we will discuss the synth area.rpt and the synth timing.rpt reports. The next section will discuss the synth resources.rpt report. The synth area.rpt report contains area information for each module in the design. Figure 5 shows a fragment from synth area.rpt for the SMIPSv1 unpipelined processor. We can use the synth area.rpt report to gain insight into how various modules are being implemented. For example, we can use the area report in a similar fashion as the synthesized.v gate-level netlist to see that the vcMux2 W32 3 module includes only 30 mux cells and uses bit-level optimizations for the remaining two bits. We can also use the area report to measure the relative area of the various modules. The report clearly shows that the majority of the processor area is in the datapath. More specically we can see that register le consumes 90% of the total processor area. The area report reveals that the
10
module vcMux2_W32_3 ( in0, in1, sel, out ); input [31:0] in0; input [31:0] in1; output [31:0] out; input sel; wire N0, n1, n2, n3, n4, n5, n6; assign N0 = sel; buffd1 buffd1 buffd1 buffd1 nr02d0 inv0d0 nr02d0 inv0d0 mx02d1 mx02d1 // ... mx02d1 mx02d1 U8 ( .I(N0), .Z(n4) ); U9 ( .I(N0), .Z(n5) ); U10 ( .I(N0), .Z(n6) ); U11 ( .I(N0), .Z(n3) ); U1 U2 U3 U4 ( ( ( ( .A1(n3), .A2(n1), .ZN(out[0]) ); .I(in0[0]), .ZN(n1) ); .A1(n3), .A2(n2), .ZN(out[1]) ); .I(in0[1]), .ZN(n2) );
U5 ( .I0(in0[31]), .I1(in1[31]), .S(n6), .Z(out[31]) ); U6 ( .I0(in0[29]), .I1(in1[29]), .S(n6), .Z(out[29]) ); 26 additional mx02d1 instantiations ... U37 ( .I0(in0[10]), .I1(in1[10]), .S(n4), .Z(out[10]) ); U38 ( .I0(in0[9]), .I1(in1[9]), .S(n4), .Z(out[9]) );
endmodule module vcMux2_W32_2 ( in0, in1, sel, out ); input [31:0] in0; input [31:0] in1; output [31:0] out; input sel; wire N0, n1, n2, n3, n4; assign N0 = sel; buffd1 buffd1 buffd1 buffd1 mx02d1 mx02d1 // ... mx02d1 mx02d1 U2 U3 U4 U5 ( ( ( ( .I(N0), .I(N0), .I(N0), .I(N0), .Z(n1) .Z(n2) .Z(n3) .Z(n4) ); ); ); );
U1 ( .I0(in0[7]), .I1(in1[7]), .S(n1), .Z(out[7]) ); U6 ( .I0(in0[1]), .I1(in1[1]), .S(n1), .Z(out[1]) ); 28 additional mx02d1 instantiations ... U35 ( .I0(in0[30]), .I1(in1[30]), .S(n4), .Z(out[30]) ); U36 ( .I0(in0[31]), .I1(in1[31]), .S(n4), .Z(out[31]) );
endmodule
11
Figure 5: Fragment from synth area.rpt register le is being implemented with approximately 1000 enable ip-ops and two large 32 input muxes (for the read ports).This is a very inecient way to implement a register le, but it is the best the synthesizer can do. Real ASIC designers rarely synthesize memories and instead turn to memory generators. A memory generator is a tool which takes an abstract description of the memory block as input and produces a memory in formats suitable for various tools. Memory generators use custom cells and procedural place+route to achieve an implementation which can be an order of magnitude better in terms of performance and area than synthesized memories. Figure 6 illustrates a fragment of the timing report found in synth timing.rpt. The report lists the critical path of the design. The critical path is the slowest logic path between any two registers and is therefore the limiting factor preventing you from decreasing the clock period constraint (and thus increasing performance). The report is generated from a purely static worst-case timing analysis (i.e. independent of the actual signals which are active when the processor is running). The rst column lists various nodes in the design. Note that several nodes internal to higher level
12
Point Incr Path ----------------------------------------------------------------------------clock ideal_clock1 (rise edge) 0.00 0.00 clock network delay (ideal) 0.00 0.00 proc/dpath/pc_pf/q_np_reg[21]/CP (dfnrq4) 0.00 # 0.00 r proc/dpath/pc_pf/q_np_reg[21]/Q (dfnrq4) 0.25 0.25 f proc/dpath/pc_pf/q_np[21] (vcRDFF_pf_W32_RESET_VALUE00001000) 0.00 0.25 f proc/dpath/imemreq_bits_addr[21] (smipsProcDpath_pstr) 0.00 0.25 f proc/imemreq_bits_addr[21] (smipsProc) 0.00 0.25 f dmem/imem_read_delay/row[0].bit[21].delay/Z (bufbdk) 0.16 0.41 f ... dmem/imem_read_delay/row[3].bit[21].delay/Z (bufbdk) 0.16 0.87 f proc/imemresp_bits_data[21] (smipsProc) 0.00 0.87 f proc/ctrl/imemresp_bits_data[21] (smipsProcCtrl) 0.00 0.87 f proc/ctrl/rf_raddr0[0] (smipsProcCtrl) 0.00 0.87 f proc/dpath/rf_raddr0[0] (smipsProcDpath_pstr) 0.00 0.87 f proc/dpath/rfile/raddr0[0] (smipsProcDpathRegfile) 0.00 0.87 f ... proc/dpath/rfile/rdata0[5] (smipsProcDpathRegfile) 0.00 1.92 f proc/dpath/op1_mux/in1[5] (vcMux2_W32_2) 0.00 1.92 f proc/dpath/op1_mux/U11/Z (mx02d1) 0.15 2.07 f proc/dpath/op1_mux/out[5] (vcMux2_W32_2) 0.00 2.07 f proc/dpath/adder/in1[5] (vcAdder_simple_W32) 0.00 2.07 f proc/dpath/adder/add_29/B[5] (vcAdder_simple_W32_DW01_add_0) 0.00 2.07 f ... proc/dpath/adder/add_29/SUM[31] (vcAdder_simple_W32_DW01_add_0) 0.00 3.71 f proc/dpath/adder/out[31] (vcAdder_simple_W32) 0.00 3.71 f proc/dpath/dmemreq_bits_addr[31] (smipsProcDpath_pstr) 0.00 3.71 f proc/dmemreq_bits_addr[31] (smipsProc) 0.00 3.71 f ... proc/dmemresp_bits_data[31] (smipsProc) 0.00 4.45 f proc/dpath/dmemresp_bits_data[31] (smipsProcDpath_pstr) 0.00 4.45 f proc/dpath/wb_mux/in1[31] (vcMux2_W32_1) 0.00 4.45 f proc/dpath/wb_mux/U2/Z (mx02d2) 0.16 4.61 f proc/dpath/wb_mux/out[31] (vcMux2_W32_1) 0.00 4.61 f proc/dpath/rfile/wdata_p[31] (smipsProcDpathRegfile) 0.00 4.61 f ... proc/dpath/rfile/registers_reg[10][31]/D (denrq1) 0.00 4.80 f data arrival time 4.80 clock ideal_clock1 (rise edge) 5.00 5.00 clock network delay (ideal) 0.00 5.00 proc/dpath/rfile/registers_reg[10][31]/CP (denrq1) 0.00 5.00 r library setup time -0.18 4.82 data required time 4.82 ----------------------------------------------------------------------------data required time 4.82 data arrival time -4.80 ----------------------------------------------------------------------------slack (MET) 0.01
13
modules have been cut out to save space. The last column lists the cumulative delay to that node, while the middle column shows the incremental delay. We can see that the critical path starts at bit 21 of the PC register; goes through the combinational read of the instruction memory; goes through the read address of the register le and out the read data port; goes through the operand mux; through the adder; out the data memory address port and back in the data memory response port; through the writeback mux; and nally ends at bit 31 of register 10 in the register le. The large buers in the memory (the bufbdk cell in the dmem module) model the combinational delay through these memories. We can use the delay column to get a feel for how much each module contributes to the critical path: the combinational memories contribute about 0.6 ns; the register le read contributes about 1.1 ns; the adder contributes 1.7 ns; and the register le write requires 0.2 ns. The critical path takes a total of 4.82ns which is less than the 5ns clock period constraint. Notice, however, that the nal register le ip-op has a setup time of 0.18 ns. So the critical path plus the setup time (4.82ns + 0.18ns = 5ns) is just fast enough to meet the clock period constraint.
14
Verilog model so that VCS can simulate the component. You can do this by adding the following command line parameter to VCS. -y $(SYNOPSYS)/dw/sim_ver +libext+.v+ We suggest only using direct instantiation as a last resort since it it creates a dependency between your high-level design and the Design Ware libraries, and it limits the options available to Design Compiler during synthesis.
**************************************** Design : smipsCore_synth/proc/dpath/adder (vcAdder_simple_W32) Resource Sharing Report | | | | Contained | | | Resource | Module | Parameters | Resources | Contained Operations | =============================================================================== | r242 | DW01_add | width=32 | | add_29 | Implementation Report | | | Current | Set | | Cell | Module | Implementation | Implementation | ============================================================================= | add_29 | DW01_add | pparch | |
**************************************** Design : smipsCore_synth/proc/dpath/adder (vcAdder_simple_W32) Resource Sharing Report | | | | Contained | | | Resource | Module | Parameters | Resources | Contained Operations | =============================================================================== | r242 | DW01_add | width=32 | | add_29 | Implementation Report | | | Current | Set | | Cell | Module | Implementation | Implementation | ============================================================================= | add_29 | DW01_add | rpl | |
15
Review
The following sequence of commands will setup the 6.375 toolow, checkout the SMIPSv1 processor example, and synthesize the design. % % % % % % % add 6.375 source /mit/6.375/setup.csh mkdir tut4 cd tut4 cvs checkout examples/smipsv1-1stage-v cd examples/smipsv1-1stage-v/build/dc-synth make
16