Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Key features of the design methodology enabling a multi-core SoC implementation of a first-generation CELL processor

Asia and South Pacific Conference on Design Automation, 2006., 2006
...Read more
Key Features of the Design Methodology Enabling a Multi-Core SoC Implementation of a First-Generation CELL Processor Dac Pham, Hans-Werner Anderson, Erwin Behnen, Mark Bolliger, Sanjay Gupta, Peter Hofstee, Paul Harvey, Charles Johns, Jim Kahle, Atsushi Kameyama 1 , John Keaty, Bob Le, Sang Lee, Tuyen Nguyen, John Petrovick, Mydung Pham, Juergen Pille, Stephen Posluszny, Mack Riley, Joseph Verock, James Warnock, Steve Weitzel, Dieter Wendel IBM Systems and Technology Group, Austin, TX 1 Toshiba America Electronic Components, Austin, TX Abstract-- This paper reviews the design challenges that current and future processors must face, with stringent power limits and high frequency targets, and the design methods required to overcome the above challenges and address the continuing Giga-scale system integration trend. This paper then describes the details behind the design methodology that was used to successfully implement a first-generation CELL processor - a multi-core SoC. Key features of this methodology are broad optimization with fast rule-based analysis engines using macro- level abstraction for constraints propagation up/down the design hierarchy, coupled with accurate transistor level simulation for detailed analysis. The methodology fostered the modular design concept that is inherent to the CELL architecture, enabling a high frequency design by maximizing custom circuit content through re-use, and balanced power, frequency, and die size targets through global convergence capabilities. The design has roughly 241 million transistors implemented in 90nm SOI technology with 8 levels of copper interconnects and one local interconnect layer. The chip has been tested at various temperatures, voltages, and frequencies. Correct operation has been observed in the lab on first pass silicon at frequencies well over 4GHz. Index TermsCELL Processor, multi-core, SOC, SOI, modularity, re-use, 64-bit Power Architecture, multi-threading, synergistic processor, flexible IO, Linux, multi-operating system, virtualization technology, real-time system, hardware content protection, correct- by-construction, thermal management, power management, clock distribution, high-performance latch, local clock buffer, design hierarchy, design environment, design dependency solution, linear sensor, digital thermal sensor. I. INTRODUCTION he architectural vision of “bringing supercomputer power to everyday life” is the driving force behind the CELL processor design, setting a new performance standard by exploiting parallelism while achieving high frequency [1]. CELL is designed for natural human interactions: photo realistic, predictable real time response, and virtualized resource for concurrent activities. CELL supports multiple operating systems including Linux, and is designed for flexibility with a wide variety of application domains. Other attributes include hardware content protection, and extensive single-precision floating-point capability. By extending the Power Architecture with Synergistic Processor Elements (SPE) having coherent DMA access to system storage and with multi-operating system resource management, CELL supports concurrent real time and conventional computing. With a dual-threaded Power Processor Element (PPE) and eight SPEs this implementation is capable of 10 simultaneous threads and over 128 outstanding memory requests. The First-Generation CELL processor consists of the PPE and its L2 cache, eight SPEs [2] each with its own local memory (LS) [3], a high bandwidth internal Element Interconnect Bus (EIB) [4], two configurable non-coherent I/O interfaces, a Memory Interface Controller (MIC), and a Pervasive unit that supports extensive test, monitoring, and debug functions. The high level chip diagram is shown in figure 1 below. Fig. 1: Processor high level diagram II. THE DESIGN CHALLENGES FOR GIGA-SCALE INTEGRATION II.1. Power & Frequency Walls Over the last decade, technology scaling has resulted in leakage power increases of over 1000X (fig. 2). With gate dielectrics and other device features fast approaching fundamental limits, a continuation of historical trends would see passive power surpassing active power within the next few years. Furthermore, the technique of increasing frequency by deepening the pipeline has reached a point of diminishing performance returns if power is taken into consideration. In the face of this power/performance wall, increased design efficiency becomes essential. These T PXU EIB (up to 96 Bytes/cycle) SXU LS SXU LS LS SXU LS LS SXU LS Dual XDR TM FlexIO TM LS SXU LS SXU SXU SXU BIC MIC L2 L1 SMF SMF SMF SMF SMF SMF SMF SMF LS SXU SM F PXU L2 L1 PPE SPE SPU SXU LS 16B/cycle(each) 16B/cycle 16B/cycle 16B/cycle 16B/cycle(2x) 32B/cycle 16B/cycle(each)
factors drove the decision to support a wider processor issue width (e.g. multi-threading) and to increase the number of architected registers. 0.01 0.1 1 0.001 0.01 0.1 1 10 100 1000 Gate Length (microns) Active Power Passive Power 1994 2004 Power Density (W/cm 2 ) 10S Tox=11A Gate Stack Gate dielectric approaching a fundamental limit (a few atomic layers) Fig. 2: Power Wall II.2. System Trends and Giga- Scale Integration Increased system integration is driving processors to take on many of the functions typically associated with the system: off load and acceleration, and integration of bridge chips as shown in figure 3. Fig. 3: System Trends III. DESIGN IMPLEMENTATION TO ADDRESS POWER AND FREQUENCY WALLS III.1. Components and Libraries Design Given a short cycle time target, a significant amount of the chip power is consumed by latches, flip-flops, and other clocked elements. However, the delay overhead imposed by standard flip-flops is considerable. Therefore, a rich set of latches and flip-flops were developed to allow for both power and delay optimizations. The basic local clock splitter components are shown in figure 4. In addition to test controls, the base block accepts a local clock gating signal, with a small setup time relative to the falling global clock (cycle boundary). Input setup and hold times are specified against the falling clock edge, as a result of the built-in latching action of the base block. Local clocks, to drive typical master-slave flip-flops, are derived from the common output point of the base block. For timing critical paths, a high-performance latch (HPL) [5, 6] was designed which combines a wide mux (up to 10-way), relying on a dynamic NOR gate, with a set-reset latch (fig. 5). The dynamic NOR starts evaluating with the launch clock, and the input data hold time is limited when all sel_b inputs are forced high after a fixed delay. Fig. 4: Local Clock Generation Dynamic circuits were used in several critical macros, in the arrays, and in PLAs. All dynamic macros were latch-bounded (macro-to-macro signals are static). Signals feeding dynamic logic were usually launched from the master portion of a flip-flop, and ANDed with the slave clock (lclk) to provide a signal which resets to 0 every cycle when lclk is low. Dynamic logic was always followed by a set-reset latch similar to that used for the HPL shown earlier. In addition, various rules were adopted to ensure a “correct-by-construction” design methodology. All circuits used a common set of clocking components to ensure uniformity across the design, with no rotation of the components allowed. An extensive set of electrical and physical checks and audits were put in place. Finally, a customized series of yield-related checking rules was employed to ensure manufacturability of the chip. The 90nm PD SOI technology offers three oxide thicknesses (thin oxide, thick oxide for high voltage device, and decoupling capacitor) and four different V T settings for the thin oxide devices. Since power was such a critical design issue, static circuit implementations were favored for the majority of the design. A variety of static circuit families were used in full custom designs, with tuners and device width optimizers used for power-performance tuning. Higher threshold voltage devices were inserted wherever possible to cut down on leakage current (no low V T devices were used), and the threshold voltage for the array devices was adjusted independently from that of the logic devices. Approximately 40% of the logic was implemented as synthesized random logic macros (RLMs), with the rest being full custom design. The local clocking described in figure 4 has several important features. Overall clock latency and absolute clock scan global clock clk_b clk fb scan lclk d1clk d2clk Base Block Common Point clockgate_b testhold_b
Key Features of the Design Methodology Enabling a Multi-Core SoC Implementation of a First-Generation CELL Processor Dac Pham, Hans-Werner Anderson, Erwin Behnen, Mark Bolliger, Sanjay Gupta, Peter Hofstee, Paul Harvey, Charles Johns, Jim Kahle, Atsushi Kameyama1, John Keaty, Bob Le, Sang Lee, Tuyen Nguyen, John Petrovick, Mydung Pham, Juergen Pille, Stephen Posluszny, Mack Riley, Joseph Verock, James Warnock, Steve Weitzel, Dieter Wendel IBM Systems and Technology Group, Austin, TX Toshiba America Electronic Components, Austin, TX 1 Abstract-- This paper reviews the design challenges that current and future processors must face, with stringent power limits and high frequency targets, and the design methods required to overcome the above challenges and address the continuing Giga-scale system integration trend. This paper then describes the details behind the design methodology that was used to successfully implement a first-generation CELL processor - a multi-core SoC. Key features of this methodology are broad optimization with fast rule-based analysis engines using macrolevel abstraction for constraints propagation up/down the design hierarchy, coupled with accurate transistor level simulation for detailed analysis. The methodology fostered the modular design concept that is inherent to the CELL architecture, enabling a high frequency design by maximizing custom circuit content through re-use, and balanced power, frequency, and die size targets through global convergence capabilities. The design has roughly 241 million transistors implemented in 90nm SOI technology with 8 levels of copper interconnects and one local interconnect layer. The chip has been tested at various temperatures, voltages, and frequencies. Correct operation has been observed in the lab on first pass silicon at frequencies well over 4GHz. With a dual-threaded Power Processor Element (PPE) and eight SPEs this implementation is capable of 10 simultaneous threads and over 128 outstanding memory requests. The First-Generation CELL processor consists of the PPE and its L2 cache, eight SPEs [2] each with its own local memory (LS) [3], a high bandwidth internal Element Interconnect Bus (EIB) [4], two configurable non-coherent I/O interfaces, a Memory Interface Controller (MIC), and a Pervasive unit that supports extensive test, monitoring, and debug functions. The high level chip diagram is shown in figure 1 below. SPU SPE SXU SXU SXU SXU SXU SXU SXU 16B/cycle(each) LS LS LS LS LS LS LS LS SMF SMF SMF SMF SMF SMF SMF SMF 16B/cycle(each) Index Terms—CELL Processor, multi-core, SOC, SOI, modularity, re-use, 64-bit Power Architecture, multi-threading, synergistic processor, flexible IO, Linux, multi-operating system, virtualization technology, real-time system, hardware content protection, correctby-construction, thermal management, power management, clock distribution, high-performance latch, local clock buffer, design hierarchy, design environment, design dependency solution, linear sensor, digital thermal sensor. EIB (up to 96 Bytes/cycle) 16B/cycle 16B/cycle 16B/cycle(2x) L2 PPE 32B/cycle L1 PXU MIC BIC Dual XDRTM FlexIOTM 16B/cycle I. INTRODUCTION he architectural vision of “bringing supercomputer power to everyday life” is the driving force behind the CELL processor design, setting a new performance standard by exploiting parallelism while achieving high frequency [1]. CELL is designed for natural human interactions: photo realistic, predictable real time response, and virtualized resource for concurrent activities. CELL supports multiple operating systems including Linux, and is designed for flexibility with a wide variety of application domains. Other attributes include hardware content protection, and extensive single-precision floating-point capability. By extending the Power Architecture with Synergistic Processor Elements (SPE) having coherent DMA access to system storage and with multi-operating system resource management, CELL supports concurrent real time and conventional computing. SXU Fig. 1: Processor high level diagram T II. THE DESIGN CHALLENGES FOR GIGA-SCALE INTEGRATION II.1. Power & Frequency Walls Over the last decade, technology scaling has resulted in leakage power increases of over 1000X (fig. 2). With gate dielectrics and other device features fast approaching fundamental limits, a continuation of historical trends would see passive power surpassing active power within the next few years. Furthermore, the technique of increasing frequency by deepening the pipeline has reached a point of diminishing performance returns if power is taken into consideration. In the face of this power/performance wall, increased design efficiency becomes essential. These factors drove the decision to support a wider processor issue width (e.g. multi-threading) and to increase the number of architected registers. 1000 100 10 Power Density (W/cm2) Gate Stack 10S Tox=11A Active Power typical master-slave flip-flops, are derived from the common output point of the base block. For timing critical paths, a high-performance latch (HPL) [5, 6] was designed which combines a wide mux (up to 10-way), relying on a dynamic NOR gate, with a set-reset latch (fig. 5). The dynamic NOR starts evaluating with the launch clock, and the input data hold time is limited when all sel_b inputs are forced high after a fixed delay. Passive Power Common Point 1 global clock 0.1 Gate dielectric approaching 0.01 a fundamental limit (a few atomic layers) 1994 fb clk_b clk lclk d1clk 2004 0.001 1 0.1 0.01 Gate Length (microns) scan d2clk Fig. 2: Power Wall clockgate_b scan II.2. System Trends and Giga- Scale Integration Increased system integration is driving processors to take on many of the functions typically associated with the system: off load and acceleration, and integration of bridge chips as shown in figure 3. Fig. 3: System Trends III. DESIGN IMPLEMENTATION TO ADDRESS POWER AND FREQUENCY WALLS III.1. Components and Libraries Design Given a short cycle time target, a significant amount of the chip power is consumed by latches, flip-flops, and other clocked elements. However, the delay overhead imposed by standard flip-flops is considerable. Therefore, a rich set of latches and flip-flops were developed to allow for both power and delay optimizations. The basic local clock splitter components are shown in figure 4. In addition to test controls, the base block accepts a local clock gating signal, with a small setup time relative to the falling global clock (cycle boundary). Input setup and hold times are specified against the falling clock edge, as a result of the built-in latching action of the base block. Local clocks, to drive testhold_b Base Block Fig. 4: Local Clock Generation Dynamic circuits were used in several critical macros, in the arrays, and in PLAs. All dynamic macros were latch-bounded (macro-to-macro signals are static). Signals feeding dynamic logic were usually launched from the master portion of a flip-flop, and ANDed with the slave clock (lclk) to provide a signal which resets to 0 every cycle when lclk is low. Dynamic logic was always followed by a set-reset latch similar to that used for the HPL shown earlier. In addition, various rules were adopted to ensure a “correct-by-construction” design methodology. All circuits used a common set of clocking components to ensure uniformity across the design, with no rotation of the components allowed. An extensive set of electrical and physical checks and audits were put in place. Finally, a customized series of yield-related checking rules was employed to ensure manufacturability of the chip. The 90nm PD SOI technology offers three oxide thicknesses (thin oxide, thick oxide for high voltage device, and decoupling capacitor) and four different VT settings for the thin oxide devices. Since power was such a critical design issue, static circuit implementations were favored for the majority of the design. A variety of static circuit families were used in full custom designs, with tuners and device width optimizers used for power-performance tuning. Higher threshold voltage devices were inserted wherever possible to cut down on leakage current (no low VT devices were used), and the threshold voltage for the array devices was adjusted independently from that of the logic devices. Approximately 40% of the logic was implemented as synthesized random logic macros (RLMs), with the rest being full custom design. The local clocking described in figure 4 has several important features. Overall clock latency and absolute clock uncertainty is minimized by this scheme since there are only three gate delays between the global clock input and the data launch clock (lclk). Also, the common point for both launch and capture clocks are at the output of the base block, minimizing the relative uncertainty between launch and capture clocks. When clocks are in the gated state, lclk is held inactive, and the capture clock is held high. The system state is therefore stored in the slave latch. global clock Base clk_b Block sel_b sel scan_b evaluation window Wide mux q sel_b d clk_b lclk lclk clk_b Fig. 5: High performance latch For power reduction, the standard flip-flop can be run in pulsed-mode, with a clock configuration shown in figure 6. In this case the slave clock is pulsed in normal operation, and master clock is held high. There is also a “chicken switch” which allows running in normal masterslave clocked mode if race problems are seen in the hardware. A non-scannable pulsed latch was also supported, minimizing area, power, and latency in situations where a longer hold time could be tolerated. feedback nand2 The timing methodology required a design margin to be applied which scaled with the total path delay of the racing paths (in addition to a certain fixed margin), as measured from the common point of divergence. This ensured that race conditions with larger uncertainties were designed with correspondingly larger margins. III.2. Clock Distribution The chip contains three distinct clock distribution systems, each sourced by an independent PLL, to support processor, bus interface, and memory interface requirements. A main high frequency clock grid covers over 85% of the chip, delivering the clock signal to the processors and miscellaneous circuits. Second and third clock grids, each operating at fractions of the main clock signal are interleaved with the main clock grid structure, creating multiple clock frequency islands within the chip. All clock grids were constructed on the lowest impedance final two layers of metal, and were supported by a matrix of over 850 individually tuned buffers. This enabled control of the clock arrival times and skews, especially on the main clock grid, which supports regions of widely varying clock load densities. As shown in figure 7, final worst-case clock skew across chip was less than 12ps. High frequency clock distribution optimization and verification needed models which included frequency sensitive inductance and resistance phenomena [7]. These models were built from data extracted from combined clock and chip power distributions, two dimensional cross sections, and capacitance models extracted from three dimensional sections. Reduced clock grid power dissipation was achieved through optimization of buffer drive strengths, grid wire periodicity, clock wire to return path spacing, and clock twig wire widths. Together, these techniques lowered clock distribution power dissipation by more than 20% compare to previous design [8]. normal mode fb global clock clk lclk scan d2clk clockgate_b scan testhold_b normal mode d1clk Not used for non-scan latches Fig. 7: Clock skew map Fig. 6: Standard Flip-Flops With the widespread use of pulsed latches, and the controlled use of clock delay elements, it was very important to have a robust methodology to check for race conditions. III.3. Thermal and Power management This SOC presented new challenges in chip thermal design. The higher heat flux from smaller hot spots hindered spreading of the heat across the silicon substrate [9]. Extensive thermal analysis carried out early in the design cycle ensured that the maximum junction temperature, as well as the average temperature of the die, would end up within design specifications. Various workloads were simulated for each component and power maps were constructed. From these maps, a matrix of small power sources was created, for use with package and heat sink models. Thermal models were then created and used to simulate both steady state and transient thermal behavior. These data were analyzed to improve the design and floor plan of the chip, and also provided feedback for improved thermal sensor design (fig. 8). Due to local heating caused by individual processing units, sophisticated local thermal sensing strategies and thermal control mechanisms were used to allow an aggressive low cost thermal design. The processor contains a linear sensor and 10 local digital thermal sensors. The linear sensor is essentially a diode connected to two external I/Os, used to measure the die's global temperature and to adjust the system cooling. The digital thermal sensors provide for early warning of any temperature increase and for thermal protection. pervasive unit provides array and logic built in self test (BIST) engines. The ability to scan at speed is provided to assist with detection of AC related faults. Electronic fuses are used extensively for array repair and selected chip personalization. III.5. Physical Design Figure 9 shows the die photo with roughly 234M transistors from 17 physical entities, 580K repeaters and 1.4M nets implemented in 90nm SOI technology with 8 levels of copper interconnects and one local interconnect layer. S P U X M I I O C S P U S P U P P U S P U R B R I A C C MIB S P U S P U S P U S P U Fig. 9: Die Photo Fig. 8: Chip thermal map III.4. Pervasive Design The pervasive logic comprises all the function necessary for initialization, clock control, test, performance monitoring, and error checking and reporting. For a complex multi-core processor, the design of the pervasive logic is a key emphasis early in the design cycle. The pervasive function is implemented as a centralized controller and in distributed units across the chip. A Performance Monitor (PFM) is provided to assist with the debug and tuning of software applications, and an on board logic analyzer (LA) assists with hardware debug. Both the PFM and LA are capable of capturing information at speed from all units across the chip. The PFM and LA also provide the capability to view single or multiple units. Extensive debug and control capabilities are provided that can be accessed via an IEEE 1149.1 interface. For manufacturing support, the At the center of the chip is the EIB, composed of four 128-bit data rings plus a 64-bit tag operated at half the processor clock rate. The wires were arranged in groups of four, interleaved with GND and VDD shields twisted at the center to reduce coupling noise on the two unshielded wires. To ensure signal integrity, over 50% of global nets were engineered with 32K repeaters. The SOC uses 2965 C4s with four regions of different row column pitches attached to a low cost organic package. This structure supports 15 separate power domains on the chip, many of which overlap physically on the die. The processor element design, power and clock grids, global routing, and chip assembly support a modular design in a building block like construction. IV. KEY FEATURES OF DESIGN METHODOLOGY IV.1. Hierarchical Design and Rule-Based Optimization Methodology IV.1.1. Design Environment and Database Structure There were many challenges in meeting the defined objectives for setting up the design environment and database system for the first-generation CELL processor project. First, the methodology had to support concurrent design execution of each partition (major core of the design such as PPE or SPE); meaning design work had to be done simultaneously and independently by different teams located IV.1.3. Physical Synthesis The increased volume of synthesized logic on the CELL processor requires maximizing the productivity of the random logic macro (RLM) designers. This is accomplished by accelerating timing closure and automating the build process. The design of RLMs used physical synthesis to accelerate timing closure. Physical footprints were imported into the synthesis tool to allow accurate timing estimations during netlist creation and placement. The placed, optimized netlists were then fed into the physical build process. Early estimated abstracts allowed for synthesis and sizing before final contracts were available from the unit integrator. Build Parameters Tool Parameters 16 Steps Power ................. Timing Extraction 14 Steps Route ................... Checking Automated RLM Build VimIn IV.1.2. Front End Design and Verification Methodology The front end logic design is captured in VHDL with all the verification done at the behavioral level. The chip verification uses Top down Specification / Bottom up Implementation strategy. For custom circuits, the schematic netlists and behavioral VHDL are verified for correctness with equivalency checking tools. The design is divided into partitions, islands, units, and modules. All the verification environments and test coverage needed to create a high quality chip is planned during High Level Design phase. The verification process is hierarchical with all the environments and checkers created at lower level being used in the higher level environments. For performance and throughput purposes, there are options to turn off some checkers during run time. The test plan is based on the coverage plan to guarantee 100% coverage with written tests. The coverage is also hierarchical i.e. lower environments designate what portions of the lower level coverage needs to be hot at higher levels. Extra coverage and checkers are also added at higher levels for corner cases. For a complex and large design such as CELL, a cycle based simulator is used for all the simulation. Both C++ based and Specman languages are support in the verification flow. Apart from the specialized test case generators used in processor core verification; Specman, C++ and Perl test cases are used for the rest of the design. Formal verification is also done at module level at various parts of the chip. Special tools were employed for Asynchronous clock boundary verification since the simulator used is cycle based. In addition to functional verification, pervasive design is also verified at various levels. This includes Scan verification, POR verification, Test mode verification, RAS verification, Trace and Debug Bus verification, etc. Hardware based accelerators are also used for software workloads, Boot code, and OS boot verification. The Grid computing usage for processor design is demonstrated in this project: over 1.5 trillion simulation cycles or about 2 million hours of simulation was completed over multiple Sim farms spanning throughout IBM US & Germany. This is one of the key attribute for over 98% of total logic bugs found, the processor core VHDL model booted Linux, and Chip Bring up exercisers ran in simulation prior to design tape out. InitDesign in different geographical areas. The existing inherently hierarchical nature of the design was carefully considered when defining the physical partitions in order to minimize the impact of creating discrete physical partitions. Strict naming convention schemata were applied to the entire design hierarchy to facilitate parallelization and to prevent collisions. Second, the database structure had to support both the hierarchical objective and also multiple design disciplines, namely logic design and verification, physical design and verification, integration, and timing, etc., to allow for efficient schedule interlocking. An AFS network file system was used to allow transparent access to design data by all team members across multiple geographical locations. Additionally, the database structure had to support common design libraries and many “shared” macros used in multiple units or partitions. Any dependency conflicts caused by usage of different levels of these libraries and macros across the design hierarchy were resolvable by design dependency solution algorithms, supported by the design environment. Third, the design environment had to fully support the custom processor design methodology by providing tools, processes, and a workspace for every designer. The design environment fills the vital linkage between designers and the supporting database and must do so in a simple and effective way. Design environment initialization was simply done with a single command to set up all required tools and environment variables necessary for design work and the database interface. WEB based analysis of RLM Build Fig. 10: Automated RLM Build Process The RLM physical build process was streamlined and automated as shown in figure 10. This was accomplished by creating a supervisor program to “drive” over 30 individual design steps from netlist import through final checking. The supervisor script used customizable templates to control the individual tool interfaces, allowing designers an automated solution with the flexibility of a manual build flow. Job management was further improved by the Report Generation’s Tool (XRG), which generated web based reports that allowed designers to easily identify failed job steps and quickly access log files. To ensure high-availability of the tool set, automated daily regression tests were performed that exercised the build process and evaluated the results. This helped identify problems before they were encountered by designers. Custom methodology checks were implemented to ensure that RLMs met design specifications before being delivered to unit integrators. IV.1.4. Static Timing Methodology To simplify timing closure and reduce runtime, all latches were modeled for the late mode timing run in the nominal process corner as non-transparent to remove timing loops. Custom designers; however, were still able to use cycle stealing techniques with an internally developed algorithm, which allowed the designer to specify the effective cycle boundary point within a given window of transparency. A timing adjust could be applied for all latches connected to a given LCB, allowing for an improved setup time, but delaying the launch of the data out of the L2 latch by a corresponding amount beyond that which the actual non-transparent latch modeling would require. To lower the cost for high-volume production, all Local Clock Buffers (LCB) and latches are designed to support at-speed scan to reduce manufacturing test time. For power management, each LCB included global and local clock gating signals. These signals have to work correctly on a cycle-by-cycle basis to allow switching from scan to functional mode in one clock cycle. At-speed scan operation allows us to time both functional and scan paths in a single timing run without the need to apply different phases to distinguish scan signals from regular ones. For the early mode timing run in the fast corner, we wanted to ensure enough margins to cover a wide process range window needed for a high-volume product. To achieve this, we used the Linear Combination of Delays (LCD) feature of our Gate-level Static Timer. This feature allows combining different process corners [10]. Usually, the coefficients for the three corners, best, nominal, and worst, add up to be 1, e.g. 15% best, 70% nominal, and 15% worst case. We used a coefficient of 1.27 plus a fixed amount of offset for the worst case calculated timing delays. This allows an increased hold time margin by slowing down the clock propagation. IV.1.5. Chip Integration & Physical Verification Methodology The chip integration methodology was created to support parallel, concurrent design at high clock frequencies. Multiple levels of hierarchies were used to manage the design problem and enable concurrency. The high level design process consisted of top-down constraint setting which lead to the division of the design into functional islands and units. The constraints became a design budget for each floorplannable object. Those budgets dictated the size, aspect ratio, rectilinear outline, pin locations, and routing layers used for each object. The implementation process fulfilled the constraints passed down the hierarchy. The integration methodology was tightly woven with timing throughout the design process. Very early in the Floorplanning process, timing shells represented each object in the hierarchy. These shells enable early timing feedback to drive partitioning, pipelining, and buffering decisions from the outset. As the data evolved through the design process, shell timing rules and Steiner estimates became schematic based timing rules with 2-D extracted parasitic and finally fully extracted timing rules and 3-D extracted routing parasitic. Buffering of signals is performed by an internally developed algorithm. Unit floorplans are filled with 4, 8, 16, or 32 bit buffer packs with all bits initially unused. A process based on Dijkstra’s Algorithm finds the shortest path from source to sinks across available buffer packs. Routing was performed using a gridded router and 13 distinct non-default routing rules. Timing estimations that used particular non-default rules carried directly into the routing process, insuring that actual routes would mirror the estimation. Later in the design cycle, each partition would analyze and correct coupled noise events predicted on closely routed nets. Noisy nets were fixed either through rerouting or by buffering. Electro-migration and missing via analysis on the power bus was also performed to insure that the power distribution met design requirements. Physical verification of all Floorplan blocks consisted mainly of LVS, DRC, methodology checks, and formal Netlist verification. All physical verification is done with cover cells that represent fixed obstructions pushed down from the parent or the routing contract. Checking with these views insures that the object will not create a conflict when stitched into each level of hierarchy. Special methodology checks enforce specific design requirements beyond traditional design rules. This would include checks for pin accessibility, design shapes properly within the boundary, and power pins on proper pitch, among others. The formal verification process insured that the final, buffered Netlist was Boolean equivalent to the original vhdl description. IV.2. Transistor Level Analysis IV.2.1. Circuit & Array Methodology for an 11FO4 Design For an 11FO4 design within an air-cooled power envelop, special emphasis was placed on power distribution, power consumption, clock distribution, signal distribution, variation due to hot spots, and inductance effects. Furthermore the chip team also had to plan for multiple clock domains, cross chip variations in delay, leakage, intrachip interconnections, and array bit cell stability early in the design cycle. Strict design guidelines in layout and circuit topology were enforced to minimize design variations. A major focus of the circuit methodology is on array design since memory arrays occupy an increasingly larger share of chip area and it is where more aggressive design techniques are used to ensure performance. There are three major challenges for array design at low voltage levels: stable cell operation (for functionality), leakage current reduction (for low power), and management of speed variations (for yield). A critical part of the circuit/array methodology is a detailed statistical analysis of cell stability, leakage, and yield in the early design phase. This analysis will determine the optimal cell size for a given technology to achieve stability, power, and yield goals while reducing chip area. The analysis also helps guide the design team and the manufacturing team to decide on the device menu for the technology. A sample result of the statistical analysis is shown in Figure 11 below. This figure plots the failures of the cells at various voltage levels for the peripheral logic circuit (Vdd1) and the core cells (Vdd2). Note that as the design enters the sub-1V operating voltage range, it may be necessary to have a separate supply for the array core cells. This design decision will have significant impact across the whole methodology. Chip planning/integration, packaging, libraries, and tools will have to be adapted to support multiple supply domains. Fai 1.5 VDD2 (cells) Pass 1.3 1.1 0.9 0.7 0.7 0.9 1.1 1.3 1.5 VDD1 (peripheral) Fig. 11: Statistical analysis of SRAM cells – Vdd1 versus Vdd2 data Transistor level analysis also plays an important role in the array verification methodology. For arrays, the high level design begins will the RTL and the implementation begins with the schematic design. There will be long lead time before layout is completed. So it is very important to have a methodology to provide accurate parasitic and interconnect models at the schematic level. The design methodology allows for a structured early floorplanning with accurate wire load models or Steiner based routing approximation to provide sufficient accuracy for schematic transistor level analysis. Logic extraction from array schematic is performed to build the test model, except for the array core which is synthesized from the high level RTL to reduce the model size. Then symbolic switch level simulation is run on the schematic and verified against the RTL as well as the test model. ATPG and marching patterns are also run on the array schematic using fast circuit simulator to verify the test patterns against the schematic. IV.2.2. Transistor Level Timing Static transistor level timing (TLT) was an integral part of the design methodology, with all custom macros, arrays, and even standard cell-based RLMs running through the tool, thereby providing comprehensive and consistent timing analysis and models. To meet aggressive frequency goals while satisfying area and power constraints, designers need to be able to quickly determine critical paths and delays in a circuit. Static timing at the transistor level using TLT helped achieve these goals [11]. The TLT team for the CELL project has improved the existing transistor-level timing methodology [12] in these four areas: improved timing margin calculation [13], local latch transparency modeling [14], pulse waveform timing in TLT templates [15], and improved method of timing model abstraction for simultaneously switching signals [16]. TLT is a transistor-level static timing tool that extends the capabilities of a Gate Level Static Timer to the transistor level. These extensions include a state analysis engine that is used to understand the timing behavior of groups of transistors and build timing models for them, and a fast circuit simulator [17] that is used for calculating propagation delays/waveforms through these transistor groups. TLT uses piecewise-linear waveforms (rather than ramps) for timing accuracy and a modified version of AWE/RICE [18, 19] to propagate these waveforms through RC interconnect. TLT runs on flattened netlists from either schematics or extracted physical data. In addition to generating transistor level timing reports, it compiles a timing model or rule that is used for static timing at higher levels of the design hierarchy. IV.2.3. Modularity and Integration of Black Box IP The architectural modularity of the CELL processor also projects into the physical domain, where all 8 SPEs are instantiations of a single SPE design partition. To make this work correctly, the interaction between the local SPE layout and the global physical design structures had to be identical at all 8 locations where each SPE is instantiated. The C4 footprint, power busses, clock sector buffers, pervasive elements, and EIB components all had to be designed upfront to fit into this scheme in a modular way. The other extreme was taken with the integration of the high-speed IO and Memory interfaces on the left and right side of CELL [20]. These partitions were designed by a 3rd party vendor as “black box” IP, and used all layout resources from the silicon up to the C4 pins over their area. The only interaction with the core/chip happened at the boundary where predetermined power and signal pins were provided to cross the interface. V. CONCLUSION In conclusion, special circuit techniques, rules for modularity and reuse, customized clocking structures, and unique power and thermal management concepts were applied to optimize the design [21]. Correct operation has been observed in the lab on first pass silicon at frequencies well over 4GHz as shown in figure 12. Hardware Performance Measurement (85°C) 4.5 Frequency [GHz] Fmax 4 3.5 3 0.9 1 1.1 1.2 Supply Voltage Fig. 12: First pass hardware in the Lab VI. ACKNOWLEDGMENTS The authors gratefully acknowledge the many contributions from the entire Sony-Toshiba-IBM team who worked tirelessly side-by-side on the design of this processor. VII. REFERENCES [1] D. Pham et al, “The Design and Implementation of a FirstGeneration CELL Processor”, ISSCC 2005 Digest of Technical Papers, Feb. 2005, pp. 184-185. [2] B. Flachs et al, “The Microarchitecture of the Streaming Processor for a CELL Processor”, ISSCC 2005 Digest of Technical Papers, Feb. 2005, pp. 134-135. [3] T. Asano et al, “A 4.8GHz Fully Pipelined Embedded SRAM in the Streaming Processor of a CELL Processor”, ISSCC 2005 Digest of Technical Papers, Feb. 2005, pp. 486-487. [4] S. Clark et al, “IBM CELL Interconnect Unit, Bus and Memory Controller”, Hot Chip’05, Aug. 2005, Paper #1.2 [5] F. Klass, C. Amir, A. Das, K. Aingaran, C. Truong, R. Wang, A. Mehta, R. Heald, G. Yee, "A New Family of Semi-dynamic and Dynamic Flip-Flops with Embedded Logic for High-Performance Processors", IEEE J. Solid State Circuits, vol. 34, pp. 712-716 (1999). [6] L. Sigal, J.D. Warnock, B.W. Curran, Y.H. Chan, P.J. Camporese, M.D. Mayo, W.V. Huott, D.R. Knebel, C.T. Chuang, J.P. Eckhardt, and P.T. Wu, “Circuit Design Techniques for the High-Performance CMOS IBM S/390 Parallel Enterprise Server G4 Microprocessor”, IBM J. Res. & Dev. Vol 41 pp. 489-503 (1997). [7] P. J. Restle, et al, “A Clock Distribution Method for Microprocessors”, IEEE J. Solid-State Circuits, vol. 36, pp 792799, May 2001 [8] P. J. Restle, et al, “The Clock Distribution of the Power4 Microprocessor”, IEEE International Solid-State Circuits Conference 2002 Digest of Technical Papers, vol. 45, pp 144-145 [9] K. Yazawa and M. Ishizuka, “Thermal Modeling with Transfer Function for the Transient Chip-On-Substrate Problem”, Thermal Science and Engineering, vol. 13, No. 1, Heat Transfer Society of Japan, 2005, pp. 37–40 [10] Posluszny, S. et al. “Timing Closure by Design,” Proceedings for the 37th Conference on Design Automation, vol.37, pp.712-717, June 2000. [11] Rao, V., J. Soreff, T. Brodnax, and R. Mains, “EinsTLT: Transistor Level Timing with EinsTimer,” Proc. Of Int. Workshop on Timing Issues (TAU), 1999. [12] Lee, Sang Y, J. Warnock, E. Behnen, J. Soreff, V. Rao, and S. Posluszny, “Improved Transistor-Level Timing Methodology for a CELL Microprocessor,” ASPDAC 2006 (submitted for publication) [13] Warnock, J.D., Erwin Behnen, Sang Y. Lee, and Jeffrey Soreff, “Improved Method for Timing Margin Calculation,” IBM Invention Publish, Feb. 2004. [14] Behnen, E., Jeffrey Soreff, James D. Warnock, and Dieter Wendel, “Method to Apply Latch Transparency Locally While Avoiding It Globally During Timing,” Filed with U.S. Patent Office, May 2004. [15] Soreff, J., Vasant Rao, James D. Warnock, Sang Y. Lee, and David Winston, “Pulse waveform timing in EinsTLT templates,” Filed with U.S. Patent Office, May 2004. [16] Warnock, J.D. and Jeffrey Soreff, “Improved Method of Timing Model Abstraction for Circuits Potentially Simultaneously Switching Internal Signals,” Filed with U.S. Patent Office, May 2004. [17] Devgan, A. and R.A.Rohrer, “Adaptively controlled explicit simulation,” IEEE Trans. Computer-Aided Design, vol. 13, pp.746762, June 1994. [18] Pillage, L.T. and R.A. Rohrer, “Asymptotic waveform evaluation for timing analysis,” IEEE Trans. Computer-Aided Design, vol. 9, No. 4, pp. 352-366, April 1990. [19] Ratzlaff, C.L, N. Gopal, and L.T. Pillage, “RICE: Rapid interconnect circuit evaluator,” IEEE Trans. Computer-Aided Design, vol. 13, No. 6, pp. 763-776, June 1994. [20] K. Chang et al, “Clocking and Circuit Design for a Parallel I/O on a First-Generation CELL Processor”, ISSCC’05 Paper #28.9 [21] Pham, D. et al. “Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation CELL Processor,” JSSCC, October. 2005 Special issue (submitted for publication).
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Dimitris Askounis
National Technical University of Athens
Marjorie Skubic
University of Missouri Columbia
Jerzy W Rozenblit
The University of Arizona
Galal Nadim
Fayoum University