The design and implementation of a first-generation CELL processor

James Warnock

The design and implementation of a first-generation CELL processor

ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005., 2005

184 • 2005 IEEE International Solid-State Circuits Conference 0-7803-8904-2/05/$20.00 ©2005 IEEE. ISSCC 2005 / SESSION 10 / MICROPROCESSORS AND SIGNAL PROCESSING / 10.2 10.2 The Design and Implementation of a First-Generation CELL Processor D. Pham 1 , S. Asano 3 , M. Bolliger 1 , M. N. Day 1 , H. P. Hofstee 1 , C. Johns 1 , J. Kahle 1 , A. Kameyama 3 , J. Keaty 1 , Y. Masubuchi 3 , M. Riley 1 , D. Shippy 1 , D. Stasiak 1 , M. Suzuoki 2 , M. Wang 1 , J. Warnock 1 , S. Weitzel 1 , D. Wendel 1 , T. Yamazaki 2 , K. Yazawa 2 1 IBM, Austin, TX 2 Sony, Tokyo, Japan 3 Toshiba, Austin, TX The implementation of a first-generation CELL processor that supports multiple operating systems including Linux consists of a 64b power processor element (PPE) and its L2 cache, multiple synergistic processor elements (SPE) [1] that each has its own local memory (LS) [2], a high-bandwidth internal element inter- connect bus (EIB), two configurable non-coherent I/O interfaces, a memory interface controller (MIC), and a pervasive unit that supports extensive test, monitoring, and debug functions. The high level chip diagram is shown in Fig. 10.2.1. The key attribut- es include hardware content protection, virtualization and real- time support combined with extensive single-precision floating- point capability. By extending the Power architecture with SPE having coherent DMA access to system storage and with multi- operating-system resource-management, CELL supports concur- rent real-time and conventional computing. With a dual-thread- ed PPE and 8 SPEs this implementation is capable of handling 10 simultaneous threads and over 128 outstanding memory requests. Figure 10.2.7 shows the die micrograph with roughly 234M tran- sistors from 17 physical entities and 580k repeaters and 1.4M nets implemented in 90nm SOI technology with 8 levels of copper interconnects and one local interconnect layer. At the center of the chip is the EIB composed of four 128b data rings plus a 64b tag operated at half the processor clock rate. The wires are arranged in groups of four, interleaved with GND and VDD shields twisted at the center to reduce coupling noise on the two unshielded wires. To ensure signal integrity, over 50% of global nets are engineered with 32k repeaters. The SoC uses 2965 C4s with four regions of different row-column pitches attached to a low-cost organic package. This structure supports 15 separate power domains on the chip, many of which overlap physically on the die. The processor element design, power and clock grids, global routing, and chip assembly support a modular design in a building-block-like construction. The chip contains 3 distinct clock-distribution systems, each sourced by an independent PLL, to support processor, bus inter- face, and memory-interface requirements. The main high-fre- quency clock grid covers over 85% of the chip, delivering the clock signal to processors and miscellaneous circuits. Second and third clock grids, each operating at fractions of the main clock signal, are interleaved with the main clock-grid structure, creating mul- tiple clock frequency islands within the chip. All clock grids are constructed on the lowest impedance final two layers of metal, and are supported by a matrix of over 850 individually tuned buffers. This enables control of the clock-arrival times and skews, especially on the main clock grid that supports regions of widely varying clock-load densities. High-frequency clock-signal distrib- ution optimization and verification rely on wire simulation mod- els that includes frequency-sensitive inductance and resistance phenomena. As shown in Fig. 10.2.2, final worst-case clock skew across chip is less than 12 ps. Given a short cycle-time target, a significant amount of the power is consumed by latches, flipflops, and other clocked ele- ments. However, the delay overhead imposed by standard flipflops is considerable. Therefore, a variety of latches and flipflops are developed to allow for both power and delay opti- mizations. The base clock components are shown in Fig. 10.2.3. In addition to test controls, the base block accepts a local clock-gate signal, with a small setup time relative to the falling global clock (cycle boundary). Input setup-and-hold times are specified against the falling clock edge, as a result of the built-in latching action of the base block. Local clocks are derived from the base block for a typical master-slave flip flop. For timing-critical paths, a high-performance latch (HPL) is designed. It combines a wide MUX (up to 10-way), relying on a dynamic NOR gate, with a set-reset latch (Fig. 10.2.4). The dynamic NOR starts evaluating with the launch of the clock, and the input-data hold time is limited when all sel_b inputs are forced high after a fixed delay. Dynamic circuits are used in several critical macros, in the arrays, and in PLAs. All dynamic macros are latch-bounded (macro-to-macro signals are static). Signals that feed the dynam- ic logic are usually launched from the master portion of a flip- flop, and ANDed with lclk to provide a signal which resets to 0 every cycle when lclk is low. Dynamic logic is always followed by a set-reset latch similar to that used for the HPL shown earlier. In addition, various rules are adopted to ensure a “correct-by-con- struction” design methodology. All circuits use a common set of clocking components to ensure uniformity across the design with no rotation of the components allowed. An extensive set of elec- trical and physical checks and audits are performed. Finally, a customized series of yield-related checking rules is employed to ensure manufacturability of the chip. This SoC presented new challenges in the thermal design of the chip. The higher-heat flux from smaller hot-spots hindered spreading of the heat across the silicon substrate. Extensive ther- mal analysis carried out early in the design cycle ensures that the maximum junction temperature as well as the average tempera- ture of the die, are within the design specifications. Various work- loads are simulated for each component and power maps are con- structed. From these maps, a matrix of small power sources is created, for use with package and heat-sink models. Thermal models are then created and used to simulate both steady-state and transient thermal behavior. This data is analyzed to improve the design and floorplan of the chip and also provide feedback for improved thermal-sensor design (Fig. 10.2.5). Due to local heating caused by individual processing units, sophisticated local thermal-sensing strategies and and thermal- control mechanisms are used to allow an aggressive low-cost thermal design. The processor contains a linear sensor and 10 local digital thermal sensors. The linear sensor is essentially a diode connected to two external I/Os, used to measure the die’s global temperature of the die and to adjust the system cooling. The digital thermal sensors provide for early warning of any tem- perature increase and for thermal protection. In conclusion, special circuit techniques, rules for modularity and reuse, customized clocking structures, and unique power- and thermal-management concepts are applied to optimize the design. Correct operation is observed in the lab on first-pass sili- con at frequencies well over 4GHz as shown in Fig. 10.2.6. Acknowledgements: The authors gratefully acknowledge the many contributions from the entire Sony-Toshiba-IBM team who worked tirelessly side-by-side on the design of this processor. References: [1] B. Flachs et al., “A Streaming Processor Unit for a CELL Processor,” ISSCC Dig. Tech. Papers, Paper 7.4, pp. 134-135, Feb., 2005. [2] T. Asano et al., “A 4.8GHz Fully Pipelined Embedded SRAM in the Streaming Processor of a CELL Processor,” ISSCC Dig. Tech. Papers, Paper 26.7, pp. 486-487, Feb., 2005.

185 DIGEST OF TECHNICAL PAPERS  Continued on Page 592 ISSCC 2005 / February 8, 2005 / Salon 8 / 9:00 AM Figure 10.2.1: Processor high-level diagram. Figure 10.2.2: Clock-skew map. Figure 10.2.3: Local clock generation. Figure 10.2.5: Die thermal map. Figure 10.2.6: First-pass hardware measurement in the Lab. Figure 10.2.4: High-performance latch (HPL). PXU EIB (up to 96 Bytes/cycle) SXU LS SXU LS LS SXU LS LS SXU LS Dual XDR TM FlexIO LS SXU LS SXU SXU SXU BIC MIC 16B/cycle(each) 16B/cycle 16B/cycle 16B/cycle 16B/cycle(2x) 32B/cycle L2 L1 16B/cycle(each) scan global clock clk_b clk fb scan lclk d1clk d2clk Base Block Common Point clockgate_b testhold_b d sel_b mux clk_b lclk lclk clk_b q evaluation window Base Block clk_b sel_b global clock scan_b sel 0,9 1 1,1 1,2 Supply Voltage 3 3,5 4 4,5 Frequency [GHz] Fmax Hardware Performance Measurement (85°C) 10

ISSCC 2005 / SESSION 10 / MICROPROCESSORS AND SIGNAL PROCESSING / 10.2 10.2 The Design and Implementation of a First-Generation CELL Processor D. Pham1, S. Asano3, M. Bolliger1, M. N. Day1, H. P. Hofstee1, C. Johns1, J. Kahle1, A. Kameyama3, J. Keaty1, Y. Masubuchi3, M. Riley1, D. Shippy1, D. Stasiak1, M. Suzuoki2, M. Wang1, J. Warnock1, S. Weitzel1, D. Wendel1, T. Yamazaki2, K. Yazawa2 1 IBM, Austin, TX Sony, Tokyo, Japan 3 Toshiba, Austin, TX 2 The implementation of a first-generation CELL processor that supports multiple operating systems including Linux consists of a 64b power processor element (PPE) and its L2 cache, multiple synergistic processor elements (SPE) [1] that each has its own local memory (LS) [2], a high-bandwidth internal element interconnect bus (EIB), two configurable non-coherent I/O interfaces, a memory interface controller (MIC), and a pervasive unit that supports extensive test, monitoring, and debug functions. The high level chip diagram is shown in Fig. 10.2.1. The key attributes include hardware content protection, virtualization and realtime support combined with extensive single-precision floatingpoint capability. By extending the Power architecture with SPE having coherent DMA access to system storage and with multioperating-system resource-management, CELL supports concurrent real-time and conventional computing. With a dual-threaded PPE and 8 SPEs this implementation is capable of handling 10 simultaneous threads and over 128 outstanding memory requests. Figure 10.2.7 shows the die micrograph with roughly 234M transistors from 17 physical entities and 580k repeaters and 1.4M nets implemented in 90nm SOI technology with 8 levels of copper interconnects and one local interconnect layer. At the center of the chip is the EIB composed of four 128b data rings plus a 64b tag operated at half the processor clock rate. The wires are arranged in groups of four, interleaved with GND and VDD shields twisted at the center to reduce coupling noise on the two unshielded wires. To ensure signal integrity, over 50% of global nets are engineered with 32k repeaters. The SoC uses 2965 C4s with four regions of different row-column pitches attached to a low-cost organic package. This structure supports 15 separate power domains on the chip, many of which overlap physically on the die. The processor element design, power and clock grids, global routing, and chip assembly support a modular design in a building-block-like construction. The chip contains 3 distinct clock-distribution systems, each sourced by an independent PLL, to support processor, bus interface, and memory-interface requirements. The main high-frequency clock grid covers over 85% of the chip, delivering the clock signal to processors and miscellaneous circuits. Second and third clock grids, each operating at fractions of the main clock signal, are interleaved with the main clock-grid structure, creating multiple clock frequency islands within the chip. All clock grids are constructed on the lowest impedance final two layers of metal, and are supported by a matrix of over 850 individually tuned buffers. This enables control of the clock-arrival times and skews, especially on the main clock grid that supports regions of widely varying clock-load densities. High-frequency clock-signal distribution optimization and verification rely on wire simulation models that includes frequency-sensitive inductance and resistance phenomena. As shown in Fig. 10.2.2, final worst-case clock skew across chip is less than 12 ps. Given a short cycle-time target, a significant amount of the power is consumed by latches, flipflops, and other clocked elements. However, the delay overhead imposed by standard flipflops is considerable. Therefore, a variety of latches and 184 flipflops are developed to allow for both power and delay optimizations. The base clock components are shown in Fig. 10.2.3. In addition to test controls, the base block accepts a local clock-gate signal, with a small setup time relative to the falling global clock (cycle boundary). Input setup-and-hold times are specified against the falling clock edge, as a result of the built-in latching action of the base block. Local clocks are derived from the base block for a typical master-slave flip flop. For timing-critical paths, a high-performance latch (HPL) is designed. It combines a wide MUX (up to 10-way), relying on a dynamic NOR gate, with a set-reset latch (Fig. 10.2.4). The dynamic NOR starts evaluating with the launch of the clock, and the input-data hold time is limited when all sel_b inputs are forced high after a fixed delay. Dynamic circuits are used in several critical macros, in the arrays, and in PLAs. All dynamic macros are latch-bounded (macro-to-macro signals are static). Signals that feed the dynamic logic are usually launched from the master portion of a flipflop, and ANDed with lclk to provide a signal which resets to 0 every cycle when lclk is low. Dynamic logic is always followed by a set-reset latch similar to that used for the HPL shown earlier. In addition, various rules are adopted to ensure a “correct-by-construction” design methodology. All circuits use a common set of clocking components to ensure uniformity across the design with no rotation of the components allowed. An extensive set of electrical and physical checks and audits are performed. Finally, a customized series of yield-related checking rules is employed to ensure manufacturability of the chip. This SoC presented new challenges in the thermal design of the chip. The higher-heat flux from smaller hot-spots hindered spreading of the heat across the silicon substrate. Extensive thermal analysis carried out early in the design cycle ensures that the maximum junction temperature as well as the average temperature of the die, are within the design specifications. Various workloads are simulated for each component and power maps are constructed. From these maps, a matrix of small power sources is created, for use with package and heat-sink models. Thermal models are then created and used to simulate both steady-state and transient thermal behavior. This data is analyzed to improve the design and floorplan of the chip and also provide feedback for improved thermal-sensor design (Fig. 10.2.5). Due to local heating caused by individual processing units, sophisticated local thermal-sensing strategies and and thermalcontrol mechanisms are used to allow an aggressive low-cost thermal design. The processor contains a linear sensor and 10 local digital thermal sensors. The linear sensor is essentially a diode connected to two external I/Os, used to measure the die’s global temperature of the die and to adjust the system cooling. The digital thermal sensors provide for early warning of any temperature increase and for thermal protection. In conclusion, special circuit techniques, rules for modularity and reuse, customized clocking structures, and unique power- and thermal-management concepts are applied to optimize the design. Correct operation is observed in the lab on first-pass silicon at frequencies well over 4GHz as shown in Fig. 10.2.6. Acknowledgements: The authors gratefully acknowledge the many contributions from the entire Sony-Toshiba-IBM team who worked tirelessly side-by-side on the design of this processor. References: [1] B. Flachs et al., “A Streaming Processor Unit for a CELL Processor,” ISSCC Dig. Tech. Papers, Paper 7.4, pp. 134-135, Feb., 2005. [2] T. Asano et al., “A 4.8GHz Fully Pipelined Embedded SRAM in the Streaming Processor of a CELL Processor,” ISSCC Dig. Tech. Papers, Paper 26.7, pp. 486-487, Feb., 2005. • 2005 IEEE International Solid-State Circuits Conference 0-7803-8904-2/05/$20.00 ©2005 IEEE. ISSCC 2005 / February 8, 2005 / Salon 8 / 9:00 AM SXU SXU SXU SXU SXU SXU SXU SXU 16B/cycle(each) LS LS LS LS LS LS LS LS 16B/cycle(each) EIB (up to 96 Bytes/cycle) 16B/cycle 16B/cycle(2x) 16B/cycle L2 MIC BIC 32B/cycle PXU L1 Dual XDRTM FlexIO 16B/cycle 10 Figure 10.2.1: Processor high-level diagram. Figure 10.2.2: Clock-skew map. Common Point global clock clk_b Base Block fb global clock sel_b clk_b clk lclk sel scan_b d1clk scan evaluation window d2clk clockgate_b mux q sel_b scan testhold_b d lclk Base Block lclk clk_b Figure 10.2.3: Local clock generation. clk_b Figure 10.2.4: High-performance latch (HPL). Hardware Performance Measurement (85°C) 4,5 Frequency [GHz] Fmax 4 3,5 3 0,9 1 1,1 1,2 Supply Voltage Figure 10.2.5: Die thermal map. Figure 10.2.6: First-pass hardware measurement in the Lab. Continued on Page 592 DIGEST OF TECHNICAL PAPERS • 185 ISSCC 2005 PAPER CONTINUATIONS S P U X M I I O C S P U P P U S P U S P U R B R I A C C MIB S P U S P U S P U S P U Figure 10.2.7: Die micrograph with high-level floorplan overlay. 592 • 2005 IEEE International Solid-State Circuits Conference 0-7803-8904-2/05/$20.00 ©2005 IEEE.

专业定制国外大毕业证【微信：wp158699 WhatsApp：+85244510406 Telegram：@CT989 】国外证件制作做国外大学毕业证，国外学位证书购买，日本学生卡仿制，英国硕士学位证书定制，美国录取通知书，澳洲毕业证成绩单，韩国大学在读证明，加拿大毕业完成信定制、国外大学毕业证丢了怎么办？如何补办国外大学文凭，定制国外研究生学历需要多少钱？【办证网：xiqingtang.com 】购买网上学历认证做国外研究生学历，做国外大学毕业证成绩单，「国外国外假毕业证知乎」，成绩单寄送、毕业证书电子版、学历学位证书区别、文凭认证、成绩单申请、学历学位证书区别、留信网认证。日本冲縄国际大学毕业证制作/定制本科卒业证书/哪里可以购买假美国塔尔萨大学成绩单日本高崎健康福祉大学毕业证制作/修了证书做个假的/哪里可以购买假美国麦肯德里大学成绩单韩国崇义女子大学毕业证制作/怎么做个假的CFP国际金融理财师证书/哪里可以购买奥芬堡应用科学大学硕士文凭韩国湖园大学毕业证制作/怎么做个假的韩国国际驾照/哪里可以购买地球科学实验室与学院硕士文凭韩国釜山长神大学毕业证制作/怎么做个假的日语N4证书/哪里可以购买勃艮第大学?硕士文凭加急代办一个日本东北公益文科大学学位记/学习成绩单电子版定制/克隆香港大学毕业证/日语N1/N2/N3/N4/N5证书定制快速定制高仿日本琉球大学学位记/合格通知书电子版制作/定制美国大学文凭/ACCA证书定制格罗宁根应用技术大学毕业证制作/定制国外大学录取通知书/购买一个假的经国管理暨健康学院硕士学位证书日本清泉女学院大学毕业证制作/录取通知书电子版/哪里可以购买假美国卡拉马祖学院成绩单布莱克本学院毕业证制作/英国本科学历如何认证/购买一个假的大汉技术学院硕士学位证书如何定制国外大学毕业证如何在喜庆堂留学网站上定制毕业证书如果您正在寻找一家专业的机构来定制您的毕业证书，喜庆堂留学是您的最佳选择。在喜庆堂留学网站上，您可以轻松地定制个性化的毕业证书，展示您的学术成就和个人荣誉。以下是一些简单步骤，帮助您了解如何在喜庆堂留学网站上定制毕业证书。第一步：浏览喜庆堂留学网站首先，访问喜庆堂留学的官方网站。您可以通过搜索引擎或直接输入网址来访问该网站。一旦进入网站，您将看到各种可定制的毕业证书样式和选项。第二步：选择合适的毕业证书样式在喜庆堂留学网站上，您可以浏览不同的毕业证书样式和设计。根据您的个人喜好和学校要求，选择适合您的样式。您可以预览每个样式的效果，并确保它符合您的期望。第三步：填写相关信息一旦您选择了合适的毕业证书样式，您需要填写一些相关信息，如您的姓名、学校名称、专业、毕业日期等。这些信息将被精确地印制在您的毕业证书上，确保其准确性和个性化。第四步：预览和确认在填写完相关信息后，您将有机会预览您定制的毕业证书。请仔细检查所有的信息和细节，确保一切都正确无误。如果需要进行任何修改，请及时提出。第五步：下单和支付确认毕业证书的预览后，您可以选择下单并进行支付。喜庆堂留学提供安全可靠的在线支付方式，以确保您的交易安全。第六步：等待交付一旦您下单并完成支付，您只需耐心等待交付。喜庆堂留学将按照约定的时间和方式将您的定制毕业证书快递至您指定的地址。您可以选择国际快递服务，确保您在世界各地都可以收到您的毕业证书。喜庆堂留学致力于为客户提供高品质的毕业证书定制服务。无论您是留学生还是毕业生，无论您的学校在世界的哪个角落，我们都可以根据您的要求制作出与您学校完全相符的毕业证书。我们的专业团队将确保每个细节都准确无误，并提供出色的客户服务。

Log In

The design and implementation of a first-generation CELL processor