CPU consists of multiple components performance improving at 20-35% p.a. often ECL or other exotic technology huge I/O and memory bandwidth Microprocessors usually a single CMOS part performance improving at 35-50% p.a. enabled through improvements in fabrication technology 1 huge investment physical advantages of smaller size General Purpose Processors desktop / server SMP / Parallel supercomputers Embedded controllers / SoCs DSPs / Graphics Processors Developments in CMOS Fabrication line size reduction 0.8, 0.5, 0.35, 0.25, 0.18, 0.15, 0.13, 0.09 10-20% reduction p.a. switching delay reduces with line size increases in clock speed Pentium 66Mhz @ 0.8, 150Mhz @ 0.6, 233MHz @ 0.35 density increases at square of 1/line size Die size increases at 10-29% p.a. Transistor count increase at 55% p.a. 2 enables architectural jumps 8, 16, 32, 64, 128 bit ALUs large caches PA-8500: 1.5MB on-chip new functional units (e.g. multiplier) duplicated functional units (multi-issue) whole System On a Chip (SoC) Developments in DRAM Technology DRAM density increases at 40-60% p.a. equivalent to 0.5-1 address bits p.a. cost dropping at same rate 16M, 64M, 256M, 1G Consequences for processor architectures: May not be able to address whole of memory from a single pointer segmentation 3 May run out of physical address bits banked (windowed) memory DRAM performance just 35% latency improvement in 10 years! new bus interfaces make more sequential b/w available SDRAM, RAMBUS, DDR, DDR2 processor Development Cycle Fabrication technology has huge inuence on power and performance must use the latest fabrication process Full custom design vs. semi custom Keep development cycle short (3-4 years) Non CMOS technology leads to complications Advance teams to research: process characteristics 4 key circuit elements packaging oor plan required performance microarchitecture investigate key problems Hope ISA features dont prove to be a handicap Keep up or die! Alpha architects planned for 1000x performance improvement over 25 years Power Consumption Important for laptops, PDAs, mobile phones, set-top boxes, etc. 155W for Digital Alpha 21364 @ 1150MHz 130W for Itanium-2 @ 1500MHz 90W for AMD Opteron 148 @ 2GHz 81W for Pentium-IV @ 3GHz 12W for Intel Mobile Pentium M @ 1100Hz 420mW for Digital StrongArm @ 233MHz, 2.0V 5 130mW for Digital StrongArm @ 100MHz, 1.65V Smaller line size results in lower power lower core voltage, reduced capacitance greater integration avoids inter-chip signalling Reduce clock speed to scale power P = CV 2 f may allow lower voltage potential for cubic scaling better than periodic HALTing Performance per Watt Dynamic Clock Gating Divide chip into a hundred or more clock zones, Only clock a zone when a clock cycle will change a registered value, Can save a factor of four power, even under heavy CPU load. always @(posedge gated_clk) begin r1 <= a + b; r2 <= ... end wire clock_needed = r1 != (a+b) || ... || ... ...; CLOCKGATECELL g1(gated_clk, clk, clock_needed); 6 Cost and Price E.g.: $0.50: 8bit micro controller $3: XScale (ARM) (400MHz, 0.18m, 20mm 2 , 2.1M[1M]) $500: Pentium IV Celeron (1.2GHz, 0.13m, 131mm 2 , 28M[4M]) $150: Pentium IV (3.2GHz, 0.09m, 180mm 2 , 42M[7M]) $2200: Itanium2 (1Ghz, 0.18m, 421mm 2 , 221M[15M]) Costs inuenced by die size, packaging, testing 7 Large inuence by manufacturing volume Costs reduce over product life (e.g. 40% p.a.) Yield improves Speed grade binning Fab shrinks and steppings Compatibility Pin Compatibility (second sourcing) Backwards Binary Compatibility 8086, 80286, 80386, 80486, Pentium, Pentium Pro, Pentium II/III/IV, Itanium NexGen, Cyrix, AMD, Transmeta typically need to re-optimize Typically hard to change architecture Users have huge investment in s/w Binary translators e.g. FX!32, WABI 8 typically interface to native OS Need co-operation from s/w vendors multi-platform support costs $s Most computer sales are upgrades Platform independence initiatives Source, p-Code, JAVA bytecode, .NET Compatibility is very important Performance Measurement Try before you buy! (often not possible) System may not even exist yet use cycle-level simulation Real workloads often hard to characterize and measure improvements especially interactive Marketing hype MHz, MIPS, MFLOPS 9 Algorithm kernels Livermore Loops, Linpack Synthetic benchmarks Dhrystones, Whetstones, iCOMP Benchmark suites SPEC-INT, SPEC-FP, SPEC-HPC, NAS Application Benchmarks TPC-C/H/R, SPECNFS, SPECWeb, Quake Performance is application dependent Standard Performance Evaluation Corporation SPEC is most widely used benchmark processor manufactures workstation vendors CPU INT / FP 89, 92, 95, 2000, (2004) Suite updated to reect current workloads CINT95/2K: 8/12 integer C programs CFP95/2K: 10/14 oating point in C&Fortran 10 measures: processor memory system compiler NOT OS, libc, disk, graphics, network Choosing programs for SPEC2000 More programs than SPEC95 Bigger programs than SPEC95 Dont t in on-chip caches Reect some real workloads Run for several minutes Amortize startup overhead & timing inaccuracies Not susceptible to trick transformations Vendors invest huge s/w eort 11 Fit in 256MB (95 was 64MB) Moving target... SPEC92, 95, 2K results not translatable CINT95 suite (C) 099.go An AI go-playing program 124.m88ksim A chip simulator for the Motorola 88100 126.gcc Based on the GNU C compiler version 2.5.3 129.compress An in-memory version of the utility 130.li Xlisp interpreter 132.ijpeg De/compression on in-memory images 134.perl An interpreter for the Perl language 147.vortex An object oriented database CFP95 suite (Fortran) 101.tomcatv Vectorized mesh generation 102.swim Shallow water equations 103.su2cor Monte-Carlo method 104.hydro2d Navier Stokes equations 107.mgrid 3d potential eld 110.applu Partial dierential equations 125.turb3d Turbulence modelling 141.apsi Weather prediction 145.fpppp Quantum chemistry 146.wave5 Maxwells equations 12 SPEC reporting Time each program to run Reproduceability is paramount Take mean of 3 runs Full disclosure Baseline measurements SPECint base95 Same compiler optimizations for whole suite Peak measurements 13 SPECint95 Each benchmark individually tweaked Unsafe optimizations can be enabled! Rate measurements for multiprocessors SPECint rate95, SPECfp rate95 time for N copies to complete x N Totalling Results How to present results? Present individual results? Arithmetic mean? Weighted harmonic mean? SPEC uses Geometric mean, normalised against a reference platform allows normalization before or after mean performance ratio can be predicted by dividing means SPEC95 uses Sun SS10/40 as reference platform 14 spec SPEC CINT95 Results Copyright 1995, Standard Performance Evaluation Corporation 34 Volume: 7 Issue: 4 SPECint95 SPECint_base95 8.09 8.09 = = Intel Corporation Alder System (200MHz, 256KB L2) SPEC license # 14 Tested By: Intel Test Date: Oct-95 Hardware Avail: May-96 Software Avail: Feb-96 Contact: Information For More Manassas, VA 22110 10754 Ambassador Drive, Suite 201 SPEC http://www.specbench.org info@specbench.org (703) 331-0180 S P E C r a t i o 0 1 2 3 4 5 6 7 8 9 10 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex Alder System (200MHz, 256KB L2) Hardware/Software Configuration for: Hardware Alder Model Name: 200MHz Pentium Pro Processor CPU: Integrated FPU: 1 Number of CPU(s): 8KBI+8KBD Primary Cache: 256KB(I+D) Secondary Cache: None Other Cache: 128MB (60ns fast page) Memory: 2GB ST32550W Disk Subsystem: AHA-2940W Controller Other Hardware: Software UnixWare 2.0, SDK Operating System: Intel C Reference Compiler 2.2 Beta Compiler: ufs, vxfs (/tmp as 8MB /tmpfs) File System: Single user (root + killall) System State: Benchmark # and Name Reference Time Base Run Time Base SPEC Ratio Run Time SPEC Ratio SPECint95 (G. Mean) 8.09 SPECint_base95 (G. Mean) 8.09 4600 567 567 8.11 8.11 099.go 1900 243 243 7.81 7.81 124.m88ksim 1700 222 222 7.65 7.65 126.gcc 1800 258 258 6.99 6.99 129.compress 1900 220 220 8.62 8.62 130.li 2400 285 285 8.43 8.43 132.ijpeg 1900 232 232 8.21 8.21 134.perl 2700 295 295 9.14 9.14 147.vortex Notes/Tuning Information Base and non-base flags are the same and use Feedback Directed Optimization Pass1: -tp p6 -ipo -xi -prof_gen -ircdb_dir /tmp/IRCDB Pass2: -tp p6 -ipo -xi -prof_use -ircdb_dir /tmp/IRCDB -ircdb_dir is a location flag and not an optimization flag Portability: 124: -DSYSV -DLEHOST 130, 134, 147: -lm 132: -DSYSV 126: -lm -lc -L/usr/ucblib -lucb -lmalloc Memory subsystem is four-way interleaved. SPEC CINT95 Results Copyright 1995, Standards Performance Evaluation Corporation -- Prepared By: -- SPECint95 SPECint_base95 -- 6.37 = = Intel 440LX motherboard Pentium Pro 200 SPEC license # 1178 Tested By: Ian Pratt, CUCL Test Date: Date Hardware Avail: Date Software Avail: Date contact Information For More Fairfax, VA 22031 2722 Merrilee Drive, Suite 200 SPEC c/o NCGA spec-ncga@cup.portal.com (703) 698-9604 ext 318 S P E C r a t i o 0 1 2 3 4 5 6 7 8 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex Pentium Pro 200 Hardware/Software Configuration for: Hardware Intel 440LX Model Name: Pentium Pro 200 CPU: FPU: 1 Number of CPU(s): 8KB+8KB Primary Cache: 256KB Secondary Cache: Other Cache: 128MB Memory: 4GB Disk Subsystem: Other Hardware: Software Linux 20.0.30 Operating System: gcc 2.7.2p Compiler: ext2 File System: multiuser System State: Benchmark # and Name Reference Time Base Run Time Base SPEC Ratio Run Time SPEC Ratio SPECint95 (G. Mean) -- SPECint_base95 (G. Mean) 6.37 4600 -- 595 -- 7.73 099.go 1900 -- 310 -- 6.12 124.m88ksim 1700 -- 276 -- 6.16 126.gcc 1800 -- 357 -- 5.04 129.compress 1900 -- 277 -- 6.85 130.li 2400 -- 384 -- 6.26 132.ijpeg 1900 -- 279 -- 6.81 134.perl 2700 -- 427 -- 6.32 147.vortex Notes/Tuning Information Portability flags were: Baseline flags were: -O2 -fomit-frame-pointer Nonbase flags were: Standard Performance Evaluation Corporation info@spec.org http://www.spec.org spec CINT2000 Result Copyright 1999-2000, Standard Performance Evaluation Corporation Compaq Computer Corporation AlphaServer ES40 Model 6/833 SPECint2000 = SPECint_base2000 = 544 518 SPEC license #: 2 Tested by: Compaq NH Test date: Oct-2000 Hardware Avail: Jan-2001 Software Avail: Nov-2000 Benchmark Reference Time Base Runtime Base Ratio Runtime Ratio 200 400 600 800 164.gzip 1400 358 392 357 393 175.vpr 1400 309 452 307 456 176.gcc 1100 178 617 160 687 181.mcf 1800 408 441 340 529 186.crafty 1000 144 694 157 637 197.parser 1800 500 360 409 440 252.eon 1300 202 645 202 644 253.perlbmk 1800 342 526 332 543 254.gap 1100 301 365 303 363 255.vortex 1900 282 673 249 763 256.bzip2 1500 268 560 264 568 300.twolf 3000 456 658 451 666 Hardware CPU: Alpha 21264B CPU MHz: 833 FPU: Integrated CPU(s) enabled: 1 CPU(s) orderable: 1 to 4 Parallel: No Primary Cache: 64KB(I)+64KB(D) on chip Secondary Cache: 8MB off chip L3 Cache: None Other Cache: None Memory: 16GB Disk Subsystem: 1x8GB BD0096349A Other Hardware: Ethernet Software Operating System: Tru64 UNIX V5.1 + Patch Kit 1 libc Compiler: Compaq C V6.3-129-44A8I Compaq C++ V6.2-033-4298H File System: AdvFS System State: Multi-user Notes/Tuning Information Baseline C : cc -arch ev6 -fast GEMFB ONESTEP C++: cxx -arch ev6 -O2 ONESTEP