Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Analyzing Behavior Specialized Acceleration

Published: 25 March 2016 Publication History

Abstract

Hardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with only certain properties, but remain largely configurable or programmable. The most important strength of BSAs -- their ability to target a wide variety of codes -- also makes their interactions and analysis complex, raising the following questions: can multiple BSAs be composed synergistically, what are their interactions with the general purpose core, and what combinations favor which workloads? From a methodological standpoint, BSAs are also challenging, as they each require ISA development, compiler and assembler extensions, and either simulator or RTL models.
To study the potential of BSAs, we propose a novel modeling technique called the Transformable Dependence Graph (TDG) - a higher level alternative to the time-consuming traditional compiler+simulator approach, while still enabling detailed microarchitectural models for both general cores and accelerators. We then propose a multi-BSA organization, called ExoCore, which we model and study using the TDG. A design space exploration reveals that an ExoCore organization can push designs beyond the established energy-performance frontiers for general purpose cores. For example, a 2-wide OOO processor with three BSAs matches the performance of a conventional 6-wide OOO core, has 40% lower area, and is 2.6x more energy efficient.

References

[1]
Parboil Benchmark Suite. impact.crhc.illinois.edu/parboil/parboil.aspx.
[2]
Vertical Microbenchmarks. http://cs.wisc.edu/vertical/microbench.
[3]
Renee St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. General-purpose code acceleration with limited-precision analog computation. In ISCA, 2014.
[4]
Thomas Ball and James R. Larus. Efficient path profiling. In MICRO, 1996.
[5]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 2011.
[6]
Shekhar Borkar and Andrew A. Chien. The future of micro- processors. Commun. ACM, 54(5):67--77, 2011.
[7]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine- learning. In ASPLOS, 2014.
[8]
Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. Application-specific processing on a general-purpose core via transparent instruction set customization. In MICRO, 2004.
[9]
Rajagopalan Desikan, Doug Burger, and Stephen W. Keckler. Measuring experimental error in microprocessor simulation. In ISCA, 2001.
[10]
Lieven Eeckhout. Computer architecture performance evaluation methods. Synthesis Lectures on Computer Architecture, 2010.
[11]
Peter A. Milder Eric S. Chung, James C. Hoe, and Ken Mai. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPUs? In MICRO '10.
[12]
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. SIGARCH Comput. Archit. News, 2011.
[13]
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for general-purpose approximate programs. In MICRO, 2012.
[14]
Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst., 2009.
[15]
Brian Fields, Shai Rubin, and Rastislav Bodik. Focusing processor policies via critical-path prediction. In ISCA, 2001.
[16]
Saturnino Garcia, Donghwan Jeon, Chris Louie, and Michael Bedford Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, 2011.
[17]
Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. Dyser: Unifying functionality and parallelism specialization for energy efficient computing. IEEE Micro, 2012.
[18]
Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO, 2011.
[19]
Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. In ISCA '10.
[20]
Mark Hempstead, Gu-Yeon Wei, and David Brooks. Navigo: An early-stage model to study power-contrained architectures and specialization. In Proceedings of Workshop on Modeling, Benchmarking, and Simulations (MoBS), 2009.
[21]
Sunpyo Hong and Hyesoon Kim. An integrated GPU power and performance model. In ISCA '10.
[22]
R. Iyer. Accelerator-rich architectures: Implications, opportunities and challenges. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, 2012.
[23]
Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor. Kismet: Parallel Speedup Estimates for Serial Programs. In OOPSLA, 2011.
[24]
Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In MICRO, 2013.
[25]
Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, and Dean M. Tullsen. Single-isa heterogeneous multi-core architectures: The potential for processor power reduction. In MICRO, 2003.
[26]
Rakesh Kumar, Dean M. Tullsen, and Norman P. Jouppi. Core architecture optimization for heterogeneous chip multiprocessors. In PACT, 2006.
[27]
Chunho Lee, M. Potkonjak, and W.H. Mangione-Smith. Me- diaBench: a tool for evaluating and synthesizing multimedia and communications systems. In MICRO, 1997.
[28]
Jaewon Lee, Hanhwi Jang, and Jangwoo Kim. Rpstacks: Fast and accurate processor design space exploration using representative stall-event stacks. In MICRO, 2014.
[29]
Sheng Li, Jung-Ho Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P. Jouppi. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO, 2009.
[30]
Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. PuDianNao: a polyvalent machine learning accelerator. In ASPLOS, 2015.
[31]
Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke. Composite Cores: Pushing heterogeneity into a core. In MICRO, 2012.
[32]
J. Meng, V.A. Morozov, K. Kumaran, V. Vishwanath, and T.D. Uram. GROPHECY: GPU performance projection from CPU code skeletons. In SC'11. ACM, 2011.
[33]
Tipp Moseley, Dirk Grunwald, Daniel A Connors, Ram Ra- manujam, Vasanth Tovinkere, and Ramesh Peri. Loopprof: Dynamic techniques for loop detection and profiling. In Proceedings of the 2006 Workshop on Binary Instrumentation and Applications (WBIA), 2006.
[34]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. Cacti 6.0: A tool to model large caches. HP Laboratories, 2009.
[35]
Sandeep Navada, Niket K. Choudhary, Salil V. Wadhavkar, and Eric Rotenberg. A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors. In PACT, 2013.
[36]
Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. Exploring the potential of heterogeneous Von Neumann/Dataflow execution models. In ISCA, 2015.
[37]
Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. Pushing the limits of accelerator efficiency while retaining programmability. In HPCA, 2016.
[38]
Tony Nowatzki, Venkatraman. Govindaraju, and Karthikeyan Sankaralingam. A graph-based program representation for analyzing hardware specialization approaches. Computer Architecture Letters, 2015.
[39]
Cedric Nugteren and Henk Corporaal. A modular and parameterisable classification of algorithms. Technical Report ESR-2011-02, Eindhoven University of Technology, 2011.
[40]
Cedric Nugteren and Henk Corporaal. The boat hull model: adapting the roofline model to enable performance prediction for parallel computing. In PPOPP, 2012.
[41]
Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, and Scott Mahlke. Trace based phase prediction for tightly-coupled heterogeneous cores. In MICRO, 2013.
[42]
Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A. Horowitz. Convolution engine: Balancing efficiency and flexibility in specialized computing. In ISCA, 2013.
[43]
Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and Jim Smith. Trace processors. In MICRO, 1997.
[44]
Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. In ISCA, 2014.
[45]
M. Shoaib Bin Altaf and D.A. Wood. LogCA: a performance model for hardware accelerators. Computer Architecture Letters, 2015.
[46]
Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In PPoPP, 2012.
[47]
H. Singh, Ming-Hau Lee, Guangming Lu, F.J. Kurdahi, N. Bagherzadeh, and E.M. Chaves Filho. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. Computers, IEEE Trans- actions on, 2000.
[48]
Tyler Sondag and Hridesh Rajan. Phase-based tuning for better utilization of performance-asymmetric multicore processors. In CGO, 2011.
[49]
Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang, and Christopher Batten. Architectural specialization for inter-iteration loop dependence patterns. In MICRO, 2014.
[50]
Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. Wavescalar. In MICRO, pages 291--, 2003.
[51]
Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel Emer. Scheduling heterogeneous multicores through performance impact estimation (pie). In ISCA, 2012.
[52]
Ashish Venkat and Dean M. Tullsen. Harnessing isa diversity: Design of a heterogeneous-isa chip multiprocessor. In ISCA, 2014.
[53]
Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. Conservation Cores: Reducing the Energy of Mature Computations. In ASPLOS '10.
[54]
Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multi-core architectures. Commun. ACM, 2009.
[55]
Jonathan A. Winter, David H. Albonesi, and Christine A. Shoemaker. Scalable thread scheduling and global power management for heterogeneous many-core architectures. In PACT, 2010.
[56]
Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. Q100: The architecture and design of a database processing unit. In ASPLOS, 2014.
[57]
T. Zidenberg, I. Keslassy, and U. Weiser. Optimal resource allocation with multiamdahl. Computer, 2013.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 51, Issue 4
ASPLOS '16
April 2016
774 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2954679
  • Editor:
  • Andy Gill
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2016
    824 pages
    ISBN:9781450340915
    DOI:10.1145/2872362
    • General Chair:
    • Tom Conte,
    • Program Chair:
    • Yuanyuan Zhou
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016
Published in SIGPLAN Volume 51, Issue 4

Check for updates

Author Tags

  1. accelerators
  2. dependence graphs
  3. modeling
  4. program behaviors
  5. specialization

Qualifiers

  • Research-article

Funding Sources

  • NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)15
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media