research-article

ASC: automatically scalable computation

Authors:

Amos Waterland,

Elaine Angelino,

Jonathan Appavoo,

Margo SeltzerAuthors Info & Claims

ACM SIGPLAN Notices, Volume 49, Issue 4

Pages 575 - 590

https://doi.org/10.1145/2644865.2541985

Published: 24 February 2014 Publication History

Abstract

We present an architecture designed to transparently and automatically scale the performance of sequential programs as a function of the hardware resources available. The architecture is predicated on a model of computation that views program execution as a walk through the enormous state space composed of the memory and registers of a single-threaded processor. Each instruction execution in this model moves the system from its current point in state space to a deterministic subsequent point. We can parallelize such execution by predictively partitioning the complete path and speculatively executing each partition in parallel. Accurately partitioning the path is a challenging prediction problem. We have implemented our system using a functional simulator that emulates the x86 instruction set, including a collection of state predictors and a mechanism for speculatively executing threads that explore potential states along the execution path. While the overhead of our simulation makes it impractical to measure speedup relative to native x86 execution, experiments on three benchmarks show scalability of up to a factor of 256 on a 1024 core machine when executing unmodified sequential programs.

References

[1]

Vikram S. Adve, John Mellor-Crummey, Mark Anderson, Jhy-Chun Wang, Daniel A. Reed, and Ken Kennedy. An integrated compilation and performance analysis environment for data parallel programs. In Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing '95, New York, NY, USA, 1995. ACM.

Digital Library

[2]

Haitham Akkary and Michael A. Driscoll. A dynamic multithreading processor. In Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, MICRO 31, pages 226--236, Los Alamitos, CA, USA, 1998. IEEE Computer Society Press.

Digital Library

[3]

Saman P. Amarasinghe and Monica S. Lam. Communication optimization and code generation for distributed memory machines. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation, PLDI '93, pages 126--138, New York, NY, USA, 1993. ACM.

Digital Library

[4]

Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University Press, 2009.

Digital Library

[5]

Jean-Loup Baer and Tien-Fu Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 176--186, New York, NY, USA, 1991. ACM.

Digital Library

[6]

Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. Dynamo: a transparent dynamic optimization system. ACM SIGPLAN Notices, 35(5):1--12, 2000.

Digital Library

[7]

Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

Digital Library

[8]

Avrim Blum. On-line algorithms in machine learning. In Amos Fiat and Gerhard J. Woeginger, editors, Online Algorithms, volume 1442 of Lecture Notes in Computer Science, pages 306--325. Springer, 1996.

Digital Library

[9]

Bill Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeflinger, David Padua, Paul Petersen, Bill Pottenger, Lawrence Rauchwerger, Peng Tu, and Stephen Weatherford. Polaris: The next generation in parallelizing compilers. In Proceedings Of The Workshop On Languages And Compilers For Parallel Computing, pages 10--1. Springer-Verlag, Berlin/Heidelberg, 1994.

[10]

Michael Boyer, David Tarjan, and Kevin Skadron. Federation: Boosting per-thread performance of throughput-oriented manycore architectures. ACM Trans. Archit. Code Optim., 7(4):19:1--19:38, December 2010.

Digital Library

[11]

Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427--485, May 1997.

Digital Library

[12]

Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.

Digital Library

[13]

Michael K. Chen and Kunle Olukotun. The jrpm system for dynamically parallelizing java programs. In Proceedings of the 30th annual international symposium on Computer architecture, ISCA '03, pages 434--446, New York, NY, USA, 2003. ACM.

Digital Library

[14]

Marcelo Cintra, José F. Martínez, and Josep Torrellas. Architectural support for scalable speculative parallelization in shared-memory multiprocessors. In Proceedings of the 27th annual international symposium on Computer architecture, ISCA '00, pages 13--24, New York, NY, USA, 2000. ACM.

Digital Library

[15]

Adam Coates, Andrew Y Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics, pages 215--223, 2011.

[16]

A. Dasgupta. Vizer: A framework to analyze and vectorize intel x86 binaries. Master's thesis, Rice University, 2002.

[17]

James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. The transmeta code morphing software: using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO '03, pages 15--24, Washington, DC, USA, 2003. IEEE Computer Society.

Digital Library

[18]

Pradeep K. Dubey, Kevin O'Brien, Kathryn M. O'Brien, and Charles Barton. Single-program speculative multithreading (spsm) architecture: compiler-assisted fine-grained multithreading. In Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques, PACT '95, pages 109--121, Manchester, UK, UK, 1995. IFIP Working Group on Algol.

Digital Library

[19]

Maria florina Balcan, Manuel Blum, Yishay Mansour, Tom Mitchell, and Santosh Vempala. New theoretical frameworks for machine learning, 2008.

[20]

Björn Franke. Fast cycle-approximate instruction set simulation. In Proceedings of the 11th international workshop on Software & compilers for embedded systems, pages 69--78. ACM, 2008.

[21]

Freddy Gabbay and Freddy Gabbay. Speculative execution based on value prediction. Technical report, EE Department TR 1080, Technion - Israel Institue of Technology, 1996.

[22]

Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Daniel Tarlow. Church: a language for generative models. CoRR, abs/1206.3255, 2012.

[23]

Raymond Greenlaw, H. James Hoover, and Walter L. Ruzzo. Limits to parallel computation: P-completeness theory. Oxford University Press, Inc., New York, NY, USA, 1995.

Digital Library

[24]

Lance Hammond, Mark Willey, and Kunle Olukotun. Data speculation support for a chip multiprocessor. In Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, ASPLOS VIII, pages 58--69, New York, NY, USA, 1998. ACM.

Digital Library

[25]

Milos Hauskrecht. Linear and logistic regression. Class lecture, 2005.

[26]

Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. In Proceedings of the 20th annual international symposium on computer architecture, ISCA '93, pages 289--300, New York, NY, USA, 1993. ACM.

Digital Library

[27]

Ben Hertzberg. Runtime Automatic Speculative Parallelization of Sequential Programs. PhD thesis, Stanford University, 2009.

[28]

Engin Ipek, Meyrem Kirman, Nevin Kirman, and Jose F. Martinez. Core fusion: accommodating software diversity in chip multiprocessors. In Proceedings of the 34th annual international symposium on Computer architecture, ISCA '07, pages 186--197, New York, NY, USA, 2007. ACM.

Digital Library

[29]

E.T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003.

[30]

Daniel A. Jiménez and Calvin Lin. Dynamic branch prediction with perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture, HPCA '01, pages 197--, Washington, DC, USA, 2001. IEEE Computer Society.

Digital Library

[31]

Troy A. Johnson, Rudolf Eigenmann, and T. N. Vijaykumar. Speculative thread decomposition through empirical optimization. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages 205--214, New York, NY, USA, 2007. ACM.

Digital Library

[32]

Ken Kennedy and John R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.

Digital Library

[33]

Hanjun Kim, Nick P. Johnson, Jae W. Lee, Scott A. Mahlke, and David I. August. Automatic speculative DOALL for clusters. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO '12, pages 94--103, New York, NY, USA, 2012. ACM.

Digital Library

[34]

Tom Knight. An architecture for mostly functional languages. In Proceedings of the 1986 ACM conference on LISP and functional programming, LFP '86, pages 105--112, New York, NY, USA, 1986. ACM.

Digital Library

[35]

Aparna Kotha, Kapil Anand, Matthew Smithson, Greeshma Yellareddy, and Rajeev Barua. Automatic parallelization in a binary rewriter. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '43, pages 547--557, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[36]

Jeffrey C. Lagarias. The 3x+1 Problem: An Annotated Bibliography, II (2000--2009). Arxiv, August 2009.

[37]

J. K. F. Lee and A. J. Smith. Branch prediction strategies and branch target buffer design. Computer, 17(1):6--22, January 1984.

Digital Library

[38]

Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. Value locality and load value prediction. In ASPLOS, pages 138--147, 1996.

Digital Library

[39]

Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2):212--261, February 1994.

Digital Library

[40]

Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau, and Josep Torrellas. POSH: A TLS compiler that exploits program structure. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '06, pages 158--167, New York, NY, USA, 2006. ACM.

Digital Library

[41]

Edward N. Lorenz. Dimension of weather and climate attractors. Nature, 353(6341):241--244, 1991.

[42]

Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, and Scott Mahlke. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 166--176, New York, NY, USA, 2009. ACM.

Digital Library

[43]

Donald Michie. "Memo" Functions and Machine Learning. Nature, 218(5136):19--22, April 1968.

[44]

Andreas Moshovos, Scott E. Breach, T. N. Vijaykumar, and Gurindar S. Sohi. Dynamic speculation and synchronization of data dependences. In Proceedings of the 24th annual international symposium on Computer architecture, ISCA '97, pages 181--193, New York, NY, USA, 1997. ACM.

Digital Library

[45]

Eugene W. Myers. An o(nd) difference algorithm and its variations. Algorithmica, 1:251--266, 1986.

[46]

Todd Mytkowicz, Amer Diwan, and Elizabeth Bradley. Computer systems are dynamical systems. Chaos: An Interdisciplinary Journal of Nonlinear Science, 19(3):033124, 2009.

[47]

Louis-Noel Pouchet. Polybench/c: the polyhedral benchmark suite.

[48]

Zach Purser, Karthik Sundaramoorthy, and Eric Rotenberg. A study of slipstream processors. In Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, MICRO 33, pages 269--280, New York, NY, USA, 2000. ACM.

Digital Library

[49]

Carlos García Quiñones, Carlos Madriles, Jesús Sánchez, Pedro Marcuello, Antonio González, and Dean M. Tullsen. Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 269--279, New York, NY, USA, 2005. ACM.

Digital Library

[50]

George Radin. The 801 minicomputer. In Proceedings of the first international symposium on Architectural support for programming languages and operating systems, ASPLOS I, pages 39--47, New York, NY, USA, 1982. ACM.

Digital Library

[51]

Easwaran Raman, Neil Vachharajani, Ram Rangan, and David I. August. Spice: speculative parallel iteration chunk execution. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, CGO '08, pages 175--184, New York, NY, USA, 2008. ACM.

Digital Library

[52]

Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT '04, pages 177--188, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[53]

David Stork Richard Duda, Peter Hart. Pattern Classification (Second Edition). John Wiley & Sons, Inc., 2001.

[54]

C. G. Ritson and F. R. M. Barnes. Evaluating intel rtm for cpas. In P. H. Welch et al, editor, Proceedings of Communicating Process Architectures 2013. Open Channel Publishing Limited, 2013.

[55]

Eric Rotenberg, Steve Bennett, and James E. Smith. Trace cache: a low latency approach to high bandwidth instruction fetching. In Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, MICRO 29, pages 24--35, Washington, DC, USA, 1996. IEEE Computer Society.

Digital Library

[56]

Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. Hybrid analysis: static & dynamic memory reference analysis. Int. J. Parallel Program., 31(4):251--283, August 2003.

Digital Library

[57]

Yiannakis Sazeides. Instruction-isomorphism in program execution. In In Proceedings of the Value Prediction Workshop, pages 47--54, 2003.

[58]

Jeremy Singer, Gavin Brown, and Ian Watson. Deriving limits of branch prediction with the fano inequality, 2006.

[59]

James E. Smith. A study of branch prediction strategies. In Proceedings of the 8th annual symposium on Computer Architecture, ISCA '81, pages 135--148, Los Alamitos, CA, USA, 1981. IEEE Computer Society Press.

Digital Library

[60]

Avinash Sodani and Gurindar S. Sohi. An empirical analysis of instruction repetition. In Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, ASPLOS VIII, pages 35--45, New York, NY, USA, 1998. ACM.

Digital Library

[61]

Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the 22nd annual international symposium on Computer architecture, ISCA '95, pages 414--425, New York, NY, USA, 1995. ACM.

Digital Library

[62]

J. Greggory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. A scalable approach to thread-level speculation. In Proceedings of the 27th annual international symposium on Computer architecture, ISCA '00, pages 1--12, New York, NY, USA, 2000. ACM.

Digital Library

[63]

J. Gregory Steffan, Christopher Colohan, Antonia Zhai, and Todd C. Mowry. The stampede approach to thread-level speculation. ACM Trans. Comput. Syst., 23(3):253--300, August 2005.

Digital Library

[64]

Benjamin Vigoda. Analog logic: Continuous-Time analog circuits for statistical signal processing. PhD thesis, Massachusetts Institute of Technology, 2003.

Digital Library

[65]

Cheng Wang, Youfeng Wu, Edson Borin, Shiliang Hu, Wei Liu, Dave Sager, Tin-fook Ngai, and Jesse Fang. Dynamic parallelization of single-threaded binary programs using speculative slicing. In Proceedings of the 23rd international conference on Supercomputing, ICS '09, pages 158--168, New York, NY, USA, 2009. ACM.

Digital Library

[66]

Amos Waterland, Jonathan Appavoo, and Margo Seltzer. Parallelization by simulated tunneling. In Proceedings of the 4th USENIX conference on Hot Topics in Parallelism, HotPar'12, pages 9--14, Berkeley, CA, USA, 2012. USENIX Association.

Digital Library

[67]

Amos Waterland, Elaine Angelino, Ekin D. Cubuk, Efthimios Kaxiras, Ryan P. Adams, Jonathan Appavoo, and Margo Seltzer, phComputational caches, Proceedings of the 6th International Systems and Storage Conference (New York, NY, USA), SYSTOR '13, ACM, 2013, pp. 8:1--8:7.

Digital Library

[68]

J. Yang, K. Skadron, M. Soffa, and K. Whitehouse. Feasibility of dynamic binary parallelization. In Proceedings of the 4th USENIX conference on Hot Topics in Parallelism, 2011.

[69]

Efe Yardimci and Michael Franz. Dynamic parallelization and mapping of binary executables on hierarchical platforms. In Proceedings of the 3rd conference on Computing frontiers, CF '06, pages 127--138, New York, NY, USA, 2006. ACM.

Digital Library

[70]

Jenn yuan Tsai and Pen-Chung Yew. The superthreaded architecture: Thread pipelining with run-time data dependence checking and control speculation. In Proceedings of the conference on Parallel architectures and compilation techniques, PACT '96, pages 35--46, 1996.

Digital Library

[71]

Hongtao Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on, pages 290--301, Feb.

[72]

Craig Zilles and Gurindar Sohi. Master/slave speculative parallelization. In Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, MICRO 35, pages 85--96, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press.

Digital Library

Recommendations

ASC: automatically scalable computation
ASPLOS '14

We present an architecture designed to transparently and automatically scale the performance of sequential programs as a function of the hardware resources available. The architecture is predicated on a model of computation that views program execution ...
ASC: automatically scalable computation
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

We present an architecture designed to transparently and automatically scale the performance of sequential programs as a function of the hardware resources available. The architecture is predicated on a model of computation that views program execution ...
Compiler and hardware support for reducing the synchronization of speculative threads

Thread-level speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel execution of threads that might not actually be independent. In this article, we focus on one important limitation of program ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 49, Issue 4

ASPLOS '14

April 2014

729 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2644865

Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ

Issue’s Table of Contents

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
February 2014
780 pages
ISBN:9781450323055
DOI:10.1145/2541940
General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Published in SIGPLAN Volume 49, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
758
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents