research-article

Complementing user-level coarse-grain parallelism with implicit speculative parallelism

Authors:

Nikolas Ioannou,

Marcelo CintraAuthors Info & Claims

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 284 - 295

https://doi.org/10.1145/2155620.2155654

Published: 03 December 2011 Publication History

Abstract

Multi-core and many-core systems are the norm in contemporary processor technology and are expected to remain so for the foreseeable future. Programs using parallel programming primitives like PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good trade-off between programming effort versus performance gain. Some parallel applications show limited or no scaling beyond a number of cores. Given the abundant number of cores expected in future many-cores, several cores would remain idle in such cases while execution performance stagnates. This paper proposes using cores that do not contribute to performance improvement for running implicit fine-grain speculative threads. In particular, we present a many-core architecture and protocol that allow applications with coarse-grain explicit parallelism to further exploit implicit speculative parallelism within each thread. Implicit speculative parallelism frees the programmer from the additional effort to explicitly partition the work into finer and properly synchronized tasks. Our results show that, for a many-core comprising of 128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance improves on top of the highest scalability point by 41% on average for the 4-core cluster and by 27% on average for the 2-core cluster. These performance improvements come with an energy consumption that is close to -- and sometimes better than -- the baseline. This approach often leads to better performance and energy efficiency compared to existing alternatives such as Core Fusion and Frequency Boosting. We also investigate the tradeoffs between explicit and implicit threads as input dataset sizes vary. Finally, we present a dynamic mechanism to choose the number of explicit and implicit threads, which performs within 6% of the static oracle selection of threads.

References

[1]

D. H. Bailey et al. The nas parallel benchmarks -- summary and preliminary results. In SC, December 1991.

Digital Library

[2]

M. Bhadauria, V. M. Weaver, and S. A. McKee. Understanding parsec performance on contemporary cmps. In IISWC, October 2009.

Digital Library

[3]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural implications. In PACT, October 2008.

Digital Library

[4]

C. Blundell et al. Deconstructing transactional semantics: The subtleties of atomicity. In WDDD, June 2005.

[5]

D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In ISCA, June 2000.

Digital Library

[6]

L. Ceze, J. Tuck, C. Cascaval, and J. Torrellas. Bulk disambiguation of speculative threads in multiprocessors. In ISCA, June 2006.

Digital Library

[7]

H. Chafi, J. Casper, B. Carlstrom, A. McDonald, C. C. Minh, W. Baek, C. Kozyrakis, and K. Olukotun. A scalable, non-blocking approach to transactional memory. In HPCA, February 2007.

Digital Library

[8]

S. Chaudhry et al. Simultaneous speculative threading: A novel pipeline architecture implemented in Sun's ROCK processor. In ISCA, June 2009.

Digital Library

[9]

M. Cintra, J. Martínez, and J. Torrellas. Architectural support for scalable speculative parallelization in shared-memory multiprocessors. In ISCA, June 2000.

Digital Library

[10]

M. Curtis-Maury et al. Prediction models for multi-dimensional power-performance optimization on many cores. In PACT, October 2008.

Digital Library

[11]

C. Dave et al. Cetus: A source-to-source compiler infrastructure for multicores. IEEE Computer, December 2009.

Digital Library

[12]

B. R. de Supinski. Personal Communication. Lawrence Livermore National Laboratory, May 2011.

[13]

S. Eyerman and L. Eeckhout. Modeling critical sections in amdahl's law and its implications for multicore design. In ISCA, June 2010.

Digital Library

[14]

L. Hammond, M. Willey, and K. Olukotun. Data speculation support for a chip multiprocessor. In ASPLOS, October 1998.

Digital Library

[15]

M. D. Hill and M. R. Marty. Amdahl's law in the multicore era. IEEE Computer, July 2008.

Digital Library

[16]

J. Howard et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In ISSCC, 2010.

[17]

J. Huang, A. Raman, T. B. Jablin, Y. Zhang, T.-H. Hung, and D. I. August. Decoupled software pipelining creates parallelization opportunities. In CGO, April 2010.

Digital Library

[18]

Intel. Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors, White Paper, November, 2008.

[19]

Intel Corporation. Intel Core2 Duo Processors and Intel Core2 Extreme Processors for Platforms Based on Mobile Intel 965 Express Chipset Family Datasheet, 2008.

[20]

E. Ipek, M. Kirman, N. Kirman, and J. F. Martínez. Core fusion: Accommodating software diversity in chip multiprocessors. In ISCA, 2007.

Digital Library

[21]

H. Kim, A. Raman, F. Liu, J. W. Lee, and D. I. August. Scalable speculative parallelization on commodity clusters. In MICRO, December 2010.

Digital Library

[22]

V. Krishnan and J. Torrellas. Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor. In ICS, July 1998.

Digital Library

[23]

R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In ISCA, June 2005.

Digital Library

[24]

K. Kusano et al. Performance evaluation of the omni openmp compiler. In ISHPC, October 2000.

Digital Library

[25]

E. A. Lee. The problem with threads. IEEE Computer, January 2006.

Digital Library

[26]

D. Lenoski, J. Laudon, K. Guarachorloo, A. Gupta, and J. Hennessy. The directory-based cache coherence protocol for the dash multiprocessor. In ISCA, May 1990.

Digital Library

[27]

W. Liu et al. POSH: a TLS compiler that exploits program structure. In PPoPP, March 2006.

Digital Library

[28]

C. Madriles, P. López, J. M. Codina, E. Gibert, F. Latorre, A. Martinez, R. Martinez, and A. Gonzalez. Boosting single-thread performance in multi-core systems through fine-grain multi-threading. In ISCA, June 2009.

Digital Library

[29]

P. Marcuello and A. González. Clustered speculative multithreaded processors. In ICS, June 1999.

Digital Library

[30]

J. Martinez and J. Torrellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. In ASPLOS, October 2002.

Digital Library

[31]

C. C. Minh et al. An effective hybrid transactional memory system with strong isolation guarantees. In ISCA, June 2007.

Digital Library

[32]

M. J. Moravan, J. Bobba, K. E. Moore, L. Yen, M. D. Hill, B. Liblit, M. M. Swift, and D. A. Wood. Supporting nested transactional memory in logtm. In ASPLOS, October 2006.

Digital Library

[33]

C.-L. Ooi, S. W. Kim, I. Park, R. Eigenmann, B. Falsafi, and T. N. Vijaykumar. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor. In ICS, 2001.

Digital Library

[34]

L. Porter, B. Choi, and D. M. Tullsen. Mapping out a path from hardware transactional memory to speculative multithreading. In PACT, September 2009.

Digital Library

[35]

R. Rajwar and J. R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In MICRO, December 2001.

Digital Library

[36]

R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In ASPLOS, October 2002.

Digital Library

[37]

J. Renau et al. SESC simulator, January 2005. http://sesc.sourceforge.net.

[38]

J. Renau, J. Tuck, W. Liu, L. Ceze, K. Strauss, and J. Torrellas. Tasking with out-of-order spawn in TLS chip multiprocessors: Microarchitecture and compilation. In ICS, June 2005.

Digital Library

[39]

G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar processors. In ISCA, June 1995.

Digital Library

[40]

M. A. Suleman et al. An asymmetric architecture for accelerating critical sections. In ASPLOS, 2008.

[41]

D. Tarjan, S. Thoziyoor, and N. P. Jouppi. Cacti 4.0. Technical report, Compaq Research Lab., 2006.

[42]

C. von Praun et al. Implicit parallelism with ordered transactions. In PPoPP, March 2007.

Digital Library

[43]

S. Woo et al. The splash-2 programs: Characterization and methodological considerations. In ISCA, June 1995.

Digital Library

[44]

L. Yen et al. Logtm-se: Decoupling hardware transactional memory from caches. In HPCA, June 2007.

Digital Library

Cited By

Kumar SSingh SAggarwal N(2023)Sustainable Data Dependency Resolution Architectural Framework to Achieve Energy Efficiency Using Speculative Parallelization2023 3rd International Conference on Innovative Sustainable Computational Technologies (CISCT)10.1109/CISCT57197.2023.10351343(1-6)Online publication date: 8-Sep-2023
https://doi.org/10.1109/CISCT57197.2023.10351343
Kumar SSingh SAggarwal NGupta BAlhalabi WBand S(2022)An efficient hardware supported and parallelization architecture for intelligent systems to overcome speculative overheadsInternational Journal of Intelligent Systems10.1002/int.2306237:12(11764-11790)Online publication date: 8-Sep-2022
https://doi.org/10.1002/int.23062
Estebanez ALlanos DGonzalez-Escribano A(2016)A Survey on Thread-Level Speculation TechniquesACM Computing Surveys10.1145/293836949:2(1-39)Online publication date: 30-Jun-2016
https://dl.acm.org/doi/10.1145/2938369
Show More Cited By

Index Terms

Complementing user-level coarse-grain parallelism with implicit speculative parallelism
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. General and reference
  1. Cross-computing tools and techniques
    1. Design

Recommendations

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Exploiting coarse-grain speculative parallelism
OOPSLA '11

Speculative execution at coarse granularities (e.g., code-blocks, methods, algorithms) offers a promising programming model for exploiting parallelism on modern architectures. In this paper we present Anumita, a framework that includes programming ...
Exposing speculative thread parallelism in SPEC2000
PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming

As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

December 2011

519 pages

ISBN:9781450310536

DOI:10.1145/2155620

Conference Chair:
Carlo Galuzzi
Technische Universiteit Delft, The Netherlands
,
General Chair:
Luigi Carro
Universidade Federal do Rio Grande do Sul, Brasil
,
Program Chairs:
Andreas Moshovos
University of Toronto, Canada
,
Milos Prvulovic
Georgia Institute of Technology, United States

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE
ACM: Association for Computing Machinery
UFRGS: Universidade Federal do Rio Grande do Sul
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MICRO-44

Sponsor:

ACM
UFRGS
SIGMICRO
IEEE-CS

MICRO-44: The 44th Annual IEEE/ACM International Symposium on Microarchitecture

December 3 - 7, 2011

Porto Alegre, Brazil

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
308
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kumar SSingh SAggarwal N(2023)Sustainable Data Dependency Resolution Architectural Framework to Achieve Energy Efficiency Using Speculative Parallelization2023 3rd International Conference on Innovative Sustainable Computational Technologies (CISCT)10.1109/CISCT57197.2023.10351343(1-6)Online publication date: 8-Sep-2023
https://doi.org/10.1109/CISCT57197.2023.10351343
Kumar SSingh SAggarwal NGupta BAlhalabi WBand S(2022)An efficient hardware supported and parallelization architecture for intelligent systems to overcome speculative overheadsInternational Journal of Intelligent Systems10.1002/int.2306237:12(11764-11790)Online publication date: 8-Sep-2022
https://doi.org/10.1002/int.23062
Estebanez ALlanos DGonzalez-Escribano A(2016)A Survey on Thread-Level Speculation TechniquesACM Computing Surveys10.1145/293836949:2(1-39)Online publication date: 30-Jun-2016
https://dl.acm.org/doi/10.1145/2938369
Fu YNguyen TWentzlaff DPrvulovic M(2015)Coherence domain restriction on large scale systemsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830832(686-698)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830832
Yanhua LYouhui ZWeimin ZNapoli CSalapura VFranke HHou R(2015)Position-aware thread-level speculative parallelization for large-scale chip-multiprocessorProceedings of the 12th ACM International Conference on Computing Frontiers10.1145/2742854.2742866(1-8)Online publication date: 6-May-2015
https://dl.acm.org/doi/10.1145/2742854.2742866
Xu FShen LWang ZGuo HSu BChen W(2014)Improving Speculation Accuracy with Inter-thread Fetching Value PredictionAlgorithms and Architectures for Parallel Processing10.1007/978-3-319-11194-0_19(245-258)Online publication date: 2014
https://doi.org/10.1007/978-3-319-11194-0_19
Xu FShen LWang ZGuo HSu BChen W(2013)HEUSPECProceedings of the 2013 42nd International Conference on Parallel Processing10.1109/ICPP.2013.76(621-630)Online publication date: 1-Oct-2013
https://dl.acm.org/doi/10.1109/ICPP.2013.76
Xekalakis PIoannou NCintra M(2012)Mixed speculative multithreaded execution modelsACM Transactions on Architecture and Code Optimization10.1145/2355585.23555919:3(1-26)Online publication date: 5-Oct-2012
https://dl.acm.org/doi/10.1145/2355585.2355591

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents