article

Energy-efficient multithreading for a hierarchical heterogeneous multicore through locality-cognizant thread generation

Authors:

Patrick A. La Fratta,

Peter M. KoggeAuthors Info & Claims

Journal of Parallel and Distributed Computing, Volume 73, Issue 12

Pages 1551 - 1562

https://doi.org/10.1016/j.jpdc.2013.07.011

Published: 01 December 2013 Publication History

Abstract

Energy costs have become increasingly problematic for high performance processors, but the rising number of cores on-chip offers promising opportunities for energy reduction. Further, emerging architectures such as heterogeneous multicores present new opportunities for improved energy efficiency. While previous work has presented novel memory architectures, multithreading techniques, and data mapping strategies for reducing energy, consideration to thread generation mechanisms that take into account data locality for this purpose has been limited. This study presents methodologies for the joint partitioning of data and threads to parallelize sequential codes across an innovative heterogeneous multicore processor called the Passive/Active Multicore (PAM) for reducing energy consumption from on-chip data transport and cache access components while also improving execution time. Experimental results show that the design with automatic thread partitioning offered reductions in energy-delay product (EDP) of up to 48%.

References

[1]

Azizi, O., Mahesri, A., Patel, S.J. and Horowitz, M., Area-efficiency in CMP core design: co-optimization of microarchitecture and physical design. SIGARCH Comput. Archit. News. v37 i2. 56-65.

[2]

Balakrishnan, S., Rajwar, R., Upton, M. and Lai, K., The impact of performance asymmetry in emerging multicore architectures. SIGARCH Comput. Archit. News. v33. 506-517.

[3]

Belady, L.A., A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. v5 i2. 78-101.

[4]

Brown, D.J. and Reams, C., Toward energy-efficient computing. Commun. ACM. v53 i3. 50-58.

[5]

Capalija, D. and Abdelrahman, T., Microarchitecture of a coarse-grain out-of-order superscalar processor. IEEE Trans. Parallel Distrib. Syst. v24 i2. 392-405.

[6]

T.E. Carlson, W. Heirman, L. Eeckhout, Sampled simulation of multithreaded applications, in: International Symposium on Performance Analysis of Systems and Software, ISPASS, 2013, pp. 2-12.

[7]

Chang, J., Huang, M., Shoemaker, J., Benoit, J., Chen, S.-L., Chen, W., Chiu, S., Ganesan, R., Leong, G., Lukka, V., Rusu, S. and Srivastava, D., The 65-nm 16-mb shared on-die l3 cache for the dual-core intel xeon processor 7100 series. IEEE J. Solid-State Circuits. v42 i4. 846-852.

[8]

Chen, Z.-H. and Su, A.W.Y., A hardware/software framework for instruction and data scratchpad memory allocation. ACM Trans. Archit. Code Optim. v7. 2:1-2:27.

[9]

Cho, S. and Melhem, R., On the interplay of parallelization, program performance, and energy consumption. IEEE Trans. Parallel Distrib. Syst. v21 i3. 342-353.

[10]

Curran, B., Eisen, L., Schwarz, E., Mak, P., Warnock, J., Meaney, P. and Fee, M., The zenterprise 196 system and microprocessor. IEEE Micro. v31 i2. 26-40.

[11]

De La Luz, V., Kadayif, I., Kandemir, M. and Sezer, U., Access pattern restructuring for memory energy. IEEE Trans. Parallel Distrib. Syst. v15. 289-303.

Digital Library

[12]

G. Dhiman, V. Kontorinis, D. Tullsen, T. Rosing, E. Saxe, J. Chew, Dynamic workload characterization for power efficient scheduling on CMP systems, in: Proc. Int'l Symp. Low Power Electronics and Design, 2010, pp. 437-442.

[13]

W. Feng, T. Scogland, The green500 list: year one, in: Proc. Int'l Symp. Parallel and Distributed Processing, 2009, pp. 1-7.

[14]

V. Govindaraju, C.-H. Ho, K. Sankaralingam, Dynamically specialized datapaths for energy efficient computing, in: Proc. 17th Int'l Symp. High Performance Computer Architecture, 2011, pp. 503-514. http://dx.doi.org/10.1109/HPCA.2011.5749755.

[15]

Hill, M. and Marty, M., Amdahl's law in the multicore era. IEEE Comput. v41 i7. 33-38.

[16]

M. Horowitz, T. Indermaur, R. Gonzalez, Low-power digital design, in: Low Power Electronics, 1994. Digest of Technical Papers, IEEE Symposium, 1994, pp. 8-11. http://dx.doi.org/10.1109/LPE.1994.573184.

[17]

C.-H. Hsu, U. Kremer, The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction, in: Proc. ACM SIGPLAN 2003 Conf. Programming Language Design and Implementation, 2003, pp. 38-48.

Digital Library

[18]

J. Huh, D. Burger, S.W. Keckler, Exploring the design space of future CMPs, in: Proc. Int'l. Conf. Parallel Architectures and Compilation Techniques, 2001, pp. 199-210.

[19]

Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F. and August, D.I., Dynamically managed data for CPU-GPU architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, ACM, New York, NY, USA. pp. 165-174.

[20]

Johns, C.R. and Brokenshire, D.A., Introduction to the cell broadband engine architecture. IBM J. Res. Dev. v51 i5. 503-519.

[21]

M. Kandemir, O. Ozturk, S.P. Muralidhara, Dynamic thread and data mapping for NoC based CMPs, in: Proc. of the 46th Annual Design Automation Conference, 2009, pp. 852-857.

[22]

U.R. Karpuzcu, B. Greskamp, J. Torrellas, The bubblewrap many-core: popping cores for sequential acceleration, in: Proc. Int'l. Symp. Microarchitecture, 2009, pp. 447-458.

[23]

Of piglets and threadlets: architectures for self-contained, mobile, memory programming. In: Int'l. Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, IEEE Computer Society. pp. 130-138.

[24]

P.M. Kogge, P. La Fratta, M. Vance, Facing the exascale energy wall, in: Int'l. Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, Kohala Coast, Hawaii, 2010.

[25]

R. Kumar, D.M. Tullsen, N.P. Jouppi, Core architecture optimization for heterogeneous chip multiprocessors, in: Proc. Int'l. Conf. Parallel Architectures and Compilation Techniques, 2006, pp. 23-32.

[26]

P.A. La Fratta, Optimizing the internal microarchitecture and ISA of a traveling thread pim system, Ph.D. Thesis, University of Notre Dame, Adviser-Peter Kogge, 2010.

[27]

P.A. La Fratta, P.M. Kogge, Instructing the memory hierarchy with in-cache computations, in: Workshop on Interaction Between Compilers and Computer Architecture, 2009.

[28]

P.A. La Fratta, P.M. Kogge, Models for generating locality-tuned traveling threads for a hierarchical multi-level heterogeneous multicore, in: Proc. Int'l. Conf. on Computing Frontiers, CF'10, 2010, pp. 227-236.

[29]

Li, L., Feng, H. and Xue, J., Compiler-directed scratchpad memory management via graph coloring. ACM Trans. Archit. Code Optim. v6. 9:1-9:17.

[30]

P. Li, S. Guo, Energy minimization on thread-level speculation in multicore systems, in: Proc. Int'l. Symp. Parallel and Distributed Computing, 2010, pp. 125-132.

[31]

Y. Luo, V. Packirisamy, W.-C. Hsu, A. Zhai, Energy efficient speculative threads: dynamic thread allocation in same-ISA heterogeneous multicore systems, in: Proc. of the Int'l Conf. on Parallel Architectures and Compilation Techniques, 2010, pp. 453-464.

[32]

J. Meng, J. Sheaffer, K. Skadron, Exploiting inter-thread temporal locality for chip multithreading, in: Proc. Int'l. Symp. Parallel and Distributed Processing, 2010, pp. 1-12. http://dx.doi.org/10.1109/IPDPS.2010.5470465.

[33]

R. Murphy, A. Rodrigues, P. Kogge, K. Underwood, The implications of working set analysis on supercomputing memory hierarchy design, in: Proc. Int'l. Conf. on Supercomputing, 2005, pp. 332-340.

[34]

Nickolls, J. and Dally, W., The GPU computing era. IEEE Micro. v30 i2. 56-69.

[35]

S. Petit, R. Ubal, J. Sahuquillo, P. Lopez, J. Duato, An efficient low-complexity alternative to the rob for out-of-order retirement of instructions, in: 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, 2009. DSD'09, 2009, pp. 635-642. http://dx.doi.org/10.1109/DSD.2009.237.

Digital Library

[36]

Renau, J., Strauss, K., Ceze, L., Liu, W., Sarangi, S., Tuck, J. and Torrellas, J., Energy-efficient thread-level speculation. IEEE Micro. v26 i1. 80-91.

[37]

A.F. Rodrigues, Programming future architectures: dusty decks, memory walls, and the speed of light, Ph.D. Thesis, University of Notre Dame, Adviser-Peter Kogge, 2006.

[38]

Thoziyoor, S., Ahn, J.H., Monchiero, M., Brockman, J.B. and Jouppi, N.P., A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. SIGARCH Comput. Archit. News. v36 i3. 51-62.

[39]

C. Tseng, S. Figueira, An analysis of the energy efficiency of multithreading on multi-core machines, in: Int'l. Green Computing Conference, 2010, pp. 283-290.

Digital Library

[40]

Unsal, O.S., Ashok, R., Koren, I., Krishna, C.M. and Moritz, C.A., Cool-cache: a compiler-enabled energy efficient data caching framework for embedded/multimedia processors. ACM Trans. Embedded Comput. Syst. v2. 373-392.

Digital Library

[41]

M. Verma, L. Wehmeyer, P. Marweclel, Dynamic overlay of scratchpad memory for energy minimization, in: Proc. Int'l. Conf. on Hardware/Software Codesign and System Synthesis, 2004, pp. 104-109.

[42]

Wang, Z. and Hu, X.S., Energy-aware variable partitioning and instruction scheduling for multibank memory architectures. ACM Trans. Des. Autom. Electron. Syst. v10. 369-388.

Digital Library

[43]

S. Wang, L. Wang, Thread-associative memory for multicore and multithreaded computing, in: Proc. Int'l Symp. Low Power Electronics and Design, 2006, pp. 139-142.

[44]

Woo, D.H. and Lee, H.-H., Extending Amdahl's law for energy-efficient computing in the many-core era. IEEE Comput. v41 i12. 24-31.

[45]

Y. Xia, V.K. Prasanna, Collaborative scheduling of dag structured computations on multicore processors, in: Proc. Int'l. Conf. on Computing Frontiers, 2010, pp. 63-72.

[46]

Y. Zhang, X. Hu, D. Chen, Global register allocation for minimizing energy consumption, in: Proc. Int'l. Symp. Low Power Electronics and Design, 1999, pp. 100-102. http://dx.doi.org/10.1109/LPE.1999.145025.

Digital Library

[47]

Zhong, Y., Shen, X. and Ding, C., Program locality analysis using reuse distance. ACM Trans. Program. Lang. Syst. v31. 20:1-20:39.

Cited By

Wang ZXiong NWang HCheng LZhao W(2022)Whole procedure heterogeneous multiprocessors low-power optimization at algorithm-levelCluster Computing10.1007/s10586-018-1920-x22:1(2407-2423)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/s10586-018-1920-x
Singh JBetha SMangipudi BAuluck N(2015)Contention Aware Energy Efficient Scheduling on Heterogeneous MultiprocessorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.232235426:5(1251-1264)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1109/TPDS.2014.2322354

Recommendations

Energy-aware thread co-location in heterogeneous multicore processors
EMSOFT '13: Proceedings of the Eleventh ACM International Conference on Embedded Software

Given the wide variety of performance demands for various workloads, the trend in embedded systems is shifting from homogeneous to heterogeneous processors, which have been shown to yield performance and energy saving benefits. A typical heterogeneous ...
Phase-based scheduling and thread migration for heterogeneous multicore processors
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Heterogeneous multicore processors (HMPs) can provide better performance and reduced energy consumption than homogeneous ones [3]. Differences between cores provide different processing capabilities for different applications; a dynamic scheduler can ...
Performance-Energy Considerations for Shared Cache Management in a Heterogeneous Multicore Processor

Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as graphic processing unit (GPU) cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing Volume 73, Issue 12

December, 2013

193 pages

ISSN:0743-7315

Issue’s Table of Contents

Copyright © © 2013.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 December 2013

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZXiong NWang HCheng LZhao W(2022)Whole procedure heterogeneous multiprocessors low-power optimization at algorithm-levelCluster Computing10.1007/s10586-018-1920-x22:1(2407-2423)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/s10586-018-1920-x
Singh JBetha SMangipudi BAuluck N(2015)Contention Aware Energy Efficient Scheduling on Heterogeneous MultiprocessorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.232235426:5(1251-1264)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1109/TPDS.2014.2322354

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents