Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Energy-efficient multithreading for a hierarchical heterogeneous multicore through locality-cognizant thread generation

Published: 01 December 2013 Publication History

Abstract

Energy costs have become increasingly problematic for high performance processors, but the rising number of cores on-chip offers promising opportunities for energy reduction. Further, emerging architectures such as heterogeneous multicores present new opportunities for improved energy efficiency. While previous work has presented novel memory architectures, multithreading techniques, and data mapping strategies for reducing energy, consideration to thread generation mechanisms that take into account data locality for this purpose has been limited. This study presents methodologies for the joint partitioning of data and threads to parallelize sequential codes across an innovative heterogeneous multicore processor called the Passive/Active Multicore (PAM) for reducing energy consumption from on-chip data transport and cache access components while also improving execution time. Experimental results show that the design with automatic thread partitioning offered reductions in energy-delay product (EDP) of up to 48%.

References

[1]
Azizi, O., Mahesri, A., Patel, S.J. and Horowitz, M., Area-efficiency in CMP core design: co-optimization of microarchitecture and physical design. SIGARCH Comput. Archit. News. v37 i2. 56-65.
[2]
Balakrishnan, S., Rajwar, R., Upton, M. and Lai, K., The impact of performance asymmetry in emerging multicore architectures. SIGARCH Comput. Archit. News. v33. 506-517.
[3]
Belady, L.A., A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. v5 i2. 78-101.
[4]
Brown, D.J. and Reams, C., Toward energy-efficient computing. Commun. ACM. v53 i3. 50-58.
[5]
Capalija, D. and Abdelrahman, T., Microarchitecture of a coarse-grain out-of-order superscalar processor. IEEE Trans. Parallel Distrib. Syst. v24 i2. 392-405.
[6]
T.E. Carlson, W. Heirman, L. Eeckhout, Sampled simulation of multithreaded applications, in: International Symposium on Performance Analysis of Systems and Software, ISPASS, 2013, pp. 2-12.
[7]
Chang, J., Huang, M., Shoemaker, J., Benoit, J., Chen, S.-L., Chen, W., Chiu, S., Ganesan, R., Leong, G., Lukka, V., Rusu, S. and Srivastava, D., The 65-nm 16-mb shared on-die l3 cache for the dual-core intel xeon processor 7100 series. IEEE J. Solid-State Circuits. v42 i4. 846-852.
[8]
Chen, Z.-H. and Su, A.W.Y., A hardware/software framework for instruction and data scratchpad memory allocation. ACM Trans. Archit. Code Optim. v7. 2:1-2:27.
[9]
Cho, S. and Melhem, R., On the interplay of parallelization, program performance, and energy consumption. IEEE Trans. Parallel Distrib. Syst. v21 i3. 342-353.
[10]
Curran, B., Eisen, L., Schwarz, E., Mak, P., Warnock, J., Meaney, P. and Fee, M., The zenterprise 196 system and microprocessor. IEEE Micro. v31 i2. 26-40.
[11]
De La Luz, V., Kadayif, I., Kandemir, M. and Sezer, U., Access pattern restructuring for memory energy. IEEE Trans. Parallel Distrib. Syst. v15. 289-303.
[12]
G. Dhiman, V. Kontorinis, D. Tullsen, T. Rosing, E. Saxe, J. Chew, Dynamic workload characterization for power efficient scheduling on CMP systems, in: Proc. Int'l Symp. Low Power Electronics and Design, 2010, pp. 437-442.
[13]
W. Feng, T. Scogland, The green500 list: year one, in: Proc. Int'l Symp. Parallel and Distributed Processing, 2009, pp. 1-7.
[14]
V. Govindaraju, C.-H. Ho, K. Sankaralingam, Dynamically specialized datapaths for energy efficient computing, in: Proc. 17th Int'l Symp. High Performance Computer Architecture, 2011, pp. 503-514. http://dx.doi.org/10.1109/HPCA.2011.5749755.
[15]
Hill, M. and Marty, M., Amdahl's law in the multicore era. IEEE Comput. v41 i7. 33-38.
[16]
M. Horowitz, T. Indermaur, R. Gonzalez, Low-power digital design, in: Low Power Electronics, 1994. Digest of Technical Papers, IEEE Symposium, 1994, pp. 8-11. http://dx.doi.org/10.1109/LPE.1994.573184.
[17]
C.-H. Hsu, U. Kremer, The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction, in: Proc. ACM SIGPLAN 2003 Conf. Programming Language Design and Implementation, 2003, pp. 38-48.
[18]
J. Huh, D. Burger, S.W. Keckler, Exploring the design space of future CMPs, in: Proc. Int'l. Conf. Parallel Architectures and Compilation Techniques, 2001, pp. 199-210.
[19]
Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F. and August, D.I., Dynamically managed data for CPU-GPU architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, ACM, New York, NY, USA. pp. 165-174.
[20]
Johns, C.R. and Brokenshire, D.A., Introduction to the cell broadband engine architecture. IBM J. Res. Dev. v51 i5. 503-519.
[21]
M. Kandemir, O. Ozturk, S.P. Muralidhara, Dynamic thread and data mapping for NoC based CMPs, in: Proc. of the 46th Annual Design Automation Conference, 2009, pp. 852-857.
[22]
U.R. Karpuzcu, B. Greskamp, J. Torrellas, The bubblewrap many-core: popping cores for sequential acceleration, in: Proc. Int'l. Symp. Microarchitecture, 2009, pp. 447-458.
[23]
Of piglets and threadlets: architectures for self-contained, mobile, memory programming. In: Int'l. Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, IEEE Computer Society. pp. 130-138.
[24]
P.M. Kogge, P. La Fratta, M. Vance, Facing the exascale energy wall, in: Int'l. Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, Kohala Coast, Hawaii, 2010.
[25]
R. Kumar, D.M. Tullsen, N.P. Jouppi, Core architecture optimization for heterogeneous chip multiprocessors, in: Proc. Int'l. Conf. Parallel Architectures and Compilation Techniques, 2006, pp. 23-32.
[26]
P.A. La Fratta, Optimizing the internal microarchitecture and ISA of a traveling thread pim system, Ph.D. Thesis, University of Notre Dame, Adviser-Peter Kogge, 2010.
[27]
P.A. La Fratta, P.M. Kogge, Instructing the memory hierarchy with in-cache computations, in: Workshop on Interaction Between Compilers and Computer Architecture, 2009.
[28]
P.A. La Fratta, P.M. Kogge, Models for generating locality-tuned traveling threads for a hierarchical multi-level heterogeneous multicore, in: Proc. Int'l. Conf. on Computing Frontiers, CF'10, 2010, pp. 227-236.
[29]
Li, L., Feng, H. and Xue, J., Compiler-directed scratchpad memory management via graph coloring. ACM Trans. Archit. Code Optim. v6. 9:1-9:17.
[30]
P. Li, S. Guo, Energy minimization on thread-level speculation in multicore systems, in: Proc. Int'l. Symp. Parallel and Distributed Computing, 2010, pp. 125-132.
[31]
Y. Luo, V. Packirisamy, W.-C. Hsu, A. Zhai, Energy efficient speculative threads: dynamic thread allocation in same-ISA heterogeneous multicore systems, in: Proc. of the Int'l Conf. on Parallel Architectures and Compilation Techniques, 2010, pp. 453-464.
[32]
J. Meng, J. Sheaffer, K. Skadron, Exploiting inter-thread temporal locality for chip multithreading, in: Proc. Int'l. Symp. Parallel and Distributed Processing, 2010, pp. 1-12. http://dx.doi.org/10.1109/IPDPS.2010.5470465.
[33]
R. Murphy, A. Rodrigues, P. Kogge, K. Underwood, The implications of working set analysis on supercomputing memory hierarchy design, in: Proc. Int'l. Conf. on Supercomputing, 2005, pp. 332-340.
[34]
Nickolls, J. and Dally, W., The GPU computing era. IEEE Micro. v30 i2. 56-69.
[35]
S. Petit, R. Ubal, J. Sahuquillo, P. Lopez, J. Duato, An efficient low-complexity alternative to the rob for out-of-order retirement of instructions, in: 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, 2009. DSD'09, 2009, pp. 635-642. http://dx.doi.org/10.1109/DSD.2009.237.
[36]
Renau, J., Strauss, K., Ceze, L., Liu, W., Sarangi, S., Tuck, J. and Torrellas, J., Energy-efficient thread-level speculation. IEEE Micro. v26 i1. 80-91.
[37]
A.F. Rodrigues, Programming future architectures: dusty decks, memory walls, and the speed of light, Ph.D. Thesis, University of Notre Dame, Adviser-Peter Kogge, 2006.
[38]
Thoziyoor, S., Ahn, J.H., Monchiero, M., Brockman, J.B. and Jouppi, N.P., A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. SIGARCH Comput. Archit. News. v36 i3. 51-62.
[39]
C. Tseng, S. Figueira, An analysis of the energy efficiency of multithreading on multi-core machines, in: Int'l. Green Computing Conference, 2010, pp. 283-290.
[40]
Unsal, O.S., Ashok, R., Koren, I., Krishna, C.M. and Moritz, C.A., Cool-cache: a compiler-enabled energy efficient data caching framework for embedded/multimedia processors. ACM Trans. Embedded Comput. Syst. v2. 373-392.
[41]
M. Verma, L. Wehmeyer, P. Marweclel, Dynamic overlay of scratchpad memory for energy minimization, in: Proc. Int'l. Conf. on Hardware/Software Codesign and System Synthesis, 2004, pp. 104-109.
[42]
Wang, Z. and Hu, X.S., Energy-aware variable partitioning and instruction scheduling for multibank memory architectures. ACM Trans. Des. Autom. Electron. Syst. v10. 369-388.
[43]
S. Wang, L. Wang, Thread-associative memory for multicore and multithreaded computing, in: Proc. Int'l Symp. Low Power Electronics and Design, 2006, pp. 139-142.
[44]
Woo, D.H. and Lee, H.-H., Extending Amdahl's law for energy-efficient computing in the many-core era. IEEE Comput. v41 i12. 24-31.
[45]
Y. Xia, V.K. Prasanna, Collaborative scheduling of dag structured computations on multicore processors, in: Proc. Int'l. Conf. on Computing Frontiers, 2010, pp. 63-72.
[46]
Y. Zhang, X. Hu, D. Chen, Global register allocation for minimizing energy consumption, in: Proc. Int'l. Symp. Low Power Electronics and Design, 1999, pp. 100-102. http://dx.doi.org/10.1109/LPE.1999.145025.
[47]
Zhong, Y., Shen, X. and Ding, C., Program locality analysis using reuse distance. ACM Trans. Program. Lang. Syst. v31. 20:1-20:39.

Cited By

View all
  • (2022)Whole procedure heterogeneous multiprocessors low-power optimization at algorithm-levelCluster Computing10.1007/s10586-018-1920-x22:1(2407-2423)Online publication date: 10-Mar-2022
  • (2015)Contention Aware Energy Efficient Scheduling on Heterogeneous MultiprocessorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.232235426:5(1251-1264)Online publication date: 1-May-2015

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing  Volume 73, Issue 12
December, 2013
193 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 December 2013

Author Tags

  1. Energy-efficient processor design
  2. Heterogeneous multicore
  3. Parallel computer architecture

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Whole procedure heterogeneous multiprocessors low-power optimization at algorithm-levelCluster Computing10.1007/s10586-018-1920-x22:1(2407-2423)Online publication date: 10-Mar-2022
  • (2015)Contention Aware Energy Efficient Scheduling on Heterogeneous MultiprocessorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.232235426:5(1251-1264)Online publication date: 1-May-2015

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media