Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

Published: 07 May 2010 Publication History

Abstract

Heterogeneous multicore processors have emerged as an energy- and area-efficient architectural solution to improving performance for domain-specific applications such as those with a plethora of data-level parallelism. These processors typically contain a large number of small, compute-centric cores for acceleration while keeping one or two high-performance ILP cores on the die to guarantee single-thread performance. Although a major portion of the transistors are occupied by the acceleration cores, these resources will sit idle when running unparallelized legacy codes or the sequential part of an application. To address this underutilization issue, in this article, we introduce Chameleon, a flexible heterogeneous multicore architecture to virtualize these resources for enhancing memory performance when running sequential programs. The Chameleon architecture can dynamically virtualize the idle acceleration cores into a last-level cache, a data prefetcher, or a hybrid between these two techniques. In addition, Chameleon can operate in an adaptive mode that dynamically configures the acceleration cores between the hybrid mode and the prefetch-only mode by monitoring the effectiveness of the Chameleon cache mode. In our evaluation with SPEC2006 benchmark suite, different levels of performance improvements were achieved in different modes for different applications. In the case of the adaptive mode, Chameleon improves the performance of SPECint06 and SPECfp06 by 31% and 15%, on average. When considering only memory-intensive applications, Chameleon improves the system performance by 50% and 26% for SPECint06 and SPECfp06, respectively.

References

[1]
Agarwal, M., Malik, K., Woley, K. M., Stone, S. S., and Frank, M. I. 2007. Exploiting postdominance for speculative parallelization. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture. IEEE, Los Alamitos, CA, 295--305.
[2]
Annavaram, M., Patel, J. M., and Davidson, E. S. 2001. Data prefetching by dependence graph precomputation. In Proceedings of the 28th International Symposium on Computer Architecture. ACM, New York, 52--61.
[3]
Arevalo, A., Matinata, R., Pandian, M., Peri, E., Ruby, K., Thomas, F., and Almond, C. 2008. Programming the Cell Broadband Engine Architecture: Examples and Best Practices. IBM Redbooks, Armonk, NY.
[4]
Artieri, A. 2005. Nomadik: an MPSoC solution for advanced multimedia. In Proceedings of the 5th International Forum on Application-Specific Multi-Processor SoC. IEEE, Los Alamitos, CA.
[5]
Brooks, D., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture. ACM, New York, 83--94.
[6]
Buck, I. 2007. GPU Computing with NVIDIA CUDA. In Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '07). ACM, New York, 6.
[7]
Cao, Y., Sato, T., Orshansky, M., Sylvester, D., and Hu, C. 2000. New paradigm of predictive MOSFET and interconnect modeling for early circuit simulation. In Proceedings of the 2000 Custom Integrated Circuits Conference. IEEE, Los Alamitos, CA, 201--204.
[8]
Chang, J. and Sohi, G. S. 2006. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 264--276.
[9]
Chen, T., Zhang, T., Sura, Z., and Tallada, M. G. 2008. Prefetching irregular references for software cache on cell. In Proceedings of the 6th International Symposium on Code Generation and Optimization. ACM, New York, 155--164.
[10]
Cheriton, D. R., Slavenburg, G. A., and Boyle, P. D. 1986. Software-controlled caches in the VMP multiprocessor. In Proceedings of the 13th International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 366--374.
[11]
Chishti, Z., Powell, M. D., and Vijaykumar, T. N. 2003. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the 36th International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 55.
[12]
Collins, J. D., Wang, H., Tullsen, D. M., Hughes, C., Lee, Y.-F., Lavery, D., and Shen, J. P. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the 28th International Symposium on Computer Architecture. ACM, New York, 14--25.
[13]
Cong, J., Guoling, H., Jagannathan, A., Reinman, G., and Rutkowski, K. 2007. Accelerating sequential applications on CMPs using core spilling. IEEE Trans. Paral. Distrib. Syst. 18, 8, 1094--1107.
[14]
Dally, W. J. and Towles, B. 2001. Route packets, not wires: On-chip interconnection networks. In Proceedings of the 38th Design Automation Conference. ACM, New York, 684--689.
[15]
Dundas, J. and Mudge, T. 1997. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 11th International Conference on Supercomputing. ACM, New York, 68--75.
[16]
Dutta, S., Jensen, R., and Rieckmann, A. 2001. Viper: a multiprocessor SOC for advanced set-top box and digital TV systems. IEEE Des. Test Comput. 18, 5, 21--31.
[17]
Eichenberger, A. E., O'Brien, K., O'Brien, K., Wu, P., Chen, T., Oden, P. H., Prener, D. A., Shepherd, J. C., So, B., et al. 2005. Optimizing compiler for the CELL processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. IEEE, Los Alamitos, CA, 161--172.
[18]
Firoozshahian, A., Solomatnikov, A., Shacham, O., Asgar, Z., Richardson, S., Kozyrakis, C., and Horowitz, M. 2009. A memory system design framework: creating smart memories. In Proceedings of the 36th International Symposium on Computer Architecture. ACM, New York, 406--417.
[19]
Ganusov, I. and Burtscher, M. 2006. Efficient emulation of hardware prefetchers via event-driven helper threading. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, 144--153.
[20]
Ghuloum, A., Smith, T., Wu, G., Zhou, X., Fang, J., Guo, P., So, B., Rajagopalan, M., Chen, Y., et al. 2007. Future-proof data parallel algorithms and software on Intel#8482; multi-core architecture. Intel Tech. J. 11, 4.
[21]
Guo, F., Solihin, Y., Zhao, L., and Iyer, R. 2007. A framework for providing quality of service in chip multi-processors. In Proceedings of the 40th International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 343--355.
[22]
Hammond, L., Willey, M., and Olukotun, K. 1998. Data speculation support for a chip multiprocessor. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 58--69.
[23]
Hamzaoglu, F., Zhang, K., Wang, Y., Ahn, H., Bhattacharya, U., Chen, Z., Ng, Y., Pavlov, A., Smits, K., et al. 2008. A 153Mb-SRAM design with dynamic stability enhancement and leakage reduction in 45nm high-K metal-gate CMOS technology. In Proceedings of the IEEE International Solid-State Circuits Conference. IEEE, Los Alamitos, CA, 376--621.
[24]
Harris, S. 2005. Synergistic caching in single-chip multiprocessors. Ph.D. thesis, Stanford University.
[25]
Hegde, R. 2008. Optimizing application performance on Intel © Core#8482; microarchitecture using hardware-implemented prefetchers. Intel Software Network. http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-usinghardware-implemented-prefetchers.
[26]
Hensley, J. 2007. AMD CTM overview. In Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'07). ACM, New York, 7.
[27]
Hill, M. and Marty, M. 2008. Amdahl's law in the multicore era. Computer 41, 7, 33--38.
[28]
Hsu, L. R., Reinhardt, S. K., Iyer, R., and Makineni, S. 2006. Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, 13--22.
[29]
Huh, J., Kim, C., Shafi, H., Zhang, L., Burger, D., and Keckler, S. W. 2005. A NUCA substrate for flexible CMP cache sharing. In Proceedings of the 19th International Conference on Supercomputing. ACM, New York, 31--40.
[30]
Ipek, E., Kirman, M., Kirman, N., and Martinez, J. F. 2007. Core fusion: accommodating software diversity in chip multiprocessors. In Proceedings of the 34th International Symposium on Computer Architecture. ACM, New York, 186--197.
[31]
Joseph, D. and Grunwald, D. 1997. Prefetching using Markov predictors. In Proceedings of the 24th International Symposium on Computer Architecture. ACM, New York, 252--263.
[32]
Kadota, H., Miyake, J., Okabayashi, I., Maeda, T., Okamoto, T., Nakajima, M., and Kagawa, K. 1987. A 32-bit CMOS microprocessor with on-chip cache and TLB. IEEE J. Solid-State Circuits 22, 5, 800--807.
[33]
Kandemir, M., Ramanujam, J., Irwin, J., Vijaykrishnan, N., Kadayif, I., and Parikh, A. 2001. Dynamic management of scratch-pad memory space. In Proceedings of the 38th Design Automation Conference. ACM, New York, 690--695.
[34]
Kandiraju, G. B. and Sivasubramaniam, A. 2002. Going the distance for TLB prefetching: an application-driven study. In Proceedings of the 29th International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 195--206.
[35]
Kim, C., Burger, D., and Keckler, S. W. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 211--222.
[36]
Kim, S., Chandra, D., and Solihin, Y. 2004. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE, Los Alamitos, CA, 111--122.
[37]
Kim, W., Gupta, M., Wei, G., and Brooks, D. 2008. System level analysis of fast, per-core DVFS using on-chip switching regulators. In Proceedings of the 14th International Symposium on High Performance Computer Architecture. IEEE, Los Alamitos, CA, 123--134.
[38]
Kistler, M., Perrone, M., and Petrini, F. 2006. Cell multiprocessor communication network: built for speed. IEEE MICRO 26, 3, 10--23.
[39]
Kumar, R. and Hinton, G. 2009. A family of 45nm IA processors. In Proceedings of the IEEE International Solid-State Circuits Conference. IEEE, Los Alamitos, CA, 58--59.
[40]
Liao, S. S., Wang, P. H., Wang, H., Hoflehner, G., Lavery, D., and Shen, J. P. 2002. Post-pass binary adaptation for software-based speculative precomputation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, 117--128.
[41]
Liu, W., Tuck, J., Ceze, L., Ahn, W., Strauss, K., Renau, J., and Torrellas, J. 2006. POSH: a TLS compiler that exploits program structure. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, 158--167.
[42]
Luk, C.-K. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th International Symposium on Computer Architecture. ACM, New York, 40--51.
[43]
Mahesri, A., Johnson, D., Crago, N., and Patel, S. J. 2008. Tradeoffs in designing accelerator architectures for visual computing. In Proceedings of the 41st International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 164--175.
[44]
Mai, K., Paaske, T., Jayasena, N., Ho, R., Dally, W. J., and Horowitz, M. 2000. Smart memories: a modular reconfigurable architecture. In Proceedings of the 27th International Symposium on Computer Architecture. ACM, New York, 161--171.
[45]
McCool, M. D., Wadleigh, K., Henderson, B., and Lin, H.-Y. 2006. Performance evaluation of GPUs using the RapidMind development platform. In Proceedings of the ACM/IEEE Conference on Supercomputing. ACM, New York, 181.
[46]
Miller, J. E. and Agarwal, A. 2006. Software-based instruction caching for embedded processors. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 293--302.
[47]
Moore, C. 2007. The role of accelerated computing in the multi-core era. In Proceedings of the Workshop on Manycore and Multicore Computing: Architectures, Applications And Directions.
[48]
Moritz, C. A., Frank, M., Lee, W., and Amarasinghe, S. 1999. Hot pages: software caching for raw microprocessors. Tech. rep. MIT-LCS-TM-599, Massachusetts Institute of Technology.
[49]
Munshi, A. 2008. OpenCL: parallel computing on the GPU and CPU. In Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '08). ACM, New York.
[50]
Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003. Runahead execution: an alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. IEEE, Los Alamitos, CA, 129.
[51]
Nesbit, K. and Smith, J. 2004. Data cache prefetching using a global history buffer. In Proceedings of the 10th International Symposium on High Performance Computer Architecture. IEEE, Los Alamitos, CA, 96--106.
[52]
Papakipos, M. 2006. PeakStream platform. In Proceedings of the ACM/IEEE Conference on Supercomputing Tutorial on GPGPU. http://www.gpgpu.org/sc2006/slides/12.papakipos.peakstream.pdf.
[53]
Pericas, M., Cristal, A., Cazorla, F. J., Gonzalez, R., Jimenez, D. A., and Valero, M. 2007. A flexible heterogeneous multi-core architecture. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques. IEEE, Los Alamitos, CA, 13--24.
[54]
Pham, D., Asano, S., Bolliger, M., Day, M. N., Hofstee, H. P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., et al. 2005. The design and implementation of a first-generation CELL processor. In Proceedings of the IEEE International Solid-State Circuits Conference. IEEE, Los Alamitos, CA.
[55]
Qureshi, M. K. and Patt, Y. N. 2006. Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 423--432.
[56]
Renau, J., Fraguela, B., Tuck, J., Liu, W., Prvulovic, M., Ceze, L., Sarangi, S., Sack, P., Strauss, K., et al. 2005. SESC simulator. http://sesc.sourceforge.net.
[57]
Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Keckler, S. W., and Moore, C. R. 2003. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In Proceedings of the 30th International Symposium on Computer Architecture. ACM, New York, 422--433.
[58]
Siegel, H. J., Schwederski, T., Nathaniel J. Davis, I., and Kuehn, J. T. 1984. PASM: a recon?gurable parallel system for image processing. SIGARCH Comput. Archit. News 12, 4, 7--19.
[59]
Singh, H., Lee, M.-H., Lu, G., Bagherzadeh, N., Kurdahi, F. J., and Filho, E. M. C. 2000. MorphoSys: an integrated recon?gurable system for data-parallel and computation-intensive applications. IEEE Trans. Comput. 49, 5, 465--481.
[60]
Smith, S. L. 2008. Intel roadmap overview. Intel Developer Forum.
[61]
Sohi, G. S., Breach, S. E., and Vijaykumar, T. N. 1995. Multiscalar processors. In Proceedings of the 22nd International Symposium on Computer Architecture. ACM, New York, 414--425.
[62]
Tendler, J., Dodson, S., Fields, S., Le, H., and Sinharoy, B. 2001. POWER4 system microarchitecture. IBM Technical white paper.
[63]
Udayakumaran, S., Dominguez, A., and Barua, R. 2006. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst. 5, 2, 472--511.
[64]
Woo, D. H., Fryman, J. B., Knies, A. D., Eng, M., and Lee, H.-H. S. 2008. POD: a 3D-integrated broad-purpose acceleration layer. IEEE Micro 28, 4, 28--40.
[65]
Woo, D. H. and Lee, H.-H. S. 2008. Extending Amdahl's law for energy-efficient computing in the many-core era. IEEE Comput. 41, 12, 24-31.
[66]
Woo, D. H. and Lee, H.-H. S. 2010. COMPASS: a programmable data prefetcher using idle GPU shaders. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 297--310.
[67]
Yeh, T. Y., Faloutsos, P., Patel, S. J., and Reinman, G. 2007. ParallAX: an architecture for real-time physics. In Proceedings of the 34th International Symposium on Computer Architecture. ACM, New York, 232--243.
[68]
Zhang, M. and Asanovic, K. 2005. Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32nd International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 336--345.

Cited By

View all
  • (2017)A comparative study of input port and crossbar configurations in NoC router microarchitectures2017 4th International Conference on Signal Processing and Integrated Networks (SPIN)10.1109/SPIN.2017.8049928(121-125)Online publication date: Feb-2017
  • (2014)ad-heapProceedings of Workshop on General Purpose Processing Using GPUs10.1145/2588768.2576786(54-63)Online publication date: 1-Mar-2014
  • (2014)ad-heapProceedings of Workshop on General Purpose Processing Using GPUs10.1145/2576779.2576786(54-63)Online publication date: 1-Mar-2014

Index Terms

  1. Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 7, Issue 1
      April 2010
      151 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/1736065
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 May 2010
      Accepted: 01 September 2009
      Revised: 01 September 2009
      Received: 01 December 2008
      Published in TACO Volume 7, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Heterogeneous multicore
      2. cache
      3. idle core
      4. prefetching

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)59
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 16 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2017)A comparative study of input port and crossbar configurations in NoC router microarchitectures2017 4th International Conference on Signal Processing and Integrated Networks (SPIN)10.1109/SPIN.2017.8049928(121-125)Online publication date: Feb-2017
      • (2014)ad-heapProceedings of Workshop on General Purpose Processing Using GPUs10.1145/2588768.2576786(54-63)Online publication date: 1-Mar-2014
      • (2014)ad-heapProceedings of Workshop on General Purpose Processing Using GPUs10.1145/2576779.2576786(54-63)Online publication date: 1-Mar-2014

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media