Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Multicore-based vector coprocessor sharing for performance and energy gains

Published: 30 September 2013 Publication History

Abstract

For most of the applications that make use of a dedicated vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism which often occurs due to vector-length variations in dynamic environments. The motivation of our work stems from: (a) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (b) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. We present a robust design framework for vector coprocessor sharing in multicore environments that maximizes vector unit utilization and performance at substantially reduced energy costs. For our adaptive vector unit, which is attached to multiple cores, we propose three basic shared working policies that enforce coarse-grain, fine-grain, and vector-lane sharing. We benchmark these vector coprocessor sharing policies for a dual-core system and evaluate them using the floating-point performance, resource utilization, and power/energy consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, and LU factorization shows that these coprocessor sharing policies yield high utilization and performance with low energy costs. The proposed policies provide 1.2--2 speedups and reduce the energy needs by about 50% as compared to a system having a single core with an attached vector coprocessor. With the performance expressed in clock cycles, the sharing policies demonstrate 3.62--7.92 speedups compared to optimized Xeon runs. We also introduce performance and empirical power models that can be used by the runtime system to estimate the effectiveness of each policy in a hybrid system that can simultaneously implement this suite of shared coprocessor policies.

References

[1]
Azevedo, A. and Juurlink, B. 2009. Scalar processing overhead on simd-only architectures. In Proceedings of 20th IEEE International Conference on Application-specific Systems, Architectures and Processors. IEEE, 183--190.
[2]
Beldianu, S. F. and Ziavras, S. G. 2011. On-chip vector coprocessor sharing for multicores. In Proceedings of the 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP'11). IEEE, 431--438.
[3]
Cho, J., Chang, H., and Sung, W. 2006. An fpga based simd processor with a vector memory unit. In Proceedings of the IEEE International Symposium on Circuits and Systems. 525--528.
[4]
Chou, C. H., Severance, A., Brant, A. D., Liu, Z., Sant, S., and Lemieux, G. 2011. VEGAS: Soft vector processor with scratchpad memory. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA'11). ACM Press, New York, 15--24.
[5]
Eggers, S., Emer, J., Levy, H., Lo, J., Stamm, R., and Tullsen, D. 1997. Simultaneous multithreading: A platform for next-generation processors. IEEE Micro 17, 5, 12--19.
[6]
Frigo, M. and Johnson, S. G. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2, 216--231.
[7]
Gerneth, F. 2010. FIR filter algorithm implementation using intel SSE instructions: Optimizing for intel atom architecture. Software white paper on Intel embedded design center. http://download.intel.com/design/intarch/papers/323411.pdf.
[8]
Golub, G. H. and Van Loan, C. F. 1996. Matrix Computations 3rd Ed. Johns Hopkins, Baltimore, MD.
[9]
Hagiescu, A. and Wong, W. F. 2011. Co-synthesis of fpga-based application-specific floating point SIMD accelerators. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA'11). ACM Press, New York, 247--256.
[10]
Intel IPP. 2010. Integrated performance primitives for intel architecture reference manual. http://software.intel.com/en-us/articles/intel-ipp.
[11]
Intel MKL. 2011. Intel math kernel library reference manual. http://software.intel.com/enus/articles/intel-math-kernel-library-documentation.
[12]
Keating, M., Flynn, D., Aitken, R., Gibsons, A., and Shi, K. 2007. Low Power Methodology Manual for System on Chip Design. Springer.
[13]
Kozyrakis, C. and Patterson, D. 2002. Vector vs. superscalar and vliw architectures for embedded multimedia benchmarks. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture. 283--293.
[14]
Kozyrakis, C. and Patterson, D. 2003a. Overcoming the limitations of conventional vector processors. SIGARCH Comput. Archit. News 31, 2, 399--409.
[15]
Kozyrakis, C. and Patterson, D. 2003b. Scalable, vector processors for embedded systems. IEEE Micro 23, 6, 36--45.
[16]
Laforest, C. E. and Steffan, J. G. 2010. Efficient multi-ported memories for FPGAs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 41--50.
[17]
Lin, Y., Lee, H., Woh, M., Harel, Y., Mahlke, S., Mudge, T., Chakrabarti, C., and Flautner, K. 2006. SODA: A low-power architecture for software radio. In Proceedings of the 33rd Annual International Symposium on Computer Architecture. IEEE, 89--101.
[18]
Sanchez, F., Alvarez, M., Salami, E., Ramirez, A., and Valero, M. 2005. On the scalability of 1- and 2-dimensional simd extensions for multimedia applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'05). IEEE Computer Society, Washington, DC, 167--176.
[19]
Sung, W. and Mitra, S. K. 1987. Implementation of digital filtering algorithms using pipelined vector processors. Proc. IEEE 75, 9, 1293--1303.
[20]
Woh, M., Seo, S., Mahlke, S., Mudge, T., Chakrabarti, C., and Flautner, K. 2010. AnySP: Anytime anywhere anyway signal processing. IEEE Micro 30, 1, 81--91.
[21]
Xilinx Inc. 2010a. XPower estimator user guide. www.xilinx.com/support/documentation/user_guides.
[22]
Xilinx Inc. 2010b. MicroBlaze processor reference guide. http://www.xilinx.com/support/documentation/sw_manuals/mb_ ref_guide.pdf.
[23]
Yang, H. and Ziavras, S. 2005. FPGA-based vector processor for algebraic equation solvers. In Proceedings of the IEEE International Systems-On-Chip Conference. IEEE, 115--116.
[24]
Yiannacouras, P., Steffan, J. G., and Rose, J. 2008. VESPA: Portable, scalable, and flexible FPGA-based vector processors. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM Press, New York, 61--70.
[25]
Yu, J., Eagleston, C., Chou, C. H.-Y., Perreault, M., and Lemieux, G. 2009. Vector processing as a soft processor accelerator. ACM Trans. Reconfig. Technol. Syst. 2, 2, 1--34.

Cited By

View all
  • (2024)Spatzformer: An Efficient Reconfigurable Dual-Core RISC-V V Cluster for Mixed Scalar-Vector Workloads2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00042(172-173)Online publication date: 24-Jul-2024
  • (2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
  • (2023)Vector Extensions in COTS Processors to Increase Guaranteed Performance in Real-Time SystemsACM Transactions on Embedded Computing Systems10.1145/356105422:2(1-26)Online publication date: 24-Jan-2023
  • Show More Cited By

Index Terms

  1. Multicore-based vector coprocessor sharing for performance and energy gains

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Embedded Computing Systems
      ACM Transactions on Embedded Computing Systems  Volume 13, Issue 2
      Special issue on application-specific processors
      September 2013
      254 pages
      ISSN:1539-9087
      EISSN:1558-3465
      DOI:10.1145/2514641
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Journal Family

      Publication History

      Published: 30 September 2013
      Accepted: 01 January 2012
      Revised: 01 September 2011
      Received: 01 March 2011
      Published in TECS Volume 13, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. FPGA prototyping
      2. Power
      3. Xilinx MicroBlaze
      4. coprocessor sharing
      5. multicore
      6. power
      7. vector coprocessor

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)30
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 04 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Spatzformer: An Efficient Reconfigurable Dual-Core RISC-V V Cluster for Mixed Scalar-Vector Workloads2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00042(172-173)Online publication date: 24-Jul-2024
      • (2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
      • (2023)Vector Extensions in COTS Processors to Increase Guaranteed Performance in Real-Time SystemsACM Transactions on Embedded Computing Systems10.1145/356105422:2(1-26)Online publication date: 24-Jan-2023
      • (2018)A Hardware Pipeline with High Energy and Resource Efficiency for FMM AccelerationACM Transactions on Embedded Computing Systems10.1145/315767017:2(1-20)Online publication date: 30-Jan-2018
      • (2018)Floating-point accelerator for biometric recognition on FPGA embedded systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.09.010112:P1(20-34)Online publication date: 1-Feb-2018
      • (2016)Vector Coprocessor Virtualization for Simultaneous MultithreadingACM Transactions on Embedded Computing Systems10.1145/289836415:3(1-25)Online publication date: 23-May-2016
      • (2016)CUDA LeaksACM Transactions on Embedded Computing Systems10.1145/280115315:1(1-25)Online publication date: 13-Jan-2016
      • (2016)Power-Performance Optimization of a Virtualized SMT Vector Processor via Thread Fusion and Lane Configuration2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2016.27(81-86)Online publication date: Jul-2016
      • (2015)Guaranteed Computational Resprinting via Model-Predictive ControlACM Transactions on Embedded Computing Systems10.1145/272471514:3(1-26)Online publication date: 21-Apr-2015
      • (2015)Performance-Energy Optimizations for Shared Vector Accelerators in MulticoresIEEE Transactions on Computers10.1109/TC.2013.229582064:3(805-817)Online publication date: 5-Feb-2015
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media