Abstract
We evaluate the scalability of a Polymorphic Register File using the Conjugate Gradient method as a case study. We focus on a heterogeneous multi-processor architecture, taking into consideration critical parameters such as cache bandwidth and memory latency. We compare the performance of 256 Polymorphic Register File-augmented workers against a single Cell PowerPC Processor Unit (PPU). In such a scenario, simulation results suggest that for the Sparse Matrix Vector Multiplication kernel, absolute speedups of up to 200 times can be obtained. Moreover, when equal number of workers in the range 1-256 is employed, our design is between 1.7 and 4.2 times faster than a Cell PPU-based system. Furthermore, we study the memory latency and cache bandwidth impact on the sustainable speedups of the system considered. Our tests suggest that a 128 worker configuration requires the caches to deliver 1638.4 GB/sec in order to preserve 80% of its peak speedup.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bailey, D., Barton, J., Lasinski, T., Simon, H. (eds.): The NAS Parallel Benchmarks. Technical Report Technical Report RNR-91-02, NASA Ames Research Center, Moffett Field, CA 94035 (1991)
Barcelona Supercomputing Center. Paraver, http://www.bsc.es/paraver
Barcelona Supercomputing Center. The NANOS Group Site: The Mercurium Compiler, http://nanos.ac.upc.edu/mcxx
Ciobanu, C., Kuzmanov, G.K., Ramirez, A., Gaydadjiev, G.N.: A Polymorphic Register File for Matrix Operations. In: Proceedings of the 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS 2010), pp. 241–249 (July 2010)
Corbal, J., Espasa, R., Valero, M.: MOM: a Matrix SIMD Instruction Set Architecture for Multimedia Applications. In: Proceedings of the ACM/IEEE SC 1999 Conference, pp. 1–12 (1999)
Das, R., Uysal, M., Saltz, J., Shin Hwang, Y.: Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures. Journal of Parallel and Distributed Computing 22, 462–479 (1993)
Ferrer, R., González, M., Silla, F., Martorell, X., Ayguadé, E.: Evaluation of memory performance on the Cell BE with the SARC programming model. In: MEDEA 2008: Proceedings of the 9th Workshop on MEmory Performance, pp. 77–84. ACM, New York (2008)
Gueron, S.: Intel Advanced Encryption Standard (AES) Instructions Set (2010), http://software.intel.com/enus/articles/intel-advancedencryption-standard-aesinstructions-set/
Gwennap, L.: AltiVec Vectorizes PowerPC. Microprocessor Report 12(6), 1–5 (1998)
IBM. Cell Broadband Engine Programming Handbook Including the PowerXCell 8i Processor, 1.11 edn. (May 2008)
Juurlink, B., Cheresiz, D., Vassiliadis, S., Wijshoff, H.A.G.: Implementation and Evaluation of the Complex Streamed Instruction Set. In: Int. Conf. on Parallel Architectures and Compilation Techniques (PACT), pp. 73–82 (2001)
Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the Cell Multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005)
Kuck, D., Stokes, R.: The Burroughs Scientific Processor (BSP). IEEE Transactions on Computers C-31(5), 363–376 (1982)
Panda, D., Hwang, K.: Reconfigurable Vector Register Windows for Fast Matrix Computation on the Orthogonal Multiprocessor. In: Proceedings of the International Conference on Application Specific Array Processors, pp. 202–213, 5-7 (1990)
Park, J., Park, S.-B., Balfour, J.D., Black-Schaffer, D., Kozyrakis, C., Dally, W.J.: Register Pointer Architecture for Efficient Embedded Processors. In: DATE 2007: Proceedings of the Conference on Design, Automation and Test in Europe, San Jose, CA, USA, pp. 600–605. EDA Consortium (2007)
Ramirez, A., Cabarcas, F., Juurlink, B., Alvarez Mesa, M., Azevedo, A., Meenderinck, C., Gaydadjiev, G., Ciobanu, C., Isaza, S., Sanchez, F.: The SARC Architecture. Micro 30(5), 16–29 (2010)
Rico, A., Cabarcas, F., Quesada, A., Pavlovic, M., Vega, A., Villavieja, C., Etsion, Y., Ramirez, A.: Scalable Simulation of Decoupled Accelerator Architectures. Technical report, Universitat Politècnica de Catalunya, Barcelona, Spain (2010)
Shahbahrami, A., Juurlink, B., Vassiliadis, S.: Matrix Register File and Extended Subwords: Two Techniques for Embedded Media Processors. In: Proceedings of the 2nd ACM Int. Conf. on Computing Frontiers, pp. 171–180 (May 2005)
Shewchuk, J.R.: An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Technical report, Carnegie Mellon University, Pittsburgh, PA, USA (1994)
Wong, S., Anjam, F., Nadeem, M.: Dynamically Reconfigurable Register File for a Softcore VLIW Processor. In: Proceedings of the Design, Automation and Test in Europe Conference, DATE 2010 (March 2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ciobanu, C.B., Martorell, X., Kuzmanov, G.K., Ramirez, A., Gaydadjiev, G.N. (2011). Scalability Evaluation of a Polymorphic Register File: A CG Case Study. In: Berekovic, M., Fornaciari, W., Brinkschulte, U., Silvano, C. (eds) Architecture of Computing Systems - ARCS 2011. ARCS 2011. Lecture Notes in Computer Science, vol 6566. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19137-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-19137-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19136-7
Online ISBN: 978-3-642-19137-4
eBook Packages: Computer ScienceComputer Science (R0)