Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

NoCMsg: A Scalable Message-Passing Abstraction for Network-on-Chips

Published: 09 March 2015 Publication History

Abstract

The number of cores of contemporary processors is constantly increasing and thus continues to deliver ever higher peak performance (following Moore’s transistor law). Yet high core counts present a challenge to hardware and software alike. Following this trend, the network-on-chip (NoC) topology has changed from buses over rings and fully connected meshes to 2D meshes.
This work contributes NoCMsg, a low-level message-passing abstraction over NoCs, which is specifically designed for large core counts in 2D meshes. NoCMsg ensures deadlock-free messaging for wormhole Manhattan-path routing over the NoC via a polling-based message abstraction and non--flow-controlled communication for selective communication patterns. Experimental results on the TilePro hardware platform show that NoCMsg can significantly reduce communication times by up to 86% for single packet messages and up to 40% for larger messages compared to other NoC-based message approaches. On the TilePro platform, NoCMsg outperforms shared memory abstractions by up to 93% as core counts and interprocess communication increase. Results for fully pipelined double-precision numerical codes show speedups of up to 64% for message passing over shared memory at 32 cores. Overall, we observe that shared memory scales up to about 16 cores on this platform, whereas message passing performs well beyond that threshold. These results generalize to similar NoC-based platforms.

References

[1]
Adapteva Processor Family. 2015. Epiphany-III 16-Core 65nm Microprocessor (E16G301). Retrieved January 19, 2015, from http://www.adapteva.com/products/silicon-devices/e16g301/.
[2]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. International Journal of Supercomputer Applications 5, 3, 63--73.
[3]
A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the Symposium on Operating Systems Principles. 29--44.
[4]
M. Cabanas-Holmen, E. H. Cannon, C. Neathery, R. Brees, B. Buchanan, A. Amort, and A. J. Kleinosowski. 2009. MAESTRO processor single event error analysis. Microelectronics Reliability&Qualification Workshop.
[5]
C. Clauss, S. Lankes, P. Reble, and T. Bemmerl. 2011. Evaluation and improvements of programming models for the Intel SCC many-core processor. In Proceedings of the 2011 International Conference on High Performance Computing and Simulation (HPCS). 525--532.
[6]
B. D. de Dinechina, P. Guironnet de Massasa, G. Lagera, C. Legera, B. Orgogozoa, J. Reyberta, and T. Strudela. 2013. A distributed run-time environment for the Kalray MPPAR-256 integrated manycore processor. In Proceeedings of the International Conference on Computational Science. 1654--1663.
[7]
J. Dean and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the Symposium on Operating Systems Design and Implementation. 137--150.
[8]
D.-R. Fan, N. Yuan, J.-C. Zhang, Y.-B. Zhou, W. Lin, F.-L. Song, X.-C. Ye, H. Huang, L. Yu, G.-P. Long, H. Zhang, and L. Liu. 2009. Godson-T: An efficient many-core architecture for parallel program executions. Journal of Computer Science and Technology 24, 6, 1061--1073.
[9]
W. Gropp, E. Lusk, N. Doss, and A. Skjellum. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22, 6, 789--828.
[10]
J. L. Gustafson. 1988. Reevaluating Amdahl’s law. Communications of the ACM 31, 5, 532--533. http://www.acm.org/pubs/toc/Abstracts/0001-0782/42415.html.
[11]
Intel. 2004. Tera-scale Research Prototype: Connecting 80 Simple Sores on a Single Test Chip. http://storage. jak-stik.ac.id/intel-research/platform/terascale/tera-scaleresearchprototypebackgrounder.pdf.
[12]
Intel. 2009. Exploring Programming Models with the Single-Chip Cloud Computer Research Prototype. Retrieved January 19, 2015, from http://blogs.intel.com/intellabs/2009/12/02/sccloudcomp/.
[13]
M. Kang, E. Park, M. Cho, J. Suh, D.-I. Kang, and S. P. Crago. 2009. MPI performance analysis and optimization on Tile64/maestro. In Proceedings of the Workshop on Multi-Core Processors for Space—Opportunities and Challenges.
[14]
S. Kato, K. Lakshmanan, Y. Ishikawa, and R. Rajkumar. 2011a. Resource sharing in GPU-accelerated windowing systems. In Proceedings of the IEEE Real-Time Embedded Technology and Applications Symposium. 191--200.
[15]
S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar. 2011b. RGEM: A responsive GPGPU execution model for runtime engines. In Proceedings of the IEEE Real-Time Systems Symposium. 57--66.
[16]
G. Krawezik and F. Cappello. 2006. Performance comparison of MPI and OpenMP on shared memory multiprocessors: Research articles. Concurrency and Computation: Practice and Experience 18, 1, 29--61.
[17]
Likwid. 2014. Lightweight Performance Tools. Retrieved January 19, 2015, from http://code.google.com/p/likwid/.
[18]
M. M. K. Martin, M. D. Hill, and D. J. Sorin. 2012. Why on-chip cache coherence is here to stay. Communications of the ACM 55, 7, 78--89.
[19]
T. G. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl, and S. Dighe. 2010. The 48-core SCC processor: The programmer’s view. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. 1--11.
[20]
Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard Version 3.0. (09 2012).
[21]
A. Moshovos, G. Memik, A. Choudhary, and B. Falsafi. 2001. JETTY: Filtering snoops for reduced energy consumption in SMP servers. In Proceedings of the Symposium on High Performance Computer Architecture. 85--96.
[22]
Network Working Group. 2007. A Remote Direct Memory Access Protocol Specification. Retrieved January 19, 2015, from http://tools.ietf.org/pdf/rfc5040.
[23]
O. Patil. 2014. Efficient and Lightweight Inter-Process Collective Operations for Massive Multi-Core Architectures. Master’s Thesis. North Carolina State University.
[24]
T. Ropars, T. V. Martsinkevich, A. Guermouche, A. Schiper, and F. Cappello. 2013. SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis. Article No. 8.
[25]
K. Sankaralingam, R. Nagarajan, P. Gratz, R. Desikan, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, W. Yoder, R. McDonald, S. W. Keckler, and D. C. Burger. 2006. The distributed microarchitecture of the TRIPS prototype processor. In Proceedings of the International Symposium on Microarchitecture. 480--491.
[26]
O. Serres, A. Anbar, S. Merchant, and T. El-Ghazawi. 2011. Experiences with UPC on TILE-64 processor. In Proceedings of the 2011 IEEE Aerospace Conference. 1--9.
[27]
K. Singh, J. P. Walters, J. Hestness, J. Suh, C. M. Rogers, and S. P. Crago. 2011. FFTW and complex ambiguity function performance on the Maestro processor. In Proceedings of the IEEE Aerospace Conference. 1--8.
[28]
J. Suh, K. J. Mighell, D.-I. Kang, and S. P. Crago. 2012. Implementation of FFT and CRBLASTER on the Maestro processor. In Proceedings of the IEEE Aerospace Conference. 1--6.
[29]
Tilera. 2014a. Multicore Processors. Retrieved January 19, 2015, from http://www.tilera.com/products.
[30]
Tilera. 2014b. Tilera User Architecture Reference. Available at http://173.201.26.195/scm/docs/UG101-User- Architecture-Reference.pdf.
[31]
R. F. van der Wijngaart, T. G. Mattson, and W. Haas. 2011. Light-weight communications on Intel’s single-chip cloud computer processor. ACM SIGOPS Operating Systems Review 45, 1, 73--83.
[32]
D. Wentzlaff and A. Agarwal. 2009. Factored operating systems (fos): The case for a scalable operating system for multicores. ACM SIGOPS Operatiing Systems Review 43, 2, 76--85.
[33]
D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown III, and A. Agarwal. 2007. On-chip interconnection architecture of the tile processor. IEEE Micro 27, 15--31.
[34]
K. Yagna. 2013. Efficient Collective Communication for Multi-Core NOC Interconnects. Master’s Thesis. North Carolina State University.
[35]
Y. Zhang, F. Mueller, X. Cui, and T. Potok. 2010. Large-scale multi-dimensional document clustering on GPU clusters. In Proceedings of the International Parallel and Distributed Processing Symposium.
[36]
C. Zimmer and F. Mueller. 2014. NoCMsg: Scalable NoC-based message passing. In Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. 186--195.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 1
April 2015
201 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2744295
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2015
Accepted: 01 December 2014
Revised: 01 November 2014
Received: 01 May 2014
Published in TACO Volume 12, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Multicore architectures
  2. message passing
  3. shared memory

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • subcontract from SecurBoration
  • NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)16
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)DLibOSACM SIGPLAN Notices10.1145/3296957.317320953:2(737-750)Online publication date: 19-Mar-2018
  • (2018)DLibOSProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173209(737-750)Online publication date: 19-Mar-2018
  • (2016)Distributed Job Allocation for Large-Scale ManycoresHigh Performance Computing10.1007/978-3-319-41321-1_21(404-425)Online publication date: 15-Jun-2016
  • (2016)Efficient and Predictable Group Communication for Manycore NoCsHigh Performance Computing10.1007/978-3-319-41321-1_20(383-403)Online publication date: 15-Jun-2016

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media