research-article

Open access

NoCMsg: A Scalable Message-Passing Abstraction for Network-on-Chips

Authors:

Christopher Zimmer,

Frank MuellerAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 1

Article No.: 1, Pages 1 - 24

https://doi.org/10.1145/2701426

Published: 09 March 2015 Publication History

Abstract

The number of cores of contemporary processors is constantly increasing and thus continues to deliver ever higher peak performance (following Moore’s transistor law). Yet high core counts present a challenge to hardware and software alike. Following this trend, the network-on-chip (NoC) topology has changed from buses over rings and fully connected meshes to 2D meshes.

This work contributes NoCMsg, a low-level message-passing abstraction over NoCs, which is specifically designed for large core counts in 2D meshes. NoCMsg ensures deadlock-free messaging for wormhole Manhattan-path routing over the NoC via a polling-based message abstraction and non--flow-controlled communication for selective communication patterns. Experimental results on the TilePro hardware platform show that NoCMsg can significantly reduce communication times by up to 86% for single packet messages and up to 40% for larger messages compared to other NoC-based message approaches. On the TilePro platform, NoCMsg outperforms shared memory abstractions by up to 93% as core counts and interprocess communication increase. Results for fully pipelined double-precision numerical codes show speedups of up to 64% for message passing over shared memory at 32 cores. Overall, we observe that shared memory scales up to about 16 cores on this platform, whereas message passing performs well beyond that threshold. These results generalize to similar NoC-based platforms.

References

[1]

Adapteva Processor Family. 2015. Epiphany-III 16-Core 65nm Microprocessor (E16G301). Retrieved January 19, 2015, from http://www.adapteva.com/products/silicon-devices/e16g301/.

[2]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. International Journal of Supercomputer Applications 5, 3, 63--73.

Digital Library

[3]

A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the Symposium on Operating Systems Principles. 29--44.

Digital Library

[4]

M. Cabanas-Holmen, E. H. Cannon, C. Neathery, R. Brees, B. Buchanan, A. Amort, and A. J. Kleinosowski. 2009. MAESTRO processor single event error analysis. Microelectronics Reliability&Qualification Workshop.

[5]

C. Clauss, S. Lankes, P. Reble, and T. Bemmerl. 2011. Evaluation and improvements of programming models for the Intel SCC many-core processor. In Proceedings of the 2011 International Conference on High Performance Computing and Simulation (HPCS). 525--532.

[6]

B. D. de Dinechina, P. Guironnet de Massasa, G. Lagera, C. Legera, B. Orgogozoa, J. Reyberta, and T. Strudela. 2013. A distributed run-time environment for the Kalray MPPAR-256 integrated manycore processor. In Proceeedings of the International Conference on Computational Science. 1654--1663.

[7]

J. Dean and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the Symposium on Operating Systems Design and Implementation. 137--150.

Digital Library

[8]

D.-R. Fan, N. Yuan, J.-C. Zhang, Y.-B. Zhou, W. Lin, F.-L. Song, X.-C. Ye, H. Huang, L. Yu, G.-P. Long, H. Zhang, and L. Liu. 2009. Godson-T: An efficient many-core architecture for parallel program executions. Journal of Computer Science and Technology 24, 6, 1061--1073.

[9]

W. Gropp, E. Lusk, N. Doss, and A. Skjellum. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22, 6, 789--828.

Digital Library

[10]

J. L. Gustafson. 1988. Reevaluating Amdahl’s law. Communications of the ACM 31, 5, 532--533. http://www.acm.org/pubs/toc/Abstracts/0001-0782/42415.html.

Digital Library

[11]

Intel. 2004. Tera-scale Research Prototype: Connecting 80 Simple Sores on a Single Test Chip. http://storage. jak-stik.ac.id/intel-research/platform/terascale/tera-scaleresearchprototypebackgrounder.pdf.

[12]

Intel. 2009. Exploring Programming Models with the Single-Chip Cloud Computer Research Prototype. Retrieved January 19, 2015, from http://blogs.intel.com/intellabs/2009/12/02/sccloudcomp/.

[13]

M. Kang, E. Park, M. Cho, J. Suh, D.-I. Kang, and S. P. Crago. 2009. MPI performance analysis and optimization on Tile64/maestro. In Proceedings of the Workshop on Multi-Core Processors for Space—Opportunities and Challenges.

[14]

S. Kato, K. Lakshmanan, Y. Ishikawa, and R. Rajkumar. 2011a. Resource sharing in GPU-accelerated windowing systems. In Proceedings of the IEEE Real-Time Embedded Technology and Applications Symposium. 191--200.

Digital Library

[15]

S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar. 2011b. RGEM: A responsive GPGPU execution model for runtime engines. In Proceedings of the IEEE Real-Time Systems Symposium. 57--66.

Digital Library

[16]

G. Krawezik and F. Cappello. 2006. Performance comparison of MPI and OpenMP on shared memory multiprocessors: Research articles. Concurrency and Computation: Practice and Experience 18, 1, 29--61.

Digital Library

[17]

Likwid. 2014. Lightweight Performance Tools. Retrieved January 19, 2015, from http://code.google.com/p/likwid/.

[18]

M. M. K. Martin, M. D. Hill, and D. J. Sorin. 2012. Why on-chip cache coherence is here to stay. Communications of the ACM 55, 7, 78--89.

Digital Library

[19]

T. G. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl, and S. Dighe. 2010. The 48-core SCC processor: The programmer’s view. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. 1--11.

Digital Library

[20]

Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard Version 3.0. (09 2012).

[21]

A. Moshovos, G. Memik, A. Choudhary, and B. Falsafi. 2001. JETTY: Filtering snoops for reduced energy consumption in SMP servers. In Proceedings of the Symposium on High Performance Computer Architecture. 85--96.

Digital Library

[22]

Network Working Group. 2007. A Remote Direct Memory Access Protocol Specification. Retrieved January 19, 2015, from http://tools.ietf.org/pdf/rfc5040.

[23]

O. Patil. 2014. Efficient and Lightweight Inter-Process Collective Operations for Massive Multi-Core Architectures. Master’s Thesis. North Carolina State University.

[24]

T. Ropars, T. V. Martsinkevich, A. Guermouche, A. Schiper, and F. Cappello. 2013. SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis. Article No. 8.

Digital Library

[25]

K. Sankaralingam, R. Nagarajan, P. Gratz, R. Desikan, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, W. Yoder, R. McDonald, S. W. Keckler, and D. C. Burger. 2006. The distributed microarchitecture of the TRIPS prototype processor. In Proceedings of the International Symposium on Microarchitecture. 480--491.

Digital Library

[26]

O. Serres, A. Anbar, S. Merchant, and T. El-Ghazawi. 2011. Experiences with UPC on TILE-64 processor. In Proceedings of the 2011 IEEE Aerospace Conference. 1--9.

Digital Library

[27]

K. Singh, J. P. Walters, J. Hestness, J. Suh, C. M. Rogers, and S. P. Crago. 2011. FFTW and complex ambiguity function performance on the Maestro processor. In Proceedings of the IEEE Aerospace Conference. 1--8.

Digital Library

[28]

J. Suh, K. J. Mighell, D.-I. Kang, and S. P. Crago. 2012. Implementation of FFT and CRBLASTER on the Maestro processor. In Proceedings of the IEEE Aerospace Conference. 1--6.

[29]

Tilera. 2014a. Multicore Processors. Retrieved January 19, 2015, from http://www.tilera.com/products.

[30]

Tilera. 2014b. Tilera User Architecture Reference. Available at http://173.201.26.195/scm/docs/UG101-User- Architecture-Reference.pdf.

[31]

R. F. van der Wijngaart, T. G. Mattson, and W. Haas. 2011. Light-weight communications on Intel’s single-chip cloud computer processor. ACM SIGOPS Operating Systems Review 45, 1, 73--83.

Digital Library

[32]

D. Wentzlaff and A. Agarwal. 2009. Factored operating systems (fos): The case for a scalable operating system for multicores. ACM SIGOPS Operatiing Systems Review 43, 2, 76--85.

Digital Library

[33]

D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown III, and A. Agarwal. 2007. On-chip interconnection architecture of the tile processor. IEEE Micro 27, 15--31.

Digital Library

[34]

K. Yagna. 2013. Efficient Collective Communication for Multi-Core NOC Interconnects. Master’s Thesis. North Carolina State University.

[35]

Y. Zhang, F. Mueller, X. Cui, and T. Potok. 2010. Large-scale multi-dimensional document clustering on GPU clusters. In Proceedings of the International Parallel and Distributed Processing Symposium.

[36]

C. Zimmer and F. Mueller. 2014. NoCMsg: Scalable NoC-based message passing. In Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. 186--195.

Cited By

Mallon SGramoli VJourjon G(2018)DLibOSACM SIGPLAN Notices10.1145/3296957.317320953:2(737-750)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173209
Mallon SGramoli VJourjon GShen XTuck JBianchini RSarkar V(2018)DLibOSProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173209(737-750)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173209
Ramachandran SMueller F(2016)Distributed Job Allocation for Large-Scale ManycoresHigh Performance Computing10.1007/978-3-319-41321-1_21(404-425)Online publication date: 15-Jun-2016
https://doi.org/10.1007/978-3-319-41321-1_21
Show More Cited By

Index Terms

NoCMsg: A Scalable Message-Passing Abstraction for Network-on-Chips

Recommendations

NoCMsg: scalable NoC-based message passing
CCGRID '14: Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

Current processor design with ever more cores may ensure that theoretical compute performance still follows past increases (resting from Moore's law), but they also increasingly present a challenge to hardware and software alike. As the core count ...
Ownership passing: efficient distributed memory programming on multi-core systems
PPoPP '13

The number of cores in multi- and many-core high-performance processors is steadily increasing. MPI, the de-facto standard for programming high-performance computing systems offers a distributed memory programming model. MPI's semantics force a copy ...
Efficient programming paradigm for video streaming processing on TILE64 platform

Advances at an unprecedented rate in computer hardware and networking technologies have made the many-core computing affordable and readily available in a matter of few years. Nonetheless, it incurs challenges to programmers to build scalable parallel ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 12, Issue 1

April 2015

201 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2744295

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2015

Accepted: 01 December 2014

Revised: 01 November 2014

Received: 01 May 2014

Published in TACO Volume 12, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

subcontract from SecurBoration
NSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
485
Total Downloads

Downloads (Last 12 months)76
Downloads (Last 6 weeks)16

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mallon SGramoli VJourjon G(2018)DLibOSACM SIGPLAN Notices10.1145/3296957.317320953:2(737-750)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173209
Mallon SGramoli VJourjon GShen XTuck JBianchini RSarkar V(2018)DLibOSProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173209(737-750)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173209
Ramachandran SMueller F(2016)Distributed Job Allocation for Large-Scale ManycoresHigh Performance Computing10.1007/978-3-319-41321-1_21(404-425)Online publication date: 15-Jun-2016
https://doi.org/10.1007/978-3-319-41321-1_21
Yagna KPatil OMueller F(2016)Efficient and Predictable Group Communication for Manycore NoCsHigh Performance Computing10.1007/978-3-319-41321-1_20(383-403)Online publication date: 15-Jun-2016
https://doi.org/10.1007/978-3-319-41321-1_20

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents