research-article

Open access

Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore Architectures

Authors:

Eduardo H. M. Cruz,

Matthias Diener,

Laércio L. Pilla,

Philippe O. A. NavauxAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 3

Article No.: 28, Pages 1 - 28

https://doi.org/10.1145/2975587

Published: 17 September 2016 Publication History

Abstract

The performance and energy efficiency of modern architectures depend on memory locality, which can be improved by thread and data mappings considering the memory access behavior of parallel applications. In this article, we propose intense pages mapping, a mechanism that analyzes the memory access behavior using information about the time the entry of each page resides in the translation lookaside buffer. It provides accurate information with a very low overhead. We present experimental results with simulation and real machines, with average performance improvements of 13.7% and energy savings of 4.4%, which come from reductions in cache misses and interconnection traffic.

References

[1]

Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. 2009. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 33--42.

[2]

Reza Azimi, David K. Tam, Livio Soares, and Michael Stumm. 2009. Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Operating Systems Review 43, 2, 56--65.

Digital Library

[3]

Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi-Keung Luk, Gail Lyons, Harish Patil, and Ady Tal. 2010. Analyzing parallel programs with Pin. IEEE Computer 43, 3, 34--41.

Digital Library

[4]

Nick Barrow-Williams, Christian Fensch, and Simon Moore. 2009. A communication characterisation of splash-2 and Parsec. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). 86--97.

Digital Library

[5]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 72--81.

Digital Library

[6]

Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. Communications of the ACM 54, 5, 67--77.

Digital Library

[7]

François Broquedis, Olivier Aumage, Brice Goglin, Samuel Thibault, Pierre-Andr Wacrenier, and Raymond Namyst. 2010a. Structuring the execution of OpenMP applications for multicore architectures. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’10). 1--10.

[8]

François Broquedis, Jerome Clet-Ortega, Stephanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010b. hwloc: A generic framework for managing hardware affinities in HPC applications. In Proceedings of the Euromicro Conference on Parallel, Distributed, and Network-Based Processing (PDP’10). 180--186.

Digital Library

[9]

Z. Chishti, M. D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. ACM SIGARCH Computer Architecture News 33, 2, 357--368.

Digital Library

[10]

Jonathan Corbet. 2012a. AutoNUMA: The Other Approach to NUMA Scheduling. Retrieved August 20, 2016, from http://lwn.net/Articles/488709/.

[11]

Jonathan Corbet. 2012b. Toward Better NUMA Scheduling. Retrieved August 20, 2016, from http://lwn.net/Articles/486858/.

[12]

P. W. Coteus, J. U. Knickerbocker, C. H. Lam, and Y. A. Vlasov. 2011. Technologies for exascale systems. IBM Journal of Research and Development 55, 5, 14:1--14:12.

Digital Library

[13]

Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2015. An efficient algorithm for communication-based task mapping. In Proceedings of the International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 207--214.

Digital Library

[14]

Blas Cuesta, Alberto Ros, Maria E. Gomez, Antonio Robles, and Jose Duato. 2013. Increasing the effectiveness of directory caches by avoiding the tracking of non-coherent memory blocks. IEEE Transactions on Computers 62, 3, 482--495.

Digital Library

[15]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quéma, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 381--393.

Digital Library

[16]

Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Rob H. Bisseling, and Umit V. Catalyurek. 2006. Parallel hypergraph partitioning for scientific computing. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’06). 124--133.

Digital Library

[17]

Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2014. kMAF: Automatic kernel-level management of thread and data affinity. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 277--288.

Digital Library

[18]

Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2015. Communication-aware process and thread mapping using online communication detection. Parallel Computing 43, 43--63.

Digital Library

[19]

Fabrice Dupros, Hideo Aochi, Ariane Ducellier, Dimitri Komatitsch, and Jean Roman. 2008. Exploiting intensive multithreading for the efficient simulation of 3D seismic wave propagation. In Proceedings of the IEEE International Conference on Computational Science and Engineering (CSE’08). 253--260.

Digital Library

[20]

Fabrice Dupros, Christiane Pousa, Alexandre Carissimi, and Jean-François Méhaut. 2010. Parallel simulations of seismic wave propagation on NUMA architectures. In Parallel Computing: From Multicores and GPU’s to Petascale, B. Chapman, F. Desprez, G. R. Joubert, A. Lichnewsky, F. Peters, and T. Priol (Eds.). IOS Press, Amsterdam, Netherlands, 67--74.

[21]

Stephane Eranian. 2006. Perfmon2: A flexible performance monitoring interface for Linux. In Proceedings of the Linux Symposium.

[22]

Josue Feliu, Julio Sahuquillo, Salvador Petit, and Jose Duato. 2012. Understanding cache hierarchy contention in CMPs to improve job scheduling. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’12).

Digital Library

[23]

Ilaria Di Gennaro, Alessandro Pellegrini, and Francesco Quaglia. 2016. OS-based NUMA optimization: Tackling the case of truly multi-thread applications with non-partitioned virtual page accesses. In Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGRID’16). 291--300.

Digital Library

[24]

Intel. 2010. Intel Itanium Architecture Software Developer’s Manual. Technical Report. Intel Corporation.

[25]

Intel. 2012a. 2nd Generation Intel Core Processor Family. Technical Report. Intel Corporation.

[26]

Intel. 2012b. Intel Performance Counter Monitor—A Better Way to Measure CPU Utilization. Retrieved August 20, 2016, from http://www.intel.com/software/pcm.

[27]

Emmanuel Jeannot and Guillaume Mercier. 2010. Near-optimal placement of MPI processes on hierarchical NUMA architectures. In Euro-Par 2010—Parallel Processing. Lecture Notes in Computer Science, Vol. 6272. Springer, 199--210.

Digital Library

[28]

JEDEC. 2012. DDR3 SDRAM Standard. Retrieved August 20, 2016, from https://www.jedec.org/standards-documents/docs/jesd-79-3d.

[29]

H. Jin, M. Frumkin, and J. Yan. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report. NASA.

[30]

George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1, 359--392.

Digital Library

[31]

Tobias Klug, Michael Ott, Josef Weidendorfer, and Carsten Trinitis. 2008. autopin—automated optimization of thread-to-core pinning on multicore systems. In Transactions on High-Performance Embedded Architectures and Compilers. Lecture Notes in Computer Science, Vol. 6590. Springer, 219--235.

Digital Library

[32]

Richard P. LaRowe, Mark A. Holliday, and Carla Schlatter Ellis. 1992. An analysis of dynamic page placement on a NUMA multiprocessor. ACM SIGMETRICS Performance Evaluation Review 20, 1, 23--34.

Digital Library

[33]

Henrik Löf and Sverker Holmgren. 2005. Affinity-on-next-touch: Increasing the performance of an industrial PDE solver on a cc-NUMA system. In Proceedings of the International Conference on Supercomputing (SC’05). 387--392.

Digital Library

[34]

P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. 2002. Simics: A full system simulation platform. IEEE Computer 35, 2, 50--58.

Digital Library

[35]

Jaydeep Marathe and Frank Mueller. 2006. Hardware profile-guided automatic page placement for ccNUMA systems. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’06). 90--99.

Digital Library

[36]

Jaydeep Marathe, Vivek Thakkar, and Frank Mueller. 2010. Feedback-directed page placement for ccNUMA via hardware-generated memory traces. Journal of Parallel and Distributed Computing 70, 12, 1204--1219.

Digital Library

[37]

Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why on-chip cache coherence is here to stay. Communications of the ACM 55, 7, 78.

Digital Library

[38]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, Min Xu, A. R. Alameldeen, K. E. Moore, M .D. Hill, and D. A. Wood. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Computer Architecture News 33, 4, 92--99.

Digital Library

[39]

Takeshi Ogasawara. 2009. NUMA-aware memory manager with dominant-thread-based copying GC. ACM SIGPLAN Notices 44, 10, 377--389.

Digital Library

[40]

Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, Fernando M. Quintão Pereira, and Fernando Magno. 2014. Compiler support for selective page migration in NUMA architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 369--380.

Digital Library

[41]

Petar Radojković, Vladimir Cakarević, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2013. Thread assignment of multithreaded network applications in multicore/multithreaded processors. IEEE Transactions on Parallel and Distributed Systems 24, 12, 2513--2525.

Digital Library

[42]

Christiane Pousa Ribeiro, Jean-François Méhaut, Alexandre Carissimi, Marcio Castro, and Luiz Gustavo Fernandes. 2009. Memory affinity for hierarchical shared memory multiprocessors. In Proceedings of the International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’09). 59--66.

Digital Library

[43]

Christian Terboven, Dieter an Mey, Dirk Schmidl, Henry Jin, and Thomas Reichstein. 2008. Data and thread affinity in OpenMP programs. In Proceedings of the Workshop on Memory Access on Future Processors: A Solved Problem? (MAW’08). 377--384.

Digital Library

[44]

Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, Norman P. Jouppi, and Palo Alto. 2008. Cacti 5.1. Technical Report. HP Labs.

[45]

Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Computing 68, 9, 1186--1200.

Digital Library

[46]

Josep Torrellas. 2009. Architectures for extreme-scale computing. IEEE Computer 42, 11, 28--35.

Digital Library

[47]

Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. OS Support for Improving Data Locality on CC-NUMA Compute Servers. Technical Report. Stanford University, Stanford, CA.

[48]

Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). 129--142.

Digital Library

[49]

Sergey Zhuravlev, Juan Carlos Saez, Sergey Blagodurov, Alexandra Fedorova, and Manuel Prieto. 2012. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys 45, 1, Article No. 4.

Digital Library

Cited By

Știrb IGillich G(2023)A Low-Level Virtual Machine Just-In-Time Prototype for Running an Energy-Saving Hardware-Aware Mapping Algorithm on C/C++ Applications That Use PthreadsEnergies10.3390/en1619678116:19(6781)Online publication date: 23-Sep-2023
https://doi.org/10.3390/en16196781
Koohi SHamid NOthman MIbragimov G(2023)HATS: HetTask SchedulingIEEE Transactions on Cloud Computing10.1109/TCC.2022.318408111:2(2071-2083)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TCC.2022.3184081
Dominico Sde Almeida EAlves M(2022)On the performance limits of thread placement for array databases in non-uniform memory architecturesComputing10.1007/s00607-021-01043-4105:5(1059-1075)Online publication date: 17-Jan-2022
https://dl.acm.org/doi/10.1007/s00607-021-01043-4
Show More Cited By

Index Terms

Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore Architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Main memory

Recommendations

Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA ...
Affinity-Based Thread and Data Mapping in Shared Memory Systems

Shared memory architectures have recently experienced a large increase in thread-level parallelism, leading to complex memory hierarchies with multiple cache memory levels and memory controllers. These new designs created a Non-Uniform Memory Access (...
LAPT

We detect the memory access patterns in shared memory applications.Using the detected access patterns, we map the threads and data to improve performance.Provide a better usage of hardware resources.We reduce execution time, cache misses and traffic on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 3

September 2016

207 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2988523

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 September 2016

Accepted: 01 July 2016

Revised: 01 July 2016

Received: 01 February 2016

Published in TACO Volume 13, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
436
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)11

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Știrb IGillich G(2023)A Low-Level Virtual Machine Just-In-Time Prototype for Running an Energy-Saving Hardware-Aware Mapping Algorithm on C/C++ Applications That Use PthreadsEnergies10.3390/en1619678116:19(6781)Online publication date: 23-Sep-2023
https://doi.org/10.3390/en16196781
Koohi SHamid NOthman MIbragimov G(2023)HATS: HetTask SchedulingIEEE Transactions on Cloud Computing10.1109/TCC.2022.318408111:2(2071-2083)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TCC.2022.3184081
Dominico Sde Almeida EAlves M(2022)On the performance limits of thread placement for array databases in non-uniform memory architecturesComputing10.1007/s00607-021-01043-4105:5(1059-1075)Online publication date: 17-Jan-2022
https://dl.acm.org/doi/10.1007/s00607-021-01043-4
Cruz EDiener MPilla LNavaux P(2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
https://dl.acm.org/doi/10.1145/3433687
Dominico Sde Almeida EAlves MMeira J(2021)Performance Analysis of Array Database Systems in Non-Uniform Memory Architecture2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00034(169-176)Online publication date: Mar-2021
https://doi.org/10.1109/PDP52278.2021.00034
Song JAhn MLee GSeo EJeong J(2021)A Performance-Stable NUMA Management Scheme for Linux-Based HPC SystemsIEEE Access10.1109/ACCESS.2021.30699919(52987-53002)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3069991
Serpa MCruz EDiener MKrause ANavaux PPanetta JFarrés ARosas CHanzich M(2019)Optimization strategies for geophysics models on manycore systemsThe International Journal of High Performance Computing Applications10.1177/1094342018824150(109434201882415)Online publication date: 17-Jan-2019
https://doi.org/10.1177/1094342018824150
Cruz EDiener MPilla LNavaux P(2019)EagerMapACM Transactions on Parallel Computing10.1145/33097115:4(1-24)Online publication date: 8-Mar-2019
https://dl.acm.org/doi/10.1145/3309711
Trahay FSelva MMorel LMarquet K(2018)NumaMMAProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225094(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225094
Cruz EDiener MSerpa MNavaux PPilla LKoren I(2018)Improving Communication and Load Balancing with Thread Mapping in Manycore Systems2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP2018.2018.00021(93-100)Online publication date: Mar-2018
https://doi.org/10.1109/PDP2018.2018.00021
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents