Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2597652.2597674acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Scalable analysis of multicore data reuse and sharing

Published: 10 June 2014 Publication History

Abstract

The performance and energy efficiency of multicore systems are increasingly dominated by the costs of communication. As hardware parallelism grows, developers require more powerful tools to assess the data sharing and reuse properties of their algorithms. The reuse distance is an effective metric to study the temporal locality of programs and model private and shared caches. But the application of this method is challenging. First, generating memory traces is very expensive in storage and very intrusive on execution, possibly distorting the parallel schedule. And second, the algorithm is computationally very expensive, limiting the length, memory size and parallelism of analyzable programs.
This paper introduces a novel coarse-grained reuse distance method, called Kernel Reuse Distance (KRD), which addresses these challenges. KRD enables a quick assessment of data locality by studying the reuse characteristics of the kernels' inputs and outputs. We analyze the performance of the initial prototype implementation and show two use cases comparing different parallel implementations. On a 24-core system, analyzing a trace from a matrix multiplication representing 24 threads, 1.37 terabytes of streamed data and 800 million distinct accesses, the parallel KRD implementation is able to compute the coherence-aware kernel reuse distance histogram for one socket (six cores) in 11.1 seconds.

References

[1]
G. Almasi, C. Cascaval, and D. A. Padua. Calculating Stack Distances Efficiently. In ACM SIGPLAN Workshop on Memory System Performance, June 2002.
[2]
A. Amer, N. Maruyama, M. Pericàs, K. Taura, R. Yokota, and S. Matsuoka. Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM. In Proceedings of ISC'13, June 2013.
[3]
Barcelona Supercomputing Center. Extrae User Guide Manual, May 2013.
[4]
B. Bennett and V. J. Kruskal. LRU Stack Processing. IBM Journal for Research and Development, pages 353--357, July 1975.
[5]
A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, and R. Vuduc. Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures. In Proceedings of IPDPS'10, May 2010.
[6]
C. Ding and T. Chilimbi. A Composable Model for Analyzing Locality of Multi-threaded Programs. Technical Report MSR-TR-2009--107, Microsoft Research, Aug. 2009.
[7]
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual. Volume 3B. http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html.
[8]
Intel Corporation. Intel Threading Building Blocks. https://www.threadingbuildingblocks.org/.
[9]
Y. Jiang, E. Z. Zhang, K. Tian, and X. Shen. Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, pages 264--282, 2010.
[10]
A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel. The Vampir Performance Analysis Tool-Set, pages 139--155. Springer Berlin Heidelberg, 2008.
[11]
Linux perf-tools Team. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page.
[12]
X. Liu and J. Mellor-Crummey. Pinpointing Data Locality Problems Using Data-centric Analysis. In Proceedings of CGO'11, Apr. 2011.
[13]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 190--200, 2005.
[14]
MassiveThreads Team. MassiveThreads: a Lightweight Thread Library for High Productivity Languages. http://code.google.com/p/massivethreads/.
[15]
R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78--117, 1970.
[16]
C. McCurdy and J. Vetter. Memphis : Finding and Fixing NUMA-related Performance Problems on Multi-core Platforms. In Proceedings of ISPASS 2010, Mar. 2010.
[17]
N. Nethercote and J. Seward. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), June 2007.
[18]
Q. Niu, J. Dinan, Q. Lu, and P. Sadayappan. PARDA: A Fast Parallel Reuse Distance Analysis Algorithm. In Proceedings of IPDPS'12, pages 1284--1294, May 2012.
[19]
S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F. Prins. Scheduling Task Parallelism on Multi-Socket Multicore Systems. In Proceedings of ROSS'11, pages 49--56, 2011.
[20]
F. Olken. Efficient Methods for Calculating the Success Function of Fixed Space Replacement Policies. Technical report, Lawrence Berkeley Laboratory, 1981.
[21]
OpenMP Architecture Review Board. OpenMP. http://openmp.org/wp/.
[22]
PAPI Team. Performance application programming interface. http://icl.cs.utk.edu/papi/.
[23]
M. Pericàs, A. Amer, K. Taura, and S. Matsuoka. Analysis of Data Reuse in Task-Parallel Runtimes. In Workshop on Performance, Modeling, Benchmarking and Simulation (PMBS'13), Nov. 2013.
[24]
D. L. Schuff, M. Kulkarni, and V. S. Pai. Accelerating Multicore Reuse Distance Analysis with Sampling and Parallelization. In Proceedings of PACT'10, pages 53--64, Sept. 2010.
[25]
D. L. Schuff, B. S. Parsons, and V. S. Pai. Multicore-Aware Reuse Distance Analysis. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, May 2010.
[26]
N. R. Tallent and J. M. Mellor-Crummey. Effective Performance Measurement and Analysis of Multithreaded Applications. In Proceedings of PPoPP'09, Feb. 2009.
[27]
TAU Team. TAU: Tuning and Analysis Utilities. http://www.cs.uoregon.edu/research/tau/home.php.
[28]
K. Taura, R. Yokota, and N. Maruyama. A Task Parallelism Meets Fast Multipole Methods. In Proceedings of the SCALA'12 workshop, Nov. 2012.
[29]
J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely. Quantifying Locality In The Memory Access Patterns of HPC Applications. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, Nov. 2005.
[30]
K. Wheeler, R. Murphy, and D. Thain. Qthreads: An API for Programming with Millions of Lightweight Threads. In Proceedings of MTAAP'08, 2008.

Cited By

View all
  • (2022) ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer ACM Transactions on Architecture and Code Optimization10.1145/348419919:1(1-25)Online publication date: 31-Mar-2022
  • (2022)Low-Overhead Reuse Distance Profiling Tool for MulticoreEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_49(555-559)Online publication date: 9-Jun-2022
  • (2020)A Locality Optimizer for Loop-dominated Applications Based on Reuse Distance AnalysisACM Transactions on Design Automation of Electronic Systems10.1145/339818925:6(1-26)Online publication date: 2-Sep-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing
June 2014
378 pages
ISBN:9781450326421
DOI:10.1145/2597652
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data reuse and sharing
  2. instrumentation
  3. multithreaded runtime systems
  4. reuse distance

Qualifiers

  • Research-article

Funding Sources

Conference

ICS'14
Sponsor:

Acceptance Rates

ICS '14 Paper Acceptance Rate 34 of 160 submissions, 21%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022) ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer ACM Transactions on Architecture and Code Optimization10.1145/348419919:1(1-25)Online publication date: 31-Mar-2022
  • (2022)Low-Overhead Reuse Distance Profiling Tool for MulticoreEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_49(555-559)Online publication date: 9-Jun-2022
  • (2020)A Locality Optimizer for Loop-dominated Applications Based on Reuse Distance AnalysisACM Transactions on Design Automation of Electronic Systems10.1145/339818925:6(1-26)Online publication date: 2-Sep-2020
  • (2018)Global Dead-Block Management for Task-Parallel ProgramsACM Transactions on Architecture and Code Optimization10.1145/323433715:3(1-25)Online publication date: 4-Sep-2018
  • (2018)Elastic PlacesACM Transactions on Architecture and Code Optimization10.1145/318545815:2(1-26)Online publication date: 1-May-2018
  • (2017)Runtime-Assisted Global Cache Management for Task-Based Parallel ProgramsIEEE Computer Architecture Letters10.1109/LCA.2016.260659316:2(145-148)Online publication date: 1-Jul-2017
  • (2017)CDLP: A Core Distributing Policy Based on Logic PartitioningGreen, Pervasive, and Cloud Computing10.1007/978-3-319-57186-7_33(443-459)Online publication date: 13-Apr-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media