research-article

Dynamic cache contention detection in multi-threaded applications

Authors:

Derek Bruening,

Weng-Fai Wong, and

Saman AmarasingheAuthors Info & Claims

VEE '11: Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

March 2011

Pages 27 - 38

https://doi.org/10.1145/1952682.1952688

Published: 09 March 2011 Publication History

Abstract

In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy.

In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.

References

[1]

DynamoRIO dynamic instrumentation tool platform, Feb. 2009. \bibtt http://dynamorio.org/.

[2]

E. Berger, K. McKinley, R. Blumofe, and P. Wilson. Hoard: A scalable memory allocator for multithreaded applications. ACM SIGPLAN Notices, 35(11):117--128, 2000.

Digital Library

[3]

P. W. Bolosky, W. J. Bolosky, and M. L. Scott. False sharing and its effect on shared memory. In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), pages 57--71, 1993.

Digital Library

[4]

D. Bruening. Efficient, Transparent, and Comprehensive Runtime Code Manipulation. PhD thesis, M.I.T., Sept. 2004.

Digital Library

[5]

M. Burrows, S. N. Freund, and J. L. Wiener. Run-time type checking for binary programs. In Proceedings of the 12th International Conference on Compiler Construction (CC '03), pages 90--105, 2003.

Digital Library

[6]

J. M. Calandrino and J. H. Anderson. On the design and implementation of a cache-aware multicore real-time scheduler. Real-Time Systems, Euromicro Conference on, 0:194--204, 2009.

Digital Library

[7]

J. Carter, J. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In Proceedings of the thirteenth ACM symposium on Operating systems principles, page 164. ACM, 1991.

Digital Library

[8]

W. Cheng, Q. Zhao, B. Yu, and S. Hiroshige. Tainttrace: Efficient flow tracing with dynamic binary rewriting. In Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC '06), pages 749--754, 2006.

Digital Library

[9]

M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenstrom. The detection and elimination of useless misses in multiprocessors. ACM SIGARCH Computer Architecture News, 21(2):88--97, 1993.

Digital Library

[10]

A. Fedorova. Operating system scheduling for chip multithreaded processors. PhD thesis, Harvard University, Cambridge, MA, USA, 2006.

Digital Library

[11]

V. W. Freeh. Dynamically controlling false sharing in distributed shared memory. International Symposium on High-Performance Distributed Computing, 0:403, 1996.

Digital Library

[12]

S. Gunther and J. Weidendorfer. Assessing cache false sharing effects by dynamic binary instrumentation. In Proceedings of the Workshop on Binary Instrumentation and Applications, pages 26--33. ACM, 2009.

Digital Library

[13]

J. J. Harrow. Runtime checking of multithreaded applications with visual threads. In Proceedings of 7th International SPIN Workshop on SPIN Model Checking and Software Verification, pages 331--342, 2000.

Digital Library

[14]

Intel-Corporation. Intel Performance Tuning Utility 3.2. User Guide, Chapter 7.4.6.5, 2008.

[15]

A. Jaleel, R. S. Cohn, C.-K. Luk, and B. Jacob. CMP$im: A Pin-based on-the-fly multi-core cache simulator. In Proceedings of The Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), pages 28--36, Beijing, China, Jun 2008.

[16]

T. Jeremiassen and S. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. ACM SIGPLAN Notices, 30(8):179--188, 1995.

Digital Library

[17]

Y. Ju and H. Dietz. Reduction of cache coherence overhead by compiler data layout and loop transformation. Languages and Compilers for Parallel Computing, pages 344--358, 1992.

Digital Library

[18]

V. Khera, P. R. LaRowe, Jr., and S. C. Ellis. An architecture-independent analysis of false sharing. Technical Report DUKE-TR-1993-13, Duke University, Durham, NC, USA, 1993.

Digital Library

[19]

S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic logging of operating system effects to guide application-level architecture simulation. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'06/Performance'06), pages 216--227, 2006.

Digital Library

[20]

N. Nethercote and A. Mycroft. Redux: A dynamic dataflow tracer. In Electronic Notes in Theoretical Computer Science, volume 89, 2003.

[21]

N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07), pages 89--100, June 2007.

Digital Library

[22]

J. Newsome. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the Network and Distributed System Security Symposium (NDSS 2005), 2005.

[23]

OpenWorks LLP. Helgrind: A data race detector, 2007. http://valgrind.org/docs/manual/hg-manual.html/.

[24]

J. Peir and R. Cytron. Minimum distance: A method for partitioning recurrences for multiprocessors. IEEE Transactions on Computers, 38(8):1203--1211, 1989.

Digital Library

[25]

F. Qin, C. Wang, Z. Li, H.-s. Kim, Y. Zhou, and Y. Wu. Lift: A low-overhead practical information flow tracking system for detecting security attacks. In Proceedings of the 39th International Symposium on Microarchitecture (MICRO 39), pages 135--148, 2006.

Digital Library

[26]

M. Rajagopalan, B. Lewis, and T. Anderson. Thread scheduling for multi-core platforms. In Proceedings of the 11th USENIX workshop on Hot topics in operating systems, pages 1--6. USENIX Association, 2007.

Digital Library

[27]

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, 2007.

Digital Library

[28]

Rational Software. Purify: Fast detection of memory leaks and access errors, 2000. http://www.rationalsoftware.com/products/whitepapers/319.jsp.

[29]

M. Ronsse, B. Stougie, J. Maebe, F. Cornelis, and K. D. Bosschere. An efficient data race detector backend for DIOTA. In Parallel Computing: Software Technology, Algorithms, Architectures & Applications, volume 13, pages 39--46. Elsevier, 2 2004.

[30]

S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4):391--411, 1997.

Digital Library

[31]

J. Seward and N. Nethercote. Using Valgrind to detect undefined value errors with bit-precision. In Proceedings of the USENIX Annual Technical Conference, pages 2--2, 2005.

Digital Library

[32]

S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In Workshop on Operating System Interference in High Performance Applications, 2006.

[33]

D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on smp-cmp-smt multiprocessors. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 47--58, New York, NY, USA, 2007. ACM.

Digital Library

[34]

J. Tao and W. Karl. CacheIn: A Toolset for Comprehensive Cache Inspection. Computational Science-ICCS 2005, pages 174--181, 2005.

Digital Library

[35]

J. Weidendorfer, M. Ott, T. Klug, and C. Trinitis. Latencies of conflicting writes on contemporary multicore architectures. Parallel Computing Technologies, pages 318--327, 2007.

Digital Library

[36]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA '95), pages 24--36, 1995.

Digital Library

[37]

Q. Zhao, D. Bruening, and S. Amarasinghe. Efficient memory shadowing for 64-bit architectures. In Proceedings of The International Symposium on Memory Management (ISMM '10), Toronto, Canada, Jun 2010.

Digital Library

[38]

Q. Zhao, D. Bruening, and S. Amarasinghe. Umbra: Efficient and scalable memory shadowing. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '10), Apr. 2010.

Digital Library

[39]

Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In International Symposium on Code Generation and Optimization, San Jose, CA, Mar 2007.

Digital Library

[40]

Q. Zhao, R. M. Rabbah, S. P. Amarasinghe, L. Rudolph, and W.-F. Wong. How to do a million watchpoints: Efficient debugging using dynamic instrumentation. In Proceedings of the 17th International Conference on Compiler Construction (CC '08), pages 147--162, 2008.

Digital Library

Cited By

Zhou JSilvestro STang SYang HLiu HZeng GWu BLiu CLiu T(2023)MemPerf: Profiling Allocator-Induced Performance SlowdownsProceedings of the ACM on Programming Languages10.1145/36228487:OOPSLA2(1418-1441)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3622848
Stewart CMorris NChen LBirke R(2022)Performance Modeling for Short-Term Cache AllocationProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545094(1-11)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545094
Gomes CChen XHempstead M(2022)PInTE: Probabilistic Induction of Theft Evictions2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00011(1-13)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00011
Show More Cited By

Index Terms

Dynamic cache contention detection in multi-threaded applications
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Dynamic cache contention detection in multi-threaded applications
VEE '11

In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such ...
Read More
Effective cache prefetching on bus-based multiprocessors

Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching ...
Read More
Reducing Contention in Shared Last-Level Cache for Throughput Processors

Deploying the Shared Last-Level Cache (SLLC) is an effective way to alleviate the memory bottleneck in modern throughput processors, such as GPGPUs. A commonly used scheduling policy of throughput processors is to render the maximum possible thread-...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

VEE '11: Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

March 2011

250 pages

ISBN:9781450306874

DOI:10.1145/1952682

General Chair:
Erez Petrank
The Technion, Israel
,
Program Chair:
Doug Lea
SUNY Oswego, USA

ACM SIGPLAN Notices Volume 46, Issue 7
VEE '11
July 2011
231 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2007477
Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

VEE '11

Sponsor:

VEE '11: ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

March 9 - 11, 2011

California, Newport Beach, USA

Acceptance Rates

Overall Acceptance Rate 80 of 235 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
657
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Zhou JSilvestro STang SYang HLiu HZeng GWu BLiu CLiu T(2023)MemPerf: Profiling Allocator-Induced Performance SlowdownsProceedings of the ACM on Programming Languages10.1145/36228487:OOPSLA2(1418-1441)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3622848
Stewart CMorris NChen LBirke R(2022)Performance Modeling for Short-Term Cache AllocationProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545094(1-11)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545094
Gomes CChen XHempstead M(2022)PInTE: Probabilistic Induction of Theft Evictions2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00011(1-13)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00011
Haque Rafi MWilliams KQasem A(2022)Raptor: Mitigating CPU-GPU False Sharing Under Unified Memory Systems2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC55832.2022.9969376(1-8)Online publication date: 24-Oct-2022
https://doi.org/10.1109/IGSC55832.2022.9969376
Zhao XZhou JGuan HWang WLiu XLiu TZhou HMoreira JMueller FEtsion Y(2021)NumaPerfProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460361(52-62)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460361
Alam MGottschlich JTatbul NTurek JMattson TMuzahid AWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)A zero-positive learning approach for diagnosing software performance regressionsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455330(11627-11639)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455330
Khan TZhao YPokam GMozafari BKasikci BMcKinley KFisher K(2019)Huron: hybrid false sharing detection and repairProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314644(453-468)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314644
Yoga ANagarakatte SMcKinley KFisher K(2019)Parallelism-centric what-if and differential analysesProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314621(485-501)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314621
Bouksiaa MTrahay FLescouet AVoron GDulong RGuermouche ABrunet EThomas G(2019)Using Differential Execution Analysis to Identify Thread InterferenceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.292748130:12(2866-2878)Online publication date: 1-Dec-2019
https://doi.org/10.1109/TPDS.2019.2927481
Peter CNikita G(2019)A novel technique for atomic instructions functional verification using lock contention analysis2019 IEEE East-West Design & Test Symposium (EWDTS)10.1109/EWDTS.2019.8884385(1-7)Online publication date: Sep-2019
https://doi.org/10.1109/EWDTS.2019.8884385
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents