Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1952682.1952688acmconferencesArticle/Chapter ViewAbstractPublication PagesveeConference Proceedingsconference-collections
research-article

Dynamic cache contention detection in multi-threaded applications

Published: 09 March 2011 Publication History
  • Get Citation Alerts
  • Abstract

    In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy.
    In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.

    References

    [1]
    DynamoRIO dynamic instrumentation tool platform, Feb. 2009. \bibtt http://dynamorio.org/.
    [2]
    E. Berger, K. McKinley, R. Blumofe, and P. Wilson. Hoard: A scalable memory allocator for multithreaded applications. ACM SIGPLAN Notices, 35(11):117--128, 2000.
    [3]
    P. W. Bolosky, W. J. Bolosky, and M. L. Scott. False sharing and its effect on shared memory. In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), pages 57--71, 1993.
    [4]
    D. Bruening. Efficient, Transparent, and Comprehensive Runtime Code Manipulation. PhD thesis, M.I.T., Sept. 2004.
    [5]
    M. Burrows, S. N. Freund, and J. L. Wiener. Run-time type checking for binary programs. In Proceedings of the 12th International Conference on Compiler Construction (CC '03), pages 90--105, 2003.
    [6]
    J. M. Calandrino and J. H. Anderson. On the design and implementation of a cache-aware multicore real-time scheduler. Real-Time Systems, Euromicro Conference on, 0:194--204, 2009.
    [7]
    J. Carter, J. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In Proceedings of the thirteenth ACM symposium on Operating systems principles, page 164. ACM, 1991.
    [8]
    W. Cheng, Q. Zhao, B. Yu, and S. Hiroshige. Tainttrace: Efficient flow tracing with dynamic binary rewriting. In Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC '06), pages 749--754, 2006.
    [9]
    M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenstrom. The detection and elimination of useless misses in multiprocessors. ACM SIGARCH Computer Architecture News, 21(2):88--97, 1993.
    [10]
    A. Fedorova. Operating system scheduling for chip multithreaded processors. PhD thesis, Harvard University, Cambridge, MA, USA, 2006.
    [11]
    V. W. Freeh. Dynamically controlling false sharing in distributed shared memory. International Symposium on High-Performance Distributed Computing, 0:403, 1996.
    [12]
    S. Gunther and J. Weidendorfer. Assessing cache false sharing effects by dynamic binary instrumentation. In Proceedings of the Workshop on Binary Instrumentation and Applications, pages 26--33. ACM, 2009.
    [13]
    J. J. Harrow. Runtime checking of multithreaded applications with visual threads. In Proceedings of 7th International SPIN Workshop on SPIN Model Checking and Software Verification, pages 331--342, 2000.
    [14]
    Intel-Corporation. Intel Performance Tuning Utility 3.2. User Guide, Chapter 7.4.6.5, 2008.
    [15]
    A. Jaleel, R. S. Cohn, C.-K. Luk, and B. Jacob. CMP$im: A Pin-based on-the-fly multi-core cache simulator. In Proceedings of The Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), pages 28--36, Beijing, China, Jun 2008.
    [16]
    T. Jeremiassen and S. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. ACM SIGPLAN Notices, 30(8):179--188, 1995.
    [17]
    Y. Ju and H. Dietz. Reduction of cache coherence overhead by compiler data layout and loop transformation. Languages and Compilers for Parallel Computing, pages 344--358, 1992.
    [18]
    V. Khera, P. R. LaRowe, Jr., and S. C. Ellis. An architecture-independent analysis of false sharing. Technical Report DUKE-TR-1993-13, Duke University, Durham, NC, USA, 1993.
    [19]
    S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic logging of operating system effects to guide application-level architecture simulation. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'06/Performance'06), pages 216--227, 2006.
    [20]
    N. Nethercote and A. Mycroft. Redux: A dynamic dataflow tracer. In Electronic Notes in Theoretical Computer Science, volume 89, 2003.
    [21]
    N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07), pages 89--100, June 2007.
    [22]
    J. Newsome. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the Network and Distributed System Security Symposium (NDSS 2005), 2005.
    [23]
    OpenWorks LLP. Helgrind: A data race detector, 2007. http://valgrind.org/docs/manual/hg-manual.html/.
    [24]
    J. Peir and R. Cytron. Minimum distance: A method for partitioning recurrences for multiprocessors. IEEE Transactions on Computers, 38(8):1203--1211, 1989.
    [25]
    F. Qin, C. Wang, Z. Li, H.-s. Kim, Y. Zhou, and Y. Wu. Lift: A low-overhead practical information flow tracking system for detecting security attacks. In Proceedings of the 39th International Symposium on Microarchitecture (MICRO 39), pages 135--148, 2006.
    [26]
    M. Rajagopalan, B. Lewis, and T. Anderson. Thread scheduling for multi-core platforms. In Proceedings of the 11th USENIX workshop on Hot topics in operating systems, pages 1--6. USENIX Association, 2007.
    [27]
    C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, 2007.
    [28]
    Rational Software. Purify: Fast detection of memory leaks and access errors, 2000. http://www.rationalsoftware.com/products/whitepapers/319.jsp.
    [29]
    M. Ronsse, B. Stougie, J. Maebe, F. Cornelis, and K. D. Bosschere. An efficient data race detector backend for DIOTA. In Parallel Computing: Software Technology, Algorithms, Architectures & Applications, volume 13, pages 39--46. Elsevier, 2 2004.
    [30]
    S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4):391--411, 1997.
    [31]
    J. Seward and N. Nethercote. Using Valgrind to detect undefined value errors with bit-precision. In Proceedings of the USENIX Annual Technical Conference, pages 2--2, 2005.
    [32]
    S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In Workshop on Operating System Interference in High Performance Applications, 2006.
    [33]
    D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on smp-cmp-smt multiprocessors. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 47--58, New York, NY, USA, 2007. ACM.
    [34]
    J. Tao and W. Karl. CacheIn: A Toolset for Comprehensive Cache Inspection. Computational Science-ICCS 2005, pages 174--181, 2005.
    [35]
    J. Weidendorfer, M. Ott, T. Klug, and C. Trinitis. Latencies of conflicting writes on contemporary multicore architectures. Parallel Computing Technologies, pages 318--327, 2007.
    [36]
    S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA '95), pages 24--36, 1995.
    [37]
    Q. Zhao, D. Bruening, and S. Amarasinghe. Efficient memory shadowing for 64-bit architectures. In Proceedings of The International Symposium on Memory Management (ISMM '10), Toronto, Canada, Jun 2010.
    [38]
    Q. Zhao, D. Bruening, and S. Amarasinghe. Umbra: Efficient and scalable memory shadowing. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '10), Apr. 2010.
    [39]
    Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In International Symposium on Code Generation and Optimization, San Jose, CA, Mar 2007.
    [40]
    Q. Zhao, R. M. Rabbah, S. P. Amarasinghe, L. Rudolph, and W.-F. Wong. How to do a million watchpoints: Efficient debugging using dynamic instrumentation. In Proceedings of the 17th International Conference on Compiler Construction (CC '08), pages 147--162, 2008.

    Cited By

    View all
    • (2023)MemPerf: Profiling Allocator-Induced Performance SlowdownsProceedings of the ACM on Programming Languages10.1145/36228487:OOPSLA2(1418-1441)Online publication date: 16-Oct-2023
    • (2022)Performance Modeling for Short-Term Cache AllocationProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545094(1-11)Online publication date: 29-Aug-2022
    • (2022)PInTE: Probabilistic Induction of Theft Evictions2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00011(1-13)Online publication date: Nov-2022
    • Show More Cited By

    Index Terms

    1. Dynamic cache contention detection in multi-threaded applications

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      VEE '11: Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
      March 2011
      250 pages
      ISBN:9781450306874
      DOI:10.1145/1952682
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 46, Issue 7
        VEE '11
        July 2011
        231 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2007477
        Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 March 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cache contention
      2. dynamic instrumentation
      3. false sharing
      4. shadow memory

      Qualifiers

      • Research-article

      Conference

      VEE '11

      Acceptance Rates

      Overall Acceptance Rate 80 of 235 submissions, 34%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)17
      • Downloads (Last 6 weeks)0

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)MemPerf: Profiling Allocator-Induced Performance SlowdownsProceedings of the ACM on Programming Languages10.1145/36228487:OOPSLA2(1418-1441)Online publication date: 16-Oct-2023
      • (2022)Performance Modeling for Short-Term Cache AllocationProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545094(1-11)Online publication date: 29-Aug-2022
      • (2022)PInTE: Probabilistic Induction of Theft Evictions2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00011(1-13)Online publication date: Nov-2022
      • (2022)Raptor: Mitigating CPU-GPU False Sharing Under Unified Memory Systems2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC55832.2022.9969376(1-8)Online publication date: 24-Oct-2022
      • (2021)NumaPerfProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460361(52-62)Online publication date: 3-Jun-2021
      • (2019)A zero-positive learning approach for diagnosing software performance regressionsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455330(11627-11639)Online publication date: 8-Dec-2019
      • (2019)Huron: hybrid false sharing detection and repairProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314644(453-468)Online publication date: 8-Jun-2019
      • (2019)Parallelism-centric what-if and differential analysesProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314621(485-501)Online publication date: 8-Jun-2019
      • (2019)Using Differential Execution Analysis to Identify Thread InterferenceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.292748130:12(2866-2878)Online publication date: 1-Dec-2019
      • (2019)A novel technique for atomic instructions functional verification using lock contention analysis2019 IEEE East-West Design & Test Symposium (EWDTS)10.1109/EWDTS.2019.8884385(1-7)Online publication date: Sep-2019
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media