Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3064176.3064186acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Public Access

SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs

Published: 23 April 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Despite the obvious importance, performance issues related to synchronization primitives are still lacking adequate attention. No literature extensively investigates categories, root causes, and fixing strategies of such performance issues. Existing work primarily focuses on one type of problems, while ignoring other important categories. Moreover, they leave the burden of identifying root causes to programmers. This paper first conducts an extensive study of categories, root causes, and fixing strategies of performance issues related to explicit synchronization primitives. Based on this study, we develop two tools to identify root causes of a range of performance issues. Compare with existing work, our proposal, SyncPerf, has three unique advantages. First, SyncPerf's detection is very lightweight, with 2.3% performance overhead on average. Second, SyncPerf integrates information based on callsites, lock variables, and types of threads. Such integration helps identify more latent problems. Last but not least, when multiple root causes generate the same behavior, SyncPerf provides a second analysis tool that collects detailed accesses inside critical sections and helps identify possible root causes. SyncPerf discovers many unknown but significant synchronization performance issues. Fixing them provides a performance gain anywhere from 2.5% to 42%. Low overhead, better coverage, and informative reports make SyncPerf an effective tool to find synchronization performance bugs in the production environment.

    References

    [1]
    T. E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 1(1):6--16, January 1990.
    [2]
    Alexander Barkov. "thr_lock_charset global mutex abused by innodb". https://bugs.mysql.com/bug.php?id=42649, 2009.
    [3]
    Christian Bienia and Kai Li. PARSEC 2.0: A new benchmark suite for chip-multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June 2009.
    [4]
    William J. Bolosky and Michael L. Scott. False sharing and its effect on shared memory performance. In USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems - Volume 4, Sedms'93, pages 3--3, Berkeley, CA, USA, 1993. USENIX Association.
    [5]
    Clay P. Breshears. Using intel thread profiler for win32* threads: Philosophy and theory. https://software.intel.com/en-us/articles/using-intel-thread-profiler-for-win32-threads-philosophy-and-theory, February 2011.
    [6]
    Ray Bryant and John Hawkes. Lockmeter: Highly-informative instrumentation for spin locks in the linux®kernel. In Proceedings of the 4th Annual Linux Showcase & Conference - Volume 4, ALS'00, pages 17--17, Berkeley, CA, USA, 2000. USENIX Association.
    [7]
    Mark Callaghan. "fast mutexes in mysql 5.1 have mutex contention when calling random()". https://bugs.mysql.com/bug.php?id=38941, 2008.
    [8]
    Guancheng Chen and Per Stenstrom. Critical lock analysis: Diagnosing critical section bottlenecks in multithreaded applications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 71:1--71:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
    [9]
    GCC community. "built-in functions for memory model aware atomic operations". https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html, 2015.
    [10]
    Charlie Curtsinger and Emery D. Berger. Coz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 184--197, New York, NY, USA, 2015. ACM.
    [11]
    Luiz DeRose, Bill Homer, and Dean Johnson. Detecting application load imbalance on high end massively parallel systems. In Anne-Marie Kermarrec, Luc Boug, and Thierry Priol, editors, Euro-Par 2007 Parallel Processing, volume 4641 of Lecture Notes in Computer Science, pages 150--159. Springer Berlin Heidelberg, 2007.
    [12]
    Luiz DeRose, Bill Homer, Dean Johnson, Steve Kaufmann, and Heidi Poxon. Cray performance analysis tools. In Michael Resch, Rainer Keller, Valentin Himmler, Bettina Krammer, and Alexander Schulz, editors, Tools for High Performance Computing, pages 191--199. Springer Berlin Heidelberg, 2008.
    [13]
    David Dice, Virendra J. Marathe, and Nir Shavit. Lock cohorting: A general technique for designing numa locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 247--256, New York, NY, USA, 2012. ACM.
    [14]
    Pedro C. Diniz and Martin C. Rinard. Lock coarsening: Eliminating lock overhead in automatically parallelized object-based programs. J. Parallel Distrib. Comput., 49(2):218--244, March 1998.
    [15]
    Kristof Du Bois, Stijn Eyerman, Jennifer B. Sartor, and Lieven Eeckhout. Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 511--522, New York, NY, USA, 2013. ACM.
    [16]
    S.J. Eggers and T.E. Jeremiassen. Eliminating false sharing. In International Conference on Parallel Processing, volume I, pages 377--381, August 1991.
    [17]
    ej-technologies GmbH. Jprofiler: The award-winning all-in-one java profiler. http://www.ej-technologies.com/products/jprofiler/overview.html.
    [18]
    David Florian. "continuous and efficient lock profiling for java on multicore architectures". http://www-public.tem-tsp.eu/~thomas_g/research/etudiants/theses/david-phd-thesis.pdf, 2015.
    [19]
    Rui Gu, Guoliang Jin, Linhai Song, Linjie Zhu, and Shan Lu. What change history tells us about thread synchronization. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 426--438, New York, NY, USA, 2015. ACM.
    [20]
    Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA '93, pages 289--300, New York, NY, USA, 1993. ACM.
    [21]
    Yongbing Huang, Zehan Cui, Licheng Chen, Wenli Zhang, Yungang Bao, and Mingyu Chen. Halock: Hardware-assisted lock contention detection in multithreaded applications. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 253--262, New York, NY, USA, 2012. ACM.
    [22]
    Intel. Using the rdtsc instruction for performance monitoring. https://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf, 1997.
    [23]
    Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. Understanding and detecting real-world performance bugs. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 77--88, New York, NY, USA, 2012. ACM.
    [24]
    Piotr Zalewski Jinwoo Hwang. Ibm thread and monitor dump analyze for java. https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=2245aa39-fa5c-4475-b891-14c205f7333c.
    [25]
    Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala, and L. Paul Chew. Scheduling strategies for optimistic parallel execution of irregular programs. In Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA '08, pages 217--228, New York, NY, USA, 2008. ACM.
    [26]
    Ran Liu, Heng Zhang, and Haibo Chen. Scalable read-mostly synchronization using passive reader-writer locks. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'14, pages 219--230, Berkeley, CA, USA, 2014. USENIX Association.
    [27]
    Tongping Liu and Emery D. Berger. Sheriff: precise detection and automatic mitigation of false sharing. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications, OOPSLA '11, pages 3--18, New York, NY, USA, 2011. ACM.
    [28]
    Xu Liu, John Mellor-Crummey, and Michael Fagan. A new approach for performance analysis of openmp programs. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 69--80, New York, NY, USA, 2013. ACM.
    [29]
    Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '05, pages 190--200, New York, NY, USA, 2005. ACM.
    [30]
    Mecki. "when should one use a spinlock instead of mutex?". http://stackoverflow.com/questions/5869825/when-should-one-use-a-spinlock-instead-of-mutex, 2011.
    [31]
    Wagner Meira, Jr., Thomas J. LeBlanc, and Alexandros Poulos. Waiting time analysis and performance visualization in carnival. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, SPDT '96, pages 1--10, New York, NY, USA, 1996. ACM.
    [32]
    John M. Mellor-Crummey and Michael L. Scott. Scalable reader-writer synchronization for shared-memory multiprocessors. In Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '91, pages 106--113, New York, NY, USA, 1991. ACM.
    [33]
    Angeles Navarro, Rafael Asenjo, Siham Tabik, and Calin Cascaval. Analytical modeling of pipeline parallelism. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques, PACT '09, pages 281--290, Washington, DC, USA, 2009. IEEE Computer Society.
    [34]
    Notlikethat. "atomic operations on floats". http: //stackoverflow.com/questions/20981007/atomicoperations-on-floats, 2014.
    [35]
    Oracle. Hprof: A heap/cpu profiling tool. http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html.
    [36]
    Oracle. Solaris Performance Analyzer. http://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf.
    [37]
    James Rapp. "diagnosing lock contention with the concurrency visualizer". http://blogs.msdn.com/b/visualizeparallel/archive/2010/07/30/diagnosing-lock-contention-with-the-concurrency-visualizer.aspx, 2010.
    [38]
    Michael L. Scott. Non-blocking timeout in scalable queue-based spin locks. In Proceedings of the Twenty-first Annual Symposium on Principles of Distributed Computing, PODC '02, pages 31--40, New York, NY, USA, 2002. ACM.
    [39]
    M. Aater Suleman, Moinuddin K. Qureshi, Khubaib, and Yale N. Patt. Feedback-directed pipeline parallelism. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pages 147--156, New York, NY, USA, 2010. ACM.
    [40]
    Nathan R. Tallent, Laksono Adhianto, and John M. Mellor-Crummey. Scalable identification of load imbalance in parallel executions using call path profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010.
    [41]
    Nathan R. Tallent, John M. Mellor-Crummey, and Allan Porterfield. Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 269--280, New York, NY, USA, 2010. ACM.
    [42]
    Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. Ad hoc synchronization considered harmful. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI'10, pages 1--8, Berkeley, CA, USA, 2010. USENIX Association.
    [43]
    Tingting Yu and Michael Pradel. Syncprof: Detecting, localizing, and optimizing synchronization bottlenecks, 2016.
    [44]
    Long Zheng, Xiaofei Liao, Bingsheng He, Song Wu, and Hai Jin. On performance debugging of unnecessary lock contentions on multicore processors: A replay-based approach. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '15, pages 56--67, Washington, DC, USA, 2015. IEEE Computer Society.

    Cited By

    View all
    • (2024)Identification of Java lock contention anti-patterns based on run-time performance dataProceedings of the 5th ACM/IEEE International Conference on Automation of Software Test (AST 2024)10.1145/3644032.3644466(209-213)Online publication date: 15-Apr-2024
    • (2024)An Adaptive Logging System (ALS): Enhancing Software Logging with Reinforcement Learning TechniquesProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645033(37-47)Online publication date: 7-May-2024
    • (2024)TCSA: Efficient Localization of Busy-Wait Synchronization Bugs for Latency-Critical ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334257335:2(297-309)Online publication date: Feb-2024
    • Show More Cited By
    1. SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems
      April 2017
      648 pages
      ISBN:9781450349383
      DOI:10.1145/3064176
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 April 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      EuroSys '17
      Sponsor:
      EuroSys '17: Twelfth EuroSys Conference 2017
      April 23 - 26, 2017
      Belgrade, Serbia

      Acceptance Rates

      Overall Acceptance Rate 241 of 1,308 submissions, 18%

      Upcoming Conference

      EuroSys '25
      Twentieth European Conference on Computer Systems
      March 30 - April 3, 2025
      Rotterdam , Netherlands

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)118
      • Downloads (Last 6 weeks)11

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Identification of Java lock contention anti-patterns based on run-time performance dataProceedings of the 5th ACM/IEEE International Conference on Automation of Software Test (AST 2024)10.1145/3644032.3644466(209-213)Online publication date: 15-Apr-2024
      • (2024)An Adaptive Logging System (ALS): Enhancing Software Logging with Reinforcement Learning TechniquesProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645033(37-47)Online publication date: 7-May-2024
      • (2024)TCSA: Efficient Localization of Busy-Wait Synchronization Bugs for Latency-Critical ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334257335:2(297-309)Online publication date: Feb-2024
      • (2023)MemPerf: Profiling Allocator-Induced Performance SlowdownsProceedings of the ACM on Programming Languages10.1145/36228487:OOPSLA2(1418-1441)Online publication date: 16-Oct-2023
      • (2023)Performance Bug Analysis and Detection for Distributed Storage and Computing SystemsACM Transactions on Storage10.1145/358028119:3(1-33)Online publication date: 19-Jun-2023
      • (2023)Effective Performance Issue Diagnosis with Value-Assisted Cost ProfilingProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587444(1-17)Online publication date: 8-May-2023
      • (2023)A Large-Scale Empirical Study of Real-Life Performance Issues in Open Source ProjectsIEEE Transactions on Software Engineering10.1109/TSE.2022.316762849:2(924-946)Online publication date: 1-Feb-2023
      • (2023)A Lock Contention Classifier Based on Java Lock Contention Anti-Patterns2023 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA58977.2023.00165(1106-1113)Online publication date: 15-Dec-2023
      • (2022)On debugging the performance of configurable software systemsProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510043(1571-1583)Online publication date: 21-May-2022
      • (2022)PerfJIT: Test-Level Just-in-Time Prediction for Performance Regression Introducing CommitsIEEE Transactions on Software Engineering10.1109/TSE.2020.302395548:5(1529-1544)Online publication date: 1-May-2022
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media