Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3178487.3178499acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Public Access

Featherlight on-the-fly false-sharing detection

Published: 10 February 2018 Publication History

Abstract

Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools detect false sharing via a heavyweight process of logging memory accesses and feeding the ensuing access traces to an offline cache simulator. We have developed Feather, a lightweight, on-the-fly false-sharing detection tool. Feather achieves low overhead by exploiting two hardware features ubiquitous in commodity CPUs: the performance monitoring units (PMU) and debug registers. Additionally, Feather is a first-of-its-kind tool to detect false sharing in multi-process applications that use shared memory. Feather allowed us to scale false-sharing detection to myriad codes. Feather detected several false-sharing cases in important multi-core and multi-process codes including previous PPoPP artifacts. Eliminating false sharing resulted in dramatic (up to 16x) speedups.

Supplementary Material

Artifacts Available, Part 1 of 5 (feather-v0.1-02-07-2018.zip)
A featherlight on-the-fly false-sharing detection tool
Artifacts Available, Part 3 of 5 (hpctoolkit-externals-v0.1-02-07-2018.zip)
HPCToolkit performance tools: essential third party libraries for hpctoolkit
Artifacts Available, Part 2 of 5 (hpctoolkit-v0.1-02-07-2018.zip)
HPCToolkit performance tools: measurement and analysis components
Artifacts Available, Part 4 of 5 (libmonitor-v0.1-02-07-2018.zip)
HPCToolkit performance tools: libmonitor - a substrate for monitoring tools

References

[1]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCToolkit: Tools for Performance Analysis of Optimized Parallel Programs. Concurrency Computation : Practice Expererience 22, 6 (April 2010), 685--701.
[2]
Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling. In SIGPLAN Conference on Programming Language Design and Implementation. ACM, NY, NY, USA, 85--96.
[3]
Matthew Arnold and Peter F. Sweeney. 1999. Approximating the Calling Context Tree via Sampling. Technical Report 21789. IBM.
[4]
Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.
[5]
Boost developer community. 2012. Boost C++ Libraries. https://sourceforge.net/projects/boost/files/boost/1.49.0/. (2012).
[6]
Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2010. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1--16. http://dl.acm.org/citation.cfm?id=1924943.1924944
[7]
Milind Chabbi, Abdelhalim Amer, Shasha Wen, and Xu Liu. 2017. An Efficient Abortable-locking Protocol for Multi-level NUMA Systems. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 61--74.
[8]
Milind Chabbi and John Mellor-Crummey. 2012. DeadSpy: A Tool to Pinpoint Program Inefficiencies. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO '12). ACM, New York, NY, USA, 124--134.
[9]
Cristian Coarfa, John Mellor-Crummey, Nathan Froyd, and Yuri Dotsenko. 2007. Scalability analysis of SPMD codes using expectations. In ICS '07: Proc. of the 21st annual International Conference on Supercomputing. ACM, NY, NY, USA, 13--22.
[10]
Dave. Dice. 2011. False sharing induced by card table marking. https://blogs.oracle.com/dave/false-sharing-induced-by-card-table-marking. (2011).
[11]
Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. https://pdfs.semanticscholar.org/5219/4b43b8385ce39b2b08ecd409c753e0efafe5.pdf. (November 2007).
[12]
Ariel Eizenberg, Shiliang Hu, Gilles Pokam, and Joseph Devietti. 2016. Remix: Online Detection and Repair of Cache Contention for the JVM. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 251--265.
[13]
Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. 1982. Gprof: A call graph execution profiler. In Proc. of the 1982 SIGPLAN Symp. on Compiler Construction. ACM Press, New York, NY, USA, 120--126.
[14]
Vincent Gramoli. 2015. More Than You Ever Wanted to Know About Synchronization: Synchrobench, Measuring the Impact of the Synchronization on Concurrent Algorithms. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, USA, 1--10.
[15]
Stephan M. Günther and Josef Weidendorfer. 2009. Assessing Cache False Sharing Effects by Dynamic Binary Instrumentation. In Proceedings of the Workshop on Binary Instrumentation and Applications (WBIA '09). ACM, New York, NY, USA, 26--33.
[16]
Robert J. Hall. 1992. Call Path Profiling. In Proceedings of the 14th International Conference on Software Engineering (ICSE '92). ACM, New York, NY, USA, 296--306.
[17]
Ravi Hegde. 2015. Optimizing Application Performance on Intel Core Microarchitecture Using Hardware-Implemented Prefetchers. (Oct 2015).
[18]
Gerard J. Holzmann. 1997. The Model Checker SPIN. IEEE Transactions on Software Engineering --- Special issue on formal methods in software practice 23, 5 (May 1997), 279--295.
[19]
Gerard J. Holzmann and Dragan Bosnacki. 2007. The Design of a Multicore Extension of the SPIN Model Checker. IEEE Transactions on Software Engineering 33, 10 (Oct. 2007), 659--674.
[20]
Intel Corp. 2009. An Introduction to the Intel® QuickPath Interconnect. http://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html. (2009).
[21]
Intel Corp. 2011. Avoiding and Identifying False Sharing Among Threads. https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads. (2011).
[22]
Intel Corp. 2015. Intel X86 Encoder Decoder Software Library. https://software.intel.com/en-us/articles/xed-x86-encoder-decoder-software-library. (2015).
[23]
Intel Corp. NA. Hardware Event-based Sampling Collection. https://software.intel.com/en-us/node/544067. (NA).
[24]
Intel Corp. NA. Intel Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide. https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf. (NA).
[25]
Intel Corporation 2008. Intel Performance Tuning Utility 3.2 Update. Intel Corporation.
[26]
Sanath Jayasena, Saman Amarasinghe, Asanka Abeyweera, Gayashan Amarasinghe, Himeshi De Silva, Sunimal Rathnayake, Xiaoqiao Meng, and Yanbin Liu. 2013. Detection of False Sharing Using Machine Learning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 30, 9 pages.
[27]
Mark Scott Johnson. 1982. Some Requirements for Architectural Support of Software Debugging. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 140--148.
[28]
Christos Kozyrakis. 2009. Phoenix Project: Shared-memory implementation of Google's MapReduce model. https://github.com/kozyraki/phoenix/tree/master/phoenix-2.0. (2009).
[29]
Leslie Lamport. 1977. Concurrent Reading and Writing. Commun. ACM 20, 11 (Nov. 1977), 806--811.
[30]
Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558--565.
[31]
Chien-Lung Liu. 2009. False Sharing Analysis for Multithreaded Programs. Master's thesis. National Chung Cheng University.
[32]
Tongping Liu and Emery D. Berger. 2011. SHERIFF: Precise Detection and Automatic Mitigation of False Sharing. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '11). ACM, New York, NY, USA, 3--18.
[33]
Tongping Liu and Xu Liu. 2016. Cheetah: Detecting False Sharing Efficiently and Effectively. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16). ACM, New York, NY, USA, 1--11.
[34]
Tongping Liu, Chen Tian, Ziang Hu, and Emery D. Berger. 2014. PREDATOR: Predictive False Sharing Detection. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). ACM, New York, NY, USA, 3--14.
[35]
Xu Liu and Bo Wu. 2015. ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 47, 12 pages.
[36]
L. Luo, A. Sriraman, B. Fugate, S. Hu, G. Pokam, C. J. Newburn, and J. Devietti. 2016. LASER: Light, Accurate Sharing dEtection and Repair. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 261--273.
[37]
Joe Mario. 2016. C2C - False Sharing Detection in Linux Perf. https://joemario.github.io/blog/2016/09/01/c2c-blog/. (2016).
[38]
McDonald, Nicholas. 2015. desbench:A benchmark application for libdes. https://github.com/nicmcd/desbench. (2015).
[39]
McDonald, Nicholas. 2015. libdes:A C++ discrete event simulation framework. https://github.com/nicmcd/libdes. (2015).
[40]
McDonald, Nicholas. 2015. supersim:A flexible event-driven cycle-accurate network simulator. https://github.com/HewlettPackard/supersim. (2015).
[41]
R. E. McLear, D. M. Scheibelhut, and E. Tammaru. 1982. Guidelines for Creating a Debuggable Processor. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 100--106.
[42]
Milind Chabbi. 2017. HMCST lock: Hierarchical MCS locks with timeout. https://github.com/HMCST/. (2017).
[43]
Greg Nakhimovsky. 2001. Debugging and Performance Tuning with Library Interposers. http://dsc.sun.com/solaris/articles/lib_interposers.html. (Jul 2001).
[44]
Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, William Aiello, and Andrew Warfield. 2013. Whose cache line is it anyway?: operating system support for live detection and repair of false sharing. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). ACM, New York, NY, USA, 141--154.
[45]
R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. 2006. MineBench: A Benchmark Suite for Data Mining Workloads. In 2006 IEEE International Symposium on Workload Characterization. 182--188.
[46]
Northwestern University. 2006. NU-MineBench suite. http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html. (2006).
[47]
Perf developers. {n. d.}. perf_event_open - Linux man page. https://linux.die.net/man/2/perf_event_open. ({n. d.}).
[48]
Aleksey Pesterev, Nickolai Zeldovich, and Robert T. Morris. 2010. Locating cache performance bottlenecks using data profiling. In EuroSys '10: Proceedings of the 5th European conference on Computer systems. ACM, New York, NY, USA, 335--348.
[49]
Princeton University. 2011. Parsec3.0. http://parsec.cs.princeton.edu/index.htm. (2011).
[50]
Mikael Ronstrom. 2012. MySQL team increases scalability by <50% for Sysbench OLTP RO in MySQL 5.6 labs release april 2012. http://mikaelronstrom.blogspot.in/2012/04/mysql-team-increases-scalability-by-50.html. (2012).
[51]
ML Scott and WJ Bolosky. 1993. False Sharing and Its Effect on Shared Memory Performance. In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS). 57.
[52]
Spin developers. 2017. Spin Sources. http://spinroot.com/spin/Src/index.html. (2017).
[53]
M. Srinivas, B. Sinharoy, R. J. Eickemeyer, R. Raghavan, S. Kunkel, T. Chen, W. Maron, D. Flemming, A. Blanchard, P. Seshadri, J. W. Kellington, A. Mericas, A. E. Petruski, V. R. Indukuru, and S. Reyes. 2011. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD 55, 3 (May-June 2011), 4:1--4:19.
[54]
Stackoverflow discussion. 2012. False sharing in boost::detail::spinlock_pool? https://stackoverflow.com/questions/11037655/false-sharing-in-boostdetailspinlock-pool. (2012).
[55]
Nathan R. Tallent, John Mellor-Crummey, and Michael W. Fagan. 2009. Binary Analysis for Measurement and Attribution of Program Performance. In Proc. of the 2009 ACM PLDI. ACM, NY, NY, USA, 441--452.
[56]
Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1998. Simultaneous Multithreading: Maximizing On-chip Parallelism. In 25 Years of the International Symposia on Computer Architecture (Selected Papers) (ISCA '98). ACM, New York, NY, USA, 533--544.
[57]
Shasha Wen, Milind Chabbi, and Xu Liu. 2017. RedSpy: Exploring Value Locality in Software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 47--61.
[58]
Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. 2018. Watching for Software Inefficiencies with Witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (to appear) (ASPLOS '18). ACM, New York, NY, USA.
[59]
Besar Wicaksono, Munara Tolubaeva, and Barbara Chapman. 2013. Detecting False Sharing in OpenMP Applications Using the DARWIN Framework. Springer Berlin Heidelberg, Berlin, Heidelberg, 283--297.
[60]
Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. 2011. Dynamic Cache Contention Detection in Multi-threaded Applications. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '11). ACM, New York, NY, USA, 27--38.

Cited By

View all
  • (2024)Scaler: Efficient and Effective Cross Flow AnalysisProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695473(907-918)Online publication date: 27-Oct-2024
  • (2024)ParaShareDetect: Dynamic Instrumentation and Runtime Analysis for False Sharing Detection in Parallel Computing2024 4th International Conference on Computer, Control and Robotics (ICCCR)10.1109/ICCCR61138.2024.10585404(230-235)Online publication date: 19-Apr-2024
  • (2024)EasyView: Bringing Performance Profiles into Integrated Development EnvironmentsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444840(386-398)Online publication date: 2-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2018
442 pages
ISBN:9781450349826
DOI:10.1145/3178487
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 53, Issue 1
    PPoPP '18
    January 2018
    426 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3200691
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 10 February 2018

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. PMU
  2. debug registers
  3. false sharing
  4. profiling
  5. sampling

Qualifiers

  • Research-article

Funding Sources

Conference

PPoPP '18

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)421
  • Downloads (Last 6 weeks)26
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Scaler: Efficient and Effective Cross Flow AnalysisProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695473(907-918)Online publication date: 27-Oct-2024
  • (2024)ParaShareDetect: Dynamic Instrumentation and Runtime Analysis for False Sharing Detection in Parallel Computing2024 4th International Conference on Computer, Control and Robotics (ICCCR)10.1109/ICCCR61138.2024.10585404(230-235)Online publication date: 19-Apr-2024
  • (2024)EasyView: Bringing Performance Profiles into Integrated Development EnvironmentsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444840(386-398)Online publication date: 2-Mar-2024
  • (2023)MemPerf: Profiling Allocator-Induced Performance SlowdownsProceedings of the ACM on Programming Languages10.1145/36228487:OOPSLA2(1418-1441)Online publication date: 16-Oct-2023
  • (2023)Precise Event Sampling on AMD Versus Intel: Quantitative and Qualitative ComparisonIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325710534:5(1594-1608)Online publication date: May-2023
  • (2022)SlowCoach: Mutating Code to Simulate Performance Bugs2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE55969.2022.00035(274-285)Online publication date: Oct-2022
  • (2022)Raptor: Mitigating CPU-GPU False Sharing Under Unified Memory Systems2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC55832.2022.9969376(1-8)Online publication date: 24-Oct-2022
  • (2021)Break dancing: low overhead, architecture neutral software branch tracingProceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3461648.3463853(122-133)Online publication date: 22-Jun-2021
  • (2021)Ghostwriter: A Cache Coherence Protocol for Error-Tolerant Applications50th International Conference on Parallel Processing Workshop10.1145/3458744.3474045(1-10)Online publication date: 9-Aug-2021
  • (2019)Huron: hybrid false sharing detection and repairProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314644(453-468)Online publication date: 8-Jun-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media