Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3332466.3374535acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

No barrier in the road: a comprehensive study and optimization of ARM barriers

Published: 19 February 2020 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper, we present the first comprehensive performance characterization and optimization of ARM barriers on both mobile and server platforms. We draw a set of observations through several abstracted models and validate them in scenarios where barriers are intensively used. We find that (1) order-preserving approaches without involving the bus significantly outperform other approaches, and (2) the tremendous overhead mostly comes from barriers strictly following remote memory references. Usually, such barriers are inserted when threads are exchanging data, and they are used to ensure the relative order between storing the data to a shared buffer and setting a flag to inform the receiver. Based on the observations, we propose a new mechanism, Pilot, to remove such barriers by leveraging the single-copy atomicity to piggyback the flag with the data. Applying Pilot only requires minor changes to applications and provides 10%-360% performance improvements in multiple benchmarks, which are close to the ideal performance without barriers.

    References

    [1]
    ARM AMBA. 2011. AXI and ACE Protocol Specification.
    [2]
    ARM ARM. 2018. Architecture Reference Manual. ARMv8, for ARMv8-A architecture profile (2018).
    [3]
    Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). Association for Computing Machinery, New York, NY, USA, 72--81.
    [4]
    Enrico Calore, Filippo Mantovani, and Daniel Ruiz. 2018. Advanced Performance Analysis of HPC Workloads on Cavium ThunderX. In 2018 International Conference on High Performance Computing & Simulation, HPCS 2018, Orleans, France, July 16--20, 2018. IEEE, 375--382.
    [5]
    Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High Performance Locks for Multi-Level NUMA Systems. (2015), 215--226.
    [6]
    Milind Chabbi and John Mellor-Crummey. 2016. Contention-Conscious, Locality-Preserving Locks. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). Association for Computing Machinery, New York, NY, USA, Article Article 22, 14 pages.
    [7]
    Dimitrios Chasapis, Marc Casas, Miquel Moretó, Raul Vidal, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite. ACM Trans. Archil. Code Optim. 12, 4, Article Article 41 (Dec. 2015), 22 pages.
    [8]
    Nathan Chong and Samin Ishtiaq. 2008. Reasoning about the ARM Weakly Consistent Memory Model. In Proceedings of the 2008 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness: Held in Conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08) (MSPC '08). Association for Computing Machinery, New York, NY, USA, 16--19.
    [9]
    Jonathan Corbet. 2014. MCS locks and qspinlocks. https://lwn.net/Articles/590243/.
    [10]
    Dave Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-Combining NUMA Locks. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). Association for Computing Machinery, New York, NY, USA, 65--74.
    [11]
    David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock Cohorting: A General Technique for Designing NUMA Locks. ACM Trans. Parallel Comput. 1, 2, Article Article 13 (Feb. 2015), 42 pages
    [12]
    Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguadé. 2009. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In ICPP 2009, International Conference on Parallel Processing, Vienna, Austria, 22--25 September 2009. IEEE Computer Society, 124--131.
    [13]
    Panagiota Fatourou and Nikolaos D. Kallimanis. 2011. A Highly-Efficient Wait-Free Universal Construction. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). Association for Computing Machinery, New York, NY, USA, 325--334.
    [14]
    Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the Combining Synchronization Technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12). Association for Computing Machinery, New York, NY, USA, 257--266.
    [15]
    Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting Coarse-grained Task, Data, and Pipeline Parallelism in Stream Programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 151--162.
    [16]
    Vincent Gramoli. 2015. More than You Ever Wanted to Know about Synchronization: Synchrobench, Measuring the Impact of the Synchronization on Concurrent Algorithms. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). Association for Computing Machinery, New York, NY, USA, 1--10.
    [17]
    SD Hammond, C Hughes, MJ Levenhagen, CT Vaughan, AJ Younge, B Schwaller, MJ Aguilar, KT Pedretti, and JH Laros. [n. d.]. Evaluating the Marvell ThunderX2 Server Processor for HPC Workloads, ([n. d.]).
    [18]
    Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat Combining and the Synchronization-Parallelism Tradeoff. In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10). Association for Computing Machinery, New York, NY, USA, 355--364.
    [19]
    Arm Holdings, [n. d.]. ARM Cortex-A Series Programmer's Guide for ARMv8.
    [20]
    SPARC International Inc and David L Weaver. 1994. The SPARC architecture manual. Prentice-Hall.
    [21]
    Intel Intel. 64. The Intel® 64 and IA-32 architectures software developer's manual. Volume 3A: System Programming Guide, Part 1, 64 (64), 64.
    [22]
    Data Center Knowledge. 2015. PayPal Deploys ARM Servers in Data Centers.
    [23]
    I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. 2015. On-the-Fly Pipeline Parallelism. ACM Trans. Parallel Comput. 2, 3, Article Article 17 (Sept. 2015), 42 pages.
    [24]
    Changhui Lin, Vijay Nagarajan, and Rajiv Gupta. 2014. Fence Scoping. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, November 16--21, 2014, Trish Damkroger and Jack J. Dongarra (Eds.). IEEE Computer Society, 105--116.
    [25]
    Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia L. Lawall, and Gilles Muller. 2012. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. In 2012 USENIX Annual Technical Conference, Boston, MA, USA, June 13--15, 2012, Gernot Heiser and Wilson C. Hsieh (Eds.). USENIX Association, 65--76. https://www.usenix.org/conference/atc12/technical-sessions/presentation/lozi
    [26]
    Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller. 2016. Fast and Portable Locking for Multicore Architectures. ACM Trans. Comput. Syst. 33, 4, Article Article 13 (Jan. 2016), 62 pages.
    [27]
    Victor Luchangco, Daniel Nussbaum, and Nir Shavit. 2006. A Hierarchical CLH Queue Lock. In Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28 - September 1, 2006, Proceedings (Lecture Notes in Computer Science), Wolfgang E. Nagel, Wolfgang V. Walter, and Wolfgang Lehner (Eds.), Vol. 4128. Springer, 801--810.
    [28]
    Luc Maranget, Susmit Sarkar, and Peter Sewell. 2012. A tutorial introduction to the ARM and POWER relaxed memory models. Draft available from http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf (2012).
    [29]
    Paul E McKenney. 2010. Memory barriers: a hardware view for software hackers. Linux Technology Center, LBM Beaverton (2010).
    [30]
    John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. Comput. Syst. 9, 1 (Feb. 1991), 21--65.
    [31]
    Robin Morisset and Francesco Zappa Nardelli. 2017. Partially Redundant Fence Elimination for X86, ARM, and Power Processors. In Proceedings of the 26th International Conference on Compiler Construction (CC 2017). Association for Computing Machinery, New York, NY, USA, 1--10.
    [32]
    Adam Morrison. 2016. Scaling Synchronization in Multicore Programs. Commun. ACM 59, 11 (Oct. 2016), 44--51.
    [33]
    Peizhao Ou and Brian Demsky. 2018. Towards Understanding the Costs of Avoiding Out-of-Thin-Air Results. Proc. ACM Program. Lang. 2, OOPSLA, Article Article 136 (Oct. 2018), 29 pages
    [34]
    Scott Owens, Susmit Sarkar, and Peter Sewell. 2009. A Better x86 Memory Model: X86-TSO. In Theorem Proving in Higher Order Logics, 22nd International Conference, TPHOLs 2009, Munich, Germany, August 17--20, 2009. Proceedings (Lecture Notes in Computer Science), Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel (Eds.), Vol. 5674. Springer, 391--407.
    [35]
    Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications, Vol. 16. Citeseer, 95.
    [36]
    Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2017. Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8. Proc. ACM Program. Lang. 2, POPL, Article Article 19 (Dec. 2017), 29 pages
    [37]
    Milos Puzovic, Srilatha Manne, Shay GalOn, and Makoto Ono. 2016. Quantifying Energy Use in Dense Shared Memory HPC Node. In 4th International Workshop on Energy Efficient Supercomputing, E2SC@SC 2016, Salt Lake City, UT, USA, November 14, 2016. IEEE Computer Society, 16--23.
    [38]
    Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, and Mateo Valero. 2013. Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?, Article 40 (2013), 12 pages.
    [39]
    Nikola Rajovic, Paul M Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, and Mateo Valero. 2013. Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 40.
    [40]
    J Rath. 2013. Baidu deploys marvell armbased cloud server.
    [41]
    Carl G. Ritson and Scott Owens. 2016. Benchmarking Weak Memory Models. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). Association for Computing Machinery, New York, NY, USA, Article Article 24, 11 pages
    [42]
    Sepideh Roghanchi, Jakob Eriksson, and Nilanjana Basu. 2017. Ffwd: Delegation is (Much) Faster than You Think. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17). Association for Computing Machinery, New York, NY, USA, 342--358.
    [43]
    Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER Multiprocessors. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '11). ACM, New York, NY, USA, 175--186.
    [44]
    Andreas Selinger, Karl Rupp, and Siegfried Selberherr. 2016. Evaluation of mobile ARM-based SoCs for high performance computing. In Proceedings of the 24th High Performance Computing Symposium, Pasadena, HPC 2016, part of the 2016 Spring Simulation Multi-conference, SpringSim '16, CA, USA, April 3--6, 2016, Josef Weinbub, Marc Baboulin, William I. Thacker, and Lukás Polok (Eds.). ACM, 21. http://dl.acm.org/citation.cfm?id=2972990
    [45]
    Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. X86-TSO: A Rigorous and Usable Programmer's Model for X86 Multiprocessors. Commun. ACM 53, 7 (July 2010), 89--97.
    [46]
    Sean White. 2014. The AMD Opteron A1100 Processor Codenamed "Seattle". In IEEE Hot Chips.
    [47]
    Daniel Yokoyama, Bruno Schulze, Fábio Borges, and Giacomo Mc Evoy. 2019. The survey on ARM processors for HPC. The Journal of Super-computing (08 Jun 2019).
    [48]
    Mingzhe Zhang, Francis C. M. Lau, Cho-Li Wang, Luwei Cheng, and Haibo Chen. 2016. Scalable Adaptive NUMA-Aware Lock: Combining Local Locking and Remote Locking for Efficient Concurrency. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). Association for Computing Machinery, New York, NY, USA, Article Article 50, 2 pages.

    Cited By

    View all
    • (2024)Brief Announcement: Work Stealing through Partial Asynchronous DelegationProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660261(281-283)Online publication date: 17-Jun-2024
    • (2023)AtoMig: Automatically Migrating Millions Lines of Code from TSO to WMMProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3579849(61-73)Online publication date: 27-Jan-2023
    • (2023)Probabilistic Concurrency Testing for Weak Memory ProgramsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575729(603-616)Online publication date: 27-Jan-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    February 2020
    454 pages
    ISBN:9781450368186
    DOI:10.1145/3332466
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication Notes

    Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

    Publication History

    Published: 19 February 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. barrier
    2. concurrency
    3. lock
    4. synchronization

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Conference

    PPoPP '20

    Acceptance Rates

    PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;
    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Brief Announcement: Work Stealing through Partial Asynchronous DelegationProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660261(281-283)Online publication date: 17-Jun-2024
    • (2023)AtoMig: Automatically Migrating Millions Lines of Code from TSO to WMMProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3579849(61-73)Online publication date: 27-Jan-2023
    • (2023)Probabilistic Concurrency Testing for Weak Memory ProgramsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575729(603-616)Online publication date: 27-Jan-2023
    • (2023)Risotto: A Dynamic Binary Translator for Weak Memory Model ArchitecturesProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3567955.3567962(107-122)Online publication date: 25-Mar-2023
    • (2022)Asymmetry-aware scalable lockingProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508420(294-308)Online publication date: 2-Apr-2022
    • (2022)Software-Level Memory Regulation to Reduce Execution Time Variation on Multicore Real-Time SystemsIEEE Access10.1109/ACCESS.2022.320370210(93799-93811)Online publication date: 2022
    • (2021)CLoFProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483557(851-865)Online publication date: 26-Oct-2021
    • (2021)VSync: push-button verification and optimization for synchronization primitives on weak memory modelsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446748(530-545)Online publication date: 19-Apr-2021
    • (2021)Performance Analysis and Optimization for SpMV Based on Aligned Storage Formats on an ARM ProcessorJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.08.002Online publication date: Aug-2021
    • (2021)Verifying and Optimizing the HMCS Lock for Arm ServersNetworked Systems10.1007/978-3-030-91014-3_17(240-260)Online publication date: 2-Dec-2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media